Tuesday, May 22, 2012

LightDHT - A lightweight python implementation of the Bittorrent distributed hashtable

I have always been fascinated by distributed data structures, but my efforts have always been dampened by the fact that they are really awkward to play with. In order to get anything that resembles real-life usage, you need to be running hundreds of nodes. Needless to say, wrangling hundreds of processes gets rather cumbersome. Fortunately, there are already thousands of people out there that are participating in one of the largest distributed data structures in operation today: the bittorrent distributed hashtable.
In order to facilitate experimentation with this wonderful system, I have written a simple python implementation of a DHT node. It is mostly (with a few exceptions -- see below) compliant with BEP0005. It should be able to act as a full member of the DHT, processing requests without disrupting the network.
Interested in playing around? Get the code on github!
From the documentation:
The aim of LightDHT is to provide a simple, flexible implementation of the Bittorrent DHT for use in research applications. If you want to trade files, you have come to the wrong place. LightDHT does not implement the actual file transfer parts of the bittorrent protocol. It only takes part in the DHT. 
The main philosophy of LightDHT comes down to two considerations: ease of use over performance and adaptability over scalability. 
In order to keep LightDHT easy to use, all DHT RPC calls are performed synchronously. This means that when you call a DHT method, your program will block until you have an answer to your request. That answer will be the return value of the function. This has the advantage that it keeps the logical program flow intact, and makes it more comfortable to use. 
In order to maintain O(log N) scaling across the network, BEP0005 (the standard governing the DHT) mandates that implementations use a bucket-based approach to the routing table. This enables the node to fulfill all requests in constant time and (more or less) constant memory. In LightDHT, we throw that recommendation to the wind. 
Since the main focus of LightDHT is reseach, we are going to keep around all the data we can. This means that we keep around every single node we know about. Since in practice the number of nodes is limited and the request rates are rather low, we do not bother keeping the routing table organized in a tree structure for quick lookups. Instead we keep it in a dictionary and sort on-demand. The performance penalty is well worth the reduced complexity of the implementation, and the flexibility of having all nodes in an easy to use data structure.

2 comments:

  1. DrX, hi,

    every now and then i revive attempts to get gittorrent up and running. this was where i left off, last time: git@gitorious.org:python-libbittorrent/pybtlib.git

    i'm *not* restricted to sticking with bittorrent as such, but it's easiest to start with.

    my goal is to make git *completely* independent of *any* single points of failure. would you be interested to help me integrate lightdht into pybtlib?

    thanks.

    ReplyDelete
  2. Hey there!

    My primary interest is tinkering with the DHT, since it's pretty much the largest distributed data structure I can get my hands on. I'm not interested in working on an actual file sharing application.

    That said, you are welcome to use the code of LightDHT in your program as long as you abide by the terms of the GPLv2 or we agree on different terms. Also, I'd be more than happy to help you out with specific issues relating to the code from time to time. Just ask here or on github.

    ReplyDelete