Tuesday, November 3, 2009

Looking Up Data in P2P Systems

H. Balakrishnan, F. Kaashoek, D. Karger, R. Morris, I. Stoica, "Looking Up Data in P2P Systems," Communications of the ACM, V. 46, N. 2, (February 2003).
This paper gives a great overview of what was the state-of-art in 2003 in the domain of P2P computing, its advantages, proposed solutions and their limitations.  The authors highlight that P2P systems are attractive due to the following reasons:
  • Low initial setting up cost.
  • Tremendous computation abilities due to aggregation of resources.
  • Fault tolerance due to absence of single point of failure.
Most P2P system design, essentially try to solve the 'lookup' problem which is about finding a data item X that can be stored on one or more dynamic set of nodes in the system. In the past, following approaches had been considered:
  • Maintain central database: Consisting of mappings between file name and location of servers. Napster did this. Advantages are that it is simple to implement and doesn't require many hops, but results in scalability and resilience problems.
  • Hierarchy: Idea on the lines of DNS that lookup should be hierarchical. Disadvantage is that nodes higher in the hierarchy are 'more important' than others and their failure may be catastrophic to some or all parts of the system.
  • Symmetric lookup algorithms: Here, all the nodes are treated as equal. These schemes allow nodes to self-organize into an efficient overlay structure. There are many flavors of these algorithms from Gnutella (broadcast approach) to Chord (logarithmic hop approach).
Further, the authors discussed about various issues in implementing the DHT which involves a load balanced way of mapping keys on nodes (which I will discuss more in the Chord blog entry) and the design of 1-Dimensional Routing lookup algorithms like Chord, Tree-based routing algorithms like Pastry and Multiple Dimensions routing algorithms like CAN. Each of them have their own set of advantages and disadvantages and it is really a matter of tradeoffs and performance metrics that would decide which is the best.

Comments

In the end, the authors raised some very good questions on the future of P2P computing. I particularly found the concept of proximity routing quite interesting.
  • Proximity Routing: Broadly, I would define it as the overall motive of taking a particular file from the user that is closest to me in terms of routing distance, thereby minimizing network bandwidth utilization as well as latency. However, there is a trade-off in such a design of assigning consecutive IDs to nodes in the same network as that network's failure can cause a consecutive chunk of nodes to go offline thereby affecting performance. It would be interesting to know if some system leverages this trade-off and has come up with a hybrid heuristic of sorts.
Further, as an interesting aside, these system always talked about a uni-directional way of locating files. That is if I want to search for key K, I initiate a request in only 1 direction, and move towards getting the key. What about a system that could send multiple requests to adjacent nodes something on the lines of how gossip protocols work? One obvious negative would be additional network utilization but it would be interesting to know if it could result in quicker lookups. This option makes much more sense with multicore processors and multi-threaded applications where they all can simultaneously send lookup requests.

No comments:

Post a Comment