Saturday, December 19, 2009

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

S. Nedevschi, J. Chandrashekar, J. Liu, B. Nordman, S. Ratnasamy, N. Taft, "Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems," NSDI'09, (April 2009).
This paper argues that systems draw a lot of power even when they are idle which should obviously not be the case. Even though various solutions have been proposed over the years to save energy costs in clusters such as wake-on-lan (WoL) or proxies (NCP), there has been no in-depth evaluation of potential energy savings. In this paper, the authors proposed an extensible proxy architecture by analyzing a trace from 250 enterprise users (90% laptops and 10% desktops) at Intel Berkeley Research lab over 4 weeks consisting of common network protocol packets in the network. Their work was essentially a trace-driven evaluation of broader benefits and design tradeoffs  in desiging proxy architectures.

The evaluation of traces showed some really interesting results. On an average, desktops were idle >50% of the time wasting >60% of the energy. This scaled to 6 billion dollars being wasted per year (assuming 170 million desktop PCs in the US). Howeover, a strict wake-on-lan scheme that wakes a node on receiving any packet doesnt work well due to the fact that the incoming traffic is high even when the node is idle. To tackle this, a hybrid scheme is necessary that can proxy for certain protocol packets and wake the hosts for important packets that require user attention. The authors analyzed various such protocols and identified "key offenders" keeping volume of packets and half-time sleep as metrics for identifying them. Once these key offenders were identified, they can be ignored or responded by the proxy or the client can be woken up.

On idetifying various key offenders, the authors designed a general proxy architecture consisting of a set of rules of type (trigger, action). The trigger was based on incoming packet and proxy state while the actions could be drop, wake, respond or redirect. The system involved no modifications to the end systems and could be implemented as a separate network product which makes it extremely useful and easily deployable.

Cutting the Electric Bill for Internet-Scale Systems

A. Qureshi, R. Weber, H. Balakrishnan, J. Guttag, B. Maggs, "Cutting the Electric Bill for Internet-Scale Systems," ACM SIGCOMM Conference, (August 2009).
This paper talked about how data center operating costs can be brought down by millions of dollars by offloading computation work at those locations where electricity costs are cheaper at that particular time. The authors argue that costs of electricity are highly variable at various places in the US and have little correlation between them. Further, the costs are highly dynamic which requires transferring computation continuously to different locations in a highly dynamic manner. Further, they introduce a concept of energy elasticity, that is the degree to which the energy consumed by a cluster depends on the load placed on it. The maximum reduction in cost their approach can achieve hinges on this factor. Moreover, it also assumes that change in bandwidth due to routing additional data does not affect the overall costs.

This paper mainly involved an extensive survey of energy costs and the feasibility of such a scheme based on Akamai systems. It modeled energy consumption and the increase in routing energy to show as a proof of concept that such a routing scheme could be profitable to implement for an organization with large datacenters at different geographical locations. The effectiveness of the proposal lies in mainly latency. Offloading computations to the most energy efficient server can result in increased latency which can be an important factor for consideration for eg. to Google in comparison with the few million dollars saved otherwise. 

A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing

S. Floyd, V. Jacobson, S. McCanne, C-G Liu, L. Zhang, "A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing," ACM SIGCOMM Conference, (August 1995).
This paper talked about SRM (Scalable Reliable Multicast), a reliable multicast protocol for application level framing and light-weight sessions. This is based on the IP Multicast group delivery model. The paper starts off by identifying some key problems in extending basic TCP style multicast such as ACK implosion, responsibility of maintaining dynamic group membership information, and the whole concept of single receiver based RTT. Keeping this problems in mind, the authors proposed that any sender based reliability scheme like TCP wouldn't really work in such a scenario and what was needed was a receiver based scheme. Further, the reliability should be maintained on a 'application data unit' (ADU) basis. This helps in putting the application in control and also since ADU names can be independent of sending hosts, it can use the anonymity of IPMulticast to exploit redundancy of multiple receivers.

The entire system was based on the wb framework which was a network conferencing tool. The underlying idea was to prevent duplicate requests for the same ADU by various receivers in the multicast group. This is done by promiscuously hearing retransmit request and responses by other multicast receivers. Whenever a receiver detects a loss, it schedules a repair request for a random time in future between [Ci*d, (C1+C2)*d] where d is the the receiver's estimate of the one-way delay to the host. Similarly, when host B receives a request from A that B is capable of answering, B sets a repair timer to a value between [D1*d, (D1+D2)*d]. The values of C1, C2, D1, D2 are set such that there is as less retransmit requests as possible in case of lost packets.

The authors showed examples of various topologies such as chain, stars etc. The simulations were done on random trees as well as large bounded-degree trees to show the scalability of the protocol. They also proposed an adaptive adjustment of the constants by means of random timer algorithm that depended on average duplicate requests. Further, to conclude the paper, the authors talked about how is this protocol generic in nature and is not just limited to wb style multicast model.

Scalable Application Layer Multicast

S. Banerjee, B. Bhattacharjee, C. Kommareddy, "Scalable Application Layer Multicast," ACM SIGCOMM Conference, (August 2002).

This paper presents an application layer multicast protocol specifically designed for low-bandwidth, data streaming applications with large receiver sets. Being based on hierarchical clustering of application-layer multicast nodes, it is highly scalable and is not very sensitive to churn. The authors claim that 2 measures of goodness of any application layer multicast overlay network can be per-link stress (no. of identical packets sent by the protocol over each underlying link in the network) and a per-member stretch (ration of path length from the source to the member along the overlay to the length of the direct unicast path). 

The authors propose NICE application-layer multicast scheme. It reduces the control overhead at any member to O(log N). Additionally, they also showed that the average member maintains state for a constant number of other members and incurs constant control overhead for topology creation and maintenance. The overall idea of the NICE scheme is as follows. All receivers are divided into various groups (of size between k and 3k-1 where k is a constant and consists of hosts that are closer to one another). Each cluster has a cluster leader that is preferably geometrically equidistant from all other nodes in the group. The cluster leaders of all the groups are further divided into groups in a higher hierarchy and this goes on until there is only one node at the top of the hierarchy. Data is sent in an hierarchical manner to the topmost node that forwards it to its corresponding children which again forward the data down the tree. It can be shown that on an average the control overhead incurred by the nodes is O(k) while in the worst case it could be upto O(k log N).

The simulation results consisted of a comparison on average link stress between NICE and Narada-5 and average path length (hops) between NICE, Narada-5, IPMulticast and Unicast. NICE performed better than Narada-5 over long periods of time. The results clearly showed a promising implementation of the application layer multicast protocol.

Tuesday, November 24, 2009

BotGraph: Large Scale Spamming Botnet Detection

Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, E. Gillum, "BotGraph: Large Scale Spamming Botnet Detection," NSDI'09, (April 2009).
This paper also talks about botnet detection albeit from an entirely different approach as the previous paper.  The authors mainly focus their attention to the web account abuse attack. That is a spammer can create many accounts (bot users) on popular mail servers and then use a large number of infected computers (bots) to send spam mails using these bot users. The authos design and implement a system called BotGraph to detect these attacks at a large scale (Hotmail logs of 500 million users and identifying over 26 million botnet created users). The whole methodology involves around creating giant user-user graphs and then analyzing the graph to identify bot-user groups by identifying connected components. The analysis was based on the following 3 strategies adopted by spammers:
  1. Assigning bot-users (email accounts) to random bots (infected machines).
  2. Keeping a queue of bot users and whenever the bots come online in a random order, the spammer assigns the requesting bot, the top k bot users. 
  3. Here, there is no limit on the number of request a bot can make, and every time the spammer assigns it a single bot user.
Clearly, these 3 strategies result in different graphs and required a dynamic recursive algorithm. The idea was that each node represented bot users and there was an edge if they both share an IP. Based on the number of shared IPs, weights were assigned to the edges. The algorithm first of all pruned the graph for all edges with weights less than 2 and then identified all connected components. Then it recursively identified connected components within these subgraphs that had edges with even higher weights. Apart from the algorithmic aspect of the paper, its computational aspect was equally challenging. Identifying bot users from 500 million email accounts is really computationally intensive. The authors implemented BotGraph using Dryad/DryadLINQ which is a powerful programming environment for distributed parallel computing. This method was called selective filtering and partitioned inputs based on user ID and distributed only the related records across partitions.

Comments

There can be an interesting comparison over the two solutions provided by NAB and BotGraph. Though both these paper were trying to solve the same problem, they were entirely different in their approaches -- NAB identifies bot vs user traffic whereas BotGraph identifies bot user accounts. However it appears that in a long term, NAB has additional advantages of identifying any bot traffic (DDoS or Click-fraud) and was not just limited to spam email whereas NetGraph had the advantage of being really practical. Email service providers can quickly/easily analyze botgraphs and ban bot user accounts. It would be interesting to discuss what would be more beneficial for say someone like Google which runs an email service as well as is concerned about DDoS attacks and click fraud. At this point, the best answer appears to be 'both'.

Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks

R. Gummadi, H. Balakrishnan, P. Maniatis, S. Ratnasamy, "Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks," NSDI'09, (April 2009).
This paper talks about identifying the traffic that is sent from compromised machines that form botnets. It aims at separating human generated  traffic from that generated by botnets by observing a fact that human generated traffic is generally accompanied by keyboard or mouse inputs. The authors designed a system called NAB ("Not-A-Bot") to approximately identify human generated activity. It uses a trusted software component called 'attester' which runs on the client machine with an untrusted OS. The authors implemented their version of the attester in the Xen hypervisor but claimed that other implementations such as building the attester in the trusted hardware or running it in software without virtualization is also possible. (The last option actually looks very promising and I wonder as to why it was not done).


Design

The authors used TPM and built the attester on top of it. A TPM provides many security services, among which the ability to measure and attest to the integrity of trusted software running on the computer boot time. The attester relies on two key primitives provided by TPMs called the direct anonymous attester (DAA) and the sealed storage. The basis of the design is that whenever an application send a message, it has to obtain an attestation from the attester. The attestation has the hash of the message, a nonce and range values signed by the attester's public key as well as its certificate. The attestation is given if the request is generated within some range (delta-m and delta-k) of the keyboard and mouse activity (with certain exceptions such as scripted mails which uses PID). On the other end, there is a verifier, which can now implement various policies based on the attestation. It can all together drop non attested packet, but practically, the authors talk about giving low priority to non-attested packets or rather making sure that the attested packets always pass through. Based on this, the paper contained implementation of effective spam solutions, DDoS prevention and click-fraud mitigation.

Comments

Overall, this was an excellent paper. It tackled a very practical problem and came up with a partially practical solution too. However, I have certain comments on the effectiveness of the protocol:

  1. It can very well be possible to have adaptive botnets that can adapt to sending data whenever there is a mouse or key stroke. This can in the worst case result in sending botnet data in bursts and may result in throttling the sender.
  2. Current design an implementations involve modifying applications and protocols (to accommodate attester data). This raises concerns about its deployability.
  3. Although, the authors mention that they donot intend to 'punish' those applications who donot use attesters, I somehow felt that giving priority to attested traffic is against this very goal.
  4. The evaluations showed that 0.08% of human generated mail was still misclassified as spam. Shouldn't it be ideally zero now? Given that they already tackled script generated mails, what could be the cause of these emails? 
  5. It was a bit of a surprise for  me when the authors mentioned that 600 GB of disk overhead for storing nonces is not much for an ISP. Is that scalable?

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, I. Stoica, "X-Trace: A Pervasive Network Tracing Framework," NSDI'07, (April 2007).
This paper presents X-Trace, a pervasive Network Tracing Framework. The idea is that different applications spanning multiple administrative domains working over different contexts make it extremely difficult to have a comprehensive view of the system. Though various diagnosis tools do exist but mostly they are either layer specific or application specific. X-Trace aims to be pervasive and generic in its approach (at the cost of requiring agreements from ADs to support it).

The idea is that the user invokes X-Trace when initiating an application task by inserting X-Trace metadata in the resulting request consisting of flags (specifying which component of metadata is placed in it), a unique TaskID, TreeInfo field (consisting of ParentID, OpID and EdgeType tuple), Destination Info (optional field to indicate the N/W adress to send reports to) and an Options field to accomodate future extensions. This metadata propagates through the network using pushDown() and pushNext() calls. pushDown() pushes metadata down the logical layer while pushNext() pushes metadata at the same level on the next components. Each node is responsible to store its X-Trace metadata which is then sent back to the report server and aids in the construction of a trace tree. This tree will help the report server to see the entire trace of how packets traveled into the network and based on incomplete/broken traces, can help the administrator to judge possible network problems.

Further the authors talked about additional uses of X-trace in tunneling, ISP connectivity troubleshooting and link layer tracing mechanisms. They evaluated the system by deploying 3 scenarios: DNS resolution, a three-tiered photo-hosting website and a service accessed through i3 overlay network.

Comments
Another well written paper. Talks about a very real problem and proposes a very generic solution. However a few questions arise over the effectiveness and ease of deployment of the protocol:
  1. Requires modifications in clients, servers and network devices to support X-Trace metadata. This limits its quick deployment.
  2. Though the authors claim that partial deployment of X-Trace can help in finding faults at some granularity, it remains an open question as to how useful will be these partial graphs.
  3. Loss of reports back to the reporting servers may be assumed as failures. 
  4. Though I donot understand this quite well, but the authors admitted that tree structure doesn't capture all possible network actions. They talked about quorum protocols or a controller that sends jobs to many working nodes and waits for all to complete.
Overall, I guess it would also be interesting to discuss how X-Trace can be extended on the lines of NetMedic to automatically detect network failures or misconfiguration errors too. 

Tuesday, November 10, 2009

NetFPGA: A Tool for Network Research and Education

G. Watson, N. McKeown, and M. Casado. NetFPGA: A Tool for Network Research and Education. In 2nd workshop on Architectural Research using FPGA Platforms (WARFP), 2006.
This paper talks about NetFPGA, a platform developed at Stanford which helps users to build real networking hardware using design tools such as Verilog that can be deployed and debugged in an operational network. FPGAs or Field-Programmable Gate Arrays are designed to be configured by the users after they are manufactured using standard hardware definition languages. It consists of various logic blocks that can be programmed to be wired together to perform complex functions. NetFPGA project leverages this technology to design Network specific FPGAs that can be used to build an Ethernet Switch or an Internet router (and probably other middleboxes too?). The NetFPGA project has released 2 versions at the time the paper was written:

NetFPGA-v1
This consisted of:
  • 3 Altera EP20K400 APEX devices
  • An 8 port Ethernet controller
  • 3 1MB SRAMs
  • Ancillary Logic
- Synthesis and Design tool: Synopsys
- Place and route tool: Altera Quartus

Further, a locally developed tool, Virtual Network System was used to map the NETFPGA-v1 ports onto the campus internet. The policy used was that of simulating first and debugging last. Due to the absence of a CPU on chip, the vendor's analysis tool couldn't be used which led the authors to develop there own simple logic analyzer. However, version 1 had various limitations. The PCB board format required an obscure rack that was difficult to obtain and required self-assembly, the rate was limited to 10Mb/s, there was no on-board CPU and finally the development environment was only limited to Linux and Solaris. These shortcomings led the authors to work on version 2 NetFPGA.


NetFPGA-v2
The main changes in v2 were:
  1. PCI Format was incorporated.This provided a familiar format to most users and also provided greater bandwidth benefits.
  2. Further tools were provided to enable development in Linux, Solaris and Windows XP
- Synthesis and Design tool: Synopsys VCS (under Linux), Mentor's ModelSim (Linux and XP)
- Place and route tool: Xilinx ISE tool (Linux or XP)

Using a CPU within the Virtex devices are still being explored. Further the authors have developed 18 NetFPGA-v2 boards and also have a bus-master DMA driver for 4-port Ethernet port. At the time of writing this paper, The authors were also in the process of implementing RCP (Rate Control Procedure) using NetFPGA-v2.

Comments

NetFPGA is definitely a great technology. It will definitely help researchers to actually implement a system and test how does it work in a real environment taking into account the hardware issues which the software routers are unable to simulate. I have a few questions in mind though and it would be great to discuss them in class:
  1. As I see it, there are 3 class of network components, Software routers, FPGAs and ASICs. NetFPGAs are definitely better than software routers. It would be interesting to discuss how close are they to application specific ICs (ASICs).
  2. Aren't software routers sufficient to test the 'correctness' of a protocol? If that is the case, why are NetFPGAs an exciting tool for researchers as well?

A Policy-aware Switching Layer for Data Centers

D. Joseph, A. Tavakoli, I. Stoica, "A Policy-aware Switching Layer for Data Centers," ACM SIGCOMM Conference, (August 2008).
To protect, manage and improve performance of datacenter applications, data centers deploy a large variety of middleboxes including firewalls, load balancers, SSL offloaders, web caches and intrusion prevention boxes. However, as of now, the procedure of deploying various middleboxes involves manually placing them in the physical path and then tweaking traffic by overloading layer 2 mechanisms  to explicitly pass through them. This paper on the other hand proposes a policy aware switching layer based approach which allows switches to specify various policies for the packets passes through them to traverse certain 'out of the way' middleboxes. This system is designed keeping the following 3 points in mind:
  1. Correctness: Traffic should maintain proper sequence of order in traversing the middleboxes.
  2. Flexibility: The sequence should be easily reconfigurable.
  3. Efficiency: Traffic should not traverse unnecessary middleboxes.
Keeping these goals in mind, the authors propose PLayer  which was built around 2 principles:
  • Separating policy from reachability: This is done by incorporating policies in switches comprising of tuples. This policies match the packet's MAC addresses and route it to a  particular middlebox. Further these policies are sent as updates to individual switches by a single policy manager and care must be taken to ensure that 2 adjacent PLayer switches have the same policy version running on it to avoid loops/ bypassing critical middleboxes.
  • Takings middleboxes off the physical network path: The middleboxes are no longer in the physical path and are connected via various interfaces to the Pswitch. There is no need of modifying middleboxes by introducing SRCMACREWRITER in between the middlebox and the regular Ethernet switch.
Comments

I found this paper extremely well written and relevant. It tackles a real problem in a pretty generalized manner. However there are some rough edges which need to tackled out:
  1. 15-byte frame encapsulation overhead may increase frame size beyond 1500 byte MTU.
  2. MAC address and VLAN spoofing attacks can render the protocol ineffective.
  3. Stateful PSwitches may not always be good for scalability.

Internet Indirection Infrastructure

I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, "Internet Indirection Infrastructure," ACM SIGCOMM Conference, (August 2002).
 This paper attempts to generalize the Internet's point to point communication system. Traditionally, Internet has always been based on a push model, wherein the sender pushes the packet to the given receiver by explicitly addressing the packet to it. i3 on the other hand, proposes what we can think of as a push-pull model where the sender pushes a packet in the network and it is the responsibility of the receiver to pull it towards it.

Broadly, the idea is that each packet is associated with an identifier which is used by the receiver to obtain delivery of the packet. The sender sends the packets as (id, data) where id is an m-bit identifier and data is the payload. This packet is directed to one of the many i3 servers. The receiver on the other hand, use triggers of the form (id, addr) to indicate their interest in a packet. It is the job of the i3 servers which form a part of the overlay network to match the identifiers and forward the packet to the given addr. The authors also proposed a second generalization of i3 which consisted of sender sending (id_stack, data) and the trigger comprising of (id, id_stack) and the intermediate i3 servers implemented in a chord structure wherein the packets get hashed according to the prefix function of the first k bits of their m-bit identifiers.

Based on this, the authors showed that a variety of services can be implemented efficiently. Primarily applications that involve mobile receivers can now be implemented with ease as the receivers can update the i3 servers on the fly whenever they move to new locations by creating new triggers. Further multicast and anycast can be implemented similarly without the sender being concerned about the individual addresses of receivers. The authors also showed that strong applications such as service composition can be implemented with the help of indirection servers in the middle without the knowledge of the sender, which I feel was the main strength of the paper.

Comments

Overall, the paper builds on a definitely great idea! The whole push-pull model was nicely explained and make total intuitive sense. However, some issues that come to mind are:
  1. I3 like all other overlay networks suffers from latency stretch. That is, every packet is routed to i3 servers instead being directly sent to the receivers. Though the authors have proposed a number of steps that could really localize the nearby i3 server where the packet is stored, it would always suffer from higher latencies as compared to sending directly to the receiver.
  2. I felt that the challenge based protocol to prevent DDoS attacks which consisted of sending a nonce that is expected to be returned by the receiver was too simplistic and can easily be broken.
  3. The authors donot really talk much about the time they should store the packet on each server, the trigger validity period etc. These values are crucial if i3 aims to be deployed on a large scale.

Thursday, November 5, 2009

DNS Performance and the Effectiveness of Caching

J. Jung, E. Sit, H. Balakrishnan, "DNS Performance and the Effectiveness of Caching," IEEE/ACM Transactions on Networking, V. 10, N. 5, (October 2002).
This paper presented a detailed analysis of traces of DNS and its associated traffic at 2 locations -- MIT LCS and KAIST. This analysis resulted in insights into how various clients interact with the DNS, effects of caching and the performance perceived by the clients. It was widely believed that 2 factors contributed to scalability of DNS:
  • Hierarchical Design around administratively delegated namespaces.
  • Aggressive use of caching.
Caching was effective only when the TTLs were high as otherwise the cache will return stale values. However, low TTLs is not a very desirable thing to have in say for example mobile servers. This led the authors to delve deeper into estimating the effectiveness of DNS caching by observing the amound of DNS traffic in the wide area traffic. The following were some of the interesting things that were uncovered during these experiments:

  1.  Low TTLs on A records (IP-hostname) mappings doesn't really affect performance due to the fact that network bandwidth is not really affected by it in practice. Further, caching at NS-record level results in substantial increase in DNS performance.
  2. The major cause of latency observed by the client is due to recursive referrals. For lookups involving 2 or more referrals, more than 95% took more than 100ms and 50% take more than 1 second to resolve.
  3. Almost half of the lookups were not associated with a TCP connection as was assumed initially.

Development of the Domain Name System

P. Mockapetris, K. Dunlap, "Development of the Domain Name System," ACM SIGCOMM Conference, 1988.
This paper examined the ideas and issues behind the initial design of DNS (Domain Name System) in 1983 exploring it in terms of the improvement it brought over existing systems, its successes and shortcomings and other performance characteristics.

Broadly speaking, the job of a DNS server is to provide a mapping between an IP and a hostname. Queries are mainly made over a hostname to determine its corresponding IP address, however the reverse is also pretty much possible. Prior to the design of DNS, there were mainly 2 popular existing designs that the authors discussed:
  • HOSTS.TXT file: This was a single file maintained by SRI-NIC consisting of all hostname-IP mappings. This simple soultion worked quite well when the number of nodes were limited but gradually had obvious scalability issues.
  • XEROX Grapevine:  This was pretty sophisticated and had a heavy use of replication, light use of caching and fixed number of hierarchy levels. Fixed number of hierarchy levels meant that it was suited for particular protocol architectures and importing its design would have meant importing elements of its protocol architecture. This limitations led to the development of a new design.
DNS Architecture

The architecture had to deal with many tradeoffs before coming up with a design. Ideally, such a system should have been a system that behaves like a perfect 'dynamic database' and incorporates related atomicity, voting and backup etc, however, it was believed that these things would make it a bit too complex to be adopted readily. The active components of the system are of 2 major types -- name servers and resolvers.  Name servers contained repositories of information while resolvers interfaced the client programs and contains algorithms to find the name servers. Further, DNS internal name space is a variable-depth tree where each node in the tree had an associated label. These subnodes of the tree can be managed by different organizations and are free to assign labels accordingly. For instance berkeley.edu is a subtree of the .edu node and is free to manage its own subtrees (which can be further managed by different administrative units eg. cs.berkeley.edu, me.berkeley.edu etc.). Third important concept was that the data for each name in the DNS was organized as a set of resource records (RRs) which were < type, class > tuples. Types represent abstract resources (eg. host address, mailbox etc.)  while the class represents the protocol family.

Further, the authors talk a bit about the importance of caching in DNS and its dependence on TTL. Lower value of TTL tend to make caching ineffective. One important observation was that even though root servers take less than 100ms to process vast majority of requests, the clients see a typical response time of 500ms to 5s. This may well be possible due to recursive nature of zone checking and other network latency issues.

Comments

It was nice to read this paper to get a historical perspective on the design decisions that led to the development of DNS. However, in present context, few questions come into mind:
  1. Given that essentially DNS is all about finding   mapping, how about designing it in terms of the DHTs?. Such a system would help in solving the ever increasing scalability issues. Specially, as the world is moving towards URIs (Universal Resource identifiers) and the namespace size is going to increase exponentially, it makes total sense to have such a system in place.
  2. Nothing much was done about security of DNS at that time. It made some sense back then when the networks were mostly trusted but it would be interesting to know how existing systems deal with this security problem.

Tuesday, November 3, 2009

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications

I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," ACM SIGCOMM Conference, 2001.
This paper talked about Chord, a distributed lookup protocol that tries to address the lookup problem (finding the node that has a particular key mapped to it) in a simple and provably correct manner.


Chord Protocol Design
  • Consistent Hashing: The hash function assigns each node and key  an m-bit identifier using SHA-1 by hashing the node's IP address and key value respectively. An interesting issue that the authors analyzed here was what if any malicious node keep on producing keys that hash to a particular node thereby overloading it? However, the authors claim that producing such a set of keys that collide under SHA-1 is equivalent to inverting or decrypting the SHA-1 function and is considered NP-hard.
  • Scalable Key Location: Every node maintains something which is called the finger table. The ith entry in the table at node n corresponds to the tuple < n+2^(i-1), successor(n+2^(i-1)), IP Address>. Now whenever a query to a particular key k is made, the node searches for i and j in the finger table such that n+2^(i-1) <= k < n+2^(j-1) and routes it to successor (n+2^(i-1)). The requests are made in this iterative fashion until the node with the key is found. The authors highlighted that this takes O(log N) lookups in the worst case.
  • Node Joins: To ensure concurrent node joins, chord makes 2 invariants, each node's successor is correctly maintained and for every key k, the node successor (k) is responsible for k. These invariants help in ensuring success of each node join operating. The new node must update its successor with the help of any node that it knows of and fetch its key responsibilities. This takes O(log N*log N) messages.
  • Stabilization: This helps in stabilizing the system in case of frequent disconnects. It works to update predecessor entries, periodically verify a node's immediate successor and periodically refreshes finger table entries. An interesting result was that as long as updating fingers take less time than it takes the network to double its size, lookups will continue to be O(log N) hops.
Finally, the authors evaluated the protocol on an iterative simulator and analyzed numbers for load balancing, path lengths, system behavior in terms of node failures and lookups during stabilization phases. The results matched the design expectations and were encouraging. Apart from a slight mismatch in graph numberings at quite some places, the graphs were pretty easy to read.


Comments

This paper definitely deserves to be one of the highly cited papers in computer science. It was well written, presented a novel solution, the protocol was very simple and yet had strong mathematical backing to prove its correctness. However, there were few issues that Chord was unable to tackle:
  1. Anonymity: Chord doesn't provide anonymity. Its lookup operation runs in predictable time and always results in success or definitive failure.
  2. Network Locality: The node IDs are assigned randomly. This has the disadvantage of chord being unable to utilize network locality in its design.
  3. May result in temporary failures: In case of node joins, it may occur that nodes in an affected region may have incorrect successor pointers, or keys may not have migrated to newly joined nodes resulting in lookup-fail temporarily till the stabilization protocol fixes it.

Looking Up Data in P2P Systems

H. Balakrishnan, F. Kaashoek, D. Karger, R. Morris, I. Stoica, "Looking Up Data in P2P Systems," Communications of the ACM, V. 46, N. 2, (February 2003).
This paper gives a great overview of what was the state-of-art in 2003 in the domain of P2P computing, its advantages, proposed solutions and their limitations.  The authors highlight that P2P systems are attractive due to the following reasons:
  • Low initial setting up cost.
  • Tremendous computation abilities due to aggregation of resources.
  • Fault tolerance due to absence of single point of failure.
Most P2P system design, essentially try to solve the 'lookup' problem which is about finding a data item X that can be stored on one or more dynamic set of nodes in the system. In the past, following approaches had been considered:
  • Maintain central database: Consisting of mappings between file name and location of servers. Napster did this. Advantages are that it is simple to implement and doesn't require many hops, but results in scalability and resilience problems.
  • Hierarchy: Idea on the lines of DNS that lookup should be hierarchical. Disadvantage is that nodes higher in the hierarchy are 'more important' than others and their failure may be catastrophic to some or all parts of the system.
  • Symmetric lookup algorithms: Here, all the nodes are treated as equal. These schemes allow nodes to self-organize into an efficient overlay structure. There are many flavors of these algorithms from Gnutella (broadcast approach) to Chord (logarithmic hop approach).
Further, the authors discussed about various issues in implementing the DHT which involves a load balanced way of mapping keys on nodes (which I will discuss more in the Chord blog entry) and the design of 1-Dimensional Routing lookup algorithms like Chord, Tree-based routing algorithms like Pastry and Multiple Dimensions routing algorithms like CAN. Each of them have their own set of advantages and disadvantages and it is really a matter of tradeoffs and performance metrics that would decide which is the best.

Comments

In the end, the authors raised some very good questions on the future of P2P computing. I particularly found the concept of proximity routing quite interesting.
  • Proximity Routing: Broadly, I would define it as the overall motive of taking a particular file from the user that is closest to me in terms of routing distance, thereby minimizing network bandwidth utilization as well as latency. However, there is a trade-off in such a design of assigning consecutive IDs to nodes in the same network as that network's failure can cause a consecutive chunk of nodes to go offline thereby affecting performance. It would be interesting to know if some system leverages this trade-off and has come up with a hybrid heuristic of sorts.
Further, as an interesting aside, these system always talked about a uni-directional way of locating files. That is if I want to search for key K, I initiate a request in only 1 direction, and move towards getting the key. What about a system that could send multiple requests to adjacent nodes something on the lines of how gossip protocols work? One obvious negative would be additional network utilization but it would be interesting to know if it could result in quicker lookups. This option makes much more sense with multicore processors and multi-threaded applications where they all can simultaneously send lookup requests.

Thursday, October 29, 2009

Active Network Vision and Reality: Lessons from a Capsule-Based System

D. Wetherall, "Active Network Vision and Reality: Lessons from a Capsule-Based System," 17th Symposium on Operating Systems Principles," (December 1999).
 Active Networks are a novel approach to network architecture in which customized programs are executed within the network at various network elements along the path to the destination. The source node is empowered to run routines on various active node on its path to the destination which not only aids in experimenting new Internet services and facilitates their easy deployment but also enables dynamic routing on a per-application basis.  Various performance and security issues that may result from the execution of untrusted and mobile code are dealt by ANTS as follows:
  • Security Issues:  Constraints have been placed on the forwarding routine disallowing changes in path and certification by a third party CA is required to impose additional security. Further, hashes are used to perform integrity checks.
  • Performance Issues: Upper limits are placed on the size (16 KB) and running time on the subroutine codes for security as well as performance purposes. Further, ANTS active network toolkit also uses active caching and loads code on-demand before execution.
Overall, this seems like a great idea. It gives the source node the ability of dynamically choosing the best suited forwarding routine on a per-application basis. No doubt, this service needs to be deployed in an incremental fashion in the Internet, something that the author did not address explicitly. However, despite being a good clean slate design idea, it raises quite a few issues:
  1. Violation of strict end-to-end principle. Does the performance increase justify this?
  2. Limitations on the type of services due to size and runtime limitations on code execution.
  3. Security issues arising from execution of third party code still remain a concern. I personally felt that relying on third party certifying authority was not very clearly explained. Moreover, even though limits were placed on size and runtime of code, the system may be still vulnerable to coordinated attacks which have DDoS like nature.
  4. Maybe I missed it, but I failed to see if there would be any incentive to the ISP to allow active node routers in the path.

Resilient Overlay Networks

D. Andersen. H. Balakrishnan, F. Kaashoek, R. Morris, "Resilient Overlay Networks," 18th Symposium on Operating Systems Principles, (December 2001).
 Resilient Overlay Network is an architecture that allows distributed Internet applications to detect and recover from path outages and periods of degraded performance within a matter of seconds. The idea is that by building an application level overlay network, we can achieve an increase in performance as well as reliability. The overall concept of RON involves a small group of nodes spread across the network in multiple ISPs. They keep pinging each other and help in creating a global state of 'link healths' (similar to link state concept) and decide to overlay packets between hosts either directly or through intermediate host(s) based on the global link health information. This helps in circumventing policy restrictions imposed by ISPs, helps it in adapting to network conditions or helps in customizing to various applications such as video/voice etc.

Overall, the analysis was definitely encouraging. It was observed that overlay network reacted to failures in a matter of seconds as opposed to BGP which takes many minutes before restoring original flow. An interesting phenomenon was observed that RON paths were actually shorter that BGP paths. My only point of critique/concern with this paper would be as to how feasible would be its wide deployment? Wouldn't the ISPs be concerned about the fact that overlay networks mess with their traffic engineering? Wouldn't this make them take steps against it? Further, issues involving large scale deployment (and probably resulting congestion due to control place information) aren't really tackled in the paper. It would be interesting to discuss these aspects in class.

Tuesday, October 27, 2009

On Inferring Autonomous System Relationships in the Internet

L. Gao, "On Inferring Autonomous System Relationships in the Internet," IEEE/ACM Transactions on Networks, V. 9, N. 6, (December 2001), pp. 733-745.
 This paper talked about how the author classified various types of routes that appeared in BGP routing tables based on AS relationships and presented an heuristic algorithm that inferred these relationships from the routing table. The underlying theme of this paper was that connectivity does not imply reachability. So, in order to study the sturcutre of the Internet, it is not just sufficient to study the graph topologies as reachability also depends a lot on inter domain policies between different ASes managed by different ISPs.

ASes may fall into one or more of the following categories: Provider, Customer, Peer or Sibling. Based on this, we can have following Relationships:
  • Customer-AS: AS exports all its routes, other customer's routes as well as provider and peer routes to the customer.
  • AS-Provider: AS usually doesn't export its provider and peer routes and only exports customer routes.
  • AS-Peer: AS can export its routes and its customers' routes but usually doesn't export its provider or peer routes.
  • AS-Sibling: AS exports its routes, routes of customers as well as provider and peer routes.
The essential theme of this paper is the 'Valley Free Property" that exists in the Internet between different ASes.  Intituively, considering Customer --> Provider relationship to be "going upwards", Provider --> Customer relationship to be "going downwards", and Peer --> Peer and Sibling --> Sibling relationships to be at the "same level", Gao proved that the Internet can never have a "Valley". This helped her to propose an algorithm that could identify quite accurately how the various ASes were connected. The first phase of the algorithm consisted of walking on a path connecting various ASeS and finding for each node, its set of neighbors as well as its out-degree. This then classified AS pairs into provider-customer or sibling relationships. Then the second phase identified AS pairs that did not have a peering relationships and finally the third phase assigned peering relationships to AS pairs.


Comments

The paper was well written, clear in explanation and it was particularly helpful to understand the algorithm due to the incremental way it was proposed. My only concern would be how much accurate these identification of relationships can be? As highlighted, the repository for BGP routing relationships is not always accurate and peering relationships might well be protected by contractual agreements. Doesn't this puts an open question on the results?

On Power-Law Relationships of the Internet Topology

M. Faloutsos, P. Faloutsos, C. Faloutsos, "On Power-Law Relationships of the Internet Topology," ACM SIGCOMM Conference, (September 1999).
This paper talks about some very interesting power laws that exist in the Internet. Despite the widespread randomness of the Internet, the authors discover some surprisingly simple power laws of the Internet topology. These were discovered by extensive real world measurement data and using least square fit methods subsequently on the plots. The correlation coefficients of 96% or higher imply that the power laws are more that just a coincidental discovery.

Power-laws are expressions of the form y = k * x^a where k and a is a constants, x and y are the measures of interest. The entire work is based on extensive measurements of the Internet. The authors had to their disposal 4 datasets, 3 of which consisted of inter-domain topology data from Nov 1997 to Dec 1998 and the 4th consisted of the router topology data in 1995. The following were the 3 power laws discovered:

Power-Law 1 (rank exponent): The outdegree, dv, of a node v, is proportional to the rank of the node, rv, to the power of a constant R.

Power-Law 2 (outdegree exponent): The frequency fd of an outdegree d is proportional to the outdegree of the power of a constant O.

Power-Law 3 (eigen exponent): The eigenvalues of a graph are proportional to the order of the graph to the power of a constant E.

Comments

An important take-away from this paper is that most of the distributions of interest are skewed (as illustrated by the power law) which makes the use of 'average values' totally misleading while studying the Internet. Power laws are no doubt really helpful in predicting the future of the internet and as well as studying the performances of new protocols that are being created for the Internet. However, my only point of critique about this otherwise very well written paper would be an inadequate analysis of why exactly these power laws hold in the Internet. As highlighted by the authors in the beginning, that in case of some drastic changes in policies or topologies, these power laws may fail to hold. I think it would have been really great had the authors talked a bit more about what these 'drastic changes' may be. Specifically, what are the underlying assumptions that make the power law hold. This would have been very important particularly in the current scenario in which the whole Internet is divided into various ASes and their policy decisions have the potential of having considerable effect on the Internet routing.

Monday, October 19, 2009

White Space Networking with Wi-Fi like Connectivity

P. Bahl, R. Chandra, T. Moscibroda, R. Murty, M. Welsh, "White Space Networking with Wi-Fi like Connectivity", ACM SIGCOMM Conference, (August 2009).
This paper presented WhiteFi, the first WiFi like system constructed on top of UHF white spaces. WiFi builds on a technique called SIFT (Signal Interpretation before Fourier Transform) that helps in reducing the time to detect transmissions in variable width systems by analyzing raw signals in time domain. To elaborate the problem in a bit more detail, FCC recently has permitted the use of unlicensed devices in the frequency band of 180 MHz to 512 MHz comprising of 30 channels (21 to 51 with the exception of 37) of 6 MHz each. This opened up a great opportunity of have a WiFi like system on this White Space instead of conventional ISM bands. However, FCC directive requires that communication must not interfere with already existing communication on the channel due to TV channels or microphones. This makes the problem very interesting due to 3 characteristics:
  • Spatial Variation: 'Channel use'  is dependent on the location of sender and receiver. Hence there should be a way of communicating as to which channels are busy and which of them are available.
  • Spectrum Fragmentation: The UHF white space are often fragmented due to presence of incumbents. The size of each fragment can vary from 1 channel to several. A bigger contiguous channel = More Throughput! So it is the responsibility of the protocol to look for contiguous channels.
  • Temporal Variation: UHF band also suffers from temporal variation due to widespread use of microphones. Hence the protocol was designed to be robust to these changes in band availability.
The WhiteFi protocol takes all the above characteristics into account and was developed on the KNOWS hardware prototype. The spectrum assignment policy involves the exchange of a spectrum map and airtime utilization vector between the AP and the client to find out which of the channels are free for use at both the ends. Such a design tackles the problem of spatial variation. Further, once the AP gets the utilization vector, given a range of UHF channels, it calculates the MCham (multichannel airtime metric) and chooses the set of channels with maximum MCham. Further, SIFT is used for efficient variable bandwidth signal detection. The authors propose 2 algorithm choices for the same -- the Linear SIFT-Discovery Algorithm (L-SIFT) or the Jump SIFT-Discovery Algorithm (J-SIFT). J-SIFT outperforms L-SIFT when number of channels is greater than 10.

The evaluation was done using a combination of simulations and experiments and the results pretty much matched the expectations. J-SIFT pretty outperformed both Baseline non-SIFT and Linear SIFT in time taken to discover the AP. Overall WhiteFi due to its adaptive nature was successful in asserting its supremacy over any other static optimal implementation of channel communication policy. Particularly, figure 14 was a treat to watch that highlighted how well the protocol behaved to changes temporal changes in spectrum.

Critiques

This is definitely one of the very good papers we have read as part of this course. Not only does it tackle a very real problem, it presents a efficient and optimal solution which works despite the stringent constraints laid down by FCC. However, I have few comments on this paper:
  1. I thought the paper lacked some analysis on how does the protocol behaves when multiple microphone users interfere on a regular basis (that is how efficient is the J-SIFT based channel hopping). Though Fig. 14 provided some insight, it would have been great to have a deeper analysis involving throughput/overhead over a relatively long period of time.
  2. Theoretically, all the channels could be busy when someone wants to communicate. Isn't that a big usability issue -- 'My Internet may or may not work'! Is it really the case? Can all frequencies between 180-512MHz be assigned for TV/Microphone transmissions?

Interactive WiFi Connectivity For Moving Vehicles

A. Balasubramanian, R. Mahajan, A. Venkataramani, B. N. Levine, J. Zahorjan, "Interactive WiFi Connectivity For Moving Vehicles", ACM SIGCOMM Conference, (August 2008)
This paper talked about ViFi (vehicular WiFi), a protocol designed for providing cheap WiFi connectivity from moving vehicles for common applications such as Web browsing (short TCP transfers) and VoIP.  It was motivated by the fact that clients can overcome many disruptions by communication with multiple base-stations(BS) simultaneously (a concept which they refer to as macrodiversity). The idea is that base-stations that opportunistically overhear a packet but not its ACK, probabilistically relay the packet to the next intended hop thereby minimizing wasted transmissions. This called for designing a new handoff policy and hence required the evaluation of 6 (4 practical and 2 theoretical) handoff policies viz. RSSI, BRR, Sticky, History, BestBS and AllBSeS. ViFi was aimed to be somewhere between BRR and AllBSeS, with the concept of choosing BSeS with high exponentially averaged beacon reception ration and not limiting itself to only one BS. Having multiple BSeS was considered a good strategy mainly because of 2 reasons:
  • The vehicles were often in range of multiple BSeS.
  • Packet losses were bursty and roughly independent across senders and receivers i.e. bursts were path dependent (due to multipath fading) rather than receiver dependent.
Protocol Design
  1. Each vehicle designates one of the nearby BSeS as the anchor (using BRR).
  2. All other BSes that the vehicle can hear are designated as auxiliaries.
  3. Each vehicle periodically broadcasts its anchors and auxiliary BSeS in beacon packets.
  4. Now whenever there is a conversation between the vehicle and its anchor BS, the auxiliary BS overhears it and if the packet is not ACKed for some time, it probabilistically relays the overheard packet with a relaying probability p.
  5. If the vehicle moves out from the range of current anchor before it can deliver packets from the internet, ViFi enables the newly designated anchor to salvage packets by contacting previous anchors over a backplane. 
The authors evaluated this over 2 experimental platforms: VanLan (11 BSeS and 2 vehicles in Redmond, WA) and DieselNet (10 BSeS and (?) vehicles in Amherst, MA). Short TCP transfers and VoIP were evaluated since the authors argued that web browsing and VoIP are 2 most commonly used uses of WiFi. ViFi undoubtedly performed better than BRR on all the fronts in both the cases. However, comparing it only with the worst-case BRR leaves many questions about further optimization of this protocol unanswered.

Critiques

Overall, the paper was very interesting. It tackled a real problem and had appropriate implementation to back up the proposal. However, I  am a bit skeptical about some of the following points:

  1. I might have misunderstood something, but since the vehicle is moving at 40 Km/h (~11 metres/second), I find it a bit intriguing that exchanging huge control information regarding probabilities of transfer between vehicle and BSes were not a problem. Computing relaying probabilities uses these values continuosly to make decisions. Isn't that supposed to result in a huge overhead of periodic beacons?
  2. The authors didn't really consider the effects of reordering of packets and claimed that it would be easy to order packets using a sequencing buffer at anchor BSes and vehicles; however how is this going to affect the auxiliary BSeS (which in turn depends on whether the buffered packets are ACKed or not) remains unanswered.
  3. ViFi's coordination mechanism depended on making the expected number of packets relayed to be 1, however, it makes much more sense to use a formulation that makes the expected number of packets received by the destination to be 1.

Thursday, October 15, 2009

XORs in the Air: Practical Wireless Network Coding

S. Katti, H. Rahuk, W. Hu, D. Katabi, M. Medard, J. Crowcroft, "XORs in the Air: Practical Wireless Network Coding," ACM SIGCOMM Conference, (September 2006).
This paper talks about COPE, a new architecture for wireless mesh networks. The idea is to cleverly use network coding to mix packets from different sources in order to increase the information content in each transmission. COPEs design is based on 3 main principles:
  • It effectively utilize the broadcast nature of wireless channel to exploit opportunistic learning. All the nodes are kept in promiscuous mode and passively listen for broadcast packets. 
  • Use of Network Coding in a practical scenario. The sender effectively XORs various packets together and depends on neighbour nodes to decode individual packets.
  • Using reception reports by which adjacent nodes can find out about the packets that their neighboring node contains.
COPE definitely performed well with UDP, resulting in a throughput increase of upto 3-4 times, however, it didn't result in any significant improvement with TCP. This difference in the behavior of COPE based on protocols and topolology raises questions about the practicality of its actual deployment. However, due credit goes to the authors for cleverly using network coding for increasing throughput. It would be interested to know about some techniques that may have attempted to combine COPE with ExOR in a more conclusive manner.

ExOR: Opportunistic Multi-Hop Routing for Wireless Networks

S. Biswas, R. Morris, "ExOR: Opportunistic Multi-Hop Routing for Wireless Networks," ACM SIGCOMM Conference, (August 2005).
This paper talked about ExOR- (Ex-Opportunistic Routing, what does Ex stand for?) , an efficient integrated routing and MAC protocol. The idea was that ExOR chooses each hop of a packet's route after the transmission for that hop, so that the choice can reflect which intermediate nodes actually received the transmission.  Basically, the paper questioned the very premise of routing in wireless networks. Traditional routing protocols choose the best sequence of nodes between source and destination, and forward each packet through that sequence. However, the authors questioned that such a scheme doesn't work quite well in wireless as transmissions is over lossy paths and has a probabilistic nature. To tackle this, they propose a cooperative diversity scheme which operated as follows:
  • The source broadcasts its batch of packets and the neighboring nodes try to 'hear' and buffer as many packets as they can.
  • Next, the node A that is closest to the destination (identified by the local forwarder list) transmits its batched sets of packets that the receiver had not received.
  • Further, the adjacent node B that overhears A knows about the packets that A has not transmitted and broadcast them and this process continues till the sender.
  • The forwarders continue to cycle through the priority list until the destination has 90% of the packets. The remaining packets are transferred with traditional routing.
Comments

This is definitely a great idea! The authors brilliantly utilize the inevitable use of broadcast  in a wireless network and come up with a smart scheme of probabilistically sending packets. They do make assumptions that the receptions at different nodes is independent of each other and there are not many collisions (which may or may not be true on a general basis) and further that there is a gradual decrease of delivery probability with distance (which is generally true, though presence of intermittent noise sources may affect this claim). However, I guess the paper was definitely novel in its approach and was definitely successful in trying to put forth its point that ExOR multihop cooperative routing is definitely better than traditional routing.

Thursday, October 8, 2009

A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols

J. Broch, D. Maltz, D. Johnson, Y-C Hu, J. Jetcheva, "A Performance Comparison of Multi-Hop Wireless Ad Hoc Network Routing Protocols," ACM Mobicom Conference, (October 1998).
This paper did a detailed simulated performance comparison  of 4 multi-hop wireless Ad Hoc Network Routing Protocols: DSDV, TORA, DSR and AODV. To enable these protocols to be simulated over ns-2, they extended the simulator to include the following:

  • Node mobility
  • Realistic physical layer (with support of propagation delay, capture effects and carrier sense)
  • Radio Network Interfaces
  • IEEE 802.11 MAC protocol using DCF
The following routing protocols were simulated and studied as part of this paper:
  • DSDV: This protocol involves keeping various tuples of the form < destination, next hop, seq. no, metric > for each destination. It requires each node to periodically broadcast routing updates so that adjacent nodes can learn about best possible routes. The authors talked about 2 modifications of DSDV -- DSDV-SQ and DSDV which differs by the fact that in the former receipt of a new sequence number for a destination will cause a triggered update while in the latter, receipt of  a new metric would result in a triggered update.
  • TORA: This is a distributed routing protocol based on a 'link reversal' algorithm. It discovers the routes on demand. The idea is that whenever route A has to QUERY a route to B, it sends a request and a node C in the path sends an UPDATE back to A telling about the path as well as its 'height' wrt B. When A gets the packet, it adjusts its height accordingly to be greater than that of C. These 'heights' help in differentiating between good and bad routes.
  • DSR: It uses source routing rather than hop-by-hop routing, with each packet to be routed carrying the complete ordered list of nodes that it must pass in its header. It consists of ROUTE DISCOVERY and ROUTE MAINTENANCE phases. 
  • AODV: This was essentially a combination of both DSR (on demand mechanism for Route Discovery) and DSDV (Hop-by-Hop routing, sequence numbers and periodic beacons). 
The evaluation of these protocols was done in a simulated environment of 50 wireless nodes forming an ad hoc network moving  about over a rectangular (1500m X 300m) flat space for 900 seconds of simulated time. Evaluations found out that DSR was a clear winner among all based on both packet recv/sent ratio and routing overhead. There were some intrinsic problems with DSDV due to which it was unable to transmit packets to nodes whose path consisted of broken links. This was due to the fact that DSDV had the limitation of keeping only 1 route per path. As far as the routing overheads were concerned, DSR did the best (probably because it promiscuously hears and caches requests), followed by DSDV-SQ.  TORA and AIDV on the other hand had considerable routing overhead. When moving nodes were brought into picture, pretty much all the protocols behaved in a similar manner as far as send/recv ratios were concerned while DSR clearly outperformed everyone on the routing overhead metric.

A High Throughput Path Metric for Multi-Hop Wireless Rounting

D. De Couto, D. Aguayo, J. Bicket, R. Morris, "A High Throughput Path Metric for Multi-Hop Wireless Rounting," ACM Mobicom Conference, (September 2003).
This paper presented ETX (Expected Transmission Count) metric which finds high throughput paths on multi-hop wireless networks and is much superior to the minimum hop count metric used before this paper was written. The authors implemented this metric in DSDV and DSR routing protocols and measured actual data on a 29 node 802.11b test-bed.

The authors started off by highlighting the fact that metrics like min hop count assume black and white links that is a given link either 'works' or doesn't. While this is reasonable to assume in wired networks, wireless links have considerable intermediate losses which might make some sets of links useful over others which the existing metric ignores. This led the authors into designing a much realistic metric which must account for the following issues:
  • Link loss ratios.
  • Asymmetric link ratios.
  • Successive hop interference.
Though End-to-End delay metric pretty much accounts for all of the above, the authors claimed that having a congestion dependent metric can result in wide oscillations in path routes. Moreover, they claimed that load balancing/congestion control can anyway be performed by separate algorithms. This led to the proposal of the ETX metric. Simply put, ETX of a link predicted number of data transmissions required to ssend a packet over that link, including retransmissions. Mathematically, ETX = 1/(df * dr) where df and  dr are the forward and reverse delivery ratios respectively. These are measured by each node periodically broadcasting packets and the neighboring node keeping a ratio count of the packets it received and the packets it should have received during a w second length window. Further, the authors implemented this metric in DSDV and DSR algorithms and observed that ETX provided considerably better routes than min hop count metric. However, the authors observed that for a run with 1,386 byte packets, though ETX offered improvement over min hop count, the gain was not as large as for small packets. This was because of the fact that ETX was using small probes to estimate the link metric and their delivery counts did not accurately depict the actual loss rate due to difference in size.

Critique

My criticism for this paper has mostly got to do with the question of 'how much actual improvement it does it result in over minimum hop metric'?  Given the practical packet size issue, we saw that ETX doesn't always work as nicely as it is theoretically expected to. Not to mention that computing (and most importantly maintaining) ETX  metrics involves considerable overhead as compared to simple minimum hop count metric. Moreover, I would assume that minimum hop count is not a very good metric and is more of a workaround solution that performs badly for lossy links as is clearly shown by the author. If this is indeed the case, why didn't the authors compare their work to more realistic metrics (like the product approximation of per-link delivery ratios proposed by Yarvis et. al, which the authors did mention in their related work!) ?

Monday, October 5, 2009

Architecture and Evaluation of an Unplanned 802.11b Mesh Network

J. Bicket, D. Aguayo, S. Biswas, R. Morris, "Architecture and Evaluation of an Unplanned 802.11b Mesh Network," ACM Mobicom Conference, (September 2005).
This paper evaluated ROOFNET, an unplanned wireless mesh architecture involving omni-directional antennas and multi-hop routing. The whole design criteria was based on assuring high throughput and making the entire system very easy to deploy without any constraints imposed by node-placement planning.

In brief, ROOFNET was deployed over a 4 sq km area in Cambridge, MA consisting of 37 nodes each with omni-directional antennas. It assumes that a small fraction of nodes will voluntarily share its wired Internet access and will be called Gateways (The authors' implementation had 4 gateways). Each node further allocated 192.168.1.x IP address blocks to user hosts attached to the node's Ethernet port and used NAT. Roofnet's routing protocol, Srcr maintained partial dynamic database  of link metrics between various pair of nodes and used Dijkstra's algorithm on the database to find routes. DSR-style flooding was used in case the node had to route a packet in absence of a link metric for that corresponding route. Further, the link metric consisted of Estimate Transmission Time (ETT) which predicted the total amount of time it would take to send a data packet along a multi-hop route.

The authors presented an extensive evaluation of the real system and showed that in a random setting, on an average 2.9 hop path formed to the gateways resulting in an average throughput of 627 kbits/sec and 39 ms average latency. Further the authors simulated a denser network and showed that dense network had increased hop count (possibly due to more number of eligible links) as well as higher throughput (possibly because of shorter hops). One important problem between practice and simulations was that in practice the authors observed link collisions which affected the theoretical throughput value calculated at each node for routing however, they claimed that this is not so bad and to avoid this, a scheme involving RTS/CTS will in degrade the performance even further due to additional overhead.

Critique

Overall, I really liked the Roofnet technology. It was definitely simple, easy to implement and robust to failures which makes it ideal as a candidate for deployment. Moreover, the best thing about it was that majority of the claims made by authors were based on actual deployment of a working system. However, I had 2 questions in mind which the paper left unanswered. Firstly, how often does the distance metrics changes? In a continuously working system, I would assume that the metrics would eventually converge to static values... am I missing something here (probably dynamic congestion considerably affects throughput?). Secondly, assuming that this system is a candidate for commercial deployment, I would assume that it will have a single antenna for each building and all the apartments having Internet access through the node's Ethernet ports. In view of such a scenario, I would have liked to see the effect of varying number of end hosts on a particular node. It may well be possible that a node is more congested for some part of the day and less on other time. It would be interesting to discuss these aspects in class.

Modeling Wireless Links for Transport Protocols

Andrei Gurtov, Sally Floyd, "Modeling Wireless Links for Transport Protocols," ACM SIGCOMM Computer Communications Review, Volume 34, Number 2, (April 2004).
This paper reviews various models for cellular, WLAN and satellite links that are used in the design of transport protocols over wireless links and attempts to strike a balance between realism, generality and detail. The purpose of the models discussed in this paper was to evaluate the effect of link-level mechanisms on end-to-end transport protocols and higher layers.

The authors started off by highlighting the following essential aspect of models:
  • Type of Wireless links: Cellular, Wireless LANs or Satellite? Each of these links have different bandwidth and link latencies and must be modelled accordingly.
  • Topologies: It deals with questions like whether a single wireless link is located at the end or in the middle of a path, or multiple wireless links constitute the path. Further, the number and location of wireless links in the path greatly affects the performance and is an important modeling parameter.
  • Wireless Traffic: It deals with questions like whether the traffic is one way or two way? What is the intrinsic nature of data being transferred over wireless links (web browsing data, GPRS traffic etc.)?
  • Performance Metrics: This deals with the basic questions of what defines a good performance? Various performance metrics such as throughput, delay, fairness, dynamics or goodput may or may not be taken into account.
 Further, the authors presented ways to model specific link characteristics as follows:
  • Error Losses and Corruption: Errors may result due to the fact that link layer only does 3 re-transmissions per data block. Further, they may be due to handovers or mobility of nodes. 
    • Model: These can be modelled by dropping packets on a per-packet, per-bit basis or having a time-based loss probability. Bursty traffic should be modelled by Gilbert model with erroneous and error-free states. Another important point of distinction should be whether errors are due to data corruption or due to handovers.
  • Delay Variation: This refers to the link delay (or delay spike) in the network. Apart from the inherent nature of the link, delay spikes can result from link-layer recovery, handovers, or scheduling. 
    • Model: These can be modelled simply by suspending the data transmission on a simulated link. 
  • Packet Reordering: This refers to reordering of the sequence in which the packets are sent. While being beneficial sometimes (in case of multiple wireless links in path), they can also incorrectly trigger packet retrasmissions and consequently congestion control responses for TCP.
    • Model: These can be modeled by swapping 2 packets in a queue at any given time or delaying one packet and letting others pass it.
  • On-Demand Resource Allocation: Wireless links often allocate channels based on availability of user traffic. This introduces allocation delays which must be accurately modeled.
    • Model: On demand allocation delay can be modeled by introducing an additional delay when a packet arrives to a queue that has been empty longer than the channel hold time. 
  • Bandwidth Variation:  Bandwidth variations may occur when a wireless scheduler assigns a high speed channel to a user for a limited time. 
    • Model: This can be modeled by changing the bandwidth of a link using periods of low and high bandwidth or havning a sine pattern variation. 
  • Asymmetry in Bandwidth and Latency: There is always assymetry in uplink and downlink directions mostly in Cellular link and none in WLAN links. Satellite links have significant asymmetry in bandwidth and  moderate asymetry in latency.
    • Model: This can be modeled simply by having different parameters for bandwidth and latency for uplink and downlinks. 
Further, there are also issues of modeling queue management (accomodate bursty traffic while controlling queueing delay) and effects of mobility.  Queue management in cellular and WLAN links can be done by using a drop-tail buffer with configurable maximum size in packets while mobility is modeled by adequately defining as what the the protocol designer considers as mobility (intersystem Vs intrasystem) and subsequently modeling it thorugh mobility models such as linear, ping pong etc. Further, the authors highlight that there is always an ongoing tussle between the merits of changing wireless links or transport protocols. Due to this, there is not really a clear asnwer as to which layer should be implementing features such as bit errors, reordering or delay variations.

Critique

I found this paper to be more of like '101 ways on how not to model your wireless networks'. Majority of the paper consisted of identifying various flaws in existing models which may or may not accurately depict the shortcomings of previous models. It may very well could have been the case that the model worked well  in a specific scenario thereby highlighting the broader point that the model's authors would have tried to make. This reminded me of something that Prof. Joe Hellerstrein said in our systems class that many systems designers make a mistake of exposing so many gears to the user that in the end he has no idea as to which gear should he turn to do a specific job! I think that having such a generic model may create such a scenario with so many parameters to manipulate that in the end it may result in making the task of modeling much more difficult for those researchers who do not have a complete and exhaustive  knowledge of the system (which may be good or bad...).

Wednesday, September 30, 2009

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links

H. Balakrishnan, V. Padmanabhan, S. Seshan, R. H. Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links," IEEE/ACM Transactions on Networking, (December 1997).

Traditionally, TCP always assumed an underlying wired network and hence believed that a packet loss could occur only due to congestion. Hence TCP congestion control primarily consisted of retransmitting in case of 3 duplicate ACKs and halving ints congestion window on each timeout. Moreover, the timeouts were pretty coarse grained (200ms) due to the underlying purpose of congestion control. However, in case of wireless networks, packet loss may very well be due to noise, bit errors and handoffs. Invoking congestion control in this scenario obviously degraded end-to-end performance. Several protocols have been proposed to improve the performance of TCP over wireless links:
  • Link-layer Protocols: This employs Forward Error Correction (FEC) and retransmission of lost packets in response to ARQ messages. Advantages: Fits naturally into layered structure of network protocol and is independent of higher-layer protocols. Disadvantage: Adverse effect on TCP.
  • Split Connection Protocols: Splits each TCP connection between sender and reciever into 2 separate connection at the base station. Sender -- Base Station -- Receiver. Advantages: A specialized protocol tuned to the wireless environment such as SRP-TCP can be used. Disadvantage: Per packet duplication/processing overhead at the base station, violation of TCP end-to-end semantics and the need to maintain states at the base station.
  • Snoop Protocol: A snoop agent module sits at the base station and monitors every packet that passes through a TCP connection and caches all those packets that have not yet been acknowledged. Now whenever it detects a packet loss (by duplicate acknowledgement or timeout), the snoop agent retransmits the cached copy to the receiver and supresses the dup ACKs. Advantages: Suppresses dup ACKs thereby avoiding fast retransmissions and congestion control policies at the sender. Moreover, it maintains a soft state. Disadvantages: Doesnt completely shield the sender from wireless losses and incurs overhead of per packet copying in the cache.
  • Selective Acknowledgements: SACK and SMART. Advantage: Resilience to reordering and lost ACKs. Disadvantages: Overhead in ACK generation/transmission.
Further, the authors implemented and evaluated the following protocols:
  • End-to-End Schemes: E2E-NEWRENO (Fast Recovery mode), E2E-SMART (TCP with SMART ACKs), E2E-IETF-SACK (TCP with Selective ACKs), E2E-ELN (TCP with an Explicit Loss Notification bit in the ACKs that notifies the sender that the loss is not due to congestion) and E2E-ELN-RXMT. 
  • Link-Layer Schemes: LL-TCP-AWARE,  LL-SMART-TCP-AWARE which leveraged TCP based ACKs instead of generating its own. 
  • Split Connection Schemes: SPLIT and SPLIT-SMART schemes were evaluated with a clever pointer passing trick that avoided data copying at the base station.
Through the evaluations, the authors found out that link layer TCP aware protocols performed best primarily due to the fact that TCP aware link layer protocols suppressed excess dup ACKs and improved performance. Further, E2E protocols with ELN and SACK were better than simple E2E protocols. Moreover, split connection protocols improved performance but were outperformed by the TCP aware link layer protocols. Overall, it seemed to me that LL-SMART-TCP-AWARE protocol emerged out to be the winner!

Critiques:

Overall, the paper was very well written, had nice and brief explanations of all the existing modifications of wireless communication protocols and did thorough evaluations of all of them. However, I think it was wrong to assume sufficient knowledge at the receiver about wireless losses to generate ELN information. I was not fully convinced by the implementation policies of E2E-ELN. Further, I am also curious as to why the exponential distribution error model was assumed to be an accurate indication of the error model. Thirdly, doesn't LL-SMART-TCP-AWARE go against the basic network layered design philosophy? Is it justifiable?

MACAW: A Media Access Protocol for Wireless LANs

V. Bharghaven, A. Demers, S. Shenker, L. Zhang, "MACAW: A Media Access Protocol for Wireless LANs," ACM SIGCOMM Conference, (August 1994)

This paper analyzed CSMA and MACA, two possible media access protocols for single channel wireless LANs and highlighted the problems associated with both of them. While CSMA suffered from the hidden terminal and exposed terminal problems, the MACA which proposed RTS-CTS-DATA with BEB suffered from basic fairness problems and had the potential of even starving a connection. This led the authors to propose MACAW which used an RTS-CTS-DS-DATA-ACK message exchange and had a better backoff algorithm. The following were its key features:

  • Fairness: MACA used the binary exponential backoff algorithm which resulted in each end host doubling its backoff time on every unsuccessful transmission/collision detection. Due to this, it was possible to have 1 connection always backing off and the other taking full available bandwidth thereby raising fairness concerns. MACAW proposed BEB copy algorithm which enabled all stations in the range having the same backoff value thereby having fairness. Further to prevent wild oscillations, MILD was introduced in the protocol. As a further optimization, fairness was defined in terms of streams and not stations.
  • Basic Message Exchange: To start with, the authors realized that since noise is likely to be present, an RTS-CTS-DATA-ACK exchange should be used for reliable communication. Further, to provide synchronization among stations, before sending a DATA packet, the station sent a 30 byte Data-Sending packet (DS). The idea was that every station which overheard the packet would know that RTS-CTS exchange was successful. Further the authors proposed RRTS (Request-for-Request-to-Send packet).
  • Multicast: Due to the inherent nature of RTS-CTS, multicast cannot be implemented straightaway. So, as a hack, the authors proposed sending RTS immediately followed by the DATA packet. Obviously, this design had the same flaws as CSMA. 
  • 'Leakage' of backoff values: Due to the inherent nature of the BEB copy algorithm, all nodes in the same cell were required to have the same backoff value. This resulted in some cases were due to proximity of nodes on the edges of the cells led to 'leaking' of backoff values into adjacent cells which led to wasteful idle time or wasteful collisions. Secondly there is also an inherent problem with BEB that absence of CTS implied a collision/congestion while it may very well be due to noise or a source being offline. This led the authors to propose that each station should maintain a separate backoff counter for each stream and also all stations attempting to communicate with the same receiving station should use the same backoff value. 
Critique

Overall, this paper was nicely written, talked about the state of art in wireless technology at that time and then proposed their solution in an incremental fashion thereby clearly highlighting the need for each and every design decision they had taken. One of my concerns was regarding the lack of good solutions for multicast. The solution proposed by the authors was more of a working hack. Can a spanning tree like solution be implemented? Further, since all the evaluation was simulated in a controlled environment, it may not clearly reflect the practical environment. Thirdly, I was intrigued by the fact that MILD had a multiplicative factor of 1.5 (and not 2 which is just bit left shift operation). Since we had discussed in class while reading about AIMD, these factors are chosen more of due to ease of implementation, I am curious as to what was the motivation behind choosing a factor of 1.5. Lastly, all the analysis was done for fixed stations. (Why) Weren't the authors concerned about moving end points in 1994?

Tuesday, September 29, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).
 This paper followed the CMU's SIGCOMM 2009 Incast paper and critically analyzed the work done as part of their implementation and presented their own opinion on its problem. The authors started off by describing the basic TCP incast problem in datacenters resulting from N servers trying to send data to a single receiver. The authors highlighted that the 2 main contributions of the CMU paper was reduce TCP RTO min and providing high resolution timers. Then they came up with some intuitive fixes that the system designed would have thought such as decreasing the minimum TCP RTO timer value, setting a smaller multiplier for exponential back off,  randomized multiplier for exponential back off and possibly randomizing timer values of each RTO as they occur. Further, based on initial experiments, the authors concluded that high resolution timers were indeed necessary. However, they found that many of these modifications were not really helpful. Such as smaller or randomized multiplier of RTO backoff was not very helpful. Randomizing the minimum and initial RTO timer value was also unhelpful.

This led the authors to dwell deep into this problem and they found out plenty of interesting insights. First, that different network suffered to differnt degrees due to incast, tradeoffs of ACK feedback, switch buffer issues, and more importantly that an analytical model is hard to implement and most likely would not work. Instead the authors implemented a quantitave model which talked about the net goodput across S senders essentially depending on S, total data D and minimum RTO time value r. Finally, a variety of quantitave refinements were proposed, one of the most interesting one being that the concept of disabling delayed ACKs donot work out so well as proposed by the CMU paper.

Overall, I liked this paper because of the approach it followed. Rather than basing analysis on few discrete observations, the idea of designing a quantitative model which can further be improved based on further observations was definitely one of the key advantages of the approach highlighted in the paper.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication," ACM SIGCOMM Conference, (August 2009).

This paper presented the TCP incast problem: a problem resulting from multiple high bandwidth TCP workloads in datacenter Ethernet. This resulted in the small switch memory buffers to get filled quickly, thereby resulting in TCP timeouts lasting hundreds of milliseconds (due to 200ms timeout in vanilla TCP protocol).

The authors started of by briefly describing the problem of 'barrier synchronized' queries, wherein a client cannot make further progress until it hears from all the servers that he had queried. This resulted in data blocks that were striped through multiple servers, simultaneosly being sent towards the client resulting in the packets overfilling the packet buffers and getting dropped. This resulted in a timeout that lasted a minimum of 200ms, determined by the TCP minimum retransmission timeout. Since, the problem was essentially a real world problem, the authors got down to present a practical solution to it straightaway. They created a cluster with N (= 48 ) servers and a test client issued a request to get a block ( = 1 MB ) of data striped across these N servers. So essentially, each server sent blocksize/N bytes of data to the requesting client. They replicated/tested the entire system on a 16 node cluster using an HP Procurve 2848 switch and a 48 node cluster using a Force10 S50 switch. There solution for solving the probem mainly consisting of proposing that the timeouts should be random which will make differnt senders to retransmit at different times. They proposed:
Timeout = ( RTO + (rand(0.5)* RTO))*2^(backoff)
 Secondly, they implemented high resolution timers in the linux kernel and the corresponding TCP stack using the GTOD framework to implement fine grained round trip times of the order of hundred microseconds. Finally, they discussed the problems of Spurious retransmissions and delayed ACKs and advised that delayed ACKs should be disabled.

Overall, this paper definitely deserves merit for exploring the TCP incast problem in detail and providing a very practical and tested solution to tackle this problem. However, at some points, I felt that the analysis was based on intuition/insights rather than strict empirical data or mathematical models. For instance, it was not very clear as to what was the motivation of using striped block workload (blocksize/N) used by the authors in their implementation. Further, I found that the reasoning behind the idea of disabling delayed ACKs is not well explained. It would be really great to discuss in the class, some practical datacenter scenarios based on which these supposedly intuitive decisions were made.

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, "VL2: A Scalable and Flexible Data Center Network," ACM SIGCOMM 2009, (August 2009).

This paper presented VL2 (Virtual Layer 2) based Data Center Network with the following key features:
  1. Flat addressing to allow services instances to be placed anywhere in the network.
  2. 'Valiant' load balancing (VLB)
  3. End System based address resolution. 
The authors established high utilization and agility to be the foremost goals in their design consideration. Current datacenters suffer from over-subscription of resources and nothing is done to prevent a traffic flood from one service to prevent affecting other. Moreover, IP addresses/VLANs are used to achieve scale. This makes things like VM migration extemely complex and introduces broadcasts.

The authors did an extensive analysis of the traffic patterns in datacenters. Howeover, they found out that there is no particular pattern among the datacenter traffic that could be particularly exploited in desing of the routing protocol. So, they too like PortLand, decided to exploit the topology. They talk about assuming a Clos Network which is a rooted tree and links between Intermediate switched and the aggregation switches form a bipartite graph. The underlying idea of VL2 addressing anf routing is having 2 addresses- application specific IP address (AA) and a location specific IP address (LA). The VL2 agent at each server traps packet from the host and encapsulates the packet with the LA address of the ToR switch of the destination. The VL2 agent in turn knows the AA Vs LA mapping by a directory lookup service. Further, the valiant load balancer (VLB) helps in balancing multiple flows of data over randomized paths.

Critique
Overall, I think that the main strength of the paper lies in the fact that all the observations are supported by large scale experiments and data.  This paper also is based on Clos topology, which again raises the issue of how apt is to design a generic protocol over a given topology. Further, another issue in VLB was that of path stretch. Howeover, I was not really conviced when the authors said that this is not much of a problem due to the environement (propagation delay is small) and underlying redundant topology. It would be really interesting to discuss the aspects of PortLand and VL2 in the class and discuss about various design decisions that can be cross implemented to improve both these protocols. I would recommend reading an excellent article It's Microsoft vs. the professors with competing data center architectures and Prof. Amin Vahdat's blog-post that compares PortLand and VL2.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric", ACM SIGCOMM, (August 2009).

This paper presented PortLand, a Layer 2 Network Fabric for modern Data Centers. The authors laid down the following requirements in order to achieve their goal:

  1. Permit VM migration to any physical machine without having the need to change IP address.
  2. Plug and Play Switches.
  3. Efficient Communication.
  4. No forwarding loops.
  5. Rapid and efficient failure detection.
It is interesting to note here that strictly speaking considering current protocols, these funcationalities can not be solved by a protocol at either layer 2 or 3. While Req 1 can be tackled at layer 3 (IP), it fails to provide plug and play functionality to switches. Further, Req 4 can be tackled by layer 2 and 3 both (by spanning tree protocol or TTL resepctively ). Howeover, Req 5 is not met by either of the layers since the routing protocols (ISIS/OSPF) are broadcast based!

As a result, the authors proposed PortLand based on the assumption of a 'Fat Tree Network Topology':

  1. A lightweight protocol to enable switches to discover their position in the topology. It involves a separate centralized system known as the Fabric Manager which is a user process on a dedicated machine responsible for ARP resolution, fault tolerance and multicast. It maintains the soft state about the network topology. It contains IP - PMAC mappings.
  2. A concept of 48 bit pseudo MAC (PMAC) of the form pod.position.port.vmid which helps in encoding a machine's position in the topology. Any switch can query the fabric manager for the PMAC and hence know the exact location of the destination.
  3. Location Discovery Protocol (LDP) which enables switches to detect their exact topology.
  4. Support for ARP, multicast and broadcast.

 Critiques

Overall, the paper was very clear in its requirements, and clearly showed that the implementation design met all those goals. However, my only concern is that this paper depends heavily on the assumption of the fat-tree network topology. I am not really sure if designing an entire protocol based on a specific topology is a good idea.  Though the authors claimed that this will generalize to any multi-rooted topology (over any link mesh?) , however, no concrete data was given to support these claims. Secondly, I was curious as to what PortLand actually stood for? Did I miss reading it somewhere in the paper :-) ?

Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, "Detailed Diagnosis in Enterprise Networks," ACM SIGCOMM Conference, (August 2009).
This paper aimed at identifying the causes of faults in an enterprise network by formulating the detailed diagnosis as an inference problem that captured the behavior and interactions of various fine grained network components. The authors developed a system NetMedic which had the following features:
  • Detailed diagnosis was framed as an inference problem which was more generalized than rule based/classifier based approaches.
  • This was a technique to estimate if A and B are impacting each other without actually having knowledge of how they interact.
  • Introduced many state variables to capture the network state to make the analysis more generalized instead of having just one "health" variable.
  • Had the ability of detect complex interactions between all components and construct a dependency graph between them.
  • Finally it applied history based reasoning, statistical abnormality detection or other learning techniques to identify a cause of faults.
 Overall, the idea was definitely novel and was put forward in a very clear manner. However, the authors mentioned that in spite of collecting 450,000 cases for analysis, they actually used only about 148 cases for their analysis. This kind of places a question mark on the generic nature of this study. What if they had not encountered complex interactions between component in these few cases and missed on important issues? I felt that for a framework that was designed around statistical abnormality detection and history based reasoning, 148 cases were quite a small data-set. Secondly, I was not very convinced how the authors aim at the discovery of redundant variables. Assuming that the cliques always represent redundant variables seem to be a bit of an over-assumption. There can very well be a set of highly intricate tasks that go 'bad' together in a majority of cases, but there might be some cases when they behave independently.  Thirdly I wonder how sound is their assumption that correlation always imply causality which is what was the main focus of their inference engine. I think that it would be really interesting to discuss these aspects in the class.