CS 268, Fall 2009: Computer Networks: September 2009

Wednesday, September 30, 2009

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links

H. Balakrishnan, V. Padmanabhan, S. Seshan, R. H. Katz, "A Comparison of Mechanisms for Improving TCP Performance over Wireless Links," IEEE/ACM Transactions on Networking, (December 1997).

Traditionally, TCP always assumed an underlying wired network and hence believed that a packet loss could occur only due to congestion. Hence TCP congestion control primarily consisted of retransmitting in case of 3 duplicate ACKs and halving ints congestion window on each timeout. Moreover, the timeouts were pretty coarse grained (200ms) due to the underlying purpose of congestion control. However, in case of wireless networks, packet loss may very well be due to noise, bit errors and handoffs. Invoking congestion control in this scenario obviously degraded end-to-end performance. Several protocols have been proposed to improve the performance of TCP over wireless links:

Link-layer Protocols: This employs Forward Error Correction (FEC) and retransmission of lost packets in response to ARQ messages. Advantages: Fits naturally into layered structure of network protocol and is independent of higher-layer protocols. Disadvantage: Adverse effect on TCP.

Split Connection Protocols: Splits each TCP connection between sender and reciever into 2 separate connection at the base station. Sender -- Base Station -- Receiver. Advantages: A specialized protocol tuned to the wireless environment such as SRP-TCP can be used. Disadvantage: Per packet duplication/processing overhead at the base station, violation of TCP end-to-end semantics and the need to maintain states at the base station.

Snoop Protocol: A snoop agent module sits at the base station and monitors every packet that passes through a TCP connection and caches all those packets that have not yet been acknowledged. Now whenever it detects a packet loss (by duplicate acknowledgement or timeout), the snoop agent retransmits the cached copy to the receiver and supresses the dup ACKs. Advantages: Suppresses dup ACKs thereby avoiding fast retransmissions and congestion control policies at the sender. Moreover, it maintains a soft state. Disadvantages: Doesnt completely shield the sender from wireless losses and incurs overhead of per packet copying in the cache.

Selective Acknowledgements: SACK and SMART. Advantage: Resilience to reordering and lost ACKs. Disadvantages: Overhead in ACK generation/transmission.

Further, the authors implemented and evaluated the following protocols:

End-to-End Schemes: E2E-NEWRENO (Fast Recovery mode), E2E-SMART (TCP with SMART ACKs), E2E-IETF-SACK (TCP with Selective ACKs), E2E-ELN (TCP with an Explicit Loss Notification bit in the ACKs that notifies the sender that the loss is not due to congestion) and E2E-ELN-RXMT.

Link-Layer Schemes: LL-TCP-AWARE, LL-SMART-TCP-AWARE which leveraged TCP based ACKs instead of generating its own.

Split Connection Schemes: SPLIT and SPLIT-SMART schemes were evaluated with a clever pointer passing trick that avoided data copying at the base station.

Through the evaluations, the authors found out that link layer TCP aware protocols performed best primarily due to the fact that TCP aware link layer protocols suppressed excess dup ACKs and improved performance. Further, E2E protocols with ELN and SACK were better than simple E2E protocols. Moreover, split connection protocols improved performance but were outperformed by the TCP aware link layer protocols. Overall, it seemed to me that LL-SMART-TCP-AWARE protocol emerged out to be the winner!

Critiques:

Overall, the paper was very well written, had nice and brief explanations of all the existing modifications of wireless communication protocols and did thorough evaluations of all of them. However, I think it was wrong to assume sufficient knowledge at the receiver about wireless losses to generate ELN information. I was not fully convinced by the implementation policies of E2E-ELN. Further, I am also curious as to why the exponential distribution error model was assumed to be an accurate indication of the error model. Thirdly, doesn't LL-SMART-TCP-AWARE go against the basic network layered design philosophy? Is it justifiable?

MACAW: A Media Access Protocol for Wireless LANs

V. Bharghaven, A. Demers, S. Shenker, L. Zhang, "MACAW: A Media Access Protocol for Wireless LANs," ACM SIGCOMM Conference, (August 1994)

This paper analyzed CSMA and MACA, two possible media access protocols for single channel wireless LANs and highlighted the problems associated with both of them. While CSMA suffered from the hidden terminal and exposed terminal problems, the MACA which proposed RTS-CTS-DATA with BEB suffered from basic fairness problems and had the potential of even starving a connection. This led the authors to propose MACAW which used an RTS-CTS-DS-DATA-ACK message exchange and had a better backoff algorithm. The following were its key features:

Fairness: MACA used the binary exponential backoff algorithm which resulted in each end host doubling its backoff time on every unsuccessful transmission/collision detection. Due to this, it was possible to have 1 connection always backing off and the other taking full available bandwidth thereby raising fairness concerns. MACAW proposed BEB copy algorithm which enabled all stations in the range having the same backoff value thereby having fairness. Further to prevent wild oscillations, MILD was introduced in the protocol. As a further optimization, fairness was defined in terms of streams and not stations.

Basic Message Exchange: To start with, the authors realized that since noise is likely to be present, an RTS-CTS-DATA-ACK exchange should be used for reliable communication. Further, to provide synchronization among stations, before sending a DATA packet, the station sent a 30 byte Data-Sending packet (DS). The idea was that every station which overheard the packet would know that RTS-CTS exchange was successful. Further the authors proposed RRTS (Request-for-Request-to-Send packet).

Multicast: Due to the inherent nature of RTS-CTS, multicast cannot be implemented straightaway. So, as a hack, the authors proposed sending RTS immediately followed by the DATA packet. Obviously, this design had the same flaws as CSMA.

'Leakage' of backoff values: Due to the inherent nature of the BEB copy algorithm, all nodes in the same cell were required to have the same backoff value. This resulted in some cases were due to proximity of nodes on the edges of the cells led to 'leaking' of backoff values into adjacent cells which led to wasteful idle time or wasteful collisions. Secondly there is also an inherent problem with BEB that absence of CTS implied a collision/congestion while it may very well be due to noise or a source being offline. This led the authors to propose that each station should maintain a separate backoff counter for each stream and also all stations attempting to communicate with the same receiving station should use the same backoff value.

Critique

Overall, this paper was nicely written, talked about the state of art in wireless technology at that time and then proposed their solution in an incremental fashion thereby clearly highlighting the need for each and every design decision they had taken. One of my concerns was regarding the lack of good solutions for multicast. The solution proposed by the authors was more of a working hack. Can a spanning tree like solution be implemented? Further, since all the evaluation was simulated in a controlled environment, it may not clearly reflect the practical environment. Thirdly, I was intrigued by the fact that MILD had a multiplicative factor of 1.5 (and not 2 which is just bit left shift operation). Since we had discussed in class while reading about AIMD, these factors are chosen more of due to ease of implementation, I am curious as to what was the motivation behind choosing a factor of 1.5. Lastly, all the analysis was done for fixed stations. (Why) Weren't the authors concerned about moving end points in 1994?

Tuesday, September 29, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).

This paper followed the CMU's SIGCOMM 2009 Incast paper and critically analyzed the work done as part of their implementation and presented their own opinion on its problem. The authors started off by describing the basic TCP incast problem in datacenters resulting from N servers trying to send data to a single receiver. The authors highlighted that the 2 main contributions of the CMU paper was reduce TCP RTO min and providing high resolution timers. Then they came up with some intuitive fixes that the system designed would have thought such as decreasing the minimum TCP RTO timer value, setting a smaller multiplier for exponential back off, randomized multiplier for exponential back off and possibly randomizing timer values of each RTO as they occur. Further, based on initial experiments, the authors concluded that high resolution timers were indeed necessary. However, they found that many of these modifications were not really helpful. Such as smaller or randomized multiplier of RTO backoff was not very helpful. Randomizing the minimum and initial RTO timer value was also unhelpful.

This led the authors to dwell deep into this problem and they found out plenty of interesting insights. First, that different network suffered to differnt degrees due to incast, tradeoffs of ACK feedback, switch buffer issues, and more importantly that an analytical model is hard to implement and most likely would not work. Instead the authors implemented a quantitave model which talked about the net goodput across S senders essentially depending on S, total data D and minimum RTO time value r. Finally, a variety of quantitave refinements were proposed, one of the most interesting one being that the concept of disabling delayed ACKs donot work out so well as proposed by the CMU paper.

Overall, I liked this paper because of the approach it followed. Rather than basing analysis on few discrete observations, the idea of designing a quantitative model which can further be improved based on further observations was definitely one of the key advantages of the approach highlighted in the paper.

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D. G. Andersen, G. R. Ganger, G. A. Gibson, B. Mueller, "Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication," ACM SIGCOMM Conference, (August 2009).

This paper presented the TCP incast problem: a problem resulting from multiple high bandwidth TCP workloads in datacenter Ethernet. This resulted in the small switch memory buffers to get filled quickly, thereby resulting in TCP timeouts lasting hundreds of milliseconds (due to 200ms timeout in vanilla TCP protocol).

The authors started of by briefly describing the problem of 'barrier synchronized' queries, wherein a client cannot make further progress until it hears from all the servers that he had queried. This resulted in data blocks that were striped through multiple servers, simultaneosly being sent towards the client resulting in the packets overfilling the packet buffers and getting dropped. This resulted in a timeout that lasted a minimum of 200ms, determined by the TCP minimum retransmission timeout. Since, the problem was essentially a real world problem, the authors got down to present a practical solution to it straightaway. They created a cluster with N (= 48 ) servers and a test client issued a request to get a block ( = 1 MB ) of data striped across these N servers. So essentially, each server sent blocksize/N bytes of data to the requesting client. They replicated/tested the entire system on a 16 node cluster using an HP Procurve 2848 switch and a 48 node cluster using a Force10 S50 switch. There solution for solving the probem mainly consisting of proposing that the timeouts should be random which will make differnt senders to retransmit at different times. They proposed:

Timeout = ( RTO + (rand(0.5)* RTO))*2^(backoff)

Secondly, they implemented high resolution timers in the linux kernel and the corresponding TCP stack using the GTOD framework to implement fine grained round trip times of the order of hundred microseconds. Finally, they discussed the problems of Spurious retransmissions and delayed ACKs and advised that delayed ACKs should be disabled.

Overall, this paper definitely deserves merit for exploring the TCP incast problem in detail and providing a very practical and tested solution to tackle this problem. However, at some points, I felt that the analysis was based on intuition/insights rather than strict empirical data or mathematical models. For instance, it was not very clear as to what was the motivation of using striped block workload (blocksize/N) used by the authors in their implementation. Further, I found that the reasoning behind the idea of disabling delayed ACKs is not well explained. It would be really great to discuss in the class, some practical datacenter scenarios based on which these supposedly intuitive decisions were made.

Tuesday, September 22, 2009

VL2: A Scalable and Flexible Data Center Network

A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, S. Sengupta, "VL2: A Scalable and Flexible Data Center Network," ACM SIGCOMM 2009, (August 2009).

This paper presented VL2 (Virtual Layer 2) based Data Center Network with the following key features:

Flat addressing to allow services instances to be placed anywhere in the network.
'Valiant' load balancing (VLB)
End System based address resolution.

The authors established high utilization and agility to be the foremost goals in their design consideration. Current datacenters suffer from over-subscription of resources and nothing is done to prevent a traffic flood from one service to prevent affecting other. Moreover, IP addresses/VLANs are used to achieve scale. This makes things like VM migration extemely complex and introduces broadcasts.

The authors did an extensive analysis of the traffic patterns in datacenters. Howeover, they found out that there is no particular pattern among the datacenter traffic that could be particularly exploited in desing of the routing protocol. So, they too like PortLand, decided to exploit the topology. They talk about assuming a Clos Network which is a rooted tree and links between Intermediate switched and the aggregation switches form a bipartite graph. The underlying idea of VL2 addressing anf routing is having 2 addresses- application specific IP address (AA) and a location specific IP address (LA). The VL2 agent at each server traps packet from the host and encapsulates the packet with the LA address of the ToR switch of the destination. The VL2 agent in turn knows the AA Vs LA mapping by a directory lookup service. Further, the valiant load balancer (VLB) helps in balancing multiple flows of data over randomized paths.

Critique
Overall, I think that the main strength of the paper lies in the fact that all the observations are supported by large scale experiments and data. This paper also is based on Clos topology, which again raises the issue of how apt is to design a generic protocol over a given topology. Further, another issue in VLB was that of path stretch. Howeover, I was not really conviced when the authors said that this is not much of a problem due to the environement (propagation delay is small) and underlying redundant topology. It would be really interesting to discuss the aspects of PortLand and VL2 in the class and discuss about various design decisions that can be cross implemented to improve both these protocols. I would recommend reading an excellent article It's Microsoft vs. the professors with competing data center architectures and Prof. Amin Vahdat's blog-post that compares PortLand and VL2.

Monday, September 21, 2009

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, A. Vahdat, "PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric", ACM SIGCOMM, (August 2009).

This paper presented PortLand, a Layer 2 Network Fabric for modern Data Centers. The authors laid down the following requirements in order to achieve their goal:

Permit VM migration to any physical machine without having the need to change IP address.
Plug and Play Switches.
Efficient Communication.
No forwarding loops.
Rapid and efficient failure detection.

It is interesting to note here that strictly speaking considering current protocols, these funcationalities can not be solved by a protocol at either layer 2 or 3. While Req 1 can be tackled at layer 3 (IP), it fails to provide plug and play functionality to switches. Further, Req 4 can be tackled by layer 2 and 3 both (by spanning tree protocol or TTL resepctively ). Howeover, Req 5 is not met by either of the layers since the routing protocols (ISIS/OSPF) are broadcast based!

As a result, the authors proposed PortLand based on the assumption of a 'Fat Tree Network Topology':

A lightweight protocol to enable switches to discover their position in the topology. It involves a separate centralized system known as the Fabric Manager which is a user process on a dedicated machine responsible for ARP resolution, fault tolerance and multicast. It maintains the soft state about the network topology. It contains IP - PMAC mappings.
A concept of 48 bit pseudo MAC (PMAC) of the form pod.position.port.vmid which helps in encoding a machine's position in the topology. Any switch can query the fabric manager for the PMAC and hence know the exact location of the destination.
Location Discovery Protocol (LDP) which enables switches to detect their exact topology.
Support for ARP, multicast and broadcast.

Critiques

Overall, the paper was very clear in its requirements, and clearly showed that the implementation design met all those goals. However, my only concern is that this paper depends heavily on the assumption of the fat-tree network topology. I am not really sure if designing an entire protocol based on a specific topology is a good idea. Though the authors claimed that this will generalize to any multi-rooted topology (over any link mesh?) , however, no concrete data was given to support these claims. Secondly, I was curious as to what PortLand actually stood for? Did I miss reading it somewhere in the paper :-) ?

Thursday, September 17, 2009

Detailed Diagnosis in Enterprise Networks

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, "Detailed Diagnosis in Enterprise Networks," ACM SIGCOMM Conference, (August 2009).

This paper aimed at identifying the causes of faults in an enterprise network by formulating the detailed diagnosis as an inference problem that captured the behavior and interactions of various fine grained network components. The authors developed a system NetMedic which had the following features:

Detailed diagnosis was framed as an inference problem which was more generalized than rule based/classifier based approaches.
This was a technique to estimate if A and B are impacting each other without actually having knowledge of how they interact.
Introduced many state variables to capture the network state to make the analysis more generalized instead of having just one "health" variable.
Had the ability of detect complex interactions between all components and construct a dependency graph between them.
Finally it applied history based reasoning, statistical abnormality detection or other learning techniques to identify a cause of faults.

Overall, the idea was definitely novel and was put forward in a very clear manner. However, the authors mentioned that in spite of collecting 450,000 cases for analysis, they actually used only about 148 cases for their analysis. This kind of places a question mark on the generic nature of this study. What if they had not encountered complex interactions between component in these few cases and missed on important issues? I felt that for a framework that was designed around statistical abnormality detection and history based reasoning, 148 cases were quite a small data-set. Secondly, I was not very convinced how the authors aim at the discovery of redundant variables. Assuming that the cliques always represent redundant variables seem to be a bit of an over-assumption. There can very well be a set of highly intricate tasks that go 'bad' together in a majority of cases, but there might be some cases when they behave independently. Thirdly I wonder how sound is their assumption that correlation always imply causality which is what was the main focus of their inference engine. I think that it would be really interesting to discuss these aspects in the class.

Floodless in SEATTLE

C. Kim, M. Caesar, J. Rexford, "Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises," ACM SIGCOMM Conference, (August 2008).

This paper describes a network architecture called SEATTLE (Scalable Ethernet Architecture of Larger Enterprises) which aims to achieve the scalabilty of IP combined with the simplicity of the Ethernet. The most important aspect of SEATTLE was its plug and play functionality as well its ability to simultanously operate with existing infrastructure and protocols.

The paper started off by discussing the problems of Ethernet bridging (namely inflexible route selection and its dependence on broadcasting for basic operations), Hybrid IP (configuration overhead due to hierarchical addressing, complexity encoutered in implementing network policies and its limited support for mobility if end hosts) and Virtual LAN (Trunk configuration overhead, limited control-plane scalability and insufficient support for richer topologies).

The authors then proposed a system for a network layer one-hop DHT. The concept was to use link state protocol on the switch level to ensure that each switch knows others' position in the network. Now, each end node was hashed based on its MAC and IP addresses on other switches. The idea was that whenever a request came to a switch, it computed the hash of the IP address of destination and queried the switch which resulted. This switch had the location of the switch in whose domain the destination host belonged and the packet was routed to it. Moreover, the value was cached for further tranmissions on source node's switch. The logic behind this was the observation that most of the communications in the modern Internet involved communicating with a small fixed number of destinations. Further, it also enabled flexible service discovery as services like "PRINTER", "DHCP_SERVER" could also be cached in a similar manner on the switches. The system was also much more lenient to topology changes by communicating between switches and upldating key value pairs. Further, the authors discussed about the scalability of the system by proposing a multi-level one hop DHT by defining a separate regional and backbone hash ring. Finally a simulations study was provided based on real-world traffic traces using Emulab to evaluate using Click and XORP routing platforms.

Overall, this paper is very straightforward in its approach and clearly discusses about the current scenario in enterprise networks. It does an impressive job in tackling out many of the issues that affect the scalability of current networks which rely a lot on broadcasts and flooding as a means of path detection. However, I felt that the authors gave a very fledgling insight in the hierarchical multi-level one-hop DHT. Since the boundary level switch ring has to maintain hash values of each and every host, this puts a question on the size of the enterprise networks that the authors were targeting. Somehow, it was not really clearly defined by the authors. Further, it would have been great if the authors had discussed about the cost/feasibility of the smart switches in the protocol too.

Tuesday, September 15, 2009

Congestion Control for High Bandwidth-Delay Product Networks

D. Katabi, M. Handley, C. Rohrs, "Congestion Control for High Bandwidth-Delay Product Networks," ACM SIGCOMM Conference, (August 2002)

This paper presented eXplicit Control Protocol (XCP) as a replacement to the current widely deployed Transport Control Protocol (TCP) which introduced the new concept of decoupling utilization control and fairness control. The advantages of XCP (and disadvantages of TCP) were:

1. Unlike TCP where efficiency and fairness are coupled together into AIMD, XCP allows these two things to be decoupled. As a result, XCP effectively implemented MIMD for its efficiency controller and AIMD for its fairness controller.

2. In XCP, each packet sends a congestion header to the gateway which contains the sender's current window size and its RTT. The router in return gives back its feedback in the same header. This allowed that the new protocol does not maintain any per flow state in routers.

3. TCP's additive increase policy makes it less responsive in its ability to acquire spare bandwidth.

4. XCP facilitates the detection of misbehaving sources.

5. XCP provides a good incentive for users to deploy it. However, it can very well co-exist with existing TCP which allows it to be implemented gradually.

6. The parameters used in XCP are pre-specified which makes it easy to be deployed without much application-specific tuning.

The design of XCP was followed by extensive simulations and stability analysis. The authors showed that XCP clearly outperformed TCP over RED, CSFQ, REM and AVQ. Overall, this paper being a relatively new paper deals with some of the core issues that affect the Internet today and questioned the very assumptions of TCP. I really liked this approach of stepping back a bit and questioning the core assumption of why should packet loss be a indicator for congestion! Moreover, many of the ideas mentioned in this paper built on things that we had been discussing in the last few classes involving burst traffic and high delay and bandwidth links. Moreover, I felt that for identifying misbehaving hosts, we do have keep per-flow information, which was not really explained well in the paper. To conclude, it would be really interesting to discuss the applicability of this protocol and how successful was its gradual deployment.

Random Early Detection Gateways for Congestion Avoidance

S. Floyd, V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance," IEEE/ACM Transactions on Networking, (August 1993).

This paper talks about the Random Early Detection (RED) gateways for congestion avoidance. The underlying idea is that the gateway drops or marks each arriving packet with a probability that is a function of the average queue size and is proportional to the connection's bandwidth share . The main advantages of RED gateways are:

No bias against bursty traffic
Avoids global synchronization
Gradual deployment is possible in the current Internet

The paper started off by discussing some early work on congestion avoidance gateways namely Early Drop Gateways (which being a drop-tail gateway resulted in global synchronization and had biased against bursty traffic) and Early Random Drop gateways (whcih were not successful in controlling misbehaving users). Many of the early congestion avoidance schemes focused on instantaneous queue lengths and not on average queue lengths which obviously created a bias against bursty traffic or traffic with high bandwidth-delay product.

The design of the RED algorithm was pretty neat in the way that it dealt with average queue length which was calculated as (1-w)*avg + w*q. Here q was the instantaneous queue size and w was the queue weight which in a way helped in smoothing out the average by making it dependent on the previous calcualted average and the current queue length. The advantage of this was that the momentary changes in queue lengths due to traffic bursts didn't change the average queue length by a huge amount (thereby not resulting in dropping/marking packets). The authors backed up their algorithm with adequate simulation which confirmed their results.

Overall, this paper was pretty clean in its presentation and put forth its point in a clear manner. However, I felt that it had quite a lot of parameters to manipulate such as w (queue weight), the max/min thresholds, the maximum probability and not enough guidelines were given for determining their values. I personally was quite curious to see how various values of w (queue weight) affect the average queue length (and hence the throughput). Moreover, all the simulations assumed equal packet sizes. As shown, in case of variable packet sizes, the probability of dropping would have been p = p*(packet size)/(max packet size). It would have been interesting to see the overhead of this operation as unlike other operations which involved bit shifting and addition, this operation would have considerably affected the computation time on the gateway. Moreover, it would be really interesting to discuss as to how successful was RED eventually and how does it stand in the current scenario (when bursty and high bandwidth-delay product traffic has increased a lot in the Internet).

Thursday, September 10, 2009

Core-Stateless Fair Queueing

I. Stoica, S. Shenker, H. Zhang, "Core-Stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks," ACM SIGCOMM, (August 1998).

This paper claims that although fair queueing does a pretty good job in managing bandwidth and buffer space and ensuring promptness, it needs to maintain state, manage buffers and perform packet scheduling on a per flow basis. This increases their complexity as well as decreases their cost-effectiveness. This paper on the other hand proposes a different architecture to tackle the complexity. The idea is that this architecture may not be as good as the strict FQ algorithm but has a very good 'profit by cost' ratio.

The architecture consists of an island of routers (which is quite vague in definition and may refer to a collection of routers owned by an ISP or may even spread across many ISPs in case there are no trust issues) consisting of a contiguous region of network. The authors make a distinction between edge routers and core routers. The idea is that the edge routers maintain flow state information and insert a label into each packet based on their estimates. The core routers which are essentially stateless implement FIFO packet scheduling and a probabilistic dropping algorithm taking into account the packet labels. Two important assumption taken by the authors were:

fair allocation mechanism play an important role in congestion control
the complexity of the current fair allocation algorithms is a substantial hindrance to their adoption.

The probabilistic dropping algorithms used by core routers doesn't give nearly-perfect levels of fairness however, it does help in achieving a reasonable approximation to the fair bandwidth allocation. The authors compared this scheme with other queueing scheme FIFO, FRED, RED and DRR and found that CSFQ's performance was better than FIFO and RED and was comparable to FRED. It's complexity level of edge routers were comparable to FRED while core routers were comparable to RED. DRR had a better performance however the cost was a play down.

Overall, this paper was very straight-forward and honest in its approach. It freely questioned its own assumptions and the authors even admitted that they were unsure how to interpret some of the final results! Moreover, even though the paper skipped some mathematical proofs and only gave the final result, it did provide an intuitive understanding of the concept which was really great.

Analysis and Simulation of a Fair Queueing Algorithm

A. Demers, S. Keshav, S. Shenker, "Analysis and Simulation of a Fair Queueing Algorithm," Internetworking: Research and Experience, 1 (1990), pp. 3-26.

This paper involves the design of the fair queueing algorithm which can ensures proper allocation of bandwidth (which packets get transmitted), promptness (how soon they get transmitted) and buffer state (which/when packets are to be discarded) independently. It shows that fair queueing provides several important advantages over the standard FCFS algorithm used by most of the routers. One of the important one is that the latter assumes that there are no ill-behaved sources in the network while fair queueing provides protection from ill-behaved sources by appropriately penalizing them.

The paper started off by highlighting the work done by Nagle on fair queueing in which the gateways maintaind separate queues of packets for each source and serviced them in round robin manner. However, this simplistic algorithm didn't tackle the case of fair bandwidth allocation due to difference in packet sizes. Eg. large packet connections effectively ended up getting higher bandwidths. Moreover, due to lack of simulation data and comparison metrics involving this algorithm, it was necessary to know exactly 'how much' improvement are these fair queueing algorithms over standard FCFS. This lead to the authors look into designing a better FQ algorithm and back it up with adequate simulation/comparison data. The max-min criterion of fairness was that an allocation is fair if:

no user receives more than its request
no other allocation scheme satisfying the above condition has higher minimum allocation
the above condition remains recursively true as the minimal user is removed and the total resource is reduced accordingly.

The allocation criterion used in this paper was conversations -- source destination pairs. A hypothetical implementation of such a fair criterion would be a bit-by-bit round robin scheme (BR) implemented on the gateways where each bit of data was forwarded by the gateway for each flow in a round robin method. The authors quite elegantly showed that this hypothetical scenario can very well be extended to the generic scenario by simply defining the rule that whenever a packet finishes transmision, the next packet sent was the one with the smallest value of F (finishing time). The authors also introduced 'delta' which was kind of a history element of the packet and helped in promptness. It helped in rewarding those connections which were using less than their share of bandwidth. Moreover, the dropped packets from a hosts were also counted in throughput which rightly penalized ill-behaved senders. The authors did a really good job at simulating the generic, JK and DEC flow control with the FCFS and FQ queueing algorithm in a variety of synthetic scenarios which were kind of border cases of evaluation which nicely backed their above arguments.

Overall, the paper was very well presented and highlighted the deep insight of authors in this area. The experimental data adequately backed up all the claims made by the authors. Few things which could have been taken into account was how would the protocol behave when we had multiple connections from the same host. Since the whole FQ was being done on a 'per connection' basis, there could be very well a series of connections establish by a single ill-behaving host. Moreover, as a young student in network research, an interesting question that comes to my mind is that how do researchers choose evaluation scenarios? Why is it apt to claim that the whole protocol works if it works for 6 boundary case scenarios? Couldn't there have been a 7th scenario which would have uncovered an interesting insight in this problem or flaws in the algorithm?

Tuesday, September 8, 2009

Congestion Avoidance and Control

Jacobson, V. 1988. Congestion avoidance and control. In Symposium Proceedings on Communications Architectures and Protocols (Stanford, California, United States, August 16 - 18, 1988)

The underlying idea highlighted by this paper is that most of the problems arise from incorrect implementations of transport layer protocols rather than the protocols themselves. In view of this, the paper highlighted plenty of instances where things could go wrong and described some algorithms to set them right. The whole paper is set in a very practical approach, and is a result of the work done in solving the problem of a sudden massive throughput decrease between LBL and Berkeley due to network congestion in October 1986.

The following contributions were proposed by the paper:

Slow start algorithm: This was an algorithm to establish the initial flow. It started off with a one packed window and then kept on doubling the size of the window whenever all the current packets sent by the window were acknowledged (limited by the max size of the window wmax).
Round-trip-time variance estimation: This talks about estimating the mean round trip time R = aR + (1 - a)M where R is the average RTT estimate and M is the recently calcuated RTT time from tthe network.
Exponential retransmit timer backoff: This deals with how should the transmits be spaced if the packet has to be re-transmitted. The author claims (with a little bit of intuitive explanation) that exponential retransmit works best in this case.
Dynamic window resizing on congestion: This is more of a caveat taken fromt he last paper, and talks about additive increase (by 1/(window size)) and multiplicative decrease (by 0.5).

Further, the authors claim that if fair sharing has to be taken into account, the congestion detection must be implemented at gateways instead of end points (End-to-end Principle violation?). In the end, the paper has a very informative appendix which gives a good insight into the implementation of the above proposed algorithms.

Overall, I felt the paper to be perfectly complimentary to the previous paper. It carefully highlighted the practical aspects of the congestion avoidance problem highlighted in the previous paper. I felt that if I had taken a course in queuing theory, I would have been in a much better position to appreciate the subtleties in this paper. It would be great if we could have a short optional reading on queuing theory along with this paper. Moreover, I am not very convinced by the author's argument of using b = 0.5 instead of 7/8 in the original paper. Though the reason highlighted is that it is due to the nature of slow start algorithm, I somehow find it difficult to appreciate it. Doesn't slow start run at the beginning just for establishing the flow? Once the flow is established, subsequent throughput is increased/decreased by the additive increase /multiplicative decrease. In this case, halving the window size seems to be pretty harsh!

Monday, September 7, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

Chiu, D. and Jain, R. Analysis of the increase and decrease algorithms for congestion avoidance in computer networks. Comput. Netw. ISDN Syst. 1989

Congestion control mechanisms can be broadly divided into 2 categories, congestion avoidance and congestion recovery. Both of them are aimed at solving the underlying network resource management problem at their heart however with different approaches. The former is a 'preventive' approach where appropriate steps are taken so that the network doesn't reach the congested state in the first palce while the latter is more of a 'responsive' approach which sets into action when the network is already congested. This paper discusses the nuances of the former congestion avoidance approach by taking into account various performance metrics such as efficiency, fairness, convergence time and size of oscillations.

At the heart of the congestion avoidance approach is the 'binary feedback scheme' which kind of acts as a network monitor and sends a binary feedback back to the host (1 if overloaded and 0 if underloaded). Now, when a host comes to know the status of the network through this feedback, it must either increase (in case bit = 0) or decrease (in case bit = 1) its throughput accordingly. Considering a linear change is the new rate (which the author argues is quite simple and sufficient to handle), the new throughput can be given as:

X(t+1) = a + bX(t)

where, a and b can be 'theorectically' positive or negative depending on the increase/decrease. However, when the author took the properties of efficiency, fairness and distributedness into account, the variables a and b pretty much reduced to the condition of b = 1 for increase and a = 0 for decrease. In other words, it was found that additive increase (X(t+1) = a + X(t)) where a > 0 and multiplicative decrease (X(t+1) = bX(t)) where 0 < b < 1 worked best. At the end, the author also introduced nonlinear controls in a fleeting manner and said that they unnecessarly complicate the task of finding the right scaling factors.

Overall, this paper was a good read because of the simplicity and straightforwardness of its explanations. I loved the way it explained the convergence to efficiency and fairness feasibility conditions through vector graphs! The paper mentions that it considers both under-utilization of network channel and over utilization to be equally wrong and something that the congestion algorithm should tackle accordingly. In spite of the convincing maths and graphs, I somehow find the additive (slow?) increase and multiplicative (faster?) decrease intuitively opposing to the equally wrong consideration. Secondly, as the author highlights, non-linear controls offer us far more flexibility in reaching equilibrium, however there are obviously performance and complexity trade-offs. It would be really interesting to discuss these trade-offs in class. Furthermore, it would also be great to discuss the behavior of congestion avoidance algorithms in a dynamic environment such as P2P sharing etc. where the hosts join and leave often.

Thursday, September 3, 2009

Understanding BGP Misconfiguration

Ratul Mahajan , David Wetherall , Tom Anderson, Understanding BGP misconfiguration, ACM SIGCOMM 2002

The whole motivation behind the paper was to highlight that accidental BGP misconfiguration have the potential to disrupt whole of the Internet connectivity which makes it important to dwell deeper into their patterns so as to have a better understanding about their frequency as well as causes. To achieve this goal, the authors attempted to answer the following questions:

Frequency of misconfigurations?
Their impact on global connectivity?
Their causes?
What are the solutions to reduce their frequency and impact?

This project spanned over 21 days and the entire stream og BGP updates taken from 23 different vantage points across the Internet was analyzed. Moreover, to validate their results, they subsequently surveyed the ISP operators involved in each incident. The fact which makes this problem pretty exciting was that there was no trivial ways in which these misconfiguration could be detected. So, the paper claims to have studied a subset of these misconfiguration (and hence talks about providing a 'lower bound' idea of the scenario). The methodology involved studying 2 broad classes of globally visible faults:

Origin Misconfiguration: This refers to the unintentional insertion of a route into the global BGP tables. This further talked about self-deaggregation, related-origin and self-origin misconfigurations.

Export Misconfiguration: This referred to the inadvertent export of a route to a BGP peer in violation of exporter's policy. (Thus violating valley-free property)

The methodolgy of identifying misconfiguration was to see short-lived (less than a day) address mappings as potential misconfigs and then confirming with AS operators if it indeed was a misconfig. Here it was argued, the origin misconfig could be a short lived new route and policy violating short-lived AS paths could be export misconfigs. These AS relationships were inferred using Gao's algorithm. It was claimed that the misconfiguration detection frequency was pretty high and majority of origin misconfigs occured due to faulty redistribution or initialization bugs. On the other hand, prefix based configuration was responsible for 22% of the export misconfigs.

Finally, the paper proposed a variety of fixes such as configuration checkers, automatic verification and better user interfaces. Overall, I liked the approach of this paper in the sense that it tackled the problem in a real hands-on level. No doubt that misconfigurations are pretty commonplace and except certain scenarios, they donot always disrupt complete connectivity, however they do have a considerable impact on routing loads. I would really like if we could discuss more along the lines of steps which could be taken towards developing automatic verification techniques to minimize these misconfigurations and how will the related design decisions change in case of S-BGP.

Wednesday, September 2, 2009

Interdomain Internet Routing

Hari Balakrishnan, Interdomain Internet Routing, MIT lecture notes, January 2009.

This paper discusses the fundamentals of how routing between different administrative domains works in the Internet and how is it influenced by various technical as well as administrative issues by introducing the Border Gateway Protocol (BGP) version 4 which is the current standard inter-domain routing protocol.

The paper starts off by highlighting that the fundamental reason for the scalability of the Internet is due to the topological addressing scheme which allows Internet addresses to be aggregated in the forwarding tables of the routers. The current structure of the Internet infrastructure comprises of many ISPs that 'cooperate' and 'compete' amongst themselves to provide global connectivity. Based on their size, they can be broadly divided into 3 tiers with many Tier 3 ISPs comprising of a small number of geographically localized customers and few Tier 1 ISPs operating on a global level. These ISPs are more like autonomous systems (AS) and for effective routing on the Internet, it is essential for them to exchange reachability information. For this, we basically have 2 type of inter-AS relationships:

Transit: This is a provider-customer relationship wherein the smaller ISP pays the larger one to provide access to all destinations in its routing tables.

Peering: This is more of a peer based relationship based on mutual consent wherein two similar sized ISPs exchange mutually beneficial routing information amongst themselves. The author mentions that these peering deals are almost always confidential and not disclosed publicly. (...which I fail to understand why?)

BGP was created by keeping in mind the following design goals:

Scalability: It is important that the entire system of BGP updates scale well with the increasing number of ASes as well as the churn in the network.

Policy: It is important that ASes have the freedom to enforce various forms of routing policies such as route filtering or internal route rankings.

Cooperation: This is also an important aspect of BGP design. It is extremely necessary to have cooperation among various ASes resulting in a mutual beneficial environment in spite of having internal route filters and ranking policies for its own profit.

Another important aspect of BGP are the 2 BGP sessions: eBGP and iBGP. eBGP refers to exterior BGP and is referred to the aspect of exchanging network routing information between different ASes. On the other hand iBGP or interior BGP sessions are used for disseminating the information of externally learned routes to all the routers internal to the ASes. Further BGP updates comprise of various route attributes in addition to the address prefixes. These include Next Hop, AS path, Local preference and Multiple-Exit Discriminator (MED) and are used in implementing various policy decisions in ASes such as route filtering and internal route rankings. In the end, the author presents a great set of practical examples such as the YouTube incident highlighting the fragility of the whole interdomain routing system which is again discussed in greater detail by Ratul Mahajan et. al. in Understanding BGP Misconfiguration.

Overall, I really liked the paper especially because of the ease and lucidity of its explanation of the BGP and its implications. However, I would really like if we can have a deeper discussion on the scalability and convergence issues of BGP in the class. Moreover, it is important to realize that BGP doesn't always give optimal routes and what can be done for it to perform better in this area is very much an open question. Security aspects of BGP (S-BGP) is also an interesting area which remain to be explored. As a side note, the author mentioned somewhere in the paper that some ISPs now have 'delay guarantees' for certain kinds of traffic. Given the nature of the Internet, this statement was kind of intriguing for me. Another point that was a matter of curiosity was that the author mentioned that there are exceptions to point-to-point eBGP session. It would be really great if we could discuss some scenarios that call for such multi-hop BGP sessions.

CS 268, Fall 2009: Computer Networks