CS 268, Fall 2009: Computer Networks: Understanding TCP Incast Throughput Collapse in Datacenter Networks

Y. Chen, R. Griffith, J. Liu, A. Joseph, R. H. Katz, "Understanding TCP Incast Throughput Collapse in Datacenter Networks," Workshop on Research in Enterprise Networks (WREN'09), (August 2009).

This paper followed the CMU's SIGCOMM 2009 Incast paper and critically analyzed the work done as part of their implementation and presented their own opinion on its problem. The authors started off by describing the basic TCP incast problem in datacenters resulting from N servers trying to send data to a single receiver. The authors highlighted that the 2 main contributions of the CMU paper was reduce TCP RTO min and providing high resolution timers. Then they came up with some intuitive fixes that the system designed would have thought such as decreasing the minimum TCP RTO timer value, setting a smaller multiplier for exponential back off, randomized multiplier for exponential back off and possibly randomizing timer values of each RTO as they occur. Further, based on initial experiments, the authors concluded that high resolution timers were indeed necessary. However, they found that many of these modifications were not really helpful. Such as smaller or randomized multiplier of RTO backoff was not very helpful. Randomizing the minimum and initial RTO timer value was also unhelpful.

This led the authors to dwell deep into this problem and they found out plenty of interesting insights. First, that different network suffered to differnt degrees due to incast, tradeoffs of ACK feedback, switch buffer issues, and more importantly that an analytical model is hard to implement and most likely would not work. Instead the authors implemented a quantitave model which talked about the net goodput across S senders essentially depending on S, total data D and minimum RTO time value r. Finally, a variety of quantitave refinements were proposed, one of the most interesting one being that the concept of disabling delayed ACKs donot work out so well as proposed by the CMU paper.

Overall, I liked this paper because of the approach it followed. Rather than basing analysis on few discrete observations, the idea of designing a quantitative model which can further be improved based on further observations was definitely one of the key advantages of the approach highlighted in the paper.

CS 268, Fall 2009: Computer Networks

Tuesday, September 29, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

No comments:

Post a Comment

Blog Archive

Followers