This paper presented the TCP incast problem: a problem resulting from multiple high bandwidth TCP workloads in datacenter Ethernet. This resulted in the small switch memory buffers to get filled quickly, thereby resulting in TCP timeouts lasting hundreds of milliseconds (due to 200ms timeout in vanilla TCP protocol).
The authors started of by briefly describing the problem of 'barrier synchronized' queries, wherein a client cannot make further progress until it hears from all the servers that he had queried. This resulted in data blocks that were striped through multiple servers, simultaneosly being sent towards the client resulting in the packets overfilling the packet buffers and getting dropped. This resulted in a timeout that lasted a minimum of 200ms, determined by the TCP minimum retransmission timeout. Since, the problem was essentially a real world problem, the authors got down to present a practical solution to it straightaway. They created a cluster with N (= 48 ) servers and a test client issued a request to get a block ( = 1 MB ) of data striped across these N servers. So essentially, each server sent blocksize/N bytes of data to the requesting client. They replicated/tested the entire system on a 16 node cluster using an HP Procurve 2848 switch and a 48 node cluster using a Force10 S50 switch. There solution for solving the probem mainly consisting of proposing that the timeouts should be random which will make differnt senders to retransmit at different times. They proposed:
Timeout = ( RTO + (rand(0.5)* RTO))*2^(backoff)Secondly, they implemented high resolution timers in the linux kernel and the corresponding TCP stack using the GTOD framework to implement fine grained round trip times of the order of hundred microseconds. Finally, they discussed the problems of Spurious retransmissions and delayed ACKs and advised that delayed ACKs should be disabled.
Overall, this paper definitely deserves merit for exploring the TCP incast problem in detail and providing a very practical and tested solution to tackle this problem. However, at some points, I felt that the analysis was based on intuition/insights rather than strict empirical data or mathematical models. For instance, it was not very clear as to what was the motivation of using striped block workload (blocksize/N) used by the authors in their implementation. Further, I found that the reasoning behind the idea of disabling delayed ACKs is not well explained. It would be really great to discuss in the class, some practical datacenter scenarios based on which these supposedly intuitive decisions were made.
No comments:
Post a Comment