TCP Maintenance and Minor Extensions M. Kuehlewind, Ed. (tcpm) University of Stuttgart Internet-Draft R. Scheffenegger Intended status: Informational NetApp, Inc. Expires: February 14, 2014 August 13, 2013 Problem Statement and Requirements for Fine-Grained ECN Feedback draft-ietf-tcpm-accecn-reqs-03-bb Abstract Explicit Congestion Notification (ECN) is an IP/TCP mechanism where network nodes can mark IP packets instead of dropping them to indicate congestion to the end-points. An ECN-capable receiver will feedback this information to the sender. ECN is specified for TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT). Recently, new TCP mechanisms like ConEx or DCTCP need fine-grained ECN feedback information in the case where more than one marking is received in one RTT. This document specifies requirements for an update to the TCP protocol so that it can provide ECN feedback signals that are more fine-grained than just once per round trip. Review Comments, dated 13 Aug 2013. This is a review by Bob Briscoe, not the work of the authors. The changes suggested in this review are just that: suggestions. They have been written as mods to the XML source of the draft merely for convenience. A diff against the original draft-03 will be the best way to read this. It is expected that the authors may accept some changes and reject others. There is no implication that any of these changes are acceptable to the authors. The motivation for some of the suggested changes is in the accompanying email sent to the tcpm list. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 1] Internet-Draft Requirements for More Accurate ECN August 2013 time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on February 14, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 2] Internet-Draft Requirements for More Accurate ECN August 2013 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 5 2. Recap of Classic ECN and ECN Nonce in IP/TCP . . . . . . . . . 6 3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7 4. Design Approaches . . . . . . . . . . . . . . . . . . . . . . 9 4.1. Re-use of ECN/NS Header Bits . . . . . . . . . . . . . . . 9 4.2. Using Other Header Bits . . . . . . . . . . . . . . . . . 10 4.3. Using a TCP Option . . . . . . . . . . . . . . . . . . . . 11 5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 11 6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 11 7. Security Considerations . . . . . . . . . . . . . . . . . . . 11 8. References . . . . . . . . . . . . . . . . . . . . . . . . . . 12 8.1. Normative References . . . . . . . . . . . . . . . . . . . 12 8.2. Informative References . . . . . . . . . . . . . . . . . . 12 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 3] Internet-Draft Requirements for More Accurate ECN August 2013 1. Introduction Explicit Congestion Notification (ECN) [RFC3168] is an IP/TCP mechanism where network nodes can mark IP packets instead of dropping them to indicate congestion to the end-points. An ECN-capable receiver will feedback this information to the sender. ECN is specified for TCP in such a way that only one feedback signal can be transmitted per Round-Trip Time (RTT). This is sufficient for pre- existing congestion control mechanisms that perform only one reduction in sending rate per RTT, independent of the number of ECN congestion marks. But recently proposed/deployed mechanisms like Congestion Exposure (ConEx) [RFC6789] or DCTCP [Ali10] need more fine-grained ECN feedback information to work correctly in the case where more than one marking is received in any one RTT. ConEx is an experimental approach that allows the sender to re-insert the congestion feedback it sees into the forward data path. This is primarily so that any traffic management can be proportionate to actual congestion caused by traffic, rather than limiting traffic based on rate or volume in case it might cause congestion [RFC6789]. A ConEx sender uses selective acknowledgements (SACK [RFC2018]) for fine-grained feedback of loss signals, but currently TCP offers no equivalent fine-grained feedback for ECN. DCTCP offers very low and predictable queueing delay. DCTCP requires switches/routers to have ECN enabled and configured with no signal smoothing, so it is currently only used in private networks, e.g. internal to data centres. DCTCP was released in Microsoft Windows 8, and implementations exist for Linux and FreeBSD. The changes DCTCP makes to TCP are not currently the subject of any IETF standardisation activity. The different DCTCP implementations alter TCP's ECN feedback protocol [RFC3168] in unspecified proprietary ways, and they either omit capability negotiation, or they use non-interoperable negotiation. A primary motivation for this document is to prevent each proprietary implementation from inventing its own handshake, which could lead to _de facto_ consumption of the few flags that remain available for standardising capability negotiation. Also, those variants that use the feedback protocol proposed in [Ali10] only work if there are no losses at all, and otherwise they become confused. To remedy these problems, Section 3 of this document lists requirements for a robust and interoperable fine-grained TCP/ECN feedback protocol that all implementations of ConEx and/or DCTCP can use. A few solutions have already been proposed, so Section 4 demonstrates how to use the requirements to compare them, by briefly sketching their high level design choices and discussing the benefits Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 4] Internet-Draft Requirements for More Accurate ECN August 2013 and drawbacks of each. The following scenarios briefly show where fine-grained feedback is needed or adds value: An RFC5681 TCP sender that supports ConEx: In this case the ConEx mechanism uses the extra information per RTT to re-echo the precise congestion information, but the congestion control algorithm still ignores multiple marks per RTT [RFC5681]. A sender using DCTCP congestion control without ConEx: The DCTCP congestion control algorithm uses the extra feedback information per RTT to decrease its rate depending on the extent of congestion marks (not just the existence of at least one mark per RTT). A sender using DCTCP congestion control and supports ConEx: Both the congestion control algorithm and ConEx use the fine- grained ECN feedback mechanism. An RFC5681 TCP sender without ConEx: No fine-grained feedback is necessary here. The congestion control algorithm still reacts on only one signal per RTT. But it is best to have one generic feedback mechanism, whether it is used or not. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. We use the following terminology from [RFC3168] and [RFC3540]: The ECN field in the IP header: CE: the Congestion Experienced codepoint, ECT(0): the first ECN-capable Transport codepoint, and ECT(1): the second ECN-capable Transport codepoint. The ECN flags in the TCP header: Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 5] Internet-Draft Requirements for More Accurate ECN August 2013 CWR: the Congestion Window Reduced flag, ECE: the ECN-Echo flag, and NS: ECN Nonce Sum. In this document, the ECN feedback scheme as specified in [RFC3168] is called the 'classic ECN' and any new proposal the 'fine-grained ECN feedback' scheme. A 'congestion mark' is defined as an IP packet where the CE codepoint is set. A 'congestion episode' refers to one or more congestion marks belonging to the same overload situation in the network (usually during one RTT). A TCP segment with the acknowledgment flag set is simply called an ACK. 2. Recap of Classic ECN and ECN Nonce in IP/TCP ECN requires two bits in the IP header. The ECN capability of a packet is indicated when either one of the two bits is set. An ECN sender can set one or the other bit to indicate an ECN-capable transport (ECT) which results in two signals, ECT(0) and ECT(1). A network node can set both bits simultaneously when it experiences congestion. When both bits are set the packet is regarded as "Congestion Experienced" (CE). In the TCP header the first two bits in byte 14 are defined as ECN feedback for each half-connection. A TCP receiver signals the reception of a congestion mark using the ECN-Echo (ECE) flag in the TCP header. For reliability, the receiver continues to set the ECE flag on every ACK. To enable the TCP receiver to determine when to stop setting the ECN-Echo flag, the sender sets the CWR flag upon reception of an ECE feedback signal. This always leads to a full RTT of ACKs with ECE set. Thus the receiver cannot signal back any additional CE markings arriving within the same RTT. The ECN Nonce [RFC3540] is an experimental addition to ECN that the TCP sender can use to protect itself against accidental or malicious concealment of marked or dropped packets. This addition defines the last bit of byte 13 in the TCP header as the Nonce Sum (NS) flag. The receiver maintains a nonce sum that counts the occurrence of ECT(1) packets, and signals the least significant bit of this sum on the NS flag. Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 6] Internet-Draft Requirements for More Accurate ECN August 2013 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | | N | C | E | U | A | P | R | S | F | | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | | | | R | E | G | K | H | T | N | N | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ Figure 1: The (post-ECN Nonce) definition of the TCP header flags However, it is believed that the ECN nonce has never been deployed. Therefore, if a sender tried to protect itself with the nonce, any receiver wishing to conceal marked or dropped packets merely has to appear like all the other receivers that have not implemented the nonce, and simply not provide any nonce feedback. An alternative for a sender to assure feedback integrity has been proposed where the sender occasionally inserts an ECN mark or loss itself, and checks that the receiver feeds it back faithfully [I-D.moncaster-tcpm-rcv-cheat]. This alternative requires no standardisation and consumes no header bits or codepoints, as well as releasing the ECT(1) codepoint in the IP header and the NS flag in the TCP header for other uses. 3. Requirements At minimum, a new feedback scheme should deliver feedback no worse than classic ECN feedback. However, to be useful for e.g. ConEx or DCTCP, a fine-grained ECN feedback protocol will also need to be fairly accurate (not necessarily perfect), timely and amenable to integrity protection. This leads to the following requirements, which should all be addressed in the specification of a fine-grained ECN feedback scheme: Resilience The ECN feedback signal is carried within the ACK. Pure TCP ACKs can be lost without recovery. Therefore, a fine-grained ECN feedback extension has to take ACK loss into account. Timeliness A CE mark is induced by a network node on the transmission path and echoed by the receiver in the TCP ACK. Thus when this information arrives at the sender, it is naturally already about one RTT old. With a sufficient ACK rate a further delay of a small number of ACKs can be tolerated. However, this information will become stale with larger delays, given the dynamic nature of networks. TCP congestion control (which itself partly introduces these dynamics) operates on a time scale of one RTT. Thus, to be timely, Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 7] Internet-Draft Requirements for More Accurate ECN August 2013 congestion feedback information should be delivered within about one RTT. Integrity Given the problems with the ECN nonce identified above, this document only requires that the integrity of fine-grained ECN feedback can be assured; it does not require that the ECN nonce is the mechanism employed to achieve this. Indeed, it entertains the possibility that a fine-grained ECN feedback scheme might re-use the nonce sum (NS) flag in the TCP header. If fine-grained ECN feedback does re-use the NS flag, an alternative should be provided that assures the integrity of the feedback at least as well as the ECN nonce or that gives strong incentives for the receiver and network nodes to cooperate honestly. Accuracy Classic ECN feeds back one congestion notification per RTT, which is sufficient for classic TCP congestion control which reduces the sending rate at most once per RTT. The fine- grained ECN feedback scheme has to ensure that, if a congestion episode occurs, at least one congestion notification is echoed and received per RTT as classic ECN would do. Of course, the goal of this extension is to reconstruct the number of CE markings more accurately and in the best case even to reconstruct the exact number of payload bytes that a CE marked packet was carrying. However, a sender should not assume to get the exact number of congestion markings or marked bytes in all situations. Delayed ACKs are commonly used with TCP. That means in most cases only every second data packet triggers an ACK. Thus a fine-grained ECN feedback extension has to take delayed ACKs into account. In a high congestion situation where most of the packets are marked with CE, a fine-grained feedback mechanism must still be able to signal sufficient congestion information. Ideally, it would be possible for the sender to determine which of the packets covered by a delayed ACK were congestion marked, e.g. if the flow consists of packets of different sizes, or to allow for future protocols where the order of the markings may be important. Also, an ideal fine- grained feedback protocol would still work if delayed ACKs covered more than two packets. Complexity The implementation should be as simple as possible and only a minimum of additional state information should be needed. Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 8] Internet-Draft Requirements for More Accurate ECN August 2013 Overhead A fine-grained ECN feedback signal should limit the additional network load, because ECN feedback is ultimately not critical information (in the worst case, loss will still be available as a congestion signal of last resort). As feedback information has to be provided frequently and in a timely fashion, potentially all or a large fraction of TCP acknowledgments will carry this information. Ideally, no additional segments should be exchanged compared to an RFC3168 TCP session, and the overhead in each segment should be minimised. Backward and forward compatibility Given fine-grained ECN feedback will involve a change to the TCP protocol, it will need to be negotiated between the two TCP endpoints. If either end does not support fine-grained feedback, they should both be able to fall-back to classic ECN feedback. A fine-grained ECN feedback extension should aim to be able to traverse most existing middleboxes. Further, a feedback mechanism should provide a method to fall-back to classic RFC3168 signaling if the new signal is suppressed by certain middleboxes. In order to avoid a fork in the TCP protocol specifications, if experiments with the new fine-grained ECN feedback protocol are successful, it is intended to eventually update RFC3168 for any TCP/ECN sender, not just for ConEx or DCTCP senders. Therefore, even if only one ECN feedback signal per RTT is needed, it should be possible to use fine-grained ECN feedback. 4. Design Approaches The schemes proposed so far are outlined below. The main differentiator is their resilience in the face of loss of pure ACKs, which largely depends on the number of bits used for the encoding. 4.1. Re-use of ECN/NS Header Bits The three ECN header flags (ECE, CWR and NS) are re-used both during the TCP handshake for capability negotiation and during the subsequent TCP session for the receiver to signal the current value of its congestion signal counter. This approach provides resilience against ACK loss by repeating the CE counter on each ACK, but Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 9] Internet-Draft Requirements for More Accurate ECN August 2013 resilience against loss of a string of pure ACKs is limited, dependent on the number of bits used. Several codings have been proposed so far: o A one bit scheme sends one ECE for each CE received (to increase the robustness against ACK loss CWR could be used to introduce redundant information on the next ACK); o A 3-bit counter scheme continuously feeds back the three least significant bits of a CE counter; o A 3-bit codepoint scheme encodes either a CE counter or an ECT(1) counter in 8 codepoints. The proposed schemes provide accumulated information on ECN-CE- marking feedback, similar to the number of acknowledged bytes in the TCP header. Due to the limited number of bits the ECN feedback information will wrap much more often than the acknowledgement field. Thus feedback information could be lost due to a relatively small sequence of pure-ACK losses. Resilience could be increased by introducing redundancy, e.g. send each counter increase two or more times. Of course any of these additional mechanisms will increase the complexity. If the congestion rate is larger that the ACK rate (multiplied by the number of congestion marks that can be signaled per ACK), the congestion information cannot be correctly fed back. Thus an accurate ECN feedback mechanism needs to be able to cover the worst case situation where every packet is CE marked. This can potentially be realized by dynamically adapting the ACK rate and redundancy, which again increases complexity and perhaps the signaling overhead as well. 4.2. Using Other Header Bits As seen in Figure 1, there are currently three unused flag bits in the TCP header. The proposed 3 bit or codepoint schemes could be extended by one or more bits, to add higher resilience against ACK loss. The relative gain would be proportionally higher resilience against ACK loss, while the respective drawbacks would remain identical. Alternatively, the receiver could use bits in the Urgent Pointer field to signal more bits of its congestion signal counter, but only whenever it does not set the Urgent Flag. As this is often the case, resilience could be increased without additional header overhead. Any proposal to use such bits would need to check the likelihood that some middleboxes might discard or 'normalise' the currently unused Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 10] Internet-Draft Requirements for More Accurate ECN August 2013 flag bits or a non-zero Urgent Pointer when the Urgent Flag is cleared. 4.3. Using a TCP Option Alternatively, a new TCP option could be introduced, to help maintaining the accuracy and integrity of ECN feedback between receiver and sender. Such an option could provide higher resilience and even more information. For instance, ECN for RTP/UDP provides the explicit the number of ECT(0), ECT(1), CE, non-ECT marked and lost packets. However, deploying new TCP options has its own challenges. Moreover, to achieve a high resilience, this option would need to be carried by most or all ACKs, which would add considerable signaling overhead. Anyway, such a TCP option could be used in addition to a more accurate ECN feedback scheme in the TCP header or in addition to classic ECN, only when available and needed. 5. Acknowledgements 6. IANA Considerations This memo includes no request to IANA. 7. Security Considerations Given ECN feedback is used as input for congestion control, the respective algorithm would not react appropriately if fine-grained ECN feedback were lost and the resilience mechanism to recover it was inadequate. This resilience requirement is articulated in Section 3. However, it should be noted that fine-grained ECN feedback is not the last resort against congestion collapse, because if there is insufficient response to ECN, loss will ensue, and TCP will still react appropriately to loss. A receiver could suppress ECN feedback information leading to its connections consuming excess sender or network resources. This problem is similar to that seen with the classic ECN feedback scheme and should be addressed by integrity checking as required in Section 3. 8. References Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 11] Internet-Draft Requirements for More Accurate ECN August 2013 8.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit Congestion Notification (ECN) Signaling with Nonces", RFC 3540, June 2003. 8.2. Informative References [Ali10] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: Efficient Packet Transport for the Commoditized Data Center", Jan 2010. [I-D.briscoe-tsvwg-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in progress), October 2010. [I-D.kuehlewind-tcpm-accurate-ecn-option] Kuehlewind, M. and R. Scheffenegger, "Accurate ECN Feedback Option in TCP", draft-kuehlewind-tcpm-accurate-ecn-option-01 (work in progress), July 2012. [I-D.moncaster-tcpm-rcv-cheat] Moncaster, T., "A TCP Test to Allow Senders to Identify Receiver Non-Compliance", draft-moncaster-tcpm-rcv-cheat-01 (work in progress), June 2007. [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996. [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 2009. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 12] Internet-Draft Requirements for More Accurate ECN August 2013 [RFC5690] Floyd, S., Arcia, A., Ros, D., and J. Iyengar, "Adding Acknowledgement Congestion Control to TCP", RFC 5690, February 2010. [RFC6789] Briscoe, B., Woundy, R., and A. Cooper, "Congestion Exposure (ConEx) Concepts and Use Cases", RFC 6789, December 2012. Authors' Addresses Mirja Kuehlewind (editor) University of Stuttgart Pfaffenwaldring 47 Stuttgart 70569 Germany Email: mirja.kuehlewind@ikr.uni-stuttgart.de Richard Scheffenegger NetApp, Inc. Am Euro Platz 2 Vienna, 1120 Austria Phone: +43 1 3676811 3146 Email: rs@netapp.com Kuehlewind & Scheffenegger Expires February 14, 2014 [Page 13]