Congestion Exposure (ConEx) M. Kuehlewind, Ed. Internet-Draft ETH Zurich Intended status: Experimental R. Scheffenegger Expires: September 9, 2015 NetApp, Inc. March 8, 2015 TCP modifications for Congestion Exposure draft-ietf-conex-tcp-modifications-07 Abstract Congestion Exposure (ConEx) is a mechanism by which senders inform the network about expected congestion based on congestion feedback from previous packets in the same flow. This document describes the necessary modifications to use ConEx with the Transmission Control Protocol (TCP). Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 9, 2015. Copyright Notice Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 1] Internet-Draft TCP Modifications for ConEx March 2015 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 18 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 3. Counting congestion . . . . . . . . . . . . . . . . . . . . . 4 3.1. Loss Detection . . . . . . . . . . . . . . . . . . . . . 5 3.1.1. General Approach . . . . . . . . . . . . . . . . . . 6 3.1.2. Without SACK Support . . . . . . . . . . . . . . . . 6 3.2. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.1. Accurate ECN feedback . . . . . . . . . . . . . . . . 9 3.2.2. Classic ECN support . . . . . . . . . . . . . . . . . 9 4. Setting the ConEx Flags . . . . . . . . . . . . . . . . . . . 10 4.1. Setting the E or the L Flag . . . . . . . . . . . . . . . 10 4.2. Setting the Credit Flag . . . . . . . . . . . . . . . . . 11 5. Loss of ConEx information . . . . . . . . . . . . . . . . . . 13 6. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 14 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 14 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 9. Security Considerations . . . . . . . . . . . . . . . . . . . 14 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 10.1. Normative References . . . . . . . . . . . . . . . . . . 15 10.2. Informative References . . . . . . . . . . . . . . . . . 16 Appendix A. Revision history . . . . . . . . . . . . . . . . . . 17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 1. Introduction Congestion Exposure (ConEx) is a mechanism by which senders inform the network about expected congestion based on congestion feedback from previous packets in the same flow. ConEx concepts and use cases are further explained in [RFC6789]. The abstract ConEx mechanism is explained in [draft-ietf-conex-abstract-mech]. This document describes the necessary modifications to use ConEx with the Transmission Control Protocol (TCP). The markings for ConEx signaling are defined in the ConEx Destination Option (CDO) for IPv6 [draft-ietf-conex-destopt]. Specifically, the use of four flags are defined: X (ConEx-capable), L (loss experienced), E (ECN experienced) and C (credit). ConEx signaling is based on loss or Explicit Congestion Notification (ECN) marks [RFC3168] as congestion indications. The sender collects this congestion information based on existing TCP feedback mechanisms from the receiver to the sender. No changes are needed at the Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 2] Internet-Draft TCP Modifications for ConEx March 2015 receiver to implement ConEx signaling. Therefore no additional negotiation is needed to implement and use ConEx at the sender. This document specifies the sender's actions that are needed to provide meaningful ConEx information to the network. Section 2 provides an overview of the modifications needed for TCP senders to implement ConEx. First congestion information has to be extracted from TCP's loss or ECN feedback as described in section 3. Section 4 details how to set the CDO marking based on this congestion information. Section 5 discusses loss of packets carrying ConEx information. Section 6 [CREF1]discusses timeliness of the ConEx feedback signal, given congestion is a temporary state. This document describes congestion accounting for TCP with and without the Selective Acknowledgment (SACK) extension [RFC2018] (in section 3.1). However, ConEx benefits from the more accurate information that SACK provides about the number of bytes dropped in the network. It is therefore preferable[CREF2] to use the SACK extension when using TCP with ConEx. The detailed mechanism to set the L flag in response to loss-based congestion feedback signal is given in section 4.1. Whereas loss has to be minimized, ECN can provide more fine-grained feedback information. ConEx-based traffic measurement or management mechanisms could benefit from this. Unfortunately, the current ECN feedback mechanism does not reflect multiple congestion markings if they occur within the same Round-Trip Time (RTT). A more accurate feedback extension to ECN (AccECN) is proposed in a separate document [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for other mechanisms. Congestion accounting for both classic ECN feedback and AccECN feedback is explained in detail in section 3.2. Setting the E flag in response to ECN-based congestion feedback is again detailed in section 4.1. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 2. Sender-side Modifications This section gives an overview of actions that need to be taken by a TCP sender modified to use ConEx signaling. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 3] Internet-Draft TCP Modifications for ConEx March 2015 In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN preferably with AccECN feedback. Therefore a ConEx sender MUST also implement SACK and ECN. Depending on the capability of the receiver, the following operation modes exist: +------+-----+ | SACK | ECN | +------+-----+ | S | A | | S | C | | S | - | | - | A | | - | C | | - | - | +------+-----+ S: SACK enabled; A: AccECN enabled; C: Classic ECN [RFC3168] enabled Table 1: ConEx modes. A ConEx sender MUST expose all congestion information to the network according to the congestion information received by ECN or based on loss information provided by the TCP feedback loop. A TCP sender SHOULD count congestion byte-wise (rather than packet-wise; see next paragraph). After any congestion notification, a sender MUST mark subsequent packets with the appropriate ConEx flag in the IP header. Furthermore, a ConEx sender must send enough credit to cover all experienced congestion for the connection so far, as well as the risk of congestion for the current transmission (see Section 4.2). With SACK the number of lost payload bytes is known, but not the number of packets carrying these bytes. With classic ECN only an indication is given that a marking occurred but not the exact number of payload bytes nor packets. As network congestion is usually byte- congestion [RFC7141], the byte-size of a packet marked with a CDO flag is defined to represent that number of bytes of congestion signalling [draft-ietf-conex-destopt]. Therefore the exact number of bytes should be taken into account, if available, to make the ConEx signal as exact as possible. Detailed mechanisms for congestion accounting in each operation mode are described in the next section. 3. Counting congestion A ConEx TCP sender maintains two counters: one that counts congestion based on the information retrieved by loss detection, and a second that accounts for ECN based congestion feedback. These counters hold Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 4] Internet-Draft TCP Modifications for ConEx March 2015 the number of outstanding bytes that should be ConEx marked with respectively the E flag or the L flag in subsequent packets. The outstanding bytes for congestion indications based on loss are added to the loss exposure gauge (LEG), as explained in Section 3.1. The outstanding bytes counted based on ECN feedback information are added to the congestion exposure gauge (CEG)as explained in Section 3.2. When the sender sends a ConEx capable packet with the E or L flag set it reduces the respective counter by the byte-size of the packet. This is explained for both counters in Section 4.1. Usually all bytes of an IP packet must be counted. Therefore the sender SHOULD take the payload and headers into account, up to and including the IP header. Therefore, as well as the TCP payload bytes, an appropriate number of header bytes SHOULD be added to the gauge for each packet of congestion feedback. And the sender SHOULD subtract header bytes from the gauge for each marked packet sent. If equal-sized packets, or at least equally distributed packet sizes can be assumed, the sender MAY only add and subtract TCP payload bytes,. In this case there should be about the same number of ConEx marked packets as the original packets that were causing the congestion. Thus both contain about the same number of header bytes so they will cancel out. This case is assumed for simplicity in the following sections. Otherwise, if a sender sends different sized packets (with unequally distributed packet sizes), the sender needs to memorize or estimate the number of lost or ECN-marked packets. A sender might be able to reconstruct the number of packets and thus the header bytes if the packet sizes of all packets that were sent during the last RTT are known. Otherwise, if no additional information is available, the conservative or even worst case number of packets and thus header bytes should be estimated, e.g. based on the minimum packet size (of all packets sent in the last RTT). If the number of ConEx marked packets is smaller (or larger) than the estimated number of lost or ECN-marked packets, the additional header bytes should be added to (or can be subtracted from) the respective counter.[CREF3] 3.1. Loss Detection Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 5] Internet-Draft TCP Modifications for ConEx March 2015 3.1.1. General Approach This section applies whether or not SACK support is available. The following section deals with the case when SACK is not available. TCP feedback is designed so that the sender can detect losses in order to retransmit the lost data. Therefore, it might be naively assumed that a TCP sender only needs to set the ConEx L flag on all retransmissions in order to signal the amount of bytes lost. However, this will not always be the case. Therefore the process of loss detection is described here and separately the process of ConEx marking is described in Section 4.1.[CREF4] A ConEx sender needs to[CREF5] maintain a local signed counter that shall be called the loss exposure gauge (LEG), indicating the number of outstanding bytes to be sent with the ConEx L flag. When a TCP sender decides that a data segment needs to be retransmitted, it will increase LEG by the size of the TCP payload bytes in the retransmission (assuming equal sized segments such that the retransmitted packet will have the same number of header bytes as the original ones). Any retransmission may be spurious. To accommodate that, a ConEx sender SHOULD make use of heuristics to detect such spurious retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel [RFC3522], [RFC4015]). When such a heuristic has determined that a certain number of packets were retransmitted erroneously, the ConEx sender SHOULD subtract the payload size of these TCP packets from LEG.[CREF6] 3.1.2. Without SACK Support If multiple losses occur within one RTT and SACK is not used, it may take several RTTs until all lost data is retransmitted. With the scheme described above, the ConEx information will be delayed considerably, but timeliness is important for ConEx. For ConEx it is not important to know which data got lost but only how much.[CREF7] During the first RTT after the initial loss detection, the amount of received data and thus also the amount of lost data can be estimated based on the number of received ACKs. Thus without SACK, the information needed for ConEx feedback can be available with an additional delay of one RTT by using the following estimation algorithm and an additional Loss Estimation Counter (LEC): Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 6] Internet-Draft TCP Modifications for ConEx March 2015 flight_bytes: current flight size in bytes retransmit_bytes: payload size of the retransmission At the first retransmission in a congestion event LEC is set: LEC = flight_bytes - 3*SMSS (At this point of time in the transmission, in the worst case, all packets in flight minus three that trigged the dupACks could have been lost.) Then during the first RTT of the congestion event: For each retransmission: LEG += retransmit_bytes LEC -= retransmit_bytes For each ACK: LEC -= SMSS After one RTT: LEG += LEC (The LEC now estimates the number of outstanding bytes that should be ConEx L marked.) After the first RTT for each following retransmissions: if (LEC > 0): LEC -= retransmit_bytes else if (LEC==0): LEG += retransmit_bytes if (LEC < 0): LEG += -LEC (The LEG is not increased for those bytes that were already counted.) 3.2. ECN ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to mark packets with the Congestion Experienced (CE) mark instead of dropping them when congestion occurs. A receiver might support 'classic' ECN, the more accurate ECN feedback scheme (AccECN), or neither. In the case that ECN is not supported for a connection, of course, no ECN marks will occur; thus the sender will never set the E flag. Otherwise, a ConEx sender must Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 7] Internet-Draft TCP Modifications for ConEx March 2015 maintain a signed counter, the congestion exposure gauge (CEG), for the number of outstanding bytes that have to be ConEx marked with the E flag. The CEG is increased when ECN information is received from an ECN- capable receiver supporting the 'classic' ECN scheme or the accurate ECN feedback scheme. When the ConEx sender receives an ACK indicating one or more segments were received with a CE mark, CEG is increased by the appropriate number of bytes as described further below. Unfortunately in case of duplicate acknowledgements the number of newly acknowledged bytes will be zero even though (CE marked) data has been received. Therefore, we increase the CEG by DeliveredData, as defined below: DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - (is_after_dup)*num_dup*1SMSS DeliveredData covers the number of bytes that has been newly delivered to the receiver. Therefore on each arrival of an ACK, DeliveredData will be increased by the newly acknowledged bytes (acked_bytes) as indicated by the current ACK, relative to all past ACKs. The formula depends on whether SACK is available, as follows: With SACK: DeliveredData is increased by the number of bytes provided by (new) SACK information (SACK_diff). Note, if less unacknowledged bytes are announced in the new SACK information than in the previous ACK, SACK_diff can be negative. In this case, data is newly acknowledged (in acked_bytes), that has previously already been accumulated into DeliveredData based on SACK information. Without SACK: DeliveredData is estimated to be 1 SMSS on duplicate acknowledgements. For the subsequent partial or full ACK, DeliveredData is estimated to be the newly acknowledged bytes, minus one SMSS for each preceding duplicate ACK. Therefore is_dup is one if the current ACK is a duplicated ACK without SACK, and zero otherwise. is_after_dup is only one for the next full or partial ACK after a number of duplicated ACKs without SACK and num_dup counts the number of duplicated ACKs in a row.[CREF8] With classic ECN, as soon as a CE mark is seen at the receiver, it will feed this information back to the sender by setting the Echo Congestion Experienced (ECE) flag in the TCP header of subsequent ACKs. Once the sender receives the first ECE of a congestion notification, it sets the CWR flag in the TCP header once. When this packet with Congestion Window Reduced (CWR) flag in the TCP header Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 8] Internet-Draft TCP Modifications for ConEx March 2015 arrives at the receiver, acknowledging its first ECE feedback, the receiver stops setting ECE. Thus, with classic ECN, one congestion marked packet causes continuous congestion feedback for a whole round trip, thus hiding the arrival of any further congestion marked packets during that round trip. The more accurate ECN feedback scheme (AccECN) has been defined to ensure that feedback properly reflects the extent of congestion marking. The two cases, with and without a receiver capable of AccECN, are discussed in the following sections. 3.2.1. Accurate ECN feedback With the [CREF9] more accurate ECN feedback scheme (AccECN) either the number of marked packets or the number of marked bytes is known. In the latter case the CEG can directly be increased by the number of marked bytes. Otherwise if D is assumed to be the number of marks, the gauge (CEG) will be conservatively increased by one SMSS for each marking or at max the number of newly acknowledged bytes: CEG += min(SMSS*D, DeliveredData) 3.2.2. Classic ECN support If the ConEx sender fully conforms to the semantics of ECN signaling as defined by [RFC5562],[CREF10] it will receive one full RTT of ACKs with the ECE flag set whenever at least one CE mark was received by the receiver. As the sender cannot estimate how many packets have actually been CE marked during this RTT, the most conservative assumption MAY be taken, namely assuming that all packets were marked. This can be achieved by increasing the CEG by DeliveredData for each ACK with the ECE flag: CEG += DeliveredData Optionally a ConEx sender could implement the following technique, called advanced compatibility mode, to considerably improve its estimate of the number of ECN-marked packets: To extract more than one ECE indication per RTT, a ConEx sender could set the CWR flag continuously to force the receiver to signal only one ECE per CE mark. Unfortunately, the use of delayed ACKs [RFC5681] (which is common) will prevent feedback of every CE mark; if a CWR confirmation is received before the ECE can be sent out on the next ACK, ECN feedback information could get lost. Thus a sender SHOULD set CWR only on those data segments that will actually trigger a (delayed) ACK. The sender would need an additional control loop to estimated which data segments will trigger an ACK in order to extract Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 9] Internet-Draft TCP Modifications for ConEx March 2015 more timely congestion notifications. Still the CEG SHOULD be increased by DeliveredData, as one or more CE marked packets could be acknowledged by one delayed ACK. The repetition of ECE in classic ECN is intended to ensure reliable delivery of congestion feedback. The following argument is intended to prove that suppressing repetitions of ECE is safe against possible congestion collapse due to lost congestion feedback. With advanced compatibility mode, if an ACK containing ECE is lost, the continual CWRs prevent it being repeated, so it will remain lost. Therefore, if congestion is light on the forward path and heavy on the reverse, most of the light congestion signals will be lost. If loss of feedback exacerbates congestion on the forward path, more forward packets will be CE marked, increasing the likelihood that feedback from at least one CE will get through per RTT. As long as one ECE reaches the sender per RTT, the sender's congestion response will be the same as if CWR were not continuous. The only way that heavy congestion on the forward path could be completely hidden would be if all ACKs on the reverse path were lost. If total ACK loss persisted, the sender would time out and do a congestion response anyway.Therefore, the problem seems confined to potential suppression of a congestion response during light congestion. Anyway, even if loss of all ECN feedback led to no congestion response, the worst that could happen would be loss instead of ECN- signalled congestion on the forward path. Given compatibility mode does not affect loss feedback, there would be no risk of congestion collapse. 4. Setting the ConEx Flags By setting the X flag, a packet is marked as ConEx-capable. All packets carrying payload MUST be marked with the X flag set, including retransmissions. No congestion feedback information is available about control packets such as pure ACKs which are not carrying any payload. Thus these packets should not be taken into account when determining ConEx information. These packet MUST carry a ConEx Destination Option with the X flag unset.[CREF11] 4.1. Setting the E or the L Flag As long as the LEG or CEG counter is positive, the sender MUST mark each ConEx-capable packet with L or E respectively, and decrease the LEG or CEG counter by the TCP payload bytes carried in the marked packet (assuming headers are not being counted because packet sizes are regular). No matter how small the value of LEG or CEG, if it is positive, to ensure ConEx signals are timely, the sender MUST NOT Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 10] Internet-Draft TCP Modifications for ConEx March 2015 defer packet marking. Therefore the value of LEG and CEG will commonly be negative. Multiple ConEx flags may be required for signaling at the same time. This may happen, for example, during excessive congestion when an ACK is received by the sender that simultaneously indicates that at least one segment has been lost, and that one or more ECN marks were received. Another case when this might happen is when ACKs are lost, so that a subsequent ACK carries summary information not previously available to the sender. Whenever both LEG and CEG are positive, the sender MUST mark each ConEx-capable packet with both L and E. If a credit signal is also pending (see Section 4.2), the C flag can be set as well. 4.2. Setting the Credit Flag The ConEx abstract mechanism [draft-ietf-conex-abstract-mech] requires that sufficient credit must be signaled in advance to cover the expected congestion during the feedback delay of one RTT. This section proposes concrete algorithms for determining how much credit to signal during congestion avoidance and slow start. However, experimentation in better credit setting algorithms is expected and encouraged. The wider goal of ConEx is to reflect the 'cost' of the risk of causing congestion on those that contribute most to it. Thus, experimentation is encouraged in better ways to improve or maintain performance while reducing the risk of causing congestion, and therefore reducing the need to signal so much credit. For a simple credit algorithm, a ConEx sender SHOULD maintain a counter of the sent credits c in bytes. If congestion occurs, credits will be consumed and the c counter SHOULD be reduced by the number of bytes that where lost or estimated to be ECN-marked. If the risk of congestion was estimated wrongly and thus too few credits were sent, the c counter becomes zero but cannot go negative. During TCP congestion avoidance, the amount of credit sent SHOULD exceed the amount of congestion experienced by at least the number of bytes in flight, as all packets could potentially get lost or congestion marked.[CREF12] Thus a ConEx sender should monitor the number of bytes in flight f. Whenever f becomes larger than c, the ConEx sender SHOULD set the C flag on each ConEx-capable packet and increase c by the size of each marked packet until it is no less than f again. Recall that c will be decreased whenever congestion occurs, therefore c will need to be replenished as soon as c drops below f. Also Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 11] Internet-Draft TCP Modifications for ConEx March 2015 recall that the sender can set the C flag on a ConEx-capable packet whether or not the E or L flags are also set. In TCP slow start, the congestion window might grow much larger than during the rest of the transmission. Thus a sender could consider sending fewer than f credits but risking being penalized by an audit function. In any case the credits SHOULD at least cover the increase in sending rate.[CREF13] Given the sending rate doubles every RTT in Slow Start, a ConEx sender should at least cover half the number of packets in flight by credits. Note that the number of losses or markings within one RTT does not solely depend on the sender's actions. In general, the behavior of the cross traffic, whether active queue management (AQM) is used and how it is parameterized influence how many packets might be dropped or marked. As long as any AQM encountered is not overly aggressive with ECN marking, sending half the flight size as credits should be sufficient whether congestion is signaled by loss or ECN. Marking C on every second packet in the initial window and every fourth packet in slow start will introduce the correct amount of credit as can be seen in Figure 1.[CREF14] This behaviour is most easily achieved by using the following formula to update c as every packet is sent during slow start: c = (f+1)/2, using integer division. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 12] Internet-Draft TCP Modifications for ConEx March 2015 f c=(f+1)/2 RTT1 |------XC------>| 1 1 |------X------->| 2 1 |------XC------>| 3 2 | | RTT2 |------X------->| 3 2 |------X------->| 4 2 |------X------->| 4 2 |------XC------>| 5 3 |------X------->| 5 3 |------X------->| 6 3 | | RTT3 |------X------->| 6 3 |------XC------>| 7 4 |------X------->| 7 4 |------X------->| 8 4 |------X------->| 8 4 |------XC------>| 9 5 |------X------->| 9 5 |------X------->| 10 5 |------X------->| 10 5 |------XC------>| 11 6 |------X------->| 11 6 |------X------->| 12 6 | . | | : | Figure 1: Credits in Slow Start (with an initial window of 3) It is possible that a TCP flow will encounter an audit function without relevant flow state, due to e.g. rerouting or memory limitations. Therefore, the sender needs to detect this case and resend credits. Thus a ConEx sender should reset the credit count c to zero if losses occur in two subsequent RTTs (assuming that the sending rate was correctly reduced based on the received congestion signal). [CREF15] 5. Loss of ConEx information Packets carrying ConEx signals could be discarded themselves. This will be a second order problem (e.g. if the loss probability is 0.1%, the probability of losing a loss signal will be 0.1% of 0.1% = 0.0001%). Therefore, an implementer MAY choose to ignore this problem, accepting instead the risk that an audit function might slightly increase the loss level (e.g. from 0.1000% to 0.1001%). Nonetheless, a ConEx sender SHOULD remember which packet was marked with either the L, the E or the C flag. If one of these packets is Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 13] Internet-Draft TCP Modifications for ConEx March 2015 detected as lost, the sender SHOULD increase the respective gauge(s), LEG or CEG, by the number of lost payload bytes in addition to increasing LEG for the loss. 6. Timeliness of the ConEx Signals ConEx signals will only be useful to a network node within a time delay of about one RTT after the congestion occurred. To avoid further delays, a ConEx sender SHOULD send the ConEx signaling on the next available packet. Any or all of the ConEx flags can be used in the same packet, which allows delay to be minimised when multiple signals are pending. If a flow becomes application-limited, there could be insufficient bytes to send to reduce the gauges to zero or below. In such cases, the sender cannot help but delay ConEx signals. Nonetheless, as long as the sender is marking all outgoing packets, an audit function is unlikely to penalize ConEx-marked packets. Therefore, no matter how long a gauge has been positive, a sender MUST NOT reduce the gauge by more than the ConEx marked bytes it has sent. If the CEG or LEG counter is negative, the respective counter SHOULD be reset to zero within one RTT after it was decreased the last time or one RTT after recovery if no further congestion occurred. [CREF16] If SACK information is not available spurious retransmission are more likely. In this case it might be valuable to slightly delay the ConEx loss feedback until a spurious retransmission might be detected. But the ConEx signal MUST NOT be delayed more than one RTT if as long as data packets are sent out.[CREF17] 7. Acknowledgements The authors would like to thank Bob Briscoe who contributed with this initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback. Moreover, thanks to Jana Iyengar who provided valuable feedback. 8. IANA Considerations This document does not have any requests to IANA. 9. Security Considerations General ConEx security considerations are covered extensively in the ConEx abstract mechanism [draft-ietf-conex-abstract-mech]. This section covers TCP-specific concerns. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 14] Internet-Draft TCP Modifications for ConEx March 2015 The ConEx modifications to TCP provide no mechanism for a receiver to force a sender not to use ConEx. A receiver can degrade the accuracy of ConEx by claiming that it does not support SACK, AccECN or ECN, but the sender will never have to turn ConEx off. The receiver cannot force the sender to have to mark ConEx more conservatively, in order to cover the risk of any inaccuracy. Instead the sender can choose to mark inaccurately, which will only increase the likelihood of loss at an audit function. Thus the receiver will only harm itself. Assuming the sender is limited in some way by a congestion allowance or quota, a receiver could spoof more loss or ECN congestion feedback than it actually experiences, in an attempt to make the sender draw down its allowance faster than necessary. However, over-declaring congestion simply makes the sender slow down. If the receiver is interested in the content it will not want to harm its own performance. However, if the receiver is solely interested in making the sender draw down its allowance, the net effect will depend on the sender's congestion control algorithm. With New Reno [RFC5681], doubling congestion feedback causes the sender to consume sqrt(2) = 1.4 times more congestion allowance. However, to improve scaling, congestion control algorithms are tending towards less responsive algorithms like Cubic or Compound TCP, and ultimately to linear algorithms like DCTCP [DCTCP]. In each case, if the receiver doubles congestion feedback, it causes the sender to respectively consume more allowance by a factor of 1.2, 1.15 or 1, where 1 implies the attack has become completely ineffective. 10. References 10.1. Normative References [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC5681] Allman, M., Paxson, V., and E. Blanton, "TCP Congestion Control", RFC 5681, September 2009. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 15] Internet-Draft TCP Modifications for ConEx March 2015 [draft-ietf-conex-abstract-mech] Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx) Concepts and Abstract Mechanism", draft-ietf-conex- abstract-mech-06 (work in progress), October 2012. [draft-ietf-conex-destopt] Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6 Destination Option for ConEx", draft-ietf-conex-destopt-04 (work in progress), March 2013. 10.2. Informative References [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: Efficient Packet Transport for the Commoditized Data Center", Jan 2010. [I-D.briscoe-conex-re-ecn-tcp] Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in progress), July 2014. [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm for TCP", RFC 3522, April 2003. [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream Control Transmission Protocol (SCTP) Duplicate Transmission Sequence Numbers (TSNs) to Detect Spurious Retransmissions", RFC 3708, February 2004. [RFC4015] Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm for TCP", RFC 4015, February 2005. [RFC5562] Kuzmanovic, A., Mondal, A., Floyd, S., and K. Ramakrishnan, "Adding Explicit Congestion Notification (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June 2009. [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP", RFC 5682, September 2009. [RFC6789] Briscoe, B., Woundy, R., and A. Cooper, "Congestion Exposure (ConEx) Concepts and Use Cases", RFC 6789, December 2012. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 16] Internet-Draft TCP Modifications for ConEx March 2015 [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion Notification", BCP 41, RFC 7141, February 2014. [draft-kuehlewind-tcpm-accurate-ecn] Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02 (work in progress), Jun 2013. Appendix A. Revision history RFC Editor: This section is to be removed before RFC publication. 00 ... initial draft, early submission to meet deadline. 01 ... refined draft, updated LEG "drain" from per-packet to RTT- based. 02 ... added Section 5 and expanded discussion about ECN interaction. 03 ... expanded the discussion around credit bits. 04 ... review comments of Jana addressed. (Change in full compliance mode.) 05 ... changes on Loss Detection without SACK, support of classic ECN and credit handling. Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 17] Internet-Draft TCP Modifications for ConEx March 2015 Editorial Comments [CREF1] BB: 'finally" here would mean "At last (sigh), here's what you've all been waiting for." :-) [CREF2] BB: Avoid 'recommended', which could be confused with the normative upper-cased word. The normative language later is good and sufficient. [CREF3] BB: I don't understand this last sentence. How does the sender suddenly know something it didn't know before? [CREF4] BB: I've added this sentence, but only to give you an excuse for having devised all this mechanism. However, I really don't know why you're going to all this trouble to be so accurate and timely. TCP never retransmits less data than is lost. And over the years TCP designers have been reducing the amount of unnecessary retransmission, and reducing retransmission delay. So I suggest we just mark retransmissions with the L flag. Done! No need even for a loss exposure gauge. ...If the sender is faced with insufficient information such that the universe of TCP designers has been unable to minimise unnecessary or delayed retransmissions, why try to do better than everyone has so far managed? Just accept that you will be over-declaring or sluggishly declaring ConEx. And assume that deployment of all the techniques to reduce late or spurious losses is proceeding, and we can walk on their shoulders. [CREF5] BB: I suggest removing MUST, because we cannot mandate a particular implementation technique. [CREF6] BB: If these mechanisms are being used, surely they will be being used to /prevent/ spurious retransmissions (not just count them but still retransmit anyway). So, if we increase LEG only when a retransmission actually occurs, is that not sufficient? [CREF7] BB: OK, I get that. But, as above, why worry about optimising a case that is becoming rare, because everyone recognised late retransmission was a problem, so SACK is pretty much universally deployed. Would you be unhappy if all this was deleted? Perhaps relegate to an appendix? But is it really so necessary? [CREF8] BB: I think 3 has been used instead of num_dup in the LEC algorithm earlier. [CREF9] BB: I changed 'a' to 'the'. Did you mean a generally more accurate scheme, or the AccECN scheme in particular? If the Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 18] Internet-Draft TCP Modifications for ConEx March 2015 latter, as it stands, the AccECN scheme doesn't give marked bytes. [CREF10] BB: Surely RFC5562 only adds ECT on the SYN/ACK. Is it really necessary to even refer to it in this draft? Whatever, it doesn't seem particularly relevant to this sentence. Or did you mean RFC3168? [CREF11] BB: I thought the result of the discussion about how to say whether the X flag is set in conex-destopt was that X is set irrespective of whether loss or ECN marking of the packet itself can be detected. The relevant sentence in conex-destopt is: "This [X=0] can be the case if no congestion feedback is (currently) available e.g. in TCP if one endpoint has been receiving data but sending nothing but pure ACKs (no user data) for some time." [CREF12] BB: I would prefer if this were stated at the maximum required, not a recommended value. The idea is to hold as much credit as the /likely/ worst-case congestion, not the /absolute/ worst case (I did experiments to find the variance of congestion in my PhD). [CREF13] BB: Again, rather than a SHOULD, can we make this a recommendation that is part of the reason for ConEx experimentation? - especially if variants like hybrid SS are enabled. [CREF14] BB: Just marking every fourth packet doesn't work for a general IW. During the IW, mark the first packet and every other packet, then after IW mark every fourth packet (to determine precisely which is the first packet to mark after the IW, maintain a packet counter and double it when IW ends). [CREF15] BB: Whoa! This is rather excessively conservative isn't it? There will often be a loss in 2 consecutive RTTs due to normal congestion. If there's a re-route, I think the new audit will drop a whole window, so the sender will naturally send a whole window's worth of credit with the retransmissions. Am I wrong? [CREF16] BB: This adds complexity. I would suggest this is a MAY. It depends on how audit is done whether it is necessary, so this will depend on experiments. For instance, in the audit function I designed, there was a long term and a short term comparison, and the long term one became more relaxed the longer the flow had been behaving. (Note I have also suggested moving this and the next para from "Setting E/L" to "Timeliness") Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 19] Internet-Draft TCP Modifications for ConEx March 2015 [CREF17] BB: As before, I disagree with the need for this para - this is trying to optimise a case that is rare because it's known to be sub-optimal, by compromising ConEx timeliness. SACK is nearly universal .If SACK isn't available, things are bound to be non- optimal. The solution is for the receiver to deploy SACK like nearly every other receiver has done, not to add more complexity to the sender and more delay to ConEx. Authors' Addresses Mirja Kuehlewind (editor) ETH Zurich Switzerland Email: mirja.kuehlewind@tik.ee.ethz.ch Richard Scheffenegger NetApp, Inc. Am Euro Platz 2 Vienna 1120 Austria Phone: +43 1 3676811 3146 Email: rs@netapp.com Kuehlewind & ScheffeneggExpires September 9, 2015 [Page 20]