< draft-ietf-conex-tcp-modifications-07.txt | draft-ietf-conex-tcp-modifications-07-bb.txt > | |||
---|---|---|---|---|
Congestion Exposure (ConEx) M. Kuehlewind, Ed. | Congestion Exposure (ConEx) M. Kuehlewind, Ed. | |||
Internet-Draft ETH Zurich | Internet-Draft ETH Zurich | |||
Intended status: Experimental R. Scheffenegger | Intended status: Experimental R. Scheffenegger | |||
Expires: August 18, 2015 NetApp, Inc. | Expires: September 9, 2015 NetApp, Inc. | |||
February 14, 2015 | March 8, 2015 | |||
TCP modifications for Congestion Exposure | TCP modifications for Congestion Exposure | |||
draft-ietf-conex-tcp-modifications-07 | draft-ietf-conex-tcp-modifications-07 | |||
Abstract | Abstract | |||
Congestion Exposure (ConEx) is a mechanism by which senders inform | Congestion Exposure (ConEx) is a mechanism by which senders inform | |||
the network about the congestion encountered by previous packets on | the network about expected congestion based on congestion feedback | |||
the same flow. This document describes the necessary modifications | from previous packets in the same flow. This document describes the | |||
to use ConEx with the Transmission Control Protocol (TCP). | necessary modifications to use ConEx with the Transmission Control | |||
Protocol (TCP). | ||||
Status of This Memo | Status of This Memo | |||
This Internet-Draft is submitted in full conformance with the | This Internet-Draft is submitted in full conformance with the | |||
provisions of BCP 78 and BCP 79. | provisions of BCP 78 and BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF). Note that other groups may also distribute | Task Force (IETF). Note that other groups may also distribute | |||
working documents as Internet-Drafts. The list of current Internet- | working documents as Internet-Drafts. The list of current Internet- | |||
Drafts is at http://datatracker.ietf.org/drafts/current/. | Drafts is at http://datatracker.ietf.org/drafts/current/. | |||
Internet-Drafts are draft documents valid for a maximum of six months | Internet-Drafts are draft documents valid for a maximum of six months | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
This Internet-Draft will expire on August 18, 2015. | This Internet-Draft will expire on September 9, 2015. | |||
Copyright Notice | Copyright Notice | |||
Copyright (c) 2015 IETF Trust and the persons identified as the | Copyright (c) 2015 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | document authors. All rights reserved. | |||
This document is subject to BCP 78 and the IETF Trust's Legal | This document is subject to BCP 78 and the IETF Trust's Legal | |||
Provisions Relating to IETF Documents | Provisions Relating to IETF Documents | |||
(http://trustee.ietf.org/license-info) in effect on the date of | (http://trustee.ietf.org/license-info) in effect on the date of | |||
publication of this document. Please review these documents | publication of this document. Please review these documents | |||
carefully, as they describe your rights and restrictions with respect | carefully, as they describe your rights and restrictions with respect | |||
to this document. Code Components extracted from this document must | to this document. Code Components extracted from this document must | |||
include Simplified BSD License text as described in Section 4.e of | include Simplified BSD License text as described in Section 4.e of | |||
the Trust Legal Provisions and are provided without warranty as | the Trust Legal Provisions and are provided without warranty as | |||
described in the Simplified BSD License. | described in the Simplified BSD License. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 18 | |||
1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 | |||
2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 | 2. Sender-side Modifications . . . . . . . . . . . . . . . . . . 3 | |||
3. Accounting congestion . . . . . . . . . . . . . . . . . . . . 4 | 3. Counting congestion . . . . . . . . . . . . . . . . . . . . . 4 | |||
3.1. Loss Detection . . . . . . . . . . . . . . . . . . . . . 5 | 3.1. Loss Detection . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.1.1. Without SACK Support . . . . . . . . . . . . . . . . 6 | 3.1.1. General Approach . . . . . . . . . . . . . . . . . . 6 | |||
3.1.2. Without SACK Support . . . . . . . . . . . . . . . . 6 | ||||
3.2. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | 3.2. ECN . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 | |||
3.2.1. Accurate ECN feedback . . . . . . . . . . . . . . . . 8 | 3.2.1. Accurate ECN feedback . . . . . . . . . . . . . . . . 9 | |||
3.2.2. Classic ECN support . . . . . . . . . . . . . . . . . 8 | 3.2.2. Classic ECN support . . . . . . . . . . . . . . . . . 9 | |||
4. Setting the ConEx Bits . . . . . . . . . . . . . . . . . . . 9 | 4. Setting the ConEx Bits . . . . . . . . . . . . . . . . . . . 10 | |||
4.1. Setting the E and the L Bit . . . . . . . . . . . . . . . 9 | 4.1. Setting the E or the L Flag . . . . . . . . . . . . . . . 10 | |||
4.2. Credit Bits . . . . . . . . . . . . . . . . . . . . . . . 9 | 4.2. Setting the Credit Flag . . . . . . . . . . . . . . . . . 11 | |||
5. Loss of ConEx information . . . . . . . . . . . . . . . . . . 11 | 5. Loss of ConEx information . . . . . . . . . . . . . . . . . . 13 | |||
6. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 11 | 6. Timeliness of the ConEx Signals . . . . . . . . . . . . . . . 14 | |||
7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 12 | 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 14 | |||
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 | |||
9. Security Considerations . . . . . . . . . . . . . . . . . . . 12 | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 14 | |||
10. References . . . . . . . . . . . . . . . . . . . . . . . . . 12 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
10.1. Normative References . . . . . . . . . . . . . . . . . . 12 | 10.1. Normative References . . . . . . . . . . . . . . . . . . 15 | |||
10.2. Informative References . . . . . . . . . . . . . . . . . 13 | 10.2. Informative References . . . . . . . . . . . . . . . . . 16 | |||
Appendix A. Revision history . . . . . . . . . . . . . . . . . . 14 | Appendix A. Revision history . . . . . . . . . . . . . . . . . . 17 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 15 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
1. Introduction | 1. Introduction | |||
Congestion Exposure (ConEx) is a mechanism by which senders inform | Congestion Exposure (ConEx) is a mechanism by which senders inform | |||
the network about the congestion encountered by previous packets on | the network about expected congestion based on congestion feedback | |||
the same flow. ConEx concepts and use cases are further explained in | from previous packets in the same flow. ConEx concepts and use cases | |||
[RFC6789]. The abstract ConEx mechanism is explained in | are further explained in [RFC6789]. The abstract ConEx mechanism is | |||
[draft-ietf-conex-abstract-mech]. This document describes the | explained in [draft-ietf-conex-abstract-mech]. This document | |||
necessary modifications to use ConEx with the Transmission Control | describes the necessary modifications to use ConEx with the | |||
Protocol (TCP). | Transmission Control Protocol (TCP). | |||
The needed markings to provide ConEx signaling are defined in the | The markings for ConEx signaling are defined in the ConEx Destination | |||
ConEx Destination Option (CDO) for IPv6 [draft-ietf-conex-destopt]. | Option (CDO) for IPv6 [draft-ietf-conex-destopt]. Specifically, the | |||
Specifically, the use of four bits are defined: the X (ConEx- | use of four flags are defined: X (ConEx-capable), L (loss | |||
capable), the L (loss experienced), the E (ECN experienced) and C | experienced), E (ECN experienced) and C (credit). | |||
(credit) bit. | ||||
ConEx signaling is based on loss or Explicit Congestion Notification | ConEx signaling is based on loss or Explicit Congestion Notification | |||
(ECN) marks [RFC3168] as a congestion indication. This congestion | (ECN) marks [RFC3168] as congestion indications. The sender collects | |||
information is retrieved by the sender based on existing feedback | this congestion information based on existing TCP feedback mechanisms | |||
mechanisms from the receiver to the sender in TCP. No changes are | from the receiver to the sender. No changes are needed at the | |||
needed at the receiver to implement ConEx signaling. Therefore no | receiver to implement ConEx signaling. Therefore no additional | |||
additional negotiation is needed to implement and use ConEx at the | negotiation is needed to implement and use ConEx at the sender. This | |||
sender. This document specifies actions needed by sender to provide | document specifies the sender's actions that are needed to provide | |||
meaningful ConEx information to the network. | meaningful ConEx information to the network. | |||
Section 2 provides an overview of the needed modifications for TCP | Section 2 provides an overview of the modifications needed for TCP | |||
senders to implement ConEx. First congestion information have to be | senders to implement ConEx. First congestion information has to be | |||
extracted from loss or ECN feedback in TCP as described in section 3. | extracted from TCP's loss or ECN feedback as described in section 3. | |||
Section 4 details how to set the CDO marking based on the accounted | Section 4 details how to set the CDO marking based on this congestion | |||
congestion information. Section 6 finally discusses timeliness of | information. Section 5 discusses loss of packets carrying ConEx | |||
the ConEx feedback signal as congestion is a temporary state. | information. Section 6 [CREF1]discusses timeliness of the ConEx | |||
feedback signal, given congestion is a temporary state. | ||||
This document describes congestion accounting for both TCP with and | This document describes congestion accounting for TCP with and | |||
without the Selective Acknowledgment (SACK) extension [RFC2018] in | without the Selective Acknowledgment (SACK) extension [RFC2018] (in | |||
section 3.1. However, ConEx benefits from more accurate information | section 3.1). However, ConEx benefits from the more accurate | |||
about the number of packets dropped in the network. It is therefore | information that SACK provides about the number of bytes dropped in | |||
recommended to use the SACK extension when using TCP with ConEx. The | the network. It is therefore preferable[CREF2] to use the SACK | |||
detailed mechanism to respectively set the L bit in response to loss- | extension when using TCP with ConEx. The detailed mechanism to set | |||
based congestion feedback signal is given in section 4.1. | the L flag in response to loss-based congestion feedback signal is | |||
given in section 4.1. | ||||
While loss-based congestion feedback should be minimized, ECN could | Whereas loss has to be minimized, ECN can provide more fine-grained | |||
actually provide more fine-grained feedback information. ConEx-based | feedback information. ConEx-based traffic measurement or management | |||
traffic measurement or management mechanisms would benefit from this. | mechanisms could benefit from this. Unfortunately, the current ECN | |||
Unfortunately, the current ECN feedback mechanism does not reflect | feedback mechanism does not reflect multiple congestion markings if | |||
multiple congestion markings which occur within the same Round-Trip | they occur within the same Round-Trip Time (RTT). A more accurate | |||
Time (RTT). A more accurate feedback extension to ECN is proposed in | feedback extension to ECN (AccECN) is proposed in a separate document | |||
a separate document [draft-kuehlewind-tcpm-accurate-ecn], as this is | [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for | |||
also useful for other mechanisms. | other mechanisms. | |||
The congestion accounting for both, with the classic ECN feedback as | Congestion accounting for both classic ECN feedback and AccECN | |||
well as a more accurate ECN feedback are explained in detail in | feedback is explained in detail in section 3.2. Setting the E flag | |||
section 3.2 while the setting of the E bit in response to ECN-based | in response to ECN-based congestion feedback is again detailed in | |||
congestion feedback is again detailed in section 4.1. | section 4.1. | |||
1.1. Requirements Language | 1.1. Requirements Language | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
2. Sender-side Modifications | 2. Sender-side Modifications | |||
This section gives an overview of actions that need to be taken by a | This section gives an overview of actions that need to be taken by a | |||
TCP sender that would like to use ConEx signaling. | TCP sender modified to use ConEx signaling. | |||
A ConEx sender MUST negotiate for both SACK and ECN or the more | In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN | |||
accurate ECN feedback in the TCP handshake if these TCP extension are | preferably with AccECN feedback. Therefore a ConEx sender MUST also | |||
available at the sender. Therefore a ConEx sender SHOULD also | ||||
implement SACK and ECN. Depending on the capability of the receiver, | implement SACK and ECN. Depending on the capability of the receiver, | |||
the following operation modes exist: | the following operation modes exist: | |||
o SACK-accECN-ConEx (SACK and accurate ECN feedback) | +------+-----+ | |||
| SACK | ECN | | ||||
o accECN-ConEx (no SACK but accurate ECN feedback) | +------+-----+ | |||
| S | A | | ||||
o ECN-ConEx (no SACK and no accurate ECN feedback but 'classic' ECN) | | S | C | | |||
| S | - | | ||||
o SACK-ECN-ConEx (SACK and 'classic' instead of accurate ECN) | | - | A | | |||
| - | C | | ||||
| - | - | | ||||
+------+-----+ | ||||
o SACK-ConEx (SACK but no ECN at all) | S: SACK enabled; A: AccECN enabled; C: Classic ECN [RFC3168] enabled | |||
o Basic-ConEx (neither SACK nor ECN) | Table 1: ConEx modes. | |||
A ConEx sender MUST expose all congestion information to the network | A ConEx sender MUST expose all congestion information to the network | |||
according to the congestion information received by ECN or based on | according to the congestion information received by ECN or based on | |||
loss information provided by the TCP feedback loop. A TCP sender | loss information provided by the TCP feedback loop. A TCP sender | |||
SHOULD account congestion byte-wise (and not packet-wise). A sender | SHOULD count congestion byte-wise (rather than packet-wise; see next | |||
MUST mark subsequent packets (after the congestion notification) with | paragraph). After any congestion notification, a sender MUST mark | |||
the respective ConEx bit in the IP header. Furthermore, a ConEx | subsequent packets with the appropriate ConEx flag in the IP header. | |||
sender must send enough credit to cover all experienced congestion | Furthermore, a ConEx sender must send enough credit to cover all | |||
for the connection so far, as well as the risk of congestion for the | experienced congestion for the connection so far, as well as the risk | |||
current transmission (see Section 4.2). | of congestion for the current transmission (see Section 4.2). | |||
With SACK only the number of lost payload bytes is known, but not the | With SACK the number of lost payload bytes is known, but not the | |||
number of packets carrying these bytes. With classic ECN only an | number of packets carrying these bytes. With classic ECN only an | |||
indication is given that a marking occurred but not the exact number | indication is given that a marking occurred but not the exact number | |||
of payload bytes nor packets. As network congestion is usually byte- | of payload bytes nor packets. As network congestion is usually byte- | |||
congestion [draft-briscoe-tsvwg-byte-pkt-mark], the exact number of | congestion [RFC7141], the byte-size of a packet marked with a CDO | |||
flag is defined to represent that number of bytes of congestion | ||||
signalling [draft-ietf-conex-destopt]. Therefore the exact number of | ||||
bytes should be taken into account, if available, to make the ConEx | bytes should be taken into account, if available, to make the ConEx | |||
signal as exact as possible. | signal as exact as possible. | |||
Detailed mechanisms for congestion accounting in each operation mode | Detailed mechanisms for congestion accounting in each operation mode | |||
are described in the next section. Further handling of the IPv6 bits | are described in the next section. | |||
itself if congestion was accounted is described in the subsequent | ||||
section afterwards. | ||||
3. Accounting congestion | 3. Counting congestion | |||
A ConEx sender maintains two counters: one that accounts congestion | A ConEx TCP sender maintains two counters: one that counts congestion | |||
based on the information retrived by loss detection, and a second | based on the information retrieved by loss detection, and a second | |||
that accounts for ECN based congestion feedback (in TCP). These | that accounts for ECN based congestion feedback. These counters hold | |||
counters hold the number of outstanding bytes that should be ConEx | the number of outstanding bytes that should be ConEx marked with | |||
marked either with the E bit or the L bit in subsequent packets. | respectively the E flag or the L flag in subsequent packets. | |||
The outstanding bytes for congestion indications based on loss are | The outstanding bytes for congestion indications based on loss are | |||
maintained in the loss exposure gauge (LEG) and the accounting is | added to the loss exposure gauge (LEG), as explained in Section 3.1. | |||
explained in Section 3.1. | ||||
The outstanding bytes accounted based on ECN feedback information are | The outstanding bytes counted based on ECN feedback information are | |||
maintained in the congestion exposure gauge (CEG). The accounting of | added to the congestion exposure gauge (CEG)as explained in | |||
these bytes from the ECN feedback is explained in more detail next in | ||||
Section 3.2. | Section 3.2. | |||
Furthermore, those counters will be reduced every time a ConEx | When the sender sends a ConEx capable packet with the E or L flag set | |||
capable packet with the E or L bit set is sent. This is explained | it reduces the respective counter by the byte-size of the packet. | |||
for both counters in Section 4.1. | This is explained for both counters in Section 4.1. | |||
Usually all bytes of an IP packet must be accounted. Therefore the | Usually all bytes of an IP packet must be counted. Therefore the | |||
sender SHOULD take the headers into account, too. If equal sized | sender SHOULD take the payload and headers into account, up to and | |||
packets, or at least equally distributed packet sizes can be assumed, | including the IP header. Therefore, as well as the TCP payload | |||
the sender MAY only account the TCP payload bytes. In this case | bytes, an appropriate number of header bytes SHOULD be added to the | |||
there should be about the same number of ConEx marked packets as the | gauge for each packet of congestion feedback. And the sender SHOULD | |||
original packets that were causing the congestion. Thus both contain | subtract header bytes from the gauge for each marked packet sent. | |||
about the same number of header bytes. This case is assumed for | ||||
simplification in the following sections. | ||||
Otherwise if this is not the case and a sender sends different sized | If equal-sized packets, or at least equally distributed packet sizes | |||
packets (with unequally distributed packet sizes), the sender needs | can be assumed, the sender MAY only add and subtract TCP payload | |||
to memorize or estimate the number of ECN-marked or lost packets. A | bytes,. In this case there should be about the same number of ConEx | |||
sender might be able to reconstruct the number of packets and thus | marked packets as the original packets that were causing the | |||
the header bytes if the packet sizes of all packets that were sent | congestion. Thus both contain about the same number of header bytes | |||
during the last RTT are known. Otherwise if no additional | so they will cancel out. This case is assumed for simplicity in the | |||
information is available the worst case number of packets and thus | following sections. | |||
header bytes should be estimated in a conservative way based on a | ||||
minimum packet size (of all packets sent in the last RTT). If the | Otherwise, if a sender sends different sized packets (with unequally | |||
number of ConEx marked packets is smaller (or larger) than the | distributed packet sizes), the sender needs to memorize or estimate | |||
estimated number of ECN-marked or lost packets, the additional header | the number of lost or ECN-marked packets. A sender might be able to | |||
bytes should the added to (or can be subtracted from) the respective | reconstruct the number of packets and thus the header bytes if the | |||
counter. | packet sizes of all packets that were sent during the last RTT are | |||
known. Otherwise, if no additional information is available, the | ||||
conservative or even worst case number of packets and thus header | ||||
bytes should be estimated, e.g. based on the minimum packet size (of | ||||
all packets sent in the last RTT). If the number of ConEx marked | ||||
packets is smaller (or larger) than the estimated number of lost or | ||||
ECN-marked packets, the additional header bytes should be added to | ||||
(or can be subtracted from) the respective counter.[CREF3] | ||||
3.1. Loss Detection | 3.1. Loss Detection | |||
3.1.1. General Approach | ||||
A ConEx sender MUST maintain a loss exposure gauge (LEG), indicating | This section applies whether or not SACK support is available. The | |||
the number of outstanding bytes that must be sent with the ConEx L | following section deals with the case when SACK is not available. | |||
bit. When a data segment is retransmitted, LEG will be increased by | ||||
the size of the TCP payload bytes contained by the retransmission, | TCP feedback is designed so that the sender can detect losses in | |||
assuming equal sized segments such that the retransmitted packet will | order to retransmit the lost data. Therefore, it might be naively | |||
have the same number of header bytes as the original ones. | assumed that a TCP sender only needs to set the ConEx L flag on all | |||
retransmissions in order to signal the amount of bytes lost. | ||||
However, this will not always be the case. Therefore the process of | ||||
loss detection is described here and separately the process of ConEx | ||||
marking is described in Section 4.1.[CREF4] | ||||
A ConEx sender needs to[CREF5] maintain a local signed counter that | ||||
shall be called the loss exposure gauge (LEG), indicating the number | ||||
of outstanding bytes to be sent with the ConEx L flag. When a TCP | ||||
sender decides that a data segment needs to be retransmitted, it will | ||||
increase LEG by the size of the TCP payload bytes in the | ||||
retransmission (assuming equal sized segments such that the | ||||
retransmitted packet will have the same number of header bytes as the | ||||
original ones). | ||||
Any retransmission may be spurious. To accommodate that, a ConEx | Any retransmission may be spurious. To accommodate that, a ConEx | |||
sender SHOULD make use of heuristics to detect such spurious | sender SHOULD make use of heuristics to detect such spurious | |||
retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel | retransmissions (e.g. F-RTO [RFC5682], DSACK [RFC3708], and Eifel | |||
[RFC3522], [RFC4015]). When such a heuristic has determined, that a | [RFC3522], [RFC4015]). When such a heuristic has determined that a | |||
certain number of packets were retransmitted erroneously, the ConEx | certain number of packets were retransmitted erroneously, the ConEx | |||
sender should subtract the payload size of these TCP packets from | sender SHOULD subtract the payload size of these TCP packets from | |||
LEG. | LEG.[CREF6] | |||
3.1.1. Without SACK Support | 3.1.2. Without SACK Support | |||
If multiple losses occur within one RTT and SACK is not used, it may | If multiple losses occur within one RTT and SACK is not used, it may | |||
take several RTTs until all lost data is retransmitted. With the | take several RTTs until all lost data is retransmitted. With the | |||
scheme described above, the ConEx information will be delayed | scheme described above, the ConEx information will be delayed | |||
strongly but timeliness is important for ConEx. | considerably, but timeliness is important for ConEx. | |||
For ConEx it is not important to know which data got lost but only | For ConEx it is not important to know which data got lost but only | |||
how much. During the first RTT after the initial loss detection, the | how much.[CREF7] During the first RTT after the initial loss | |||
amount of received data and thus also the amount of lost data can be | detection, the amount of received data and thus also the amount of | |||
estimated based on the number of received ACKs. Thus without SACK, | lost data can be estimated based on the number of received ACKs. | |||
the needed information for the ConEx feedback can be available with | Thus without SACK, the information needed for ConEx feedback can be | |||
an additionally delay of one RTT by using the following estimation | available with an additional delay of one RTT by using the following | |||
algorithm and an additional Loss Estimation Counter (LEC): | estimation algorithm and an additional Loss Estimation Counter (LEC): | |||
flight_bytes: current flight size in bytes | flight_bytes: current flight size in bytes | |||
retransmit_bytes: payload size of the retransmission | retransmit_bytes: payload size of the retransmission | |||
At the first retransmission in a congestion event LEC is set: | At the first retransmission in a congestion event LEC is set: | |||
LEC = flight_bytes - 3*SMSS | LEC = flight_bytes - 3*SMSS | |||
(At this point of time in the transmission, in the worst case, | (At this point of time in the transmission, in the worst case, | |||
all packets in flight minus three that trigged the dupACks | all packets in flight minus three that trigged the dupACks | |||
skipping to change at page 7, line 13 | skipping to change at page 7, line 40 | |||
that should be ConEx L marked.) | that should be ConEx L marked.) | |||
After the first RTT for each following retransmissions: | After the first RTT for each following retransmissions: | |||
if (LEC > 0): LEC -= retransmit_bytes | if (LEC > 0): LEC -= retransmit_bytes | |||
else if (LEC==0): LEG += retransmit_bytes | else if (LEC==0): LEG += retransmit_bytes | |||
if (LEC < 0): LEG += -LEC | if (LEC < 0): LEG += -LEC | |||
(The LEG is not increased for those bytes that were | (The LEG is not increased for those bytes that were | |||
already accounted.) | already counted.) | |||
3.2. ECN | 3.2. ECN | |||
ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to | ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to | |||
mark packets with the Congestion Experienced (CE) mark instead of | mark packets with the Congestion Experienced (CE) mark instead of | |||
(early) dropping them when congestion occurs. As soon as a CE mark | dropping them when congestion occurs. | |||
is seen at the receiver, with classic ECN it will feed this | ||||
information back to the sender by setting the Echo Congestion | ||||
Experienced (ECE) bit in the TCP header of all subsequent ACKs until | ||||
a packet with Congestion Window Reduced (CWR) bit in the TCP header | ||||
is received to acknowledge the reception of the congestion | ||||
notification. The sender sets the CWR bit in the TCP header once | ||||
when the first ECE of a congestion notification is received. | ||||
A receiver can support 'classic' ECN, a more accurate ECN feedback | A receiver might support 'classic' ECN, the more accurate ECN | |||
scheme, or neither. In the case ECN is not supported at all, of | feedback scheme (AccECN), or neither. In the case that ECN is not | |||
course, no ECN marks will occur, thus the E bit will never be set. | supported for a connection, of course, no ECN marks will occur; thus | |||
Otherwise, a ConEx sender must maintain a counter, the congestion | the sender will never set the E flag. Otherwise, a ConEx sender must | |||
exposure gauge (CEG), for the number of outstanding bytes that have | maintain a signed counter, the congestion exposure gauge (CEG), for | |||
to be ConEx marked with the E bit. | the number of outstanding bytes that have to be ConEx marked with the | |||
E flag. | ||||
The CEG is increased when ECN information is received from an ECN- | The CEG is increased when ECN information is received from an ECN- | |||
capable receiver supporting the 'classic' ECN scheme or the accurate | capable receiver supporting the 'classic' ECN scheme or the accurate | |||
ECN feedback scheme. When the ConEx sender receives an ACK | ECN feedback scheme. When the ConEx sender receives an ACK | |||
indicating one or more segments were received with a CE mark, CEG is | indicating one or more segments were received with a CE mark, CEG is | |||
increased by the appropriate number of bytes as described further | increased by the appropriate number of bytes as described further | |||
below. | below. | |||
Unfortunately in case of duplicate acknowledgements the number of | Unfortunately in case of duplicate acknowledgements the number of | |||
newly acknowledged bytes will be zero even though (CE marked) data | newly acknowledged bytes will be zero even though (CE marked) data | |||
has been received. Therefore, we increase the CEG by DeliveredData, | has been received. Therefore, we increase the CEG by DeliveredData, | |||
as defined below: | as defined below: | |||
DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - | DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS - | |||
(is_after_dup)*num_dup*1SMSS | (is_after_dup)*num_dup*1SMSS | |||
DeliveredData covers the number of bytes which has been newly | DeliveredData covers the number of bytes that has been newly | |||
delivered to the receiver. Therefore on each arrival of an ACK, | delivered to the receiver. Therefore on each arrival of an ACK, | |||
DeliveredData will be increased by the newly acknowledged bytes | DeliveredData will be increased by the newly acknowledged bytes | |||
(acked_bytes) as indicated by the current ACK, relative to all past | (acked_bytes) as indicated by the current ACK, relative to all past | |||
ACKs. | ACKs. The formula depends on whether SACK is available, as follows: | |||
Moreover with SACK, DeliveredData is increased by the number of bytes | With SACK: DeliveredData is increased by the number of bytes | |||
provided by (new) SACK information (SACK_diff). Note, if less | provided by (new) SACK information (SACK_diff). Note, if less | |||
unacknowledged bytes are announced in the new SACK information than | unacknowledged bytes are announced in the new SACK information | |||
in the previous ACK, SACK_diff can be negative. In this case, data | than in the previous ACK, SACK_diff can be negative. In this | |||
is newly acknowledged (in acked_byte), that has previously already | case, data is newly acknowledged (in acked_bytes), that has | |||
been accounted to DeliveredData based on SACK information. | previously already been accumulated into DeliveredData based on | |||
SACK information. | ||||
Without SACK, DeliveredData is estimated to be 1 SMSS on duplicate | Without SACK: DeliveredData is estimated to be 1 SMSS on duplicate | |||
acknowledgements. For the subsequent partial or full ACK, | acknowledgements. For the subsequent partial or full ACK, | |||
DeliveredData is estimated to be the newly acknowledged bytes, minus | DeliveredData is estimated to be the newly acknowledged bytes, | |||
one SMSS for each preceding duplicate ACK. Therefore is_dup is one | minus one SMSS for each preceding duplicate ACK. Therefore is_dup | |||
if the current ACK is a duplicated ACK without SACK, and zero | is one if the current ACK is a duplicated ACK without SACK, and | |||
otherwise. is_after_dup is only one for the next full or partial ACK | zero otherwise. is_after_dup is only one for the next full or | |||
after a number of duplicated ACKs without SACK and num_dup counts the | partial ACK after a number of duplicated ACKs without SACK and | |||
number of duplicated ACKs in a row. | num_dup counts the number of duplicated ACKs in a row.[CREF8] | |||
The two cases, with and without more accurate ECN depending on the | With classic ECN, as soon as a CE mark is seen at the receiver, it | |||
receiver capability, are discussed in the following sections. | will feed this information back to the sender by setting the Echo | |||
Congestion Experienced (ECE) flag in the TCP header of subsequent | ||||
ACKs. Once the sender receives the first ECE of a congestion | ||||
notification, it sets the CWR flag in the TCP header once. When this | ||||
packet with Congestion Window Reduced (CWR) flag in the TCP header | ||||
arrives at the receiver, acknowledging its first ECE feedback, the | ||||
receiver stops setting ECE. | ||||
Thus, with classic ECN, one congestion marked packet causes | ||||
continuous congestion feedback for a whole round trip, thus hiding | ||||
the arrival of any further congestion marked packets during that | ||||
round trip. The more accurate ECN feedback scheme (AccECN) has been | ||||
defined to ensure that feedback properly reflects the extent of | ||||
congestion marking. The two cases, with and without a receiver | ||||
capable of AccECN, are discussed in the following sections. | ||||
3.2.1. Accurate ECN feedback | 3.2.1. Accurate ECN feedback | |||
With a more accurate ECN feedback scheme either the number of marked | With the [CREF9] more accurate ECN feedback scheme (AccECN) either | |||
packets/received CE marks or directly the number of marked bytes is | the number of marked packets or the number of marked bytes is known. | |||
known. In the later case the CEG can directly be increased by the | In the latter case the CEG can directly be increased by the number of | |||
number of marked bytes. Otherwise if D is assumed to be the number | marked bytes. Otherwise if D is assumed to be the number of marks, | |||
of marks, the gauge CEG will be conservatively increased by one SMSS | the gauge (CEG) will be conservatively increased by one SMSS for each | |||
for each marking or at max the number of newly acknowledged bytes: | marking or at max the number of newly acknowledged bytes: | |||
CEG += min(SMSS*D, DeliveredData) | CEG += min(SMSS*D, DeliveredData) | |||
3.2.2. Classic ECN support | 3.2.2. Classic ECN support | |||
If the ConEx sender fully conforms to the semantics of the ECN | If the ConEx sender fully conforms to the semantics of ECN signaling | |||
signaling as defined by [RFC5562], it will receive one full RTT of | as defined by [RFC5562],[CREF10] it will receive one full RTT of ACKs | |||
ACKs with the ECE flag set whenever at least one CE mark was received | with the ECE flag set whenever at least one CE mark was received by | |||
by the receiver. As the sender cannot estimate how much packets have | the receiver. As the sender cannot estimate how many packets have | |||
actually been CE marked during this RTT, the most conservative | actually been CE marked during this RTT, the most conservative | |||
assumption should be taken, namely assuming that all packets were | assumption MAY be taken, namely assuming that all packets were | |||
marked. This can be achieved by increasing the CEG by DeliveredData | marked. This can be achieved by increasing the CEG by DeliveredData | |||
for each ACK with the ECE flag: | for each ACK with the ECE flag: | |||
CEG += DeliveredData | CEG += DeliveredData | |||
Optionally a ConEx sender could implement an Advanced Compatibility | Optionally a ConEx sender could implement the following technique, | |||
Mode: | called advanced compatibility mode, to considerably improve its | |||
estimate of the number of ECN-marked packets: | ||||
To extract more than one ECE indication per RTT, a ConEx sender could | To extract more than one ECE indication per RTT, a ConEx sender could | |||
set the CWR flag opportunistically to force the receiver to signal | set the CWR flag continuously to force the receiver to signal only | |||
only one ECE per CE mark. Unfortunately, the use of delayed ACKs | one ECE per CE mark. Unfortunately, the use of delayed ACKs | |||
[RFC5681], as it is usually done today, will prevent a feedback of | [RFC5681] (which is common) will prevent feedback of every CE mark; | |||
every CE mark. If an CWR confirmation will be received before the | if a CWR confirmation is received before the ECE can be sent out on | |||
ECE can be sent out with the next ACK, ECN feedback information | the next ACK, ECN feedback information could get lost. Thus a sender | |||
information could get lost. Thus a sender should set CWR only on | SHOULD set CWR only on those data segments that will actually trigger | |||
those data segments, that will actually trigger a (delayed) ACK. The | a (delayed) ACK. The sender would need an additional control loop to | |||
sender would need an additional control loop to estimated which data | estimated which data segments will trigger an ACK in order to extract | |||
segment will trigger an ACK. But such a more sophisticated | more timely congestion notifications. Still the CEG SHOULD be | |||
heuristics could extract congestion notifications more timely. Still | increased by DeliveredData, as one or more CE marked packets could be | |||
the CEG need to be increased by DeliveredData, as one or more CE | acknowledged by one delayed ACK. | |||
marked packets could be acknowledged by one delayed ACK. | ||||
The repetition of ECE in classic ECN is intended to ensure reliable | ||||
delivery of congestion feedback. The following argument is intended | ||||
to prove that suppressing repetitions of ECE is safe against possible | ||||
congestion collapse due to lost congestion feedback. | ||||
With advanced compatibility mode, if an ACK containing ECE is lost, | ||||
the continual CWRs prevent it being repeated, so it will remain lost. | ||||
Therefore, if congestion is light on the forward path and heavy on | ||||
the reverse, most of the light congestion signals will be lost. If | ||||
loss of feedback exacerbates congestion on the forward path, more | ||||
forward packets will be CE marked, increasing the likelihood that | ||||
feedback from at least one CE will get through per RTT. As long as | ||||
one ECE reaches the sender per RTT, the sender's congestion response | ||||
will be the same as if CWR were not continuous. The only way that | ||||
heavy congestion on the forward path could be completely hidden would | ||||
be if all ACKs on the reverse path were lost. If total ACK loss | ||||
persisted, the sender would time out and do a congestion response | ||||
anyway.Therefore, the problem seems confined to potential suppression | ||||
of a congestion response during light congestion. | ||||
Anyway, even if loss of all ECN feedback led to no congestion | ||||
response, the worst that could happen would be loss instead of ECN- | ||||
signalled congestion on the forward path. Given compatibility mode | ||||
does not affect loss feedback, there would be no risk of congestion | ||||
collapse. | ||||
4. Setting the ConEx Bits | 4. Setting the ConEx Bits | |||
By setting the X bit a packet is marked as ConEx-capable. All | By setting the X flag, a packet is marked as ConEx-capable. All | |||
packets carrying payload MUST be marked with the X bit set including | packets carrying payload MUST be marked with the X flag set, | |||
retransmissions. No congestion feedback information are available | including retransmissions. No congestion feedback information is | |||
about control packets such as pure ACKs which are not carrying any | available about control packets such as pure ACKs which are not | |||
payload. Thus these packets should not be taken into account when | carrying any payload. Thus these packets should not be taken into | |||
determining ConEx information. These packet MUST carry a ConEx | account when determining ConEx information. These packet MUST carry | |||
Destination Option with the X bit unset. | a ConEx Destination Option with the X flag unset.[CREF11] | |||
4.1. Setting the E and the L Bit | 4.1. Setting the E or the L Flag | |||
As long as the CEG or LEG counter is positive, ConEx-capable packets | As long as the LEG or CEG counter is positive, the sender MUST mark | |||
SHOULD be marked with E or L respectively, and the CEG or LEG counter | each ConEx-capable packet with L or E respectively, and decrease the | |||
is decreased by the TCP payload bytes carried in this packet. If the | LEG or CEG counter by the TCP payload bytes carried in the marked | |||
CEG or LEG counter is negative, the respective counter SHOULD be | packet (assuming headers are not being counted because packet sizes | |||
reset to zero within one RTT after it was decreased the last time or | are regular). No matter how small the value of LEG or CEG, if it is | |||
one RTT after recovery if no further congestion occurred. | positive, to ensure ConEx signals are timely, the sender MUST NOT | |||
defer packet marking. Therefore the value of LEG and CEG will | ||||
commonly be negative. | ||||
If SACK information is not available spurious retransmission are more | Multiple ConEx flags may be required for signaling at the same time. | |||
likely. In this case it might be valuable to slightly delay the | This may happen, for example, during excessive congestion when an ACK | |||
ConEx loss feedback until a spurious retransmission might be | is received by the sender that simultaneously indicates that at least | |||
detected. But the ConEx signal MUST NOT be delayed more than one RTT | one segment has been lost, and that one or more ECN marks were | |||
if as long as data packets are sent out. | received. Another case when this might happen is when ACKs are lost, | |||
so that a subsequent ACK carries summary information not previously | ||||
available to the sender. | ||||
4.2. Credit Bits | Whenever both LEG and CEG are positive, the sender MUST mark each | |||
ConEx-capable packet with both L and E. If a credit signal is also | ||||
pending (see Section 4.2), the C flag can be set as well. | ||||
The ConEx abstract mechanism requires that sufficient credit must be | 4.2. Setting the Credit Flag | |||
signaled in advance to cover the expected congestion during the | ||||
feedback delay of one RTT. A ConEx sender should maintain a counter | ||||
of the sent credits c in bytes. If congestion occurs, credits will | ||||
be consumed and the c counter should be reduced by the number of | ||||
bytes that where lost or estimated to be ECN-marked. If the risk of | ||||
congestion was estimated wrongly and thus too few credits were sent, | ||||
the c counter becomes zero but can not get negative. | ||||
The number of credits sent should always equal the number of bytes in | The ConEx abstract mechanism [draft-ietf-conex-abstract-mech] | |||
flight, as all packets could potentially get lost or congestion | requires that sufficient credit must be signaled in advance to cover | |||
marked. Thus a ConEx sender should monitor the number of bytes in | the expected congestion during the feedback delay of one RTT. | |||
flight f. If f ever becomes larger than c, the ConEx sender SHOULD | ||||
send new credits. Remember that c will be decreased if congestion | ||||
occurs. | ||||
In TCP Slow Start, the congestion window might grow much larger than | This section proposes concrete algorithms for determining how much | |||
during the rest of the transmission. Thus a sender could consider to | credit to signal during congestion avoidance and slow start. | |||
sent fewer than f credits but risking potential penalization by an | However, experimentation in better credit setting algorithms is | |||
audit. In any case the credits should at least cover the increase in | expected and encouraged. The wider goal of ConEx is to reflect the | |||
sending rate. As the sending rate increases exponentially in Slow | 'cost' of the risk of causing congestion on those that contribute | |||
Start, thus double every RTT, a ConEx sender should at least cover | most to it. Thus, experimentation is encouraged in better ways to | |||
half the number of packets in flight by credits. Note, that the | improve or maintain performance while reducing the risk of causing | |||
number of losses or markings within one RTT does not only depend | congestion, and therefore reducing the need to signal so much credit. | |||
actions taken by the sender. In general, the behavior of the cross | ||||
traffic, and if Active Queue Management (AQM) is used, the respective | ||||
parameterization influence how many packets get dropped or marked. | ||||
But if the used AQM is not overly aggressive with ECN marking, | ||||
sending halve the flight size as credits should be sufficient for | ||||
both, congestion signaled by loss or ECN. Marking every fourth | ||||
packet will allow the respective number of credits in Slow Start as | ||||
it can be seen in Figure Figure 1. | ||||
RTT1 |------XC------>| | For a simple credit algorithm, a ConEx sender SHOULD maintain a | |||
|------X------->| | counter of the sent credits c in bytes. If congestion occurs, | |||
|------X------->| credit=1 in_flight=3 | credits will be consumed and the c counter SHOULD be reduced by the | |||
number of bytes that where lost or estimated to be ECN-marked. If | ||||
the risk of congestion was estimated wrongly and thus too few credits | ||||
were sent, the c counter becomes zero but cannot go negative. | ||||
During TCP congestion avoidance, the amount of credit sent SHOULD | ||||
exceed the amount of congestion experienced by at least the number of | ||||
bytes in flight, as all packets could potentially get lost or | ||||
congestion marked.[CREF12] Thus a ConEx sender should monitor the | ||||
number of bytes in flight f. Whenever f becomes larger than c, the | ||||
ConEx sender SHOULD set the C flag on each ConEx-capable packet and | ||||
increase c by the size of each marked packet until it is no less than | ||||
f again. | ||||
Recall that c will be decreased whenever congestion occurs, therefore | ||||
c will need to be replenished as soon as c drops below f. Also | ||||
recall that the sender can set the C flag on a ConEx-capable packet | ||||
whether or not the E or L flags are also set. | ||||
In TCP slow start, the congestion window might grow much larger than | ||||
during the rest of the transmission. Thus a sender could consider | ||||
sending fewer than f credits but risking being penalized by an audit | ||||
function. In any case the credits SHOULD at least cover the increase | ||||
in sending rate.[CREF13] Given the sending rate doubles every RTT in | ||||
Slow Start, a ConEx sender should at least cover half the number of | ||||
packets in flight by credits. Note that the number of losses or | ||||
markings within one RTT does not solely depend on the sender's | ||||
actions. In general, the behavior of the cross traffic, whether | ||||
active queue management (AQM) is used and how it is parameterized | ||||
influence how many packets might be dropped or marked. As long as | ||||
any AQM encountered is not overly aggressive with ECN marking, | ||||
sending half the flight size as credits should be sufficient whether | ||||
congestion is signaled by loss or ECN. Marking C on every second | ||||
packet in the initial window and every fourth packet in slow start | ||||
will introduce the correct amount of credit as can be seen in | ||||
Figure 1.[CREF14] This behaviour is most easily achieved by using the | ||||
following formula to update c as every packet is sent during slow | ||||
start: | ||||
c = (f+1)/2, using integer division. | ||||
f c=(f+1)/2 | ||||
RTT1 |------XC------>| 1 1 | ||||
|------X------->| 2 1 | ||||
|------XC------>| 3 2 | ||||
| | | | | | |||
RTT2 |------X------->| | RTT2 |------X------->| 3 2 | |||
|------XC------>| | |------X------->| 4 2 | |||
|------X------->| | |------X------->| 4 2 | |||
|------X------->| | |------XC------>| 5 3 | |||
|------X------->| | |------X------->| 5 3 | |||
|------XC------>| credit=3 in_flight=6 | |------X------->| 6 3 | |||
| | | | | | |||
RTT3 |------X------->| | RTT3 |------X------->| 6 3 | |||
|------X------->| | |------XC------>| 7 4 | |||
|------X------->| | |------X------->| 7 4 | |||
|------XC------>| | |------X------->| 8 4 | |||
|------X------->| | |------X------->| 8 4 | |||
|------X------->| | |------XC------>| 9 5 | |||
|------X------->| | |------X------->| 9 5 | |||
|------XC------>| | |------X------->| 10 5 | |||
|------X------->| | |------X------->| 10 5 | |||
|------X------->| | |------XC------>| 11 6 | |||
|------X------->| | |------X------->| 11 6 | |||
|------XC------>| credit=6 in_flight=12 | |------X------->| 12 6 | |||
| . | | | . | | |||
| : | | | : | | |||
Figure 1: Credits in Slow Start (with an initial window of 3) | Figure 1: Credits in Slow Start (with an initial window of 3) | |||
It is possible that the audit looses state due to e.g. rerouting or | It is possible that a TCP flow will encounter an audit function | |||
memory limitations. Therefore, the sender needs to detect this case | without relevant flow state, due to e.g. rerouting or memory | |||
and resend credits. Thus a ConEx sender should reset the credit | limitations. Therefore, the sender needs to detect this case and | |||
count c to zero if losses occur in two subsequent RTTs (assuming that | resend credits. Thus a ConEx sender should reset the credit count c | |||
the sending rate was correctly reduced based on the received | to zero if losses occur in two subsequent RTTs (assuming that the | |||
congestion signal). | sending rate was correctly reduced based on the received congestion | |||
signal). [CREF15] | ||||
5. Loss of ConEx information | 5. Loss of ConEx information | |||
Packets carrying ConEx can also get lost. A ConEx sender must | Packets carrying ConEx signals could be discarded themselves. This | |||
remember which packet was marked with either the L, the E or the C | will be a second order problem (e.g. if the loss probability is 0.1%, | |||
bit. If one of these packets is detected to be lost, the should | the probability of losing a loss signal will be 0.1% of 0.1% = | |||
increase the respective gauge, LEG or CEG, by the number of lost | 0.0001%). Therefore, an implementer MAY choose to ignore this | |||
payload bytes. | problem, accepting instead the risk that an audit function might | |||
slightly increase the loss level (e.g. from 0.1000% to 0.1001%). | ||||
Nonetheless, a ConEx sender SHOULD remember which packet was marked | ||||
with either the L, the E or the C flag. If one of these packets is | ||||
detected as lost, the sender SHOULD increase the respective gauge(s), | ||||
LEG or CEG, by the number of lost payload bytes in addition to | ||||
increasing LEG for the loss. | ||||
6. Timeliness of the ConEx Signals | 6. Timeliness of the ConEx Signals | |||
ConEx signals can only be evaluated by a network node with a time | ConEx signals will only be useful to a network node within a time | |||
delay of about one RTT after the congestion occured. To avoid | delay of about one RTT after the congestion occurred. To avoid | |||
further delays, a ConEx sender SHOULD sent the ConEx signaling with | further delays, a ConEx sender SHOULD send the ConEx signaling on the | |||
the next available packet. In cases where it is preferable to | next available packet. | |||
slightly delay the ConEx signal, the sender MUST NOT delay the ConEx | ||||
signal more than one RTT. | ||||
Multiple ConEx bits may become available for signaling at the same | Any or all of the ConEx flags can be used in the same packet, which | |||
time, for example when an ACK is received by the sender, that | allows delay to be minimised when multiple signals are pending. | |||
indicates at the same time that at least one segment has been lost, | ||||
and that one or more ECN marks were received. This may happen during | If a flow becomes application-limited, there could be insufficient | |||
excessive congestion, where the queues overflow even though ECN was | bytes to send to reduce the gauges to zero or below. In such cases, | |||
used and currently all packets are marked, while others have to be | the sender cannot help but delay ConEx signals. Nonetheless, as long | |||
dropped nevertheless. Another possibility when this may happen are | as the sender is marking all outgoing packets, an audit function is | |||
lost ACKs, so that a subsequent ACK carries summary information not | unlikely to penalize ConEx-marked packets. Therefore, no matter how | |||
previously available to the sender. As ConEx-capable packet can | long a gauge has been positive, a sender MUST NOT reduce the gauge by | |||
carry different ConEx marks at the same time, these information do | more than the ConEx marked bytes it has sent. | |||
not need to be distributed over several packets and thus can be sent | ||||
without further delay. | If the CEG or LEG counter is negative, the respective counter SHOULD | |||
be reset to zero within one RTT after it was decreased the last time | ||||
or one RTT after recovery if no further congestion occurred. | ||||
[CREF16] | ||||
If SACK information is not available spurious retransmission are more | ||||
likely. In this case it might be valuable to slightly delay the | ||||
ConEx loss feedback until a spurious retransmission might be | ||||
detected. But the ConEx signal MUST NOT be delayed more than one RTT | ||||
if as long as data packets are sent out.[CREF17] | ||||
7. Acknowledgements | 7. Acknowledgements | |||
The authors would like to thank Bob Briscoe who contributed with this | The authors would like to thank Bob Briscoe who contributed with this | |||
initial ideas and valuable feedback. Moreover, thanks to Jana | initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback. | |||
Iyengar who provided valuable feedback. | Moreover, thanks to Jana Iyengar who provided valuable feedback. | |||
8. IANA Considerations | 8. IANA Considerations | |||
This document does not have any requests to IANA. | This document does not have any requests to IANA. | |||
9. Security Considerations | 9. Security Considerations | |||
With some of the advanced ECN compatibility modes it is possible to | General ConEx security considerations are covered extensively in the | |||
miss congestion notifications. Thus a sender will not decrease its | ConEx abstract mechanism [draft-ietf-conex-abstract-mech]. This | |||
sending rate. If the congestion is persistent, the likelihood to | section covers TCP-specific concerns. | |||
receive a congestion notification increases. In the worst case the | ||||
sender will still react correctly to loss. This will prevent a | The ConEx modifications to TCP provide no mechanism for a receiver to | |||
congestion collapse. | force a sender not to use ConEx. A receiver can degrade the accuracy | |||
of ConEx by claiming that it does not support SACK, AccECN or ECN, | ||||
but the sender will never have to turn ConEx off. The receiver | ||||
cannot force the sender to have to mark ConEx more conservatively, in | ||||
order to cover the risk of any inaccuracy. Instead the sender can | ||||
choose to mark inaccurately, which will only increase the likelihood | ||||
of loss at an audit function. Thus the receiver will only harm | ||||
itself. | ||||
Assuming the sender is limited in some way by a congestion allowance | ||||
or quota, a receiver could spoof more loss or ECN congestion feedback | ||||
than it actually experiences, in an attempt to make the sender draw | ||||
down its allowance faster than necessary. However, over-declaring | ||||
congestion simply makes the sender slow down. If the receiver is | ||||
interested in the content it will not want to harm its own | ||||
performance. | ||||
However, if the receiver is solely interested in making the sender | ||||
draw down its allowance, the net effect will depend on the sender's | ||||
congestion control algorithm. With New Reno [RFC5681], doubling | ||||
congestion feedback causes the sender to consume sqrt(2) = 1.4 times | ||||
more congestion allowance. However, to improve scaling, congestion | ||||
control algorithms are tending towards less responsive algorithms | ||||
like Cubic or Compound TCP, and ultimately to linear algorithms like | ||||
DCTCP [DCTCP]. In each case, if the receiver doubles congestion | ||||
feedback, it causes the sender to respectively consume more allowance | ||||
by a factor of 1.2, 1.15 or 1, where 1 implies the attack has become | ||||
completely ineffective. | ||||
10. References | 10. References | |||
10.1. Normative References | 10.1. Normative References | |||
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP | [RFC2018] Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP | |||
Selective Acknowledgment Options", RFC 2018, October 1996. | Selective Acknowledgment Options", RFC 2018, October 1996. | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
skipping to change at page 13, line 29 | skipping to change at page 16, line 22 | |||
Destination Option for ConEx", draft-ietf-conex-destopt-04 | Destination Option for ConEx", draft-ietf-conex-destopt-04 | |||
(work in progress), March 2013. | (work in progress), March 2013. | |||
10.2. Informative References | 10.2. Informative References | |||
[DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, | [DCTCP] Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, | |||
P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: | P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP: | |||
Efficient Packet Transport for the Commoditized Data | Efficient Packet Transport for the Commoditized Data | |||
Center", Jan 2010. | Center", Jan 2010. | |||
[I-D.briscoe-tsvwg-re-ecn-tcp] | [I-D.briscoe-conex-re-ecn-tcp] | |||
Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, | Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith, | |||
"Re-ECN: Adding Accountability for Causing Congestion to | "Re-ECN: Adding Accountability for Causing Congestion to | |||
TCP/IP", draft-briscoe-tsvwg-re-ecn-tcp-09 (work in | TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in | |||
progress), October 2010. | progress), July 2014. | |||
[RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm | [RFC3522] Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm | |||
for TCP", RFC 3522, April 2003. | for TCP", RFC 3522, April 2003. | |||
[RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective | [RFC3708] Blanton, E. and M. Allman, "Using TCP Duplicate Selective | |||
Acknowledgement (DSACKs) and Stream Control Transmission | Acknowledgement (DSACKs) and Stream Control Transmission | |||
Protocol (SCTP) Duplicate Transmission Sequence Numbers | Protocol (SCTP) Duplicate Transmission Sequence Numbers | |||
(TSNs) to Detect Spurious Retransmissions", RFC 3708, | (TSNs) to Detect Spurious Retransmissions", RFC 3708, | |||
February 2004. | February 2004. | |||
skipping to change at page 14, line 14 | skipping to change at page 17, line 5 | |||
[RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, | [RFC5682] Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata, | |||
"Forward RTO-Recovery (F-RTO): An Algorithm for Detecting | "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting | |||
Spurious Retransmission Timeouts with TCP", RFC 5682, | Spurious Retransmission Timeouts with TCP", RFC 5682, | |||
September 2009. | September 2009. | |||
[RFC6789] Briscoe, B., Woundy, R., and A. Cooper, "Congestion | [RFC6789] Briscoe, B., Woundy, R., and A. Cooper, "Congestion | |||
Exposure (ConEx) Concepts and Use Cases", RFC 6789, | Exposure (ConEx) Concepts and Use Cases", RFC 6789, | |||
December 2012. | December 2012. | |||
[draft-briscoe-tsvwg-byte-pkt-mark] | [RFC7141] Briscoe, B. and J. Manner, "Byte and Packet Congestion | |||
Briscoe, B. and J. Manner, "Byte and Packet Congestion | Notification", BCP 41, RFC 7141, February 2014. | |||
Notification", draft-briscoe-tsvwg-byte-pkt-mark-010 (work | ||||
in progress), May 2013. | ||||
[draft-kuehlewind-tcpm-accurate-ecn] | [draft-kuehlewind-tcpm-accurate-ecn] | |||
Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN | Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN | |||
Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02 | Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02 | |||
(work in progress), Jun 2013. | (work in progress), Jun 2013. | |||
Appendix A. Revision history | Appendix A. Revision history | |||
RFC Editior: This section is to be removed before RFC publication. | RFC Editor: This section is to be removed before RFC publication. | |||
00 ... initial draft, early submission to meet deadline. | 00 ... initial draft, early submission to meet deadline. | |||
01 ... refined draft, updated LEG "drain" from per-packet to RTT- | 01 ... refined draft, updated LEG "drain" from per-packet to RTT- | |||
based. | based. | |||
02 ... added Section 5 and expanded discussion about ECN interaction. | 02 ... added Section 5 and expanded discussion about ECN interaction. | |||
03 ... expanded the discussion around credit bits. | 03 ... expanded the discussion around credit bits. | |||
04 ... review comments of Jana addressed. (Change in full compliance | 04 ... review comments of Jana addressed. (Change in full compliance | |||
mode.) | mode.) | |||
05 ... changes on Loss Detection without SACK, support of classic ECN | 05 ... changes on Loss Detection without SACK, support of classic ECN | |||
and credit handling. | and credit handling. | |||
Editorial Comments | ||||
[CREF1] BB: 'finally" here would mean "At last (sigh), here's what | ||||
you've all been waiting for." :-) | ||||
[CREF2] BB: Avoid 'recommended', which could be confused with the | ||||
normative upper-cased word. The normative language later is | ||||
good and sufficient. | ||||
[CREF3] BB: I don't understand this last sentence. How does the sender | ||||
suddenly know something it didn't know before? | ||||
[CREF4] BB: I've added this sentence, but only to give you an excuse for | ||||
having devised all this mechanism. However, I really don't know | ||||
why you're going to all this trouble to be so accurate and | ||||
timely. TCP never retransmits less data than is lost. And over | ||||
the years TCP designers have been reducing the amount of | ||||
unnecessary retransmission, and reducing retransmission delay. | ||||
So I suggest we just mark retransmissions with the L flag. | ||||
Done! No need even for a loss exposure gauge. ...If the sender | ||||
is faced with insufficient information such that the universe of | ||||
TCP designers has been unable to minimise unnecessary or delayed | ||||
retransmissions, why try to do better than everyone has so far | ||||
managed? Just accept that you will be over-declaring or | ||||
sluggishly declaring ConEx. And assume that deployment of all | ||||
the techniques to reduce late or spurious losses is proceeding, | ||||
and we can walk on their shoulders. | ||||
[CREF5] BB: I suggest removing MUST, because we cannot mandate a | ||||
particular implementation technique. | ||||
[CREF6] BB: If these mechanisms are being used, surely they will be | ||||
being used to /prevent/ spurious retransmissions (not just count | ||||
them but still retransmit anyway). So, if we increase LEG only | ||||
when a retransmission actually occurs, is that not sufficient? | ||||
[CREF7] BB: OK, I get that. But, as above, why worry about optimising a | ||||
case that is becoming rare, because everyone recognised late | ||||
retransmission was a problem, so SACK is pretty much universally | ||||
deployed. Would you be unhappy if all this was deleted? | ||||
Perhaps relegate to an appendix? But is it really so necessary? | ||||
[CREF8] BB: I think 3 has been used instead of num_dup in the LEC | ||||
algorithm earlier. | ||||
[CREF9] BB: I changed 'a' to 'the'. Did you mean a generally more | ||||
accurate scheme, or the AccECN scheme in particular? If the | ||||
latter, as it stands, the AccECN scheme doesn't give marked | ||||
bytes. | ||||
[CREF10] BB: Surely RFC5562 only adds ECT on the SYN/ACK. Is it really | ||||
necessary to even refer to it in this draft? Whatever, it | ||||
doesn't seem particularly relevant to this sentence. Or did | ||||
you mean RFC3168? | ||||
[CREF11] BB: I thought the result of the discussion about how to say | ||||
whether the X flag is set in conex-destopt was that X is set | ||||
irrespective of whether loss or ECN marking of the packet | ||||
itself can be detected. The relevant sentence in conex-destopt | ||||
is: "This [X=0] can be the case if no congestion feedback is | ||||
(currently) available e.g. in TCP if one endpoint has been | ||||
receiving data but sending nothing but pure ACKs (no user data) | ||||
for some time." | ||||
[CREF12] BB: I would prefer if this were stated at the maximum required, | ||||
not a recommended value. The idea is to hold as much credit as | ||||
the /likely/ worst-case congestion, not the /absolute/ worst | ||||
case (I did experiments to find the variance of congestion in | ||||
my PhD). | ||||
[CREF13] BB: Again, rather than a SHOULD, can we make this a | ||||
recommendation that is part of the reason for ConEx | ||||
experimentation? - especially if variants like hybrid SS are | ||||
enabled. | ||||
[CREF14] BB: Just marking every fourth packet doesn't work for a general | ||||
IW. During the IW, mark the first packet and every other | ||||
packet, then after IW mark every fourth packet (to determine | ||||
precisely which is the first packet to mark after the IW, | ||||
maintain a packet counter and double it when IW ends). | ||||
[CREF15] BB: Whoa! This is rather excessively conservative isn't it? | ||||
There will often be a loss in 2 consecutive RTTs due to normal | ||||
congestion. If there's a re-route, I think the new audit will | ||||
drop a whole window, so the sender will naturally send a whole | ||||
window's worth of credit with the retransmissions. Am I wrong? | ||||
[CREF16] BB: This adds complexity. I would suggest this is a MAY. It | ||||
depends on how audit is done whether it is necessary, so this | ||||
will depend on experiments. For instance, in the audit | ||||
function I designed, there was a long term and a short term | ||||
comparison, and the long term one became more relaxed the | ||||
longer the flow had been behaving. (Note I have also suggested | ||||
moving this and the next para from "Setting E/L" to | ||||
"Timeliness") | ||||
[CREF17] BB: As before, I disagree with the need for this para - this is | ||||
trying to optimise a case that is rare because it's known to be | ||||
sub-optimal, by compromising ConEx timeliness. SACK is nearly | ||||
universal .If SACK isn't available, things are bound to be non- | ||||
optimal. The solution is for the receiver to deploy SACK like | ||||
nearly every other receiver has done, not to add more | ||||
complexity to the sender and more delay to ConEx. | ||||
Authors' Addresses | Authors' Addresses | |||
Mirja Kuehlewind (editor) | Mirja Kuehlewind (editor) | |||
ETH Zurich | ETH Zurich | |||
Switzerland | Switzerland | |||
Email: mirja.kuehlewind@tik.ee.ethz.ch | Email: mirja.kuehlewind@tik.ee.ethz.ch | |||
Richard Scheffenegger | Richard Scheffenegger | |||
NetApp, Inc. | NetApp, Inc. | |||
End of changes. 74 change blocks. | ||||
314 lines changed or deleted | 547 lines changed or added | |||
This html diff was produced by rfcdiff 1.42. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |