Congestion Exposure (ConEx)                           M. Kuehlewind, Ed.
Internet-Draft                                                ETH Zurich
Intended status: Experimental                           R. Scheffenegger
Expires: September 9, 2015                                  NetApp, Inc.
                                                           March 8, 2015


               TCP modifications for Congestion Exposure
                 draft-ietf-conex-tcp-modifications-07

Abstract

   Congestion Exposure (ConEx) is a mechanism by which senders inform
   the network about expected congestion based on congestion feedback
   from previous packets in the same flow.  This document describes the
   necessary modifications to use ConEx with the Transmission Control
   Protocol (TCP).

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on September 9, 2015.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 1]

Internet-Draft         TCP Modifications for ConEx            March 2015


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .  18
     1.1.  Requirements Language . . . . . . . . . . . . . . . . . .   3
   2.  Sender-side Modifications . . . . . . . . . . . . . . . . . .   3
   3.  Counting congestion . . . . . . . . . . . . . . . . . . . . .   4
     3.1.  Loss Detection  . . . . . . . . . . . . . . . . . . . . .   5
       3.1.1.  General Approach  . . . . . . . . . . . . . . . . . .   6
       3.1.2.  Without SACK Support  . . . . . . . . . . . . . . . .   6
     3.2.  ECN . . . . . . . . . . . . . . . . . . . . . . . . . . .   7
       3.2.1.  Accurate ECN feedback . . . . . . . . . . . . . . . .   9
       3.2.2.  Classic ECN support . . . . . . . . . . . . . . . . .   9
   4.  Setting the ConEx Flags . . . . . . . . . . . . . . . . . . .  10
     4.1.  Setting the E or the L Flag . . . . . . . . . . . . . . .  10
     4.2.  Setting the Credit Flag . . . . . . . . . . . . . . . . .  11
   5.  Loss of ConEx information . . . . . . . . . . . . . . . . . .  13
   6.  Timeliness of the ConEx Signals . . . . . . . . . . . . . . .  14
   7.  Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  14
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  14
   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  14
   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  15
     10.1.  Normative References . . . . . . . . . . . . . . . . . .  15
     10.2.  Informative References . . . . . . . . . . . . . . . . .  16
   Appendix A.  Revision history . . . . . . . . . . . . . . . . . .  17
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

1.  Introduction

   Congestion Exposure (ConEx) is a mechanism by which senders inform
   the network about expected congestion based on congestion feedback
   from previous packets in the same flow.  ConEx concepts and use cases
   are further explained in [RFC6789].  The abstract ConEx mechanism is
   explained in [draft-ietf-conex-abstract-mech].  This document
   describes the necessary modifications to use ConEx with the
   Transmission Control Protocol (TCP).

   The markings for ConEx signaling are defined in the ConEx Destination
   Option (CDO) for IPv6 [draft-ietf-conex-destopt].  Specifically, the
   use of four flags are defined: X (ConEx-capable), L (loss
   experienced), E (ECN experienced) and C (credit).

   ConEx signaling is based on loss or Explicit Congestion Notification
   (ECN) marks [RFC3168] as congestion indications.  The sender collects
   this congestion information based on existing TCP feedback mechanisms
   from the receiver to the sender.  No changes are needed at the


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 2]

Internet-Draft         TCP Modifications for ConEx            March 2015


   receiver to implement ConEx signaling.  Therefore no additional
   negotiation is needed to implement and use ConEx at the sender.  This
   document specifies the sender's actions that are needed to provide
   meaningful ConEx information to the network.

   Section 2 provides an overview of the modifications needed for TCP
   senders to implement ConEx.  First congestion information has to be
   extracted from TCP's loss or ECN feedback as described in section 3.
   Section 4 details how to set the CDO marking based on this congestion
   information.  Section 5 discusses loss of packets carrying ConEx
   information.  Section 6 [CREF1]discusses timeliness of the ConEx
   feedback signal, given congestion is a temporary state.

   This document describes congestion accounting for TCP with and
   without the Selective Acknowledgment (SACK) extension [RFC2018] (in
   section 3.1).  However, ConEx benefits from the more accurate
   information that SACK provides about the number of bytes dropped in
   the network.  It is therefore preferable[CREF2] to use the SACK
   extension when using TCP with ConEx.  The detailed mechanism to set
   the L flag in response to loss-based congestion feedback signal is
   given in section 4.1.

   Whereas loss has to be minimized, ECN can provide more fine-grained
   feedback information.  ConEx-based traffic measurement or management
   mechanisms could benefit from this.  Unfortunately, the current ECN
   feedback mechanism does not reflect multiple congestion markings if
   they occur within the same Round-Trip Time (RTT).  A more accurate
   feedback extension to ECN (AccECN) is proposed in a separate document
   [draft-kuehlewind-tcpm-accurate-ecn], as this is also useful for
   other mechanisms.

   Congestion accounting for both classic ECN feedback and AccECN
   feedback is explained in detail in section 3.2.  Setting the E flag
   in response to ECN-based congestion feedback is again detailed in
   section 4.1.

1.1.  Requirements Language

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

2.  Sender-side Modifications

   This section gives an overview of actions that need to be taken by a
   TCP sender modified to use ConEx signaling.


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 3]

Internet-Draft         TCP Modifications for ConEx            March 2015


   In the TCP handshake, a ConEx sender MUST negotiate for SACK and ECN
   preferably with AccECN feedback.  Therefore a ConEx sender MUST also
   implement SACK and ECN.  Depending on the capability of the receiver,
   the following operation modes exist:

                              +------+-----+
                              | SACK | ECN |
                              +------+-----+
                              | S    | A   |
                              | S    | C   |
                              | S    | -   |
                              | -    | A   |
                              | -    | C   |
                              | -    | -   |
                              +------+-----+

   S: SACK enabled; A: AccECN enabled; C: Classic ECN [RFC3168] enabled

                           Table 1: ConEx modes.

   A ConEx sender MUST expose all congestion information to the network
   according to the congestion information received by ECN or based on
   loss information provided by the TCP feedback loop.  A TCP sender
   SHOULD count congestion byte-wise (rather than packet-wise; see next
   paragraph).  After any congestion notification, a sender MUST mark
   subsequent packets with the appropriate ConEx flag in the IP header.
   Furthermore, a ConEx sender must send enough credit to cover all
   experienced congestion for the connection so far, as well as the risk
   of congestion for the current transmission (see Section 4.2).

   With SACK the number of lost payload bytes is known, but not the
   number of packets carrying these bytes.  With classic ECN only an
   indication is given that a marking occurred but not the exact number
   of payload bytes nor packets.  As network congestion is usually byte-
   congestion [RFC7141], the byte-size of a packet marked with a CDO
   flag is defined to represent that number of bytes of congestion
   signalling [draft-ietf-conex-destopt].  Therefore the exact number of
   bytes should be taken into account, if available, to make the ConEx
   signal as exact as possible.

   Detailed mechanisms for congestion accounting in each operation mode
   are described in the next section.

3.  Counting congestion

   A ConEx TCP sender maintains two counters: one that counts congestion
   based on the information retrieved by loss detection, and a second
   that accounts for ECN based congestion feedback.  These counters hold


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 4]

Internet-Draft         TCP Modifications for ConEx            March 2015


   the number of outstanding bytes that should be ConEx marked with
   respectively the E flag or the L flag in subsequent packets.

   The outstanding bytes for congestion indications based on loss are
   added to the loss exposure gauge (LEG), as explained in Section 3.1.

   The outstanding bytes counted based on ECN feedback information are
   added to the congestion exposure gauge (CEG)as explained in
   Section 3.2.

   When the sender sends a ConEx capable packet with the E or L flag set
   it reduces the respective counter by the byte-size of the packet.
   This is explained for both counters in Section 4.1.

   Usually all bytes of an IP packet must be counted.  Therefore the
   sender SHOULD take the payload and headers into account, up to and
   including the IP header.  Therefore, as well as the TCP payload
   bytes, an appropriate number of header bytes SHOULD be added to the
   gauge for each packet of congestion feedback.  And the sender SHOULD
   subtract header bytes from the gauge for each marked packet sent.

   If equal-sized packets, or at least equally distributed packet sizes
   can be assumed, the sender MAY only add and subtract TCP payload
   bytes,.  In this case there should be about the same number of ConEx
   marked packets as the original packets that were causing the
   congestion.  Thus both contain about the same number of header bytes
   so they will cancel out.  This case is assumed for simplicity in the
   following sections.

   Otherwise, if a sender sends different sized packets (with unequally
   distributed packet sizes), the sender needs to memorize or estimate
   the number of lost or ECN-marked packets.  A sender might be able to
   reconstruct the number of packets and thus the header bytes if the
   packet sizes of all packets that were sent during the last RTT are
   known.  Otherwise, if no additional information is available, the
   conservative or even worst case number of packets and thus header
   bytes should be estimated, e.g. based on the minimum packet size (of
   all packets sent in the last RTT).  If the number of ConEx marked
   packets is smaller (or larger) than the estimated number of lost or
   ECN-marked packets, the additional header bytes should be added to
   (or can be subtracted from) the respective counter.[CREF3]

3.1.  Loss Detection


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 5]

Internet-Draft         TCP Modifications for ConEx            March 2015


3.1.1.  General Approach

   This section applies whether or not SACK support is available.  The
   following section deals with the case when SACK is not available.

   TCP feedback is designed so that the sender can detect losses in
   order to retransmit the lost data.  Therefore, it might be naively
   assumed that a TCP sender only needs to set the ConEx L flag on all
   retransmissions in order to signal the amount of bytes lost.
   However, this will not always be the case.  Therefore the process of
   loss detection is described here and separately the process of ConEx
   marking is described in Section 4.1.[CREF4]

   A ConEx sender needs to[CREF5] maintain a local signed counter that
   shall be called the loss exposure gauge (LEG), indicating the number
   of outstanding bytes to be sent with the ConEx L flag.  When a TCP
   sender decides that a data segment needs to be retransmitted, it will
   increase LEG by the size of the TCP payload bytes in the
   retransmission (assuming equal sized segments such that the
   retransmitted packet will have the same number of header bytes as the
   original ones).

   Any retransmission may be spurious.  To accommodate that, a ConEx
   sender SHOULD make use of heuristics to detect such spurious
   retransmissions (e.g.  F-RTO [RFC5682], DSACK [RFC3708], and Eifel
   [RFC3522], [RFC4015]).  When such a heuristic has determined that a
   certain number of packets were retransmitted erroneously, the ConEx
   sender SHOULD subtract the payload size of these TCP packets from
   LEG.[CREF6]

3.1.2.  Without SACK Support

   If multiple losses occur within one RTT and SACK is not used, it may
   take several RTTs until all lost data is retransmitted.  With the
   scheme described above, the ConEx information will be delayed
   considerably, but timeliness is important for ConEx.

   For ConEx it is not important to know which data got lost but only
   how much.[CREF7] During the first RTT after the initial loss
   detection, the amount of received data and thus also the amount of
   lost data can be estimated based on the number of received ACKs.
   Thus without SACK, the information needed for ConEx feedback can be
   available with an additional delay of one RTT by using the following
   estimation algorithm and an additional Loss Estimation Counter (LEC):


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 6]

Internet-Draft         TCP Modifications for ConEx            March 2015


      flight_bytes:      current flight size in bytes
      retransmit_bytes:  payload size of the retransmission

      At the first retransmission in a congestion event LEC is set:

         LEC = flight_bytes - 3*SMSS

         (At this point of time in the transmission, in the worst case,
         all packets in flight minus three that trigged the dupACks
         could have been lost.)

      Then during the first RTT of the congestion event:

         For each retransmission:
            LEG += retransmit_bytes
            LEC -= retransmit_bytes

         For each ACK:
            LEC -= SMSS


      After one RTT:

         LEG += LEC

         (The LEC now estimates the number of outstanding bytes
         that should be ConEx L marked.)

      After the first RTT for each following retransmissions:

         if (LEC > 0): LEC -= retransmit_bytes
         else if (LEC==0): LEG += retransmit_bytes

         if (LEC < 0): LEG += -LEC

         (The LEG is not increased for those bytes that were
         already counted.)

3.2.  ECN

   ECN [RFC3168] is an IP/TCP mechanism that allows network nodes to
   mark packets with the Congestion Experienced (CE) mark instead of
   dropping them when congestion occurs.

   A receiver might support 'classic' ECN, the more accurate ECN
   feedback scheme (AccECN), or neither.  In the case that ECN is not
   supported for a connection, of course, no ECN marks will occur; thus
   the sender will never set the E flag.  Otherwise, a ConEx sender must


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 7]

Internet-Draft         TCP Modifications for ConEx            March 2015


   maintain a signed counter, the congestion exposure gauge (CEG), for
   the number of outstanding bytes that have to be ConEx marked with the
   E flag.

   The CEG is increased when ECN information is received from an ECN-
   capable receiver supporting the 'classic' ECN scheme or the accurate
   ECN feedback scheme.  When the ConEx sender receives an ACK
   indicating one or more segments were received with a CE mark, CEG is
   increased by the appropriate number of bytes as described further
   below.

   Unfortunately in case of duplicate acknowledgements the number of
   newly acknowledged bytes will be zero even though (CE marked) data
   has been received.  Therefore, we increase the CEG by DeliveredData,
   as defined below:

   DeliveredData = acked_bytes + SACK_diff + (is_dup)*1SMSS -
   (is_after_dup)*num_dup*1SMSS

   DeliveredData covers the number of bytes that has been newly
   delivered to the receiver.  Therefore on each arrival of an ACK,
   DeliveredData will be increased by the newly acknowledged bytes
   (acked_bytes) as indicated by the current ACK, relative to all past
   ACKs.  The formula depends on whether SACK is available, as follows:

   With SACK:  DeliveredData is increased by the number of bytes
      provided by (new) SACK information (SACK_diff).  Note, if less
      unacknowledged bytes are announced in the new SACK information
      than in the previous ACK, SACK_diff can be negative.  In this
      case, data is newly acknowledged (in acked_bytes), that has
      previously already been accumulated into DeliveredData based on
      SACK information.

   Without SACK:  DeliveredData is estimated to be 1 SMSS on duplicate
      acknowledgements.  For the subsequent partial or full ACK,
      DeliveredData is estimated to be the newly acknowledged bytes,
      minus one SMSS for each preceding duplicate ACK.  Therefore is_dup
      is one if the current ACK is a duplicated ACK without SACK, and
      zero otherwise. is_after_dup is only one for the next full or
      partial ACK after a number of duplicated ACKs without SACK and
      num_dup counts the number of duplicated ACKs in a row.[CREF8]

   With classic ECN, as soon as a CE mark is seen at the receiver, it
   will feed this information back to the sender by setting the Echo
   Congestion Experienced (ECE) flag in the TCP header of subsequent
   ACKs.  Once the sender receives the first ECE of a congestion
   notification, it sets the CWR flag in the TCP header once.  When this
   packet with Congestion Window Reduced (CWR) flag in the TCP header


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 8]

Internet-Draft         TCP Modifications for ConEx            March 2015


   arrives at the receiver, acknowledging its first ECE feedback, the
   receiver stops setting ECE.

   Thus, with classic ECN, one congestion marked packet causes
   continuous congestion feedback for a whole round trip, thus hiding
   the arrival of any further congestion marked packets during that
   round trip.  The more accurate ECN feedback scheme (AccECN) has been
   defined to ensure that feedback properly reflects the extent of
   congestion marking.  The two cases, with and without a receiver
   capable of AccECN, are discussed in the following sections.

3.2.1.  Accurate ECN feedback

   With the [CREF9] more accurate ECN feedback scheme (AccECN) either
   the number of marked packets or the number of marked bytes is known.
   In the latter case the CEG can directly be increased by the number of
   marked bytes.  Otherwise if D is assumed to be the number of marks,
   the gauge (CEG) will be conservatively increased by one SMSS for each
   marking or at max the number of newly acknowledged bytes:

   CEG += min(SMSS*D, DeliveredData)

3.2.2.  Classic ECN support

   If the ConEx sender fully conforms to the semantics of ECN signaling
   as defined by [RFC5562],[CREF10] it will receive one full RTT of ACKs
   with the ECE flag set whenever at least one CE mark was received by
   the receiver.  As the sender cannot estimate how many packets have
   actually been CE marked during this RTT, the most conservative
   assumption MAY be taken, namely assuming that all packets were
   marked.  This can be achieved by increasing the CEG by DeliveredData
   for each ACK with the ECE flag:

   CEG += DeliveredData

   Optionally a ConEx sender could implement the following technique,
   called advanced compatibility mode, to considerably improve its
   estimate of the number of ECN-marked packets:

   To extract more than one ECE indication per RTT, a ConEx sender could
   set the CWR flag continuously to force the receiver to signal only
   one ECE per CE mark.  Unfortunately, the use of delayed ACKs
   [RFC5681] (which is common) will prevent feedback of every CE mark;
   if a CWR confirmation is received before the ECE can be sent out on
   the next ACK, ECN feedback information could get lost.  Thus a sender
   SHOULD set CWR only on those data segments that will actually trigger
   a (delayed) ACK.  The sender would need an additional control loop to
   estimated which data segments will trigger an ACK in order to extract


Kuehlewind & ScheffeneggExpires September 9, 2015               [Page 9]

Internet-Draft         TCP Modifications for ConEx            March 2015


   more timely congestion notifications.  Still the CEG SHOULD be
   increased by DeliveredData, as one or more CE marked packets could be
   acknowledged by one delayed ACK.

   The repetition of ECE in classic ECN is intended to ensure reliable
   delivery of congestion feedback.  The following argument is intended
   to prove that suppressing repetitions of ECE is safe against possible
   congestion collapse due to lost congestion feedback.

   With advanced compatibility mode, if an ACK containing ECE is lost,
   the continual CWRs prevent it being repeated, so it will remain lost.
   Therefore, if congestion is light on the forward path and heavy on
   the reverse, most of the light congestion signals will be lost.  If
   loss of feedback exacerbates congestion on the forward path, more
   forward packets will be CE marked, increasing the likelihood that
   feedback from at least one CE will get through per RTT.  As long as
   one ECE reaches the sender per RTT, the sender's congestion response
   will be the same as if CWR were not continuous.  The only way that
   heavy congestion on the forward path could be completely hidden would
   be if all ACKs on the reverse path were lost.  If total ACK loss
   persisted, the sender would time out and do a congestion response
   anyway.Therefore, the problem seems confined to potential suppression
   of a congestion response during light congestion.

   Anyway, even if loss of all ECN feedback led to no congestion
   response, the worst that could happen would be loss instead of ECN-
   signalled congestion on the forward path.  Given compatibility mode
   does not affect loss feedback, there would be no risk of congestion
   collapse.

4.  Setting the ConEx Flags

   By setting the X flag, a packet is marked as ConEx-capable.  All
   packets carrying payload MUST be marked with the X flag set,
   including retransmissions.  No congestion feedback information is
   available about control packets such as pure ACKs which are not
   carrying any payload.  Thus these packets should not be taken into
   account when determining ConEx information.  These packet MUST carry
   a ConEx Destination Option with the X flag unset.[CREF11]

4.1.  Setting the E or the L Flag

   As long as the LEG or CEG counter is positive, the sender MUST mark
   each ConEx-capable packet with L or E respectively, and decrease the
   LEG or CEG counter by the TCP payload bytes carried in the marked
   packet (assuming headers are not being counted because packet sizes
   are regular).  No matter how small the value of LEG or CEG, if it is
   positive, to ensure ConEx signals are timely, the sender MUST NOT


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 10]

Internet-Draft         TCP Modifications for ConEx            March 2015


   defer packet marking.  Therefore the value of LEG and CEG will
   commonly be negative.

   Multiple ConEx flags may be required for signaling at the same time.
   This may happen, for example, during excessive congestion when an ACK
   is received by the sender that simultaneously indicates that at least
   one segment has been lost, and that one or more ECN marks were
   received.  Another case when this might happen is when ACKs are lost,
   so that a subsequent ACK carries summary information not previously
   available to the sender.

   Whenever both LEG and CEG are positive, the sender MUST mark each
   ConEx-capable packet with both L and E.  If a credit signal is also
   pending (see Section 4.2), the C flag can be set as well.

4.2.  Setting the Credit Flag

   The ConEx abstract mechanism [draft-ietf-conex-abstract-mech]
   requires that sufficient credit must be signaled in advance to cover
   the expected congestion during the feedback delay of one RTT.

   This section proposes concrete algorithms for determining how much
   credit to signal during congestion avoidance and slow start.
   However, experimentation in better credit setting algorithms is
   expected and encouraged.  The wider goal of ConEx is to reflect the
   'cost' of the risk of causing congestion on those that contribute
   most to it.  Thus, experimentation is encouraged in better ways to
   improve or maintain performance while reducing the risk of causing
   congestion, and therefore reducing the need to signal so much credit.

   For a simple credit algorithm, a ConEx sender SHOULD maintain a
   counter of the sent credits c in bytes.  If congestion occurs,
   credits will be consumed and the c counter SHOULD be reduced by the
   number of bytes that where lost or estimated to be ECN-marked.  If
   the risk of congestion was estimated wrongly and thus too few credits
   were sent, the c counter becomes zero but cannot go negative.

   During TCP congestion avoidance, the amount of credit sent SHOULD
   exceed the amount of congestion experienced by at least the number of
   bytes in flight, as all packets could potentially get lost or
   congestion marked.[CREF12] Thus a ConEx sender should monitor the
   number of bytes in flight f.  Whenever f becomes larger than c, the
   ConEx sender SHOULD set the C flag on each ConEx-capable packet and
   increase c by the size of each marked packet until it is no less than
   f again.

   Recall that c will be decreased whenever congestion occurs, therefore
   c will need to be replenished as soon as c drops below f.  Also


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 11]

Internet-Draft         TCP Modifications for ConEx            March 2015


   recall that the sender can set the C flag on a ConEx-capable packet
   whether or not the E or L flags are also set.

   In TCP slow start, the congestion window might grow much larger than
   during the rest of the transmission.  Thus a sender could consider
   sending fewer than f credits but risking being penalized by an audit
   function.  In any case the credits SHOULD at least cover the increase
   in sending rate.[CREF13] Given the sending rate doubles every RTT in
   Slow Start, a ConEx sender should at least cover half the number of
   packets in flight by credits.  Note that the number of losses or
   markings within one RTT does not solely depend on the sender's
   actions.  In general, the behavior of the cross traffic, whether
   active queue management (AQM) is used and how it is parameterized
   influence how many packets might be dropped or marked.  As long as
   any AQM encountered is not overly aggressive with ECN marking,
   sending half the flight size as credits should be sufficient whether
   congestion is signaled by loss or ECN.  Marking C on every second
   packet in the initial window and every fourth packet in slow start
   will introduce the correct amount of credit as can be seen in
   Figure 1.[CREF14] This behaviour is most easily achieved by using the
   following formula to update c as every packet is sent during slow
   start:

      c = (f+1)/2, using integer division.


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 12]

Internet-Draft         TCP Modifications for ConEx            March 2015


                                            f   c=(f+1)/2
                  RTT1  |------XC------>|   1   1
                        |------X------->|   2   1
                        |------XC------>|   3   2
                        |               |
                  RTT2  |------X------->|   3   2
                        |------X------->|   4   2
                        |------X------->|   4   2
                        |------XC------>|   5   3
                        |------X------->|   5   3
                        |------X------->|   6   3
                        |               |
                  RTT3  |------X------->|   6   3
                        |------XC------>|   7   4
                        |------X------->|   7   4
                        |------X------->|   8   4
                        |------X------->|   8   4
                        |------XC------>|   9   5
                        |------X------->|   9   5
                        |------X------->|  10   5
                        |------X------->|  10   5
                        |------XC------>|  11   6
                        |------X------->|  11   6
                        |------X------->|  12   6
                        |      .        |
                        |      :        |

       Figure 1: Credits in Slow Start (with an initial window of 3)

   It is possible that a TCP flow will encounter an audit function
   without relevant flow state, due to e.g. rerouting or memory
   limitations.  Therefore, the sender needs to detect this case and
   resend credits.  Thus a ConEx sender should reset the credit count c
   to zero if losses occur in two subsequent RTTs (assuming that the
   sending rate was correctly reduced based on the received congestion
   signal).  [CREF15]

5.  Loss of ConEx information

   Packets carrying ConEx signals could be discarded themselves.  This
   will be a second order problem (e.g. if the loss probability is 0.1%,
   the probability of losing a loss signal will be 0.1% of 0.1% =
   0.0001%).  Therefore, an implementer MAY choose to ignore this
   problem, accepting instead the risk that an audit function might
   slightly increase the loss level (e.g. from 0.1000% to 0.1001%).

   Nonetheless, a ConEx sender SHOULD remember which packet was marked
   with either the L, the E or the C flag.  If one of these packets is


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 13]

Internet-Draft         TCP Modifications for ConEx            March 2015


   detected as lost, the sender SHOULD increase the respective gauge(s),
   LEG or CEG, by the number of lost payload bytes in addition to
   increasing LEG for the loss.

6.  Timeliness of the ConEx Signals

   ConEx signals will only be useful to a network node within a time
   delay of about one RTT after the congestion occurred.  To avoid
   further delays, a ConEx sender SHOULD send the ConEx signaling on the
   next available packet.

   Any or all of the ConEx flags can be used in the same packet, which
   allows delay to be minimised when multiple signals are pending.

   If a flow becomes application-limited, there could be insufficient
   bytes to send to reduce the gauges to zero or below.  In such cases,
   the sender cannot help but delay ConEx signals.  Nonetheless, as long
   as the sender is marking all outgoing packets, an audit function is
   unlikely to penalize ConEx-marked packets.  Therefore, no matter how
   long a gauge has been positive, a sender MUST NOT reduce the gauge by
   more than the ConEx marked bytes it has sent.

   If the CEG or LEG counter is negative, the respective counter SHOULD
   be reset to zero within one RTT after it was decreased the last time
   or one RTT after recovery if no further congestion occurred.
   [CREF16]

   If SACK information is not available spurious retransmission are more
   likely.  In this case it might be valuable to slightly delay the
   ConEx loss feedback until a spurious retransmission might be
   detected.  But the ConEx signal MUST NOT be delayed more than one RTT
   if as long as data packets are sent out.[CREF17]

7.  Acknowledgements

   The authors would like to thank Bob Briscoe who contributed with this
   initial ideas [I-D.briscoe-conex-re-ecn-tcp] and valuable feedback.
   Moreover, thanks to Jana Iyengar who provided valuable feedback.

8.  IANA Considerations

   This document does not have any requests to IANA.

9.  Security Considerations

   General ConEx security considerations are covered extensively in the
   ConEx abstract mechanism [draft-ietf-conex-abstract-mech].  This
   section covers TCP-specific concerns.


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 14]

Internet-Draft         TCP Modifications for ConEx            March 2015


   The ConEx modifications to TCP provide no mechanism for a receiver to
   force a sender not to use ConEx.  A receiver can degrade the accuracy
   of ConEx by claiming that it does not support SACK, AccECN or ECN,
   but the sender will never have to turn ConEx off.  The receiver
   cannot force the sender to have to mark ConEx more conservatively, in
   order to cover the risk of any inaccuracy.  Instead the sender can
   choose to mark inaccurately, which will only increase the likelihood
   of loss at an audit function.  Thus the receiver will only harm
   itself.

   Assuming the sender is limited in some way by a congestion allowance
   or quota, a receiver could spoof more loss or ECN congestion feedback
   than it actually experiences, in an attempt to make the sender draw
   down its allowance faster than necessary.  However, over-declaring
   congestion simply makes the sender slow down.  If the receiver is
   interested in the content it will not want to harm its own
   performance.

   However, if the receiver is solely interested in making the sender
   draw down its allowance, the net effect will depend on the sender's
   congestion control algorithm.  With New Reno [RFC5681], doubling
   congestion feedback causes the sender to consume sqrt(2) = 1.4 times
   more congestion allowance.  However, to improve scaling, congestion
   control algorithms are tending towards less responsive algorithms
   like Cubic or Compound TCP, and ultimately to linear algorithms like
   DCTCP [DCTCP].  In each case, if the receiver doubles congestion
   feedback, it causes the sender to respectively consume more allowance
   by a factor of 1.2, 1.15 or 1, where 1 implies the attack has become
   completely ineffective.

10.  References

10.1.  Normative References

   [RFC2018]  Mathis, M., Mahdavi, J., Floyd, S., and A. Romanow, "TCP
              Selective Acknowledgment Options", RFC 2018, October 1996.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC3168]  Ramakrishnan, K., Floyd, S., and D. Black, "The Addition
              of Explicit Congestion Notification (ECN) to IP", RFC
              3168, September 2001.

   [RFC5681]  Allman, M., Paxson, V., and E. Blanton, "TCP Congestion
              Control", RFC 5681, September 2009.


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 15]

Internet-Draft         TCP Modifications for ConEx            March 2015


   [draft-ietf-conex-abstract-mech]
              Mathis, M. and B. Briscoe, "Congestion Exposure (ConEx)
              Concepts and Abstract Mechanism", draft-ietf-conex-
              abstract-mech-06 (work in progress), October 2012.

   [draft-ietf-conex-destopt]
              Krishnan, S., Kuehlewind, M., and C. Ucendo, "IPv6
              Destination Option for ConEx", draft-ietf-conex-destopt-04
              (work in progress), March 2013.

10.2.  Informative References

   [DCTCP]    Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel,
              P., Prabhakar, B., Sengupta, S., and M. Sridharan, "DCTCP:
              Efficient Packet Transport for the Commoditized Data
              Center", Jan 2010.

   [I-D.briscoe-conex-re-ecn-tcp]
              Briscoe, B., Jacquet, A., Moncaster, T., and A. Smith,
              "Re-ECN: Adding Accountability for Causing Congestion to
              TCP/IP", draft-briscoe-conex-re-ecn-tcp-04 (work in
              progress), July 2014.

   [RFC3522]  Ludwig, R. and M. Meyer, "The Eifel Detection Algorithm
              for TCP", RFC 3522, April 2003.

   [RFC3708]  Blanton, E. and M. Allman, "Using TCP Duplicate Selective
              Acknowledgement (DSACKs) and Stream Control Transmission
              Protocol (SCTP) Duplicate Transmission Sequence Numbers
              (TSNs) to Detect Spurious Retransmissions", RFC 3708,
              February 2004.

   [RFC4015]  Ludwig, R. and A. Gurtov, "The Eifel Response Algorithm
              for TCP", RFC 4015, February 2005.

   [RFC5562]  Kuzmanovic, A., Mondal, A., Floyd, S., and K.
              Ramakrishnan, "Adding Explicit Congestion Notification
              (ECN) Capability to TCP's SYN/ACK Packets", RFC 5562, June
              2009.

   [RFC5682]  Sarolahti, P., Kojo, M., Yamamoto, K., and M. Hata,
              "Forward RTO-Recovery (F-RTO): An Algorithm for Detecting
              Spurious Retransmission Timeouts with TCP", RFC 5682,
              September 2009.

   [RFC6789]  Briscoe, B., Woundy, R., and A. Cooper, "Congestion
              Exposure (ConEx) Concepts and Use Cases", RFC 6789,
              December 2012.


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 16]

Internet-Draft         TCP Modifications for ConEx            March 2015


   [RFC7141]  Briscoe, B. and J. Manner, "Byte and Packet Congestion
              Notification", BCP 41, RFC 7141, February 2014.

   [draft-kuehlewind-tcpm-accurate-ecn]
              Kuehlewind, M. and R. Scheffenegger, "More Accurate ECN
              Feedback in TCP", draft-kuehlewind-tcpm-accurate-ecn-02
              (work in progress), Jun 2013.

Appendix A.  Revision history

   RFC Editor: This section is to be removed before RFC publication.

   00 ... initial draft, early submission to meet deadline.

   01 ... refined draft, updated LEG "drain" from per-packet to RTT-
   based.

   02 ... added Section 5 and expanded discussion about ECN interaction.

   03 ... expanded the discussion around credit bits.

   04 ... review comments of Jana addressed.  (Change in full compliance
   mode.)

   05 ... changes on Loss Detection without SACK, support of classic ECN
   and credit handling.


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 17]

Internet-Draft         TCP Modifications for ConEx            March 2015


Editorial Comments

[CREF1] BB: 'finally" here would mean "At last (sigh), here's what
        you've all been waiting for." :-)

[CREF2] BB: Avoid 'recommended', which could be confused with the
        normative upper-cased word.  The normative language later is
        good and sufficient.

[CREF3] BB: I don't understand this last sentence.  How does the sender
        suddenly know something it didn't know before?

[CREF4] BB: I've added this sentence, but only to give you an excuse for
        having devised all this mechanism.  However, I really don't know
        why you're going to all this trouble to be so accurate and
        timely.  TCP never retransmits less data than is lost.  And over
        the years TCP designers have been reducing the amount of
        unnecessary retransmission, and reducing retransmission delay.
        So I suggest we just mark retransmissions with the L flag.
        Done!  No need even for a loss exposure gauge. ...If the sender
        is faced with insufficient information such that the universe of
        TCP designers has been unable to minimise unnecessary or delayed
        retransmissions, why try to do better than everyone has so far
        managed?  Just accept that you will be over-declaring or
        sluggishly declaring ConEx.  And assume that deployment of all
        the techniques to reduce late or spurious losses is proceeding,
        and we can walk on their shoulders.

[CREF5] BB: I suggest removing MUST, because we cannot mandate a
        particular implementation technique.

[CREF6] BB: If these mechanisms are being used, surely they will be
        being used to /prevent/ spurious retransmissions (not just count
        them but still retransmit anyway).  So, if we increase LEG only
        when a retransmission actually occurs, is that not sufficient?

[CREF7] BB: OK, I get that.  But, as above, why worry about optimising a
        case that is becoming rare, because everyone recognised late
        retransmission was a problem, so SACK is pretty much universally
        deployed.  Would you be unhappy if all this was deleted?
        Perhaps relegate to an appendix?  But is it really so necessary?

[CREF8] BB: I think 3 has been used instead of num_dup in the LEC
        algorithm earlier.

[CREF9] BB: I changed 'a' to 'the'.  Did you mean a generally more
        accurate scheme, or the AccECN scheme in particular?  If the


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 18]

Internet-Draft         TCP Modifications for ConEx            March 2015


        latter, as it stands, the AccECN scheme doesn't give marked
        bytes.

[CREF10] BB: Surely RFC5562 only adds ECT on the SYN/ACK.  Is it really
         necessary to even refer to it in this draft?  Whatever, it
         doesn't seem particularly relevant to this sentence.  Or did
         you mean RFC3168?

[CREF11] BB: I thought the result of the discussion about how to say
         whether the X flag is set in conex-destopt was that X is set
         irrespective of whether loss or ECN marking of the packet
         itself can be detected.  The relevant sentence in conex-destopt
         is: "This [X=0] can be the case if no congestion feedback is
         (currently) available e.g. in TCP if one endpoint has been
         receiving data but sending nothing but pure ACKs (no user data)
         for some time."

[CREF12] BB: I would prefer if this were stated at the maximum required,
         not a recommended value.  The idea is to hold as much credit as
         the /likely/ worst-case congestion, not the /absolute/ worst
         case (I did experiments to find the variance of congestion in
         my PhD).

[CREF13] BB: Again, rather than a SHOULD, can we make this a
         recommendation that is part of the reason for ConEx
         experimentation? - especially if variants like hybrid SS are
         enabled.

[CREF14] BB: Just marking every fourth packet doesn't work for a general
         IW.  During the IW, mark the first packet and every other
         packet, then after IW mark every fourth packet (to determine
         precisely which is the first packet to mark after the IW,
         maintain a packet counter and double it when IW ends).

[CREF15] BB: Whoa!  This is rather excessively conservative isn't it?
         There will often be a loss in 2 consecutive RTTs due to normal
         congestion.  If there's a re-route, I think the new audit will
         drop a whole window, so the sender will naturally send a whole
         window's worth of credit with the retransmissions.  Am I wrong?

[CREF16] BB: This adds complexity.  I would suggest this is a MAY.  It
         depends on how audit is done whether it is necessary, so this
         will depend on experiments.  For instance, in the audit
         function I designed, there was a long term and a short term
         comparison, and the long term one became more relaxed the
         longer the flow had been behaving.  (Note I have also suggested
         moving this and the next para from "Setting E/L" to
         "Timeliness")


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 19]

Internet-Draft         TCP Modifications for ConEx            March 2015


[CREF17] BB: As before, I disagree with the need for this para - this is
         trying to optimise a case that is rare because it's known to be
         sub-optimal, by compromising ConEx timeliness.  SACK is nearly
         universal .If SACK isn't available, things are bound to be non-
         optimal.  The solution is for the receiver to deploy SACK like
         nearly every other receiver has done, not to add more
         complexity to the sender and more delay to ConEx.

Authors' Addresses

   Mirja Kuehlewind (editor)
   ETH Zurich
   Switzerland

   Email: mirja.kuehlewind@tik.ee.ethz.ch


   Richard Scheffenegger
   NetApp, Inc.
   Am Euro Platz 2
   Vienna  1120
   Austria

   Phone: +43 1 3676811 3146
   Email: rs@netapp.com


Kuehlewind & ScheffeneggExpires September 9, 2015              [Page 20]