draft-briscoe-tsvwg-re-ecn-tcp-06.txt | draft-briscoe-tsvwg-re-ecn-tcp-07.txt | |||
---|---|---|---|---|
Transport Area Working Group B. Briscoe | Transport Area Working Group B. Briscoe | |||
Internet-Draft BT & UCL | Internet-Draft BT & UCL | |||
Intended status: Standards Track A. Jacquet | Intended status: Standards Track A. Jacquet | |||
Expires: January 15, 2009 T. Moncaster | Expires: September 4, 2009 T. Moncaster | |||
A. Smith | A. Smith | |||
BT | BT | |||
July 14, 2008 | March 3, 2009 | |||
Re-ECN: Adding Accountability for Causing Congestion to TCP/IP | Re-ECN: Adding Accountability for Causing Congestion to TCP/IP | |||
draft-briscoe-tsvwg-re-ecn-tcp-06 | draft-briscoe-tsvwg-re-ecn-tcp-07 | |||
Status of this Memo | Status of This Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
other groups may also distribute working documents as Internet- | other groups may also distribute working documents as Internet- | |||
Drafts. | Drafts. | |||
skipping to change at page 1, line 37 | skipping to change at page 1, line 37 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
This Internet-Draft will expire on January 15, 2009. | This Internet-Draft will expire on September 4, 2009. | |||
Copyright Notice | Copyright Notice | |||
Copyright (C) The IETF Trust (2008). | Copyright (c) 2009 IETF Trust and the persons identified as the | |||
document authors. All rights reserved. | ||||
This document is subject to BCP 78 and the IETF Trust's Legal | ||||
Provisions Relating to IETF Documents in effect on the date of | ||||
publication of this document (http://trustee.ietf.org/license-info). | ||||
Please review these documents carefully, as they describe your rights | ||||
and restrictions with respect to this document. | ||||
Abstract | Abstract | |||
This document introduces a new protocol for explicit congestion | This document introduces a new protocol for explicit congestion | |||
notification (ECN), termed re-ECN, which can be deployed | notification (ECN), termed re-ECN, which can be deployed | |||
incrementally around unmodified routers. It enbales the the upstream | incrementally around unmodified routers. The protocol works by | |||
party at any trust boundary in the internetwork to be held | arranging an extended ECN field in each packet so that, as it crosses | |||
responsible for the congestion they cause, or allow to be caused. | any interface in an internetwork, it will carry a truthful prediction | |||
of congestion on the remainder of its path. The purpose of this | ||||
So, networks can introduce straightforward accountability for | document is to specify the re-ECN protocol at the IP layer and to | |||
congestion and policing mechanisms for incoming traffic from end- | give guidelines on any consequent changes required to transport | |||
customers or from neighbouring network domains. The protocol works | protocols. It includes the changes required to TCP both as an | |||
by arranging an extended ECN field in each packet so that, as it | example and as a specification. It briefly gives examples of | |||
crosses any interface in an internetwork, it will carry a truthful | ||||
prediction of congestion on the remainder of its path. The purpose | ||||
of this document is to specify the re-ECN protocol at the IP layer | ||||
and to give guidelines on any consequent changes required to | ||||
transport protocols. It includes the changes required to TCP both as | ||||
an example and as a specification. It also gives examples of | ||||
mechanisms that can use the protocol to ensure data sources respond | mechanisms that can use the protocol to ensure data sources respond | |||
correctly to congestion. And it describes example mechanisms that | correctly to congestion,and these are described more fully in a | |||
ensure the dominant selfish strategy of both network domains and end- | companion document [re-ecn-motive]. | |||
points will be to set the extended ECN field honestly. | ||||
Authors' Statement: Status (to be removed by the RFC Editor) | Authors' Statement: Status (to be removed by the RFC Editor) | |||
Although the re-ECN protocol is intended to make a simple but far- | Although the re-ECN protocol is intended to make a simple but far- | |||
reaching change to the Internet architecture, the most immediate | reaching change to the Internet architecture, the most immediate | |||
priority for the authors is to delay any move of the ECN nonce to | priority for the authors is to delay any move of the ECN nonce to | |||
Proposed Standard status. The argument for this position is | Proposed Standard status. The argument for this position is | |||
developed in Appendix I. | developed in Appendix E. | |||
Changes from previous drafts (to be removed by the RFC Editor) | Changes from previous drafts (to be removed by the RFC Editor) | |||
Full diffs created using the rfcdiff tool are available at | Full diffs created using the rfcdiff tool are available at | |||
<http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#retcp> | <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#retcp> | |||
From -05 to -06 (current version): | From -06 to -07 (current version): | |||
Clarifications made to Section 1 and Section 3. | ||||
Minor editorial changes throughout. | ||||
From -04 to -05: | ||||
Completed justification for packet marking with FNE during slow- | ||||
start(Appendix D). | ||||
Minor editorial changes throughout. | ||||
From -03 to -04: | ||||
Clarified reasons for holding back ECN nonce (Section 3.3 & | ||||
Appendix I). | ||||
Clarified Figure 2. | ||||
Added Section 4.1.1.1 on equivalence of drops and ECN marks. | ||||
Improved precision of Section 5.6 on IP in IP tunnels. | ||||
Explained the RTT fairness is possible to enforce, but unlikely to | ||||
be required (Section 6.1.3 & Appendix F). | ||||
Explained that bulk per-user policing should be adequate but per- | ||||
flow policing is also possible if desired, though it is not likely | ||||
to be necessary (Section 6.1.5 & Appendix G). | ||||
Reinforced need for passive policing at inter-domain borders to | ||||
enable all-optical networking (Section 6.1.6). | ||||
Minor editorial changes throughout. | ||||
From -02 to -03: | Major changes made following splitting this protocol document from | |||
the related motivations document [re-ecn-motive]. | ||||
Started guidelines for re-ECN support in DCCP and SCTP. | Significant re-ordering of remaining text. | |||
Added annex on limitations of nonce mechanism. | New terminology introduced for clarity. | |||
Minor editorial changes throughout. | Minor editorial changes throughout. | |||
From -01 to -02: | ||||
Explanation on informal terminology in Section 3.5 clarified. | ||||
IPv6 wire protocol encoding added (Section 5.2). | ||||
Text on (non-)issues with tunnels, encryption and link layer | ||||
congestion notification added (Section 5.6 & Section 5.7). | ||||
Section added giving evolvability arguments against encouraging | ||||
bottleneck policing (Section 6.1.2). And text on re-ECN's | ||||
evolvability by design added to Section 6.1.3 | ||||
Text on inter-domain policing (Section 6.1.6) and inter-domain | ||||
fail-safes (Section 6.1.7) added. | ||||
From -00 to -01: | ||||
Encoding of re-ECN wire protocol changed for reasons given in | ||||
Appendix B and consequently draft substantially re-written. | ||||
Substantial text added in sections on applications, incremental | ||||
deployment, architectural rationale and security considerations. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 5 | |||
2. Requirements notation . . . . . . . . . . . . . . . . . . . . 8 | 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 6 | |||
3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 | 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
3.1. Background and Applicability . . . . . . . . . . . . . . . 8 | 4. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 7 | |||
3.2. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . . 10 | 4.1. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . . 7 | |||
3.3. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or | 4.1.1. Congestion Control and Policing the Protocol . . . . . 7 | |||
v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 | 4.1.2. Background and Applicability . . . . . . . . . . . . . 8 | |||
3.4. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 12 | 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or | |||
3.5. Informal Terminology . . . . . . . . . . . . . . . . . . . 14 | v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | |||
4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 | 4.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 10 | |||
4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | 4.4. Positive and Negative Flows . . . . . . . . . . . . . . . 12 | |||
4.1.1. RECN mode: Full Re-ECN capable transport . . . . . . . 17 | 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
4.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 | 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 13 | |||
compliant ECN Receiver . . . . . . . . . . . . . . . . 20 | 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 15 | |||
4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 21 | 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 16 | |||
4.1.4. Extended ECN (EECN) Field Settings during Flow | 5.4. Justification for Setting the First SYN to FNE . . . . . . 17 | |||
Start or after Idle Periods . . . . . . . . . . . . . 23 | 5.5. Control and Management . . . . . . . . . . . . . . . . . . 18 | |||
4.1.5. Pure ACKS, Retransmissions, Window Probes and | 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 18 | |||
Partial ACKs . . . . . . . . . . . . . . . . . . . . . 27 | 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 19 | |||
4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 27 | 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 19 | |||
4.2.1. General Guidelines for Adding Re-ECN to Other | 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 20 | |||
Transports . . . . . . . . . . . . . . . . . . . . . . 27 | 6. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28 | 6.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28 | 6.1.1. RECN mode: Full Re-ECN capable transport . . . . . . . 22 | |||
4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 29 | 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 | |||
5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 29 | compliant ECN Receiver . . . . . . . . . . . . . . . . 24 | |||
5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 29 | 6.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 26 | |||
5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 30 | 6.1.4. Extended ECN (EECN) Field Settings during Flow | |||
5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 31 | Start or after Idle Periods . . . . . . . . . . . . . 27 | |||
5.4. Justification for Setting the First SYN to FNE . . . . . . 33 | 6.1.5. Pure ACKS, Retransmissions, Window Probes and | |||
5.5. Control and Management . . . . . . . . . . . . . . . . . . 34 | Partial ACKs . . . . . . . . . . . . . . . . . . . . . 31 | |||
5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 34 | 6.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 32 | |||
5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 35 | 6.2.1. General Guidelines for Adding Re-ECN to Other | |||
5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 35 | Transports . . . . . . . . . . . . . . . . . . . . . . 32 | |||
5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 36 | 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 32 | |||
6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 37 | 6.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 33 | |||
6.1. Policing Congestion Response . . . . . . . . . . . . . . . 37 | 6.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 33 | |||
6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 37 | 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 33 | |||
6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 38 | 8. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 34 | |||
6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 39 | 8.1. Congestion Notification Integrity . . . . . . . . . . . . 34 | |||
6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 46 | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 35 | |||
6.1.5. Policing . . . . . . . . . . . . . . . . . . . . . . . 47 | 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 37 | |||
6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 49 | 11. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 37 | |||
6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 52 | 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 37 | |||
6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 53 | 13. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 38 | |||
6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 53 | 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 38 | |||
6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 53 | 14.1. Normative References . . . . . . . . . . . . . . . . . . . 38 | |||
6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54 | 14.2. Informative References . . . . . . . . . . . . . . . . . . 39 | |||
6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 55 | Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 41 | |||
6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 55 | ||||
6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 55 | ||||
7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 56 | ||||
7.1. Incremental Deployment Features . . . . . . . . . . . . . 56 | ||||
7.2. Incremental Deployment Incentives . . . . . . . . . . . . 57 | ||||
8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 62 | ||||
9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 65 | ||||
9.1. Policing Rate Response to Congestion . . . . . . . . . . . 65 | ||||
9.2. Congestion Notification Integrity . . . . . . . . . . . . 66 | ||||
9.3. Identifying Upstream and Downstream Congestion . . . . . . 67 | ||||
10. Security Considerations . . . . . . . . . . . . . . . . . . . 67 | ||||
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 | ||||
12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 69 | ||||
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 69 | ||||
14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69 | ||||
15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 70 | ||||
15.1. Normative References . . . . . . . . . . . . . . . . . . . 70 | ||||
15.2. Informative References . . . . . . . . . . . . . . . . . . 70 | ||||
Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 74 | ||||
Appendix B. Justification for Two Codepoints Signifying Zero | Appendix B. Justification for Two Codepoints Signifying Zero | |||
Worth Packets . . . . . . . . . . . . . . . . . . . . 75 | Worth Packets . . . . . . . . . . . . . . . . . . . . 43 | |||
Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 76 | Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 44 | |||
Appendix D. Packet Marking with FNE During Flow Start . . . . . . 78 | Appendix D. Packet Marking with FNE During Flow Start . . . . . . 45 | |||
Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 80 | Appendix E. Argument for holding back the ECN nonce . . . . . . . 47 | |||
Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 80 | Appendix F. Alternative Terminology Used in Other Documents . . . 49 | |||
Appendix G. Policer Designs to ensure Congestion | ||||
Responsiveness . . . . . . . . . . . . . . . . . . . 80 | ||||
G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 80 | ||||
G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 82 | ||||
Appendix H. Downstream Congestion Metering Algorithms . . . . . . 84 | ||||
H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 84 | ||||
H.2. Inflation Factor for Persistently Negative Flows . . . . . 85 | ||||
Appendix I. Argument for holding back the ECN nonce . . . . . . . 86 | ||||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 88 | ||||
Intellectual Property and Copyright Statements . . . . . . . . . . 90 | ||||
1. Introduction | 1. Introduction | |||
This document aims: | This document aims to provide a complete specification of the | |||
addition of the re-ECN protocol to IP and guidelines on how to add it | ||||
o To provide a complete specification of the addition of the re-ECN | to transport layer protocols, including a complete specification of | |||
protocol to IP and guidelines on how to add it to transport layer | re-ECN in TCP as an example. The motivation behind this proposal is | |||
protocols, including a complete specification of re-ECN in TCP as | given in [re-ecn-motive], but we include a brief summary here. | |||
an example; | ||||
o To show how a number of hard problems become much easier to solve | ||||
once re-ECN is available in IP. | ||||
In ECN [RFC3168] congested queues probabilistically mark packets as | Re-ECN is intended to allow senders to inform the network of the | |||
they approach a congested state. The receiver informs the sender | level of congestion they expect their flows to see. This information | |||
that they have seen one or more marks. In re-ECN the sender must | is currently only visible at the transport layer. ECN [RFC3168] | |||
predict the level of congestion on the path by re-inserting feedback | reveals the upstream congestion state of any path by monitoring the | |||
according to the marking scheme described later in this draft. This | rate of CE marks. The receiver then informs the sender when they | |||
results in packets that carry a prediction of downstream congestion. | have seen a marked packet. Re-ECN builds on ECN by providing new | |||
codepoints that allow the sender to declare the level of congestion | ||||
they expect on the forward path. It is closely related to ECN and | ||||
indeed we define a compatability mode to allow a re-ECN sender to | ||||
communicate with an ECN receiver [xref]. | ||||
If a sender understates expected congestion compared to actual | If a sender understates expected congestion compared to actual | |||
congestion then the network could discard packets or enact some other | congestion then the network could discard packets or enact some other | |||
sanction. A policer can also be introduced at the ingress of | sanction. A policer can also be introduced at the ingress of | |||
networks that can limit the congestion caused (or base penalties on | networks that can limit the level of congestion being caused. | |||
it). | ||||
It is important to add a few key points. | ||||
o It can be seen that it takes one round trip before any feedback is | ||||
received. For this reason a sender must make a conservative | ||||
prediction by transmitting IP packets with a special Feedback Not | ||||
Established (FNE) marking. | ||||
o It should be noted that the prediction is carried in-band in | ||||
normal data packets and for many transports feedback can be | ||||
carried in the normal acknowledgements or control packets. | ||||
o The re-ECN protocol is independent of the transport. In TCP, | ||||
acknowledgments are used to convey the feedback from receiver to | ||||
sender. This memo concentrates on TCP as an example transport | ||||
protocol, however the re-ECN protocol is compatible with any | ||||
transport where feedback can be sent from receiver to sender. | ||||
A general statement of the problem solved by re-ECN is to provide | A general statement of the problem solved by re-ECN is to provide | |||
sufficient information in each IP datagram to be able to hold senders | sufficient information in each IP datagram to be able to hold senders | |||
and whole networks accountable for the congestion they cause | and whole networks accountable for the congestion they cause | |||
downstream, before they cause it. But the every-day problems that | downstream, before they cause it. But the every-day problems that | |||
re-ECN can solve are much more recognisable than this rather generic | re-ECN can solve are much more recognisable than this rather generic | |||
statement: mitigating distributed denial of service (DDoS); | statement: mitigating distributed denial of service (DDoS); | |||
simplifying differentiation of quality of service (QoS); policing | simplifying differentiation of quality of service (QoS); policing | |||
compliance to congestion control; and so on. | compliance to congestion control; and so on. | |||
Uniquely, re-ECN manages to enable solutions to these problems | It is important to add a few key points. | |||
without unduly stifling innovative new ways to use the Internet. | ||||
This was a hard balance to strike, given it could be argued that DDoS | ||||
is an innovative way to use the Internet. The most valuable insight | ||||
was to allow each network to choose the level of constraint it wishes | ||||
to impose. Also re-ECN has been carefully designed so that networks | ||||
that choose to use it conservatively can protect themselves against | ||||
the congestion caused in their network by users on other networks | ||||
with more liberal policies. | ||||
For instance, some network owners want to block applications like | ||||
voice and video unless their network is compensated for the extra | ||||
share of bottleneck bandwidth taken. These real-time applications | ||||
tend to be unresponsive when congestion arises. Whereas elastic TCP- | ||||
based applications back away quickly, ending up taking a much smaller | ||||
share of congested capacity for themselves. Other network owners | ||||
want to invest in large amounts of capacity and make their gains from | ||||
simplicity of operation and economies of scale. | ||||
re-ECN allows the more conservative networks to police out flows that | ||||
have not asked to be unresponsive to congestion---not because they | ||||
are voice or video---just because they don't respond to congestion. | ||||
But it also allows other networks to choose not to police. | ||||
Crucially, when flows from liberal networks cross into a conservative | ||||
network, re-ECN enables the conservative network to apply penalties | ||||
to its neighbouring networks for the congestion they allow to be | ||||
caused. And these penalties can be applied to bulk data, without | ||||
regard to flows. | ||||
Then, if unresponsive applications become so dominant that some of | o In any stnadard network it always takes one round trip before any | |||
the more liberal networks experience congestion collapse [RFC3714], | feedback is received. For this reason a sender must make a | |||
they can change their minds and use re-ECN to apply tighter controls | conservative prediction by transmitting IP packets with a special | |||
in order to bring congestion back under control. | Cautious marking. | |||
re-ECN works by arranging that each packet arrives at each network | o It should be noted that the prediction is carried in-band in | |||
element carrying a view of expected congestion on its own downstream | normal data packets and for many transports feedback can be | |||
path, albeit averaged over multiple packets. Most usefully, | carried in the normal acknowledgements or control packets. | |||
congestion on the remainder of the path becomes visible in the IP | ||||
header at the first ingress. Many of the applications of re-ECN | ||||
involve a policer at this ingress using the view of downstream | ||||
congestion arriving in packets to police or control the packet rate. | ||||
Importantly, the scheme is recursive: a whole network harbouring | o The re-ECN protocol is independent of the transport. In TCP, | |||
users causing congestion in downstream networks can be held | acknowledgments are used to convey the feedback from receiver to | |||
responsible or policed by its downstream neighbour. | sender. This memo concentrates on TCP as an example transport | |||
protocol, however the re-ECN protocol is compatible with any | ||||
transport where feedback can be sent from receiver to sender. | ||||
This document is structured as follows. First an overview of the re- | This document is structured as follows. First an overview of the re- | |||
ECN protocol is given (Section 3), outlining its attributes and | ECN protocol is given (Section 4), outlining its attributes and | |||
explaining conceptually how it works as a whole. The two main parts | explaining conceptually how it works as a whole. The two main parts | |||
of the document follow. That is, the protocol specification divided | of the document follow. That is, the protocol specification divided | |||
into transport (Section 4) and network (Section 5) layers which | into network (Section 5) and transport (Section 6) layers. | |||
contain most of the standards compliance terminology, then the | ||||
applications re-ECN can be put to, such as policing DDoS, QoS and | ||||
congestion control (Section 6). Although these applications do not | ||||
require standardisation themselves, they are described in a fair | ||||
degree of detail in order to explain how re-ECN can be used. Given | ||||
re-ECN proposes to use the last undefined bit in the IPv4 header, we | ||||
felt it necessary to outline the potential that re-ECN could release | ||||
in return for being given that bit. | ||||
Deployment issues discussed throughout the document are brought | Deployment issues discussed throughout the document are brought | |||
together in Section 7, which is followed by a brief section | together in Section 7. Related work is discussed in (Section 8). | |||
explaining the somewhat subtle rationale for the design from an | ||||
architectural perspective (Section 8). We end by describing related | ||||
work (Section 9), listing security considerations (Section 10) and | ||||
finally drawing conclusions (Section 12). | ||||
2. Requirements notation | 2. Requirements notation | |||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in [RFC2119]. | |||
This document first specifies a protocol, then describes a framework | 3. Terminology | |||
that creates the right incentives to ensure compliance to the | ||||
protocol. This could cause confusion because the second part of the | ||||
document considers many cases where malicious nodes may not comply | ||||
with the protocol. When such contingencies are described, if any of | ||||
the above keywords are not capitalised, that is deliberate. So, for | ||||
instance, the following two apparently contradictory sentences would | ||||
be perfectly consistent: i) x MUST do this; ii) x may not do this. | ||||
3. Protocol Overview | The following terminology is used throughout this memo. Some of this | |||
terminology is new and, to avoid confusion, Appendix F sets out all | ||||
the alternative terminology that has been used in other re-ECN | ||||
related documents. | ||||
3.1. Background and Applicability | o Neutral packet - a packet that is able to be congestion marked by | |||
an ECN or re-ECN queue. | ||||
o Negative packet - a Neutral packet that has been congestion marked | ||||
by an ECN or re-ECN queue. | ||||
o Positive packet - a packet that has been marked by the sender to | ||||
indicate the expected level of congestion along its path. In | ||||
general Positive packets should only be sent in response to | ||||
feedback received from the receiver.* | ||||
o Cancelled packet - a Positive Packet that has been congestion | ||||
marked by an ECN or re-ECN queue. | ||||
o Cautious packet - a packet that has been marked by the sender to | ||||
indeiate the expected level of congestion along its path. In | ||||
general Cautious packets should be used when there is insufficient | ||||
feedback to be confident about the congestion state of the | ||||
network.* | ||||
o * the difference between positive and cautious packets is | ||||
explained in detail later in the document along with guidelines on | ||||
the use of Cautious packets. | ||||
All the above terms have related IP codepoints as defined in | ||||
(Section 5). | ||||
4. Protocol Overview | ||||
4.1. Simplified Re-ECN Protocol | ||||
We describe here the simplified re-ECN protocol. To simplify the | ||||
description we assume packets and segments are synonymous. | ||||
Packets are sent from a sender to a receiver. In Figure 1 the queues | ||||
(Q1 and Q2) are ECN enabled as per RFC 3168 [RFC3168]. If congestion | ||||
occurs then packets are marked with the congestion experienced (CE) | ||||
flag exactly as in the ECN protocol [RFC3168]; the routers do not | ||||
need to be modified and do not need to know the re-ECN protocol. The | ||||
receiver constantly informs the sender of the current count of | ||||
Positive packets it has seen. The sender uses this information | ||||
determine how many Positive packets it must send into the network. | ||||
The receiver's aim is to balance the number of bytes that have been | ||||
congestion marked with the number of Positive bytes it has sent. | ||||
+--------- Feedback----------+ | ||||
| | | ||||
v | | ||||
+---+ +----+ +----+ +---+ | ||||
| | | | | | | | | ||||
| S |--->| Q1 |--->| Q2 |--->| R | | ||||
| | | | | | | | | ||||
+---+ +----+ +----+ +---+ | ||||
Figure 1: Simple Re-ECN | ||||
4.1.1. Congestion Control and Policing the Protocol | ||||
The arrangement of the protocol ensures that packets carry a | ||||
declaration of the amount of congestion that will be experienced on | ||||
the path. The re-ECN protocol is orthogonal to to any congestion | ||||
control algorithms, but can be used to ensure that congestion control | ||||
is being applied by the sender. | ||||
In general we assume that there will be a policer at the network | ||||
ingress which can rate limit traffic based on the amount of | ||||
congestion declared. | ||||
At the network egress there is a droper which can impose sanctions on | ||||
flows that incorrectly declare congestion. | ||||
Policers and droppers are explained in more detail in | ||||
[re-ecn-motive]. | ||||
4.1.2. Background and Applicability | ||||
The re-ECN protocol makes no changes and has no effect on the TCP | The re-ECN protocol makes no changes and has no effect on the TCP | |||
congestion control algorithm or on other rate responses to | congestion control algorithm or on other rate responses to | |||
congestion. re-ECN is not a new congestion control protocol, rather | congestion. re-ECN is not a new congestion control protocol, rather | |||
it is orthogonal to congestion control itself. Re-ECN is concerned | it is orthogonal to congestion control itself. Re-ECN is concerned | |||
with revealing information about congestion so that users and | with revealing information about congestion so that users and | |||
networks can be held accountable for the congestion they cause, or | networks can be held accountable for the congestion they cause, or | |||
allow to be caused. | allow to be caused. | |||
Re-ECN builds on ECN so we briefly recap the essentials of the ECN | Re-ECN builds on ECN so we briefly recap the essentials of the ECN | |||
protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are | protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are | |||
assigned to the ECN field. The sender clears the field to "00" (Not- | assigned to the ECN field. The sender clears the field to "00" (Not- | |||
ECT) if either end-point transport is not ECN-capable. Otherwise it | ECT) if either end-point transport is not ECN-capable. Otherwise it | |||
indicates an ECN-capable transport (ECT) using either of the two | indicates an ECN-capable transport (ECT) using either of the two | |||
code-points "10" or "01" (ECT(0) and ECT(1) resp.). | code-points "10" or "01" (ECT(0) and ECT(1) resp.). | |||
ECN-capable queues probabilistically set "11" if congestion is | ECN-capable queues probabilistically set this field to "11" if | |||
experienced (CE), the marking probability increasing with the length | congestion is experienced (CE). In general this marking probability | |||
of the queue at its egress link (typically using the RED | will increase with the length of the queue at its egress link | |||
algorithm [RFC2309]). However, they still drop rather than mark Not- | (typically using the RED algorithm [RFC2309]). However, they still | |||
ECT packets. With multiple ECN-capable queues on a path, a flow of | drop rather than mark Not-ECT packets. With multiple ECN-capable | |||
packets accumulates the fraction of CE marking that each queue adds. | queues on a path, a flow of packets accumulates the fraction of CE | |||
The combined effect of the packet marking of all the queues along the | marking that each queue adds. The combined effect of the packet | |||
path signals congestion of the whole path to the receiver. So, for | marking of all the queues along the path signals congestion of the | |||
example, if one queue early in a path is marking 1% of packets and | whole path to the receiver. So, for example, if one queue early in a | |||
another later in a path is marking 2%, flows that pass through both | path is marking 1% of packets and another later in a path is marking | |||
queues will experience approximately 3% marking (see Appendix A for a | 2%, flows that pass through both queues will experience approximately | |||
precise treatment). | 3% marking (see Appendix A for a precise treatment). | |||
The choice of two ECT code-points in the ECN field [RFC3168] | The choice of two ECT code-points in the ECN field [RFC3168] | |||
permitted future flexibility, optionally allowing the sender to | permitted future flexibility, optionally allowing the sender to | |||
encode the experimental ECN nonce [RFC3540] in the packet stream. | encode the experimental ECN nonce [RFC3540] in the packet stream. | |||
The nonce is designed to allow a sender to check the integrity of | The nonce is designed to allow a sender to check the integrity of | |||
congestion feedback. But Section 9.2 explains that it still gives no | congestion feedback. But Section 8.1 explains that it still gives no | |||
control over how fast the sender transmits as a result of the | control over how fast the sender transmits as a result of the | |||
feedback. On the other hand, re-ECN is designed both to ensure that | feedback. On the other hand, re-ECN is designed both to ensure that | |||
congestion is declared honestly and that the sender's rate responds | congestion is declared honestly and that the sender's rate responds | |||
appropriately. | appropriately. | |||
Re-ECN is based on a feedback arrangement called `re- | Re-ECN is based on a feedback arrangement called `re- | |||
feedback' [Re-fb]. The word is short for either receiver-aligned, | feedback' [Re-fb]. The word is short for either receiver-aligned, | |||
re-inserted or re-echoed feedback. But it actually works even when | re-inserted or re-echoed feedback. But it actually works even when | |||
no feedback is available. In fact it has been carefully designed to | no feedback is available. In fact it has been carefully designed to | |||
work for single datagram flows. It also encourages aggregation of | work for single datagram flows. It also encourages aggregation of | |||
single packet flows by congestion control proxies. Then, even if the | single packet flows by congestion control proxies. Then, even if the | |||
traffic mix of the Internet were to become dominated by short | traffic mix of the Internet were to become dominated by short | |||
messages, it would still be possible to control congestion | messages, it would still be possible to control congestion | |||
effectively and efficiently. | effectively and efficiently. | |||
Changing the Internet's feedback architecture seems to imply | Changing the Internet's feedback architecture seems to imply | |||
considerable upheaval. But re-ECN can be deployed incrementally at | considerable upheaval. But re-ECN can be deployed incrementally at | |||
the transport layer around unmodified queues using existing fields in | the transport layer around unmodified queues using existing fields in | |||
IP (v4 or v6). However it does also require the last undefined bit | IP (v4 or v6). However it does also require the last undefined bit | |||
in the IPv4 header, which it uses in combination with the 2-bit ECN | in the IPv4 header, which it uses in combination with the 2-bit ECN | |||
field to create four new codepoints. Nonetheless, we RECOMMENDED | field to create four new codepoints. Nonetheless, we RECOMMEND | |||
adding optional preferentail drop to IP queues based on the re-ECN | adding optional preferentail drop to IP queues based on the re-ECN | |||
fields in order to improve resilience against DoS attacks. | fields in order to improve resilience against DoS attacks. | |||
Similarly, re-ECN works best if both the sender and receiver | Similarly, re-ECN works best if both the sender and receiver | |||
transports are re-ECN-capable, but it can work with just sender | transports are re-ECN-capable, but it can work with just sender | |||
support. Section 7.1 summarises the incremental deployment strategy. | support(Section 6.1.2). | |||
Before re-ECN can be considered worthy of using up the last bit in | ||||
the IP header, we must be sure that all our claims are robust. We | ||||
have gradually been reducing the list of outstanding issues, but the | ||||
few that still remain are listed in Section 6.3. We expect new | ||||
attacks may still be found, but we offer the re-ECN protocol on the | ||||
basis that it is built on fairly solid theoretical foundations and, | ||||
so far, it has proved possible to keep it relatively robust. | ||||
3.2. Simplified Re-ECN Protocol | ||||
We describe here the simplified re-ECN protocol. In this first | ||||
description we assume packets and segments are synonymous. | ||||
Packets are sent from a sender to a receiver. In Figure 1 the queues | ||||
(Q1 and Q2) are ECN enabled as per RFC 3168 [ref]. If congestion | ||||
occurs then packets are marked with the congestion experienced (CE) | ||||
flag exactly as in the ECN protocol [RFC3168]; the routers do not | ||||
need to be modified and do not need to know the re-ECN protocol. On | ||||
reception of marked packets the receiver notifies the sender of the | ||||
current count of marked packets. Note that this is the number of | ||||
packets marked rather than the setting of the ECE flag in ECN. The | ||||
sender uses this information to re-echo mark packets in exact | ||||
correspondence to the number of CE marked bytes observed at the | ||||
receiver. | ||||
+--------- Feedback----------+ | ||||
| | | ||||
v | | ||||
+---+ +----+ +----+ +---+ | ||||
| | RE | | | | | | | ||||
| S |--->| Q1 |--->| Q2 |--->| R | | ||||
| | | | | | | | | ||||
+---+ +----+ +----+ +---+ | ||||
Figure 1: Simple Re-ECN | ||||
3.3. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) | 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) | |||
The re-ECN wire protocol uses the two bit ECN field broadly as in | The re-ECN wire protocol uses the two bit ECN field broadly as in | |||
RFC3168 [RFC3168] as described above, but with five differences of | RFC3168 [RFC3168] as described above, but with five differences of | |||
detail (brought together in a list in Section 7.1). This | detail (brought together in a list in Section 7). This specification | |||
specification defines a new re-ECN extension (RE) flag. We will | defines a new re-ECN extension (RE) flag. We will defer the | |||
defer the definition of the actual position of the RE flag in the | definition of the actual position of the RE flag in the IPv4 & v6 | |||
IPv4 & v6 headers until Section 5. When we don't need to choose | headers until Section 5. When we don't need to choose between IPv4 | |||
between IPv4 and v6 wire protocols it will suffice call it the RE | and v6 wire protocols it will suffice call it the RE flag. | |||
flag. | ||||
Unlike the ECN field, the RE flag is intended to be set by the sender | Unlike the ECN field, the RE flag is intended to be set by the sender | |||
and remain unchanged along the path, although it can be read by | and SHOULD remain unchanged along the path, although it can be read | |||
network elements that understand the re-ECN protocol. It is feasible | by network elements that understand the re-ECN protocol. It is | |||
that a network element MAY change the setting of the RE flag, perhaps | feasible that a network element MAY change the setting of the RE | |||
acting as a proxy for an end-point, but such a protocol would have to | flag, perhaps acting as a proxy for an end-point, but such a protocol | |||
be defined in another specification (e.g. [Re-PCN]). | would have to be defined in another specification (e.g. [Re-PCN]). | |||
Although the RE flag is a separate, single bit field, it can be read | Although the RE flag is a separate, single bit field, it can be read | |||
as an extension to the two-bit ECN field; the three concatenated bits | as an extension to the two-bit ECN field; the three concatenated bits | |||
in what we will call the extended ECN field (EECN) giving eight | in what we will call the extended ECN field (EECN) giving eight | |||
codepoints. We will use the RFC3168 names of the ECN codepoints to | codepoints. We will use the RFC3168 names of the ECN codepoints to | |||
describe settings of the ECN field when the RE flag setting is "don't | describe settings of the ECN field when the RE flag setting is "don't | |||
care", but we also define the following six extended ECN codepoint | care", but we also define the following six extended ECN codepoint | |||
names for when we need to be more specific. | names for when we need to be more specific. | |||
One of re-ECN's codepoints is an alternative use of the codepoint set | One of re-ECN's codepoints is an alternative use of the codepoint set | |||
aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN | aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN | |||
do not need to use the ECN nonce as long as the sender is also | do not need to use the ECN nonce as long as the sender is also | |||
checking for transport protocol compliance | checking for transport protocol compliance | |||
[I-D.moncaster-tcpm-rcv-cheat]. The case for doing this is given in | [I-D.moncaster-tcpm-rcv-cheat]. The case for doing this is given in | |||
Appendix I. Two re-ECN codepoints are given compatible uses to those | Appendix E. Two re-ECN codepoints are given compatible uses to those | |||
defined in RFC3168 (Not-ECT and CE). The other codepoint used by | defined in RFC3168 (Not-ECT and CE). The other codepoint used by | |||
RFC3168 (ECT(0)) isn't used for re-ECN. Altogether this leave one | RFC3168 (ECT(0)) isn't used for re-ECN. Altogether this leave one | |||
codepoint of the eight unused by ECN or re-ECN and available for | codepoint of the eight unused by ECN or re-ECN and available for | |||
future use. | future use. | |||
+-------+------------+------+--------------+------------------------+ | +--------+-------------+-------+-----------+------------------------+ | |||
| ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | | | ECN | RFC3168 | RE | EECN | re-ECN meaning | | |||
| field | codepoint | flag | codepoint | | | | field | codepoint | flag | codepoint | | | |||
+-------+------------+------+--------------+------------------------+ | +--------+-------------+-------+-----------+------------------------+ | |||
| 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | | | 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | | |||
| | | | | transport | | | | | | | transport (Legacy) | | |||
| 00 | --- | 1 | FNE | Feedback not | | | 00 | --- | 1 | FNE | Feedback not | | |||
| | | | | established | | | | | | | established (Cautious) | | |||
| 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | | | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | | |||
| | | | | and RECT | | | | | | | and RECT (Positive) | | |||
| 01 | --- | 1 | RECT | Re-ECN capable | | | 01 | --- | 1 | RECT | Re-ECN capable | | |||
| | | | | transport | | | | | | | transport (Neutral) | | |||
| 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | | | 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | | |||
| | | | | | | | | | | | | | |||
| 10 | --- | 1 | --CU-- | Currently unused | | | 10 | --- | 1 | --CU-- | Currently unused | | |||
| | | | | | | | | | | | | | |||
| 11 | CE | 0 | CE(0) | Re-Echo canceled by | | | 11 | CE | 0 | CE(0) | Re-Echo cancelled by | | |||
| | | | | congestion experienced | | | | | | | CE (Cancelled) | | |||
| 11 | --- | 1 | CE(-1) | Congestion experienced | | | 11 | --- | 1 | CE(-1) | Congestion Experienced | | |||
+-------+------------+------+--------------+------------------------+ | | | | | | (Negative) | | |||
+--------+-------------+-------+-----------+------------------------+ | ||||
Table 1: Extended ECN Codepoints | Table 1: Extended ECN Codepoints | |||
3.4. Re-ECN Protocol Operation | 4.3. Re-ECN Protocol Operation | |||
In this section we will give an overview of the operation of the re- | In this section we will give an overview of the operation of the re- | |||
ECN protocol for TCP/IP, leaving a detailed specification to the | ECN protocol for TCP/IP, leaving a detailed specification to the | |||
following sections. Other transports will be discussed later. | following sections. Other transports will be discussed later. | |||
In summary, the protocol adds a third `re-echo' stage to the existing | In summary, the protocol adds a third `re-echo' stage to the existing | |||
TCP/IP ECN protocol. Whenever the network adds CE congestion | TCP/IP ECN protocol. Whenever the network adds CE congestion | |||
signalling to the IP header on the forward data path, the receiver | signalling to the IP header on the forward data path, the receiver | |||
feeds it back to the ingress using TCP, then the sender re-echoes it | feeds it back to the ingress using TCP, then the sender re-echoes it | |||
into the forward data path using the RE flag in the next packet. | into the forward data path using the RE flag in the next packet. | |||
Prior to receiving any feedback a sender will not know which setting | Prior to receiving any feedback a sender will not know which setting | |||
of the RE flag to use, so it sets the feedback not established (FNE) | of the RE flag to use, so it sends Cautious packets by setting the | |||
codepoint. The network reads the FNE codepoint conservatively as | FNE codepoint. The network reads the FNE codepoint conservatively as | |||
equivalent to re-echoed congestion. | equivalent to re-echoed congestion. | |||
Specifically, once feedback from a flow is established, a re-ECN | Specifically, once feedback from an ECN or re-ECN capable flow is | |||
sender always initialises the ECN field to ECT(1). And it usually | established, a re-ECN sender always initialises the ECN field to | |||
sets the RE flag to "1". Whenever a queue marks a packet to CE, the | ECT(1). And it usually sets the RE flag to "1" indicating a Neutral | |||
receiver feeds back this event to the sender. On receiving this | packet. Whenever a queue marks a packet to CE, the receiver feeds | |||
feedback, the re-ECN sender will clear the RE flag to "0" in the next | back this event to the sender. On receiving this feedback, the re- | |||
packet it sends. | ECN sender will clear the RE flag to "0" in the next packet it sends | |||
(indicating a Positive packet). | ||||
We chose to set and clear the RE flag this way round to ease | We chose to set and clear the RE flag this way round to ease | |||
incremental deployment (see Section 7.1). To avoid confusion we will | incremental deployment (see Section 7). To avoid confusion we will | |||
use the term `blanking' (rather than marking) when the RE flag is | use the term `blanking' (rather than marking) when the RE flag is | |||
cleared to "0". So, over a stream of packets, we will talk of the | cleared to "0". So, over a stream of packets, we will talk of the | |||
`RE blanking fraction' as the fraction of octets in packets with the | `RE blanking fraction' as the fraction of octets in packets with the | |||
RE flag cleared to "0". | RE flag cleared to "0". | |||
+---+ +----+ +----+ +---+ | +---+ +----+ +----+ +---+ | |||
| S |--| Q1 |----------------| Q2 |--| R | | | S |--| Q1 |----------------| Q2 |--| R | | |||
+---+ +----+ +----+ +---+ | +---+ +----+ +----+ +---+ | |||
. . . . | . . . . | |||
^ . . . . | ^ . . . . | |||
skipping to change at page 14, line 5 | skipping to change at page 12, line 5 | |||
horizontal line at 3% in the figure. The CE marked fraction is shown | horizontal line at 3% in the figure. The CE marked fraction is shown | |||
by the stepped line which rises to meet the RE blanking fraction line | by the stepped line which rises to meet the RE blanking fraction line | |||
with steps at at each queue where packets are marked. Two queues are | with steps at at each queue where packets are marked. Two queues are | |||
shown (Q1 and Q2) that are currently congested. Each time packets | shown (Q1 and Q2) that are currently congested. Each time packets | |||
pass through a fraction are marked; 1% at Q1 and 2% at Q2). The | pass through a fraction are marked; 1% at Q1 and 2% at Q2). The | |||
approximate downstream congestion can be measured at the observation | approximate downstream congestion can be measured at the observation | |||
points shown along the path by subtracting the CE marking fraction | points shown along the path by subtracting the CE marking fraction | |||
from the RE blanking fraction, as shown in the table below | from the RE blanking fraction, as shown in the table below | |||
(Appendix A derives these approximations from a precise analysis). | (Appendix A derives these approximations from a precise analysis). | |||
+-------------------+------------------------------+ | NB due to the unary nature of ECN marking and the equivalent unary | |||
| Observation point | Approx downstream congestion | | nature of re-ECN blanking, the precise fraction of marked bytes must | |||
+-------------------+------------------------------+ | be calculated by maintaining a moving average of the number of | |||
| L | 3% - 0% = 3% | | packets that have been marked as a proportion of the total number of | |||
| M | 3% - 1% = 2% | | packets. | |||
| N | 3% - 3% = 0% | | ||||
+-------------------+------------------------------+ | ||||
Table 2: Downstream Congestion Measured at Example Observation Points | ||||
All along the path, whole-path congestion remains unchanged so it can | Along the path the fraction of packets that had their RE field | |||
be used as a reference against which to compare upstream congestion. | cleared remains unchanged so it can be used as a reference against | |||
The difference predicts downstream congestion for the rest of the | which to compare upstream congestion. The difference predicts | |||
path. Therefore, measuring the fractions of each codepoint at any | downstream congestion for the rest of the path. Therefore, measuring | |||
point in the Internet will reveal upstream, downstream and whole path | the fractions of each codepoint at any point in the Internet will | |||
congestion. | reveal upstream, downstream and whole path congestion. | |||
Note that we have introduced discussion of marking and blanking | Note that we have introduced discussion of marking and blanking | |||
fractions solely for illustration. To be absolutely clear, for TCP | fractions solely for illustration. We are not saying any protocol | |||
these fractions are averages that would result from the behaviour of | handler will work with these average fractions directly. In fact the | |||
the protocol handler mechanically blanking outgoing packets in direct | protocol actually requires the number of marked and blanked bytes to | |||
response to incoming feedback---we are not saying any protocol | balance by the time the packet reaches the receiver. | |||
handler has to work with these average fractions directly. | ||||
3.5. Informal Terminology | ||||
In the rest of this memo we will loosely talk of positive or negative | 4.4. Positive and Negative Flows | |||
flows, meaning flows where the moving average of the downstream | ||||
congestion metric is persistently positive or negative. A negative | ||||
flow is one where more CE marked packets than re-ECN blanked packets | ||||
arrive. Likewise in positive flows more re-ECN blanked packets | ||||
arrive than CE marked packets. The notion of a negative metric | ||||
arises because it is derived by subtracting one metric from another. | ||||
Of course actual downstream congestion cannot be negative, only the | ||||
metric can (whether due to time lags or deliberate malice). | ||||
Just as we will loosely talk of positive and negative flows, we will | In Section 3 we introduced the terms Positive, Neutral, Negative, | |||
also talk of positive or negative packets, meaning packets that | Cautious and Cancelled. This terminology is based on the requirement | |||
contribute positively or negatively to the downstream congestion | to balance the proportion of bytes marked as CE with the proportion | |||
metric. | of bytes that are re-echo marked. In the rest of this memo we will | |||
loosely talk of positive or negative flows, meaning flows where the | ||||
moving average of the downstream congestion metric is persistently | ||||
positive or negative. A negative flow is one where more CE marked | ||||
packets than re-ECN blanked packets arrive. Likewise in positive | ||||
flows more re-ECN blanked packets arrive than CE marked packets. The | ||||
notion of a negative metric arises because it is derived by | ||||
subtracting one metric from another. Of course actual downstream | ||||
congestion cannot be negative, only the metric can (whether due to | ||||
time lags or deliberate malice). | ||||
Therefore we will talk of packets having `worth' of +1, 0 or -1, | Therefore we will talk of packets having `worth' of +1, 0 or -1, | |||
which, when multiplied by their size, indicates their contribution to | which, when multiplied by their size, indicates their contribution to | |||
the downstream congestion metric. | the downstream congestion metric. The worth of each type of packet | |||
is given below in Table 2. The idea is that most flows start with | ||||
The idea is that most packets start with zero worth. Every time the | zero worth. Every time the network decrements the worth of a packet, | |||
network decrements the worth of a packet, the sender increments the | the sender increments the worth of a later packet. Then, over time, | |||
worth of a later packet. Then, over time, as many positive octets | as many positive octets should arrive at the receiver as negative. | |||
should arrive at the receiver as negative. Note we have said octets | Note we have said octets not packets, so if packets are of different | |||
not packets, so if packets are of different sizes, the worth should | sizes, the worth should be incremented on enough octets to balance | |||
be incremented on enough octets to balance the octets in negative | the octets in negative packets arriving at the receiver. It is this | |||
packets arriving at the receiver. It is this balance that will allow | balance that will allow the network to hold the sender accountable | |||
the network to hold the sender accountable for the congestion it | for the congestion it causes. | |||
causes. | ||||
If a packet carrying re-echoed congestion happens to also be | If a packet carrying re-echoed congestion happens to also be | |||
congestion marked, the +1 worth added by the sender will be cancelled | congestion marked, the +1 worth added by the sender will be cancelled | |||
out by the -1 network congestion marking. Although the two worth | out by the -1 network congestion marking. Although the two worth | |||
values correctly cancel out, neither the congestion marking nor the | values correctly cancel out, neither the congestion marking nor the | |||
re-echoed congestion are lost, because the RE bit and the ECN field | re-echoed congestion are lost, because the RE bit and the ECN field | |||
are orthogonal. So, whenever this happens, the receiver will | are orthogonal. So, whenever this happens, the receiver will | |||
correctly detect and re-echo the new congestion event as well. | correctly detect and re-echo the new congestion event as well. | |||
The table below specifies unambiguously the worth of each extended | The table below specifies unambiguously the worth of each extended | |||
ECN codepoint. Note the order is different from the previous table | ECN codepoint. Note the order is different from the previous table | |||
to better show how the worth increments and decrements. The FNE | to better show how the worth increments and decrements. | |||
codepoint is used in the flow bootstrap process (explained later) and | ||||
has the same positive (+1) worth as a packet with the Re-Echo | ||||
codepoint. | ||||
+--------+------+----------------+-------+--------------------------+ | +---------+-------+---------------+-------+-------------------------+ | |||
| ECN | RE | Extended ECN | Worth | Re-ECN meaning | | | ECN | RE | Extended ECN | Worth | Re-ECN Term | | |||
| field | bit | codepoint | | | | | field | bit | codepoint | | | | |||
+--------+------+----------------+-------+--------------------------+ | +---------+-------+---------------+-------+-------------------------+ | |||
| 00 | 0 | Not-RECT | ... | Not re-ECN-capable | | | 00 | 0 | Not-RECT | ... | --- | | |||
| | | | | transport | | | 00 | 1 | FNE | +1 | Cautious | | |||
| 00 | 1 | FNE | +1 | Feedback not established | | | 01 | 0 | Re-Echo | +1 | Positive | | |||
| 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | | | 10 | 0 | Legacy | ... | RFC3168 ECN use only | | |||
| | | | | RECT | | | | | | | | | |||
| 10 | 0 | --- | ... | RFC3168 ECN use only | | | 11 | 0 | CE(0) | 0 | Negative | | |||
| 11 | 0 | CE(0) | 0 | Re-Echo canceled by | | | 01 | 1 | RECT | 0 | Neutral | | |||
| | | | | congestion experienced | | ||||
| 01 | 1 | RECT | 0 | Re-ECN capable transport | | ||||
| 10 | 1 | --CU-- | ... | Currently unused | | | 10 | 1 | --CU-- | ... | Currently unused | | |||
| | | | | | | | | | | | | | |||
| 11 | 1 | CE(-1) | -1 | Congestion experienced | | | 11 | 1 | CE(-1) | -1 | Negative | | |||
+--------+------+----------------+-------+--------------------------+ | +---------+-------+---------------+-------+-------------------------+ | |||
Table 3: 'Worth' of Extended ECN Codepoints | Table 2: 'Worth' of Extended ECN Codepoints | |||
4. Transport Layers | 5. Network Layer | |||
4.1. TCP | 5.1. Re-ECN IPv4 Wire Protocol | |||
The wire protocol of the ECN field in the IP header remains largely | ||||
unchanged from [RFC3168]. However, an extension to the ECN field we | ||||
call the RE (Re-ECN extension) flag (Section 4.2) is defined in this | ||||
document. It doubles the extended ECN codepoint space, giving 8 | ||||
potential codepoints. The semantics of the extra codepoints are | ||||
backward compatible with the semantics of the 4 original codepoints | ||||
[RFC3168] (Section 7 collects together and summarises all the changes | ||||
defined in this document). | ||||
For IPv4, this document proposes that the new RE control flag will be | ||||
positioned where the `reserved' control flag was at bit 48 of the | ||||
IPv4 header (counting from 0). Alternatively, some would call this | ||||
bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 | ||||
header (Figure 3). | ||||
0 1 2 | ||||
+---+---+---+ | ||||
| R | D | M | | ||||
| E | F | F | | ||||
+---+---+---+ | ||||
Figure 3: New Definition of the Re-ECN Extension (RE) Control Flag at | ||||
the Start of Byte 7 of the IPv4 Header | ||||
The semantics of the RE flag are described in outline in Section 4 | ||||
and specified fully in Section 6. The RE flag is always considered | ||||
in conjunction with the 2-bit ECN field, as if they were concatenated | ||||
together to form a 3-bit extended ECN field. If the ECN field is set | ||||
to either the ECT(1) or CE codepoint, when the RE flag is blanked | ||||
(cleared to "0") it represents a re-echo of congestion experienced by | ||||
an early packet. If the ECN field is set to the Not-ECT codepoint, | ||||
when the RE flag is set to "1" it represents the feedback not | ||||
established (FNE) codepoint, which signals that the packet was sent | ||||
without the benefit of congestion feedback. | ||||
It is believed that the FNE codepoint can simultaneously serve other | ||||
purposes, particularly where the start of a flow needs distinguishing | ||||
from packets later in the flow. For instance it would have been | ||||
useful to identify new flows for tag switching and might enable | ||||
similar developments in the future if it were adopted. It is similar | ||||
to the state set-up bit idea designed to protect against memory | ||||
exhaustion attacks. This idea was proposed informally by David Clark | ||||
and documented by Handley and Greenhalgh [Steps_DoS]. The FNE | ||||
codepoint can be thought of as a `soft-state set-up flag', because it | ||||
is idempotent (i.e. one occurrence of the flag is sufficient but | ||||
further occurrences achieve the same effect if previous ones were | ||||
lost). | ||||
We are sure there will probably be other claims pending on the use of | ||||
bit 48. We know of at least two [ARI05], [RFC3514] but neither have | ||||
been pursued in the IETF, so far, although the present proposal would | ||||
meet the needs of the latter. | ||||
The security flag proposal (commonly known as the evil bit) was | ||||
published on 1 April 2003 as Informational RFC 3514, but it was not | ||||
adopted due to confusion over whether evil-doers might set it | ||||
inappropriately. The present proposal is backward compatible with | ||||
RFC3514 because if re-ECN compliant senders were benign they would | ||||
correctly clear the evil bit to honestly declare that they had just | ||||
received congestion feedback. Whereas evil-doers would hide | ||||
congestion feedback by setting the evil bit continuously, or at least | ||||
more often than they should. So, evil senders can be identified, | ||||
because they declare that they are good less often than they should. | ||||
5.2. Re-ECN IPv6 Wire Protocol | ||||
For IPv6, this document proposes that the new RE control flag will be | ||||
positioned as the first bit of the option field of a new Congestion | ||||
hop by hop option header (Figure 4). | ||||
0 1 2 3 | ||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Next Header | Hdr ext Len | Option Type | Opt Length =4 | | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
|R| Reserved for future use | | ||||
|E| | | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
Figure 4: Definition of a New IPv6 Congestion Hop by Hop Option | ||||
Header containing the re-ECN Extension (RE) Control Flag | ||||
0 1 2 3 4 5 6 7 8 | ||||
+-+-+-+-+-+-+-+-+- | ||||
|AIU|C|Option ID| | ||||
+-+-+-+-+-+-+-+-+- | ||||
Figure 5: Congestion Hop by Hop Option Type Encoding | ||||
The Hop-by-Hop Options header enables packets to carry information to | ||||
be examined and processed by routers or nodes along the packet's | ||||
delivery path, including the source and destination nodes. For re- | ||||
ECN, the two bits of the Action If Unrecognized (AIU) flag of the | ||||
Congestion extension header MUST be set to "00" meaning if | ||||
unrecognized `skip over option and continue processing the header'. | ||||
Then, any routers or a receiver not upgraded with the optional re-ECN | ||||
features described in this memo will simply ignore this header. But | ||||
routers with these optional re-ECN features or a re-ECN policing | ||||
function, will process this Congestion extension header. | ||||
The `C' flag MUST be set to "1" to specify that the Option Data | ||||
(currently only the RE control flag) can change en-route to the | ||||
packet's final destination. This ensures that, when an | ||||
Authentication header (AH [RFC4302]) is present in the packet, for | ||||
any option whose data may change en-route, its entire Option Data | ||||
field will be treated as zero-valued octets when computing or | ||||
verifying the packet's authenticating value. | ||||
Although the RE control flag should not be changed along the path, we | ||||
expect that the rest of this option field that is currently `Reserved | ||||
for future use' could be used for a multi-bit congestion notification | ||||
field which we would expect to change en route. As the RE flag does | ||||
not need end-to-end authentication, we set the C flag to '1'. | ||||
{ToDo: A Congestion Hop by Hop Option ID will need to be registered | ||||
with IANA.} | ||||
5.3. Router Forwarding Behaviour | ||||
Re-ECN works well without modifying the forwarding behaviour of any | ||||
routers. However, below, two OPTIONAL changes to forwarding | ||||
behaviour are defined which respectively enhance performance and | ||||
improve a router's discrimination against flooding attacks. They are | ||||
both OPTIONAL additions that we propose MAY apply by default to all | ||||
Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN | ||||
marking behaviours [RFC3168]. Specifications for PHBs MAY define | ||||
different forwarding behaviours from this default, but this is not | ||||
required. [Re-PCN] is one example. | ||||
FNE indicates ECT: | ||||
The FNE codepoint tells a router to assume that the packet was | ||||
sent by an ECN-capable transport (see Section 5.4). Therefore an | ||||
FNE packet MAY be marked rather than dropped. Note that the FNE | ||||
codepoint has been intentionally chosen so that, to RFC3168 | ||||
compliant routers (which do not inspect the RE flag) an FNE packet | ||||
appears to be Not-ECT so it will be dropped by legacy AQM | ||||
algorithms. | ||||
A network operator MUST NOT configure a queue to ECN mark rather | ||||
than drop FNE packets unless it can guarantee that FNE packets | ||||
will be rate limited, either locally or upstream. The ingress | ||||
policers discussed in [re-ecn-motive] would count as rate limiters | ||||
for this purpose. | ||||
Preferential Drop: If a re-ECN capable router queue experiences very | ||||
high load so that it has to drop arriving packets (e.g. a DoS | ||||
attack), it MAY preferentially drop packets within the same | ||||
Diffserv PHB using the preference order for extended ECN | ||||
codepoints given in Table 3. Preferential dropping can be | ||||
difficult to implement on some hardware, but if feasible it would | ||||
discriminate against attack traffic if done as part of the overall | ||||
policing framework of [re-ecn-motive]. If nowhere else, routers | ||||
at the egress of a network SHOULD implement preferential drop | ||||
(stronger than the MAY above). For simplicity, preferences 4 & 5 | ||||
MAY be merged into one preference level. | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
| ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | | ||||
| field | bit | ECN | | (1 = drop | | | ||||
| | | codepoint | | 1st) | | | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
| 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | | ||||
| | | | | | congestion and | | ||||
| | | | | | RECT | | ||||
| 00 | 1 | FNE | +1 | 4 | Feedback not | | ||||
| | | | | | established | | ||||
| 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | | ||||
| | | | | | by congestion | | ||||
| | | | | | experienced | | ||||
| 01 | 1 | RECT | 0 | 3 | Re-ECN capable | | ||||
| | | | | | transport | | ||||
| 11 | 1 | CE(-1) | -1 | 3 | Congestion | | ||||
| | | | | | experienced | | ||||
| 10 | 1 | --CU-- | n/a | 2 | Currently Unused | | ||||
| 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | | ||||
| | | | | | only | | ||||
| 00 | 0 | Not-RECT | n/a | 1 | Not | | ||||
| | | | | | Re-ECN-capable | | ||||
| | | | | | transport | | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
Table 3: Drop Preference of EECN Codepoints (Sorted by `Worth') | ||||
The above drop preferences are arranged to preserve packets with | ||||
more positive worth (Section 4.4), given senders of positive | ||||
packets must have honestly declared downstream congestion. A full | ||||
treatment of this is provided in the companion document desribing | ||||
the motivation and architecture for re-ECN [re-ecn-motive] | ||||
particularly when the application of re-ECN to protect against | ||||
DDoS attacks is described. | ||||
5.4. Justification for Setting the First SYN to FNE | ||||
the initial SYN MUST be set to FNE by Re-ECT client A (Section 6.1.4) | ||||
and (Section 5.3) says a queue MAY optionally treat an FNE packet as | ||||
ECN capable, so an initial SYN may be marked CE(-1) rather than | ||||
dropped. This seems dangerous, because the sender has not yet | ||||
established whether the receiver is a RFC3168 one that does not | ||||
understand congestion marking. It also seems to allow malicious | ||||
senders to take advantage of ECN marking to avoid so much drop when | ||||
launching SYN flooding attacks. Below we explain the features of the | ||||
protocol design that remove both these dangers. | ||||
ECN-capable initial SYN with a Not-ECT server: If the TCP server B | ||||
is re-ECN capable, provision is made for it to feedback a possible | ||||
congestion marked SYN in the SYN ACK (Section 6.1.4). But if the | ||||
TCP client A finds out from the SYN ACK that the server was not | ||||
ECN-capable, the TCP client MUST conservatively consider the first | ||||
SYN as congestion marked before setting itself into Not-ECT mode. | ||||
Section 6.1.4 mandates that such a TCP client MUST also set its | ||||
initial window to 1 segment. In this way we remove the need to | ||||
cautiously avoid setting the first SYN to Not-RECT. This will | ||||
give worse performance while deployment is patchy, but better | ||||
performance once deployment is widespread. | ||||
SYN flooding attacks can't exploit ECN-capability: Malicious hosts | ||||
may think they can use the advantage that ECN-marking gives over | ||||
drop in launching classic SYN-flood attacks. But Section 5.3 | ||||
mandates that a router MUST only be configured to treat packets | ||||
with the FNE codepoint as ECN-capable if FNE packets are rate | ||||
limited somewhere. Introduction of the FNE codepoint was a | ||||
deliberate move to enable transport-neutral handling of flow-start | ||||
and flow state set-up in the IP layer where it belongs. It then | ||||
becomes possible to protect against flooding attacks of all forms | ||||
(not just SYN flooding) without transport-specific inspection for | ||||
things like the SYN flag in TCP headers. Then, for instance, SYN | ||||
flooding attacks using IPSec ESP encryption can also be rate | ||||
limited at the IP layer. | ||||
It might seem pedantic going to all this trouble to enable ECN on the | ||||
initial packet of a flow, but it is motivated by a much wider concern | ||||
to ensure safe congestion control will still be possible even if the | ||||
application mix evolves to the point where the majority of flows | ||||
consist of a single window or even a single packet. It also allows | ||||
denial of service attacks to be more easily isolated and prevented. | ||||
5.5. Control and Management | ||||
5.5.1. Negative Balance Warning | ||||
A new ICMP message type is being considered so that a dropper can | ||||
warn the apparent sender of a flow that it has started to sanction | ||||
the flow. The message would have similar semantics to the `Time | ||||
exceeded' ICMP message type. To ensure the sender has to invest some | ||||
work before the network will generate such a message, a dropper | ||||
SHOULD only send such a message for flows that have demonstrated that | ||||
they have started correctly by establishing a positive record, but | ||||
have later gone negative. The threshold is up to the implementation. | ||||
The purpose of the message is to deconfuse the cause of drops from | ||||
other causes, such as congestion or transmission losses. The dropper | ||||
would send the message to the sender of the flow, not the receiver. | ||||
If we did define this message type, it would be REQUIRED for all re- | ||||
ECT senders to parse and understand it. Note that a sender MUST only | ||||
use this message to explain why losses are occurring. A sender MUST | ||||
NOT take this message to mean that losses have occurred that it was | ||||
not aware of. Otherwise, spoof messages could be sent by malicious | ||||
sources to slow down a sender (c.f. ICMP source quench). | ||||
However, the need for this message type is not yet confirmed, as we | ||||
are considering how to prevent it being used by malicious senders to | ||||
scan for droppers and to test their threshold settings. {ToDo: | ||||
Complete this section.} | ||||
5.5.2. Rate Response Control | ||||
As discussed in [re-ecn-motive] the sender's access operator will be | ||||
expected to use bulk per-user policing, but they might choose to | ||||
introduce a per-flow policer. In cases where operators do introduce | ||||
per-flow policing, there may be a need for a sender to send a request | ||||
to the ingress policer asking for permission to apply a non-default | ||||
response to congestion (where TCP-friendly is assumed to be the | ||||
default). This would require the sender to know what message | ||||
format(s) to use and to be able to discover how to address the | ||||
policer. The required control protocol(s) are outside the scope of | ||||
this document, but will require definition elsewhere. | ||||
The policer is likely to be local to the sender and inline, probably | ||||
at the ingress interface to the internetwork. So, discovery should | ||||
not be hard. A variety of control protocols already exist for some | ||||
widely used rate-responses to congestion. For instance DCCP | ||||
congestion control identifiers (CCIDs [RFC4340]) fulfil this role and | ||||
so does QoS signalling (e.g. and RSVP request for controlled load | ||||
service is equivalent to a request for no rate response to | ||||
congestion, but with admission control). | ||||
5.6. IP in IP Tunnels | ||||
For re-ECN to work correctly through IP in IP tunnels, it needs | ||||
slightly different tunnel handling to regular ECN [RFC3168]. | ||||
Currently there is some incosistency between how the handling of IP | ||||
in IP tunnels is defined in [RFC3168] and how it is defined in | ||||
[RFC4301], but re-ECN would work fine with the IPsec behaviour. This | ||||
inconsistency is addressed in a new Internet Draft [ECN-tunnel] that | ||||
proposes to update RFC3168 tunnel behaviour to bring it into line | ||||
with IPsec. Ideally, for re-ECN to work through a tunnel, the tunnel | ||||
entry should copy both the RE flag and the ECN field from the inner | ||||
to the outer IP header. Then at the tunnel exit, any congestion | ||||
marking of the outer ECN field should overwrite the inner ECN field | ||||
(unless the inner field is Not-ECT in which case an alarm should be | ||||
raised). The RE flag shouldn't change along a path, so the outer RE | ||||
flag should be the same as the inner. If it isn't a management alarm | ||||
should be raised. This behaviour is the same as the full- | ||||
functionality variant of [RFC3168] at tunnel exit, but different at | ||||
tunnel entry. | ||||
If tunnels are left as they are specified in [RFC3168], whether the | ||||
limited or full-functionality variants are used, a problem arises | ||||
with re-ECN if a tunnel crosses an inter-domain boundary, because the | ||||
difference between positive and negative markings will not be | ||||
correctly accounted for. In a limited functionality ECN tunnel, the | ||||
flow will appear to be RFC3168 compliant traffic, and therefore may | ||||
be wrongly rate limited. In a full-functionality ECN tunnel, the | ||||
result will depend whether the tunnel entry copies the inner RE flag | ||||
to the outer header or the RE flag in the outer header is always | ||||
cleared. If the former, the flow will tend to be too positive when | ||||
accounted for at borders. If the latter, it will be too negative. | ||||
If the rules set out in [ECN-tunnel] are followed then this will not | ||||
be an issue. | ||||
5.7. Non-Issues | ||||
The following issues might seem to cause unfavourable interactions | ||||
with re-ECN, but we will explain why they don't: | ||||
o Various link layers support explicit congestion notification, such | ||||
as Frame Relay and ATM. Explicit congestion notification is | ||||
proposed to be added to other link layers, such as Ethernet | ||||
(802.3ar Ethernet congestion management) and MPLS [RFC5129]; | ||||
o Encryption and IPSec. | ||||
In the case of congestion notification at the link layer, each | ||||
particular link layer scheme either manages congestion on the link | ||||
with its own link-level feedback (the usual arrangement in the cases | ||||
of ATM and Frame Relay), or congestion notification from the link | ||||
layer is merged into congestion notification at the IP level when the | ||||
frame headers are decapsulated at the end of the link (the | ||||
recommended arrangement in the Ethernet and MPLS cases). Given the | ||||
RE flag is not intended to change along the path, this means that | ||||
downstream congestion will still be measureable at any point where IP | ||||
is processed on the path by subtracting positive from negative | ||||
markings. | ||||
In the case of encryption, as long as the tunnel issues described in | ||||
Section 5.6 are dealt with, payload encryption itself will not be a | ||||
problem. The design goal of re-ECN is to include downstream | ||||
congestion in the IP header so that it is not necessary to bury into | ||||
inner headers. Obfuscation of flow identifiers is not a problem for | ||||
re-ECN policing elements. Re-ECN doesn't ever require flow | ||||
identifiers to be valid, it only requires them to be unique. So if | ||||
an IPSec encapsulating security payload (ESP [RFC4305]) or an | ||||
authentication header (AH [RFC4302]) is used, the security parameters | ||||
index (SPI) will be a sufficient flow identifier, as it is intended | ||||
to be unique to a flow without revealing actual port numbers. | ||||
In general, even if endpoints use some locally agreed scheme to hide | ||||
port numbers, re-ECN policing elements can just consider the pair of | ||||
source and destination IP addresses as the flow identifier. Re-ECN | ||||
encourages endpoints to at least tell the network layer that a | ||||
sequence of packets are all part of the same flow, if indeed they | ||||
are. The alternative would be for the sender to make each packet | ||||
appear to be a new flow, which would require them all to be marked | ||||
FNE in order to avoid being treated with the bulk of malicious flows | ||||
at the egress dropper. Given the FNE marking is worth +1 and | ||||
networks are likely to rate limit FNE packets, endpoints are given an | ||||
incentive not to set FNE on each packet. But if the sender really | ||||
does want to hide the flow relationship between packets it can choose | ||||
to pay the cost of multiple FNE packets, which in the long run will | ||||
compensate for the extra memory required on network policing elements | ||||
to process each flow. | ||||
6. Transport Layers | ||||
6.1. TCP | ||||
Re-ECN capability at the sender is essential. At the receiver it is | Re-ECN capability at the sender is essential. At the receiver it is | |||
optional, as long as the receiver has a basic RFC3168-compliant ECN- | optional, as long as the receiver has a basic RFC3168-compliant ECN- | |||
capable transport (ECT) [RFC3168]. Given re-ECN is not the first | capable transport (ECT) [RFC3168]. Given re-ECN is not the first | |||
attempt to define the semantics of the ECN field, we give a table | attempt to define the semantics of the ECN field, we give a table | |||
below summarising what happens for various combinations of | below summarising what happens for various combinations of | |||
capabilities of the sender S and receiver R, as indicated in the | capabilities of the sender S and receiver R, as indicated in the | |||
first four columns below. The last column gives the mode a half- | first four columns below. The last column gives the mode a half- | |||
connection should be in after the first two of the three TCP | connection should be in after the first two of the three TCP | |||
handshakes. | handshakes. | |||
skipping to change at page 17, line 5 | skipping to change at page 22, line 40 | |||
at least one of the transports does not understand even basic ECN | at least one of the transports does not understand even basic ECN | |||
marking. | marking. | |||
Note that we use the term Re-ECT for a host transport that is re-ECN- | Note that we use the term Re-ECT for a host transport that is re-ECN- | |||
capable but RECN for the modes of the half connections between hosts | capable but RECN for the modes of the half connections between hosts | |||
when they are both Re-ECT. If a host transport is Re-ECT, this fact | when they are both Re-ECT. If a host transport is Re-ECT, this fact | |||
alone does NOT imply either of its half connections will necessarily | alone does NOT imply either of its half connections will necessarily | |||
be in RECN mode, at least not until it has confirmed that the other | be in RECN mode, at least not until it has confirmed that the other | |||
host is Re-ECT. | host is Re-ECT. | |||
4.1.1. RECN mode: Full Re-ECN capable transport | 6.1.1. RECN mode: Full Re-ECN capable transport | |||
In full RECN mode, for each half connection, both the sender and the | In full RECN mode, for each half connection, both the sender and the | |||
receiver each maintain an unsigned integer counter we will call ECC | receiver each maintain an unsigned integer counter we will call ECC | |||
(echo congestion counter). The receiver maintains a count of how | (echo congestion counter). The receiver maintains a count of how | |||
many times a CE marked packet has arrived during the half-connection. | many times a CE marked packet has arrived during the half-connection. | |||
Once a RECN connection is established, the three TCP option flags | Once a RECN connection is established, the three TCP option flags | |||
(ECE, CWR & NS) used for ECN-related functions in other versions of | (ECE, CWR & NS) used for ECN-related functions in other versions of | |||
ECN are used as a 3-bit field for the receiver to repeatedly tell the | ECN are used as a 3-bit field for the receiver to repeatedly tell the | |||
sender the current value of ECC, modulo 8, whenever it sends a TCP | sender the current value of ECC, modulo 8, whenever it sends a TCP | |||
ACK. We will call this the echo congestion increment (ECI) field. | ACK. We will call this the echo congestion increment (ECI) field. | |||
This overloaded use of these 3 option flags as one 3-bit ECI field is | This overloaded use of these 3 option flags as one 3-bit ECI field is | |||
shown in Figure 4. The actual definition of the TCP header, | shown in Figure 7. The actual definition of the TCP header, | |||
including the addition of support for the ECN nonce, is shown for | including the addition of support for the ECN nonce, is shown for | |||
comparison in Figure 3. This specification does not redefine the | comparison in Figure 6. This specification does not redefine the | |||
names of these three TCP option flags, it merely overloads them with | names of these three TCP option flags, it merely overloads them with | |||
another definition once a flow is established. | another definition once a flow is established. | |||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
| | | N | C | E | U | A | P | R | S | F | | | | | N | C | E | U | A | P | R | S | F | | |||
| Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | |||
| | | | R | E | G | K | H | T | N | N | | | | | | R | E | G | K | H | T | N | N | | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
Figure 3: The (post-ECN Nonce) definition of bytes 13 and 14 of the | Figure 6: The (post-ECN Nonce) definition of bytes 13 and 14 of the | |||
TCP Header | TCP Header | |||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
| | | | U | A | P | R | S | F | | | | | | U | A | P | R | S | F | | |||
| Header Length | Reserved | ECI | R | C | S | S | Y | I | | | Header Length | Reserved | ECI | R | C | S | S | Y | I | | |||
| | | | G | K | H | T | N | N | | | | | | G | K | H | T | N | N | | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
Figure 4: Definition of the ECI field within bytes 13 and 14 of the | Figure 7: Definition of the ECI field within bytes 13 and 14 of the | |||
TCP Header, overloading the current definitions above for established | TCP Header, overloading the current definitions above for established | |||
RECN flows. | RECN flows. | |||
Receiver Action in RECN Mode | Receiver Action in RECN Mode | |||
Every time a CE marked packet arrives at a receiver in RECN mode, | Every time a CE marked packet arrives at a receiver in RECN mode, | |||
the receiver transport increments its local value of ECC and MUST | the receiver transport increments its local value of ECC and MUST | |||
echo its value, modulo 8, to the sender in the ECI field of the | echo its value, modulo 8, to the sender in the ECI field of the | |||
next ACK. It MUST repeat the same value of ECI in every | next ACK. It MUST repeat the same value of ECI in every | |||
subsequent ACK until the next CE event, when it increments ECI | subsequent ACK until the next CE event, when it increments ECI | |||
skipping to change at page 18, line 30 | skipping to change at page 24, line 22 | |||
below for the sender's safety strategy). Whenever the ECI field | below for the sender's safety strategy). Whenever the ECI field | |||
increments by D (and/or d drops are detected), the sender MUST | increments by D (and/or d drops are detected), the sender MUST | |||
clear the RE flag to "0" in the IP header of the next D' data | clear the RE flag to "0" in the IP header of the next D' data | |||
packets it sends (where D' = D + d), effectively re-echoing each | packets it sends (where D' = D + d), effectively re-echoing each | |||
single increment of ECI. Otherwise the data sender MUST send all | single increment of ECI. Otherwise the data sender MUST send all | |||
data packets with RE set to "1". | data packets with RE set to "1". | |||
As a general rule, once a flow is established, as well as setting | As a general rule, once a flow is established, as well as setting | |||
or clearing the RE flag as above, a data sender in RECN mode MUST | or clearing the RE flag as above, a data sender in RECN mode MUST | |||
always set the ECN field to ECT(1). However, the settings of the | always set the ECN field to ECT(1). However, the settings of the | |||
extended ECN field during flow start are defined in Section 4.1.4. | extended ECN field during flow start are defined in Section 6.1.4. | |||
As we have already emphasised, the re-ECN protocol makes no | As we have already emphasised, the re-ECN protocol makes no | |||
changes and has no effect on the TCP congestion control algorithm. | changes and has no effect on the TCP congestion control algorithm. | |||
So, the first increment of ECI (or detection of a drop) in a RTT | So, the first increment of ECI (or detection of a drop) in a RTT | |||
triggers the standard TCP congestion response, no more than one | triggers the standard TCP congestion response, no more than one | |||
congestion response per round trip, as usual. However, the sender | congestion response per round trip, as usual. However, the sender | |||
re-echoes every increment of ECI irrespective of RTTs. | re-echoes every increment of ECI irrespective of RTTs. | |||
A TCP sender also acts as the receiver for the other half- | A TCP sender also acts as the receiver for the other half- | |||
connection. The host will maintain two ECC values S.ECC and R.ECC | connection. The host will maintain two ECC values S.ECC and R.ECC | |||
as sender and receiver respectively. Every TCP header sent by a | as sender and receiver respectively. Every TCP header sent by a | |||
host in RECN mode will also repeat the prevailing value of R.ECC | host in RECN mode will also repeat the prevailing value of R.ECC | |||
in its ECI field. If a sender in RECN mode has to retransmit a | in its ECI field. If a sender in RECN mode has to retransmit a | |||
packet due to a suspected loss, the re-transmitted packet MUST | packet due to a suspected loss, the re-transmitted packet MUST | |||
carry the latest prevailing value of R.ECC when it is re- | carry the latest prevailing value of R.ECC when it is re- | |||
transmitted, which will not necessarily be the one it carried | transmitted, which will not necessarily be the one it carried | |||
originally. | originally. | |||
4.1.1.1. Drops and Marks | 6.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN | |||
Re-ECN is based on the ECN protocol [RFC3168] . In turn the | ||||
congestion markings ECN uses are typically based on the RED | ||||
algorithm [RFC2309]. This algorithm marks packets as CE with a | ||||
probability that increases as the size of the router queue increases. | ||||
However, if the queue becomes too full then it will revert to | ||||
dropping packets. Because of this it is important that a re-ECN | ||||
sender treats each packet drop it detects as if it were actually a CE | ||||
mark. This ensures that it can continue to correctly echo congestion | ||||
even through a highly congested path. | ||||
In order to ensure that drops are correctly echoed the sender needs | ||||
to add the number of drops detected per RTT to the difference in ECI | ||||
value waiting to be echoed. Drop detection is defined as set out in | ||||
[RFC2581] -- if the connection is in slow start then a single | ||||
duplicate aknowledgement will be treated as an indication of a drop. | ||||
When the system is in the congestion avoidance stage then 3 duplicate | ||||
acknowledgements will be treated as a sign of a drop. In all cases, | ||||
if a re-transmission time-out occurs then that will be treatd as a | ||||
drop. | ||||
4.1.1.2. Safety against Long Pure ACK Loss Sequences | ||||
The ECI method was chosen for echoing congestion marking because a | ||||
re-ECN sender needs to know about every CE mark arriving at the | ||||
receiver, not just whether at least one arrives within a round trip | ||||
time (which is all the ECE/CWR mechanism supported). And, as pure | ||||
ACKs are not protected by TCP reliable delivery, we repeat the same | ||||
ECI value in every ACK until it changes. Even if many ACKs in a row | ||||
are lost, as soon as one gets through, the ECI field it repeats from | ||||
previous ACKs that didn't get through will update the sender on how | ||||
many CE marks arrived since the last ACK got through. | ||||
The sender will only lose a record of the arrival of a CE mark if all | ||||
the ACKS are lost (and all of them were pure ACKs) for a stream of | ||||
data long enough to contain 8 or more CE marks. So, if the marking | ||||
fraction was p, at least 8/p pure ACKs would have to be lost. For | ||||
example, if p was 5%, a sequence of 160 pure ACKs would all have to | ||||
be lost. To protect against such extremely unlikely events, if a re- | ||||
ECN sender detects a sequence of pure ACKs has been lost it SHOULD | ||||
assume the ECI field wrapped as many times as possible within the | ||||
sequence. | ||||
Specifically, if a re-ECN sender receives an ACK with an | ||||
acknowledgement number that acknowledges L segments since the | ||||
previous ACK but with a sequence number unchanged from the previously | ||||
received ACK, it SHOULD conservatively assume that the ECI field | ||||
incremented by D' = L - ((L-D) mod 8), where D is the apparent | ||||
increase in the ECI field. For example if the ACK arriving after 9 | ||||
pure ACK losses apparently increased ECI by 2, the assumed increment | ||||
of ECI would still be 2. But if ECI apparently increased by 2 after | ||||
11 pure ACK losses, ECI should be assumed to have increased by 10. | ||||
A re-ECN sender MAY implement a heuristic algorithm to predict beyond | ||||
reasonable doubt that the ECI field probably did not wrap within a | ||||
sequence of lost pure ACKs. But such an algorithm is OPTIONAL. Such | ||||
an algorithm MUST NOT be used unless it is proven to work even in the | ||||
presence of correlation between high ACK loss rate on the back | ||||
channel and high CE marking rate on the forward channel. | ||||
Whatever assumption a re-ECN sender makes about potentially lost CE | ||||
marks, both its congestion control and its re-echoing behaviour | ||||
SHOULD be consistent with the assumption it makes. | ||||
4.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN | ||||
Receiver | Receiver | |||
If the half-connection is in RECN-Co mode, ECN feedback proceeds no | If the half-connection is in RECN-Co mode, ECN feedback proceeds no | |||
differently to that of RFC3168 compliant ECN. In other words, the | differently to that of RFC3168 compliant ECN. In other words, the | |||
receiver sets the ECE flag repeatedly in the TCP header and the | receiver sets the ECE flag repeatedly in the TCP header and the | |||
sender responds by setting the CWR flag. Although RECN-Co mode is | sender responds by setting the CWR flag. Although RECN-Co mode is | |||
used when the receiver has not implemented the re-ECN protocol, the | used when the receiver has not implemented the re-ECN protocol, the | |||
sender can infer enough from its RFC3168 compliant ECN feedback to | sender can infer enough from its RFC3168 compliant ECN feedback to | |||
set or clear the RE flag reasonably well. Specifically, every time | set or clear the RE flag reasonably well. Specifically, every time | |||
the receiver toggles the ECE field from "0" to "1" (or a loss is | the receiver toggles the ECE field from "0" to "1" (or a loss is | |||
skipping to change at page 20, line 45 | skipping to change at page 25, line 19 | |||
packets with RE set to "1". Once a flow is established, a re-ECN | packets with RE set to "1". Once a flow is established, a re-ECN | |||
data sender in RECN-Co mode MUST always set the ECN field to ECT(1). | data sender in RECN-Co mode MUST always set the ECN field to ECT(1). | |||
If a CE marked packet arrives at the receiver within a round trip | If a CE marked packet arrives at the receiver within a round trip | |||
time of a previous mark, the receiver will still be echoing ECE for | time of a previous mark, the receiver will still be echoing ECE for | |||
the last CE mark. Therefore, such a mark will be missed by the | the last CE mark. Therefore, such a mark will be missed by the | |||
sender. Of course, this isn't of concern for congestion control, but | sender. Of course, this isn't of concern for congestion control, but | |||
it does mean that very occasionally the RE blanking fraction will be | it does mean that very occasionally the RE blanking fraction will be | |||
understated. Therefore flows in RECN-Co mode may occasionally be | understated. Therefore flows in RECN-Co mode may occasionally be | |||
mistaken for very lightly cheating flows and consequently might | mistaken for very lightly cheating flows and consequently might | |||
suffer a small number of packet drops through an egress dropper | suffer a small number of packet drops through an egress dropper. We | |||
(Section 6.1.4). We expect re-ECN would be deployed for some time | expect re-ECN would be deployed for some time before policers and | |||
before policers and droppers start to enforce it. So, given there is | droppers start to enforce it. So, given there is not much ECN | |||
not much ECN deployment yet anyway, this minor problem may affect | deployment yet anyway, this minor problem may affect only a very | |||
only a very small proportion of flows, reducing to nothing over the | small proportion of flows, reducing to nothing over the years as | |||
years as RFC3168 compliant ECN hosts upgrade. The use of RECN-Co | RFC3168 compliant ECN hosts upgrade. The use of RECN-Co mode would | |||
mode would need to be reviewed in the light of experience at the time | need to be reviewed in the light of experience at the time of re-ECN | |||
of re-ECN deployment. | deployment. | |||
RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their | RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their | |||
code simple, MAY choose not to implement this mode. If they do not, | code simple, MAY choose not to implement this mode. If they do not, | |||
a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the | a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the | |||
presence of an ECN-capable receiver. It MAY choose to fall back to | presence of an ECN-capable receiver. It MAY choose to fall back to | |||
the ECT-Nonce mode, but if re-ECN implementers don't want to be | the ECT-Nonce mode, but if re-ECN implementers don't want to be | |||
bothered with RECN-Co mode, they probably won't want to add an ECT- | bothered with RECN-Co mode, they probably won't want to add an ECT- | |||
Nonce mode either. | Nonce mode either. | |||
4.1.2.1. Re-ECN support for the ECN Nonce | 6.1.2.1. Re-ECN support for the ECN Nonce | |||
A TCP half-connection in RECN-Co mode MUST NOT support the ECN | A TCP half-connection in RECN-Co mode MUST NOT support the ECN | |||
Nonce [RFC3540]. This means that the sending code of a re-ECN | Nonce [RFC3540]. This means that the sending code of a re-ECN | |||
implementation will never need to include ECN Nonce support. Re-ECN | implementation will never need to include ECN Nonce support. Re-ECN | |||
is intended to provide wider protection than the ECN nonce against | is intended to provide wider protection than the ECN nonce against | |||
congestion control misbehaviour, and re-ECN only requires support | congestion control misbehaviour, and re-ECN only requires support | |||
from the sender, therefore it is preferable to specifically rule out | from the sender, therefore it is preferable to specifically rule out | |||
the need for dual sender implementations. As a consequence, a re-ECN | the need for dual sender implementations. As a consequence, a re-ECN | |||
capable sender will never set ECT(0), so it will be easier for | capable sender will never set ECT(0), so it will be easier for | |||
network elements to discriminate re-ECN traffic flows from other ECN | network elements to discriminate re-ECN traffic flows from other ECN | |||
skipping to change at page 21, line 41 | skipping to change at page 26, line 15 | |||
RFC3540 allows an ECN nonce sender to choose whether to sanction a | RFC3540 allows an ECN nonce sender to choose whether to sanction a | |||
receiver that does not ever set the nonce sum. Given re-ECN is | receiver that does not ever set the nonce sum. Given re-ECN is | |||
intended to provide wider protection than the ECN nonce against | intended to provide wider protection than the ECN nonce against | |||
congestion control misbehaviour, implementers of re-ECN receivers MAY | congestion control misbehaviour, implementers of re-ECN receivers MAY | |||
choose not to implement backwards compatibility with the ECN nonce | choose not to implement backwards compatibility with the ECN nonce | |||
capability. This may be because they deem that the risk of sanctions | capability. This may be because they deem that the risk of sanctions | |||
is low, perhaps because significant deployment of the ECN nonce seems | is low, perhaps because significant deployment of the ECN nonce seems | |||
unlikely at implementation time. | unlikely at implementation time. | |||
4.1.3. Capability Negotiation | 6.1.3. Capability Negotiation | |||
During the TCP hand-shake at the start of a connection, an originator | During the TCP hand-shake at the start of a connection, an originator | |||
of the connection (host A) with a re-ECN-capable transport MUST | of the connection (host A) with a re-ECN-capable transport MUST | |||
indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 | indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 | |||
in the initial SYN. | in the initial SYN. | |||
A responding Re-ECT host (host B) MUST return a SYN ACK with flags | A responding Re-ECT host (host B) MUST return a SYN ACK with flags | |||
CWR=1 and ECE=0. The responding host MUST NOT set this combination | CWR=1 and ECE=0. The responding host MUST NOT set this combination | |||
of flags unless the preceding SYN has already indicated Re-ECT | of flags unless the preceding SYN has already indicated Re-ECT | |||
support as above. Normally a Re-ECT server (B) will reply to a Re- | support as above. Normally a Re-ECT server (B) will reply to a Re- | |||
skipping to change at page 23, line 19 | skipping to change at page 27, line 42 | |||
preceding SYN (because there is a broken RFC3168 compliant | preceding SYN (because there is a broken RFC3168 compliant | |||
implementation that behaves this way), RFC3168 specifies that the | implementation that behaves this way), RFC3168 specifies that the | |||
whole connection MUST revert to Not-ECT. | whole connection MUST revert to Not-ECT. | |||
Also note that, whenever the SYN flag of a TCP segment is set | Also note that, whenever the SYN flag of a TCP segment is set | |||
(including when the ACK flag is also set), the NS, CWR and ECE flags | (including when the ACK flag is also set), the NS, CWR and ECE flags | |||
( i.e the ECI field of the SYNACK) MUST NOT be interpreted as the | ( i.e the ECI field of the SYNACK) MUST NOT be interpreted as the | |||
3-bit ECI value, which is only set as a copy of the local ECC value | 3-bit ECI value, which is only set as a copy of the local ECC value | |||
in non-SYN packets. | in non-SYN packets. | |||
4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after | 6.1.4. Extended ECN (EECN) Field Settings during Flow Start or after | |||
Idle Periods | Idle Periods | |||
If the originator (A) of a TCP connection supports re-ECN it MUST set | If the originator (A) of a TCP connection supports re-ECN it MUST set | |||
the extended ECN (EECN) field in the IP header of the initial SYN | the extended ECN (EECN) field in the IP header of the initial SYN | |||
packet to the feedback not established (FNE) codepoint. | packet to the feedback not established (FNE) codepoint. | |||
FNE is a new extended ECN codepoint defined by this specification | FNE is a new extended ECN codepoint defined by this specification | |||
(Section 3.3). The feedback not established (FNE) codepoint is used | (Section 4.2). The feedback not established (FNE) codepoint is used | |||
when the transport does not have the benefit of ECN feedback so it | when the transport does not have the benefit of ECN feedback so it | |||
cannot decide whether to set or clear the RE flag. | cannot decide whether to set or clear the RE flag. | |||
If after receiving a SYN the server B has set its sending half- | If after receiving a SYN the server B has set its sending half- | |||
connection into RECN mode or RECN-Co mode, it MUST set the extended | connection into RECN mode or RECN-Co mode, it MUST set the extended | |||
ECN field in the IP header of its SYN ACK to the feedback not | ECN field in the IP header of its SYN ACK to the feedback not | |||
established (FNE) codepoint. Note the careful wording here, which | established (FNE) codepoint. Note the careful wording here, which | |||
means that Re-ECT server B MUST set FNE on a SYN ACK whether it is | means that Re-ECT server B MUST set FNE on a SYN ACK whether it is | |||
responding to a SYN from a Re-ECT client or from a client that is | responding to a SYN from a Re-ECT client or from a client that is | |||
merely ECN-capable. This is because FNE indicates the transport is | merely ECN-capable. This is because FNE indicates the transport is | |||
skipping to change at page 27, line 5 | skipping to change at page 31, line 9 | |||
trip time. We use the lower bound of the retransmission timeout | trip time. We use the lower bound of the retransmission timeout | |||
(RTO) [RFC2988], which is commonly used as the idle period before TCP | (RTO) [RFC2988], which is commonly used as the idle period before TCP | |||
must reduce to the restart window [RFC2581]. Note our specification | must reduce to the restart window [RFC2581]. Note our specification | |||
of re-ECN's idle period is NOT intended to change the idle period for | of re-ECN's idle period is NOT intended to change the idle period for | |||
TCP's restart, nor indeed for any other purposes. | TCP's restart, nor indeed for any other purposes. | |||
{ToDo: Describe how the sender falls back to RFC3168 modes if packets | {ToDo: Describe how the sender falls back to RFC3168 modes if packets | |||
don't appear to be getting through (to work round firewalls | don't appear to be getting through (to work round firewalls | |||
discarding packets they consider unusual).} | discarding packets they consider unusual).} | |||
4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs | 6.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs | |||
A re-ECN sender MUST clear the RE flag to "0" and set the ECN field | A re-ECN sender MUST clear the RE flag to "0" and set the ECN field | |||
to Not-ECT in pure ACKs, retransmissions and window probes, as | to Not-ECT in pure ACKs, retransmissions and window probes, as | |||
specified in [RFC3168]. Our eventual goal is for all packets to be | specified in [RFC3168]. Our eventual goal is for all packets to be | |||
sent with re-ECN enabled, and we believe the semantics of the ECI | sent with re-ECN enabled, and we believe the semantics of the ECI | |||
field go a long way towards being able to achieve this. However, we | field go a long way towards being able to achieve this. However, we | |||
have not completed a full security analysis for these cases, | have not completed a full security analysis for these cases, | |||
therefore, currently we merely re-state current practice. | therefore, currently we merely re-state current practice. | |||
We must also reconcile the facts that congestion marking is applied | We must also reconcile the facts that congestion marking is applied | |||
skipping to change at page 27, line 47 | skipping to change at page 32, line 5 | |||
through the variable R. | through the variable R. | |||
This does not ensure precisely the same number of octets have RE | This does not ensure precisely the same number of octets have RE | |||
blanked as were CE marked. But we believe positive errors will | blanked as were CE marked. But we believe positive errors will | |||
cancel negative over a long enough period. {ToDo: However, more | cancel negative over a long enough period. {ToDo: However, more | |||
research is needed to prove whether this is so. If it is not, it may | research is needed to prove whether this is so. If it is not, it may | |||
be necessary to increment and decrement R in octets rather than | be necessary to increment and decrement R in octets rather than | |||
packets, by incrementing R as the product of D and the size in octets | packets, by incrementing R as the product of D and the size in octets | |||
of packets being sent (typically the MSS).} | of packets being sent (typically the MSS).} | |||
4.2. Other Transports | 6.2. Other Transports | |||
4.2.1. General Guidelines for Adding Re-ECN to Other Transports | 6.2.1. General Guidelines for Adding Re-ECN to Other Transports | |||
As a general rule, Re-ECT sender transports that have established the | As a general rule, Re-ECT sender transports that have established the | |||
receiver transport is at least ECN-capable (not necessarily re-ECN | receiver transport is at least ECN-capable (not necessarily re-ECN | |||
capable) MUST blank the RE codepoint for at least as many octets as | capable) MUST blank the RE codepoint for at least as many octets as | |||
arrive at receiver with the CE codepoint set. Re-ECN-capable sender | arrive at receiver with the CE codepoint set. Re-ECN-capable sender | |||
transports should always initialise the ECN field to the ECT(1) | transports should always initialise the ECN field to the ECT(1) | |||
codepoint once a flow is established. | codepoint once a flow is established. | |||
If the sender transport does not have sufficient feedback to even | If the sender transport does not have sufficient feedback to even | |||
estimate the path's CE rate, it SHOULD set FNE continuously. If the | estimate the path's CE rate, it SHOULD set FNE continuously. If the | |||
skipping to change at page 28, line 32 | skipping to change at page 32, line 39 | |||
following: | following: | |||
o UDP fire and forget (e.g. DNS) | o UDP fire and forget (e.g. DNS) | |||
o UDP streaming with no feedback | o UDP streaming with no feedback | |||
o UDP streaming with feedback | o UDP streaming with feedback | |||
} | } | |||
4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS | 6.2.2. Guidelines for adding Re-ECN to RSVP or NSIS | |||
A separate I-D has been submitted [Re-PCN] describing how re-ECN can | A separate I-D has been submitted [Re-PCN] describing how re-ECN can | |||
be used in an edge-to-edge rather than end-to-end scenario. It can | be used in an edge-to-edge rather than end-to-end scenario. It can | |||
then be used by downstream networks to police whether upstream | then be used by downstream networks to police whether upstream | |||
networks are blocking new flow reservations when downstream | networks are blocking new flow reservations when downstream | |||
congestion is too high, even though the congestion is in other | congestion is too high, even though the congestion is in other | |||
operators' downstream networks. This relates to current IETF work on | operators' downstream networks. This relates to current IETF work on | |||
Admission Control over Diffserv using Pre-Congestion Notification | Admission Control over Diffserv using Pre-Congestion Notification | |||
(PCN) [PCN-arch]. | (PCN) [PCN-arch]. | |||
4.2.3. Guidelines for adding Re-ECN to DCCP | 6.2.3. Guidelines for adding Re-ECN to DCCP | |||
Beside adjusting the initial features negotiation sequence, operating | Beside adjusting the initial features negotiation sequence, operating | |||
re-ECN in DCCP [RFC4340] could be achieved by defining a new option | re-ECN in DCCP [RFC4340] could be achieved by defining a new option | |||
to be added to acknowledgments, that would include a multibit field | to be added to acknowledgments, that would include a multibit field | |||
where the destination could copy its ECC. | where the destination could copy its ECC. | |||
4.2.4. Guidelines for adding Re-ECN to SCTP | 6.2.4. Guidelines for adding Re-ECN to SCTP | |||
Appendix A in [RFC4960] gives the specifications for SCTP to support | Appendix A in [RFC4960] gives the specifications for SCTP to support | |||
ECN. Similar steps should be taken to support re-ECN. Beside | ECN. Similar steps should be taken to support re-ECN. Beside | |||
adjusting the initial features negotiation sequence, operating re-ECN | adjusting the initial features negotiation sequence, operating re-ECN | |||
in SCTP could be achieved by defining a new control chunk, that would | in SCTP could be achieved by defining a new control chunk, that would | |||
include a multibit field where the destination could copy its ECC | include a multibit field where the destination could copy its ECC | |||
5. Network Layer | ||||
5.1. Re-ECN IPv4 Wire Protocol | ||||
The wire protocol of the ECN field in the IP header remains largely | ||||
unchanged from [RFC3168]. However, an extension to the ECN field we | ||||
call the RE (Re-ECN extension) flag (Section 3.3) is defined in this | ||||
document. It doubles the extended ECN codepoint space, giving 8 | ||||
potential codepoints. The semantics of the extra codepoints are | ||||
backward compatible with the semantics of the 4 original codepoints | ||||
[RFC3168] (Section 7.1 collects together and summarises all the | ||||
changes defined in this document). | ||||
For IPv4, this document proposes that the new RE control flag will be | ||||
positioned where the `reserved' control flag was at bit 48 of the | ||||
IPv4 header (counting from 0). Alternatively, some would call this | ||||
bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 | ||||
header (Figure 5). | ||||
0 1 2 | ||||
+---+---+---+ | ||||
| R | D | M | | ||||
| E | F | F | | ||||
+---+---+---+ | ||||
Figure 5: New Definition of the Re-ECN Extension (RE) Control Flag at | ||||
the Start of Byte 7 of the IPv4 Header | ||||
The semantics of the RE flag are described in outline in Section 3 | ||||
and specified fully in Section 4. The RE flag is always considered | ||||
in conjunction with the 2-bit ECN field, as if they were concatenated | ||||
together to form a 3-bit extended ECN field. If the ECN field is set | ||||
to either the ECT(1) or CE codepoint, when the RE flag is blanked | ||||
(cleared to "0") it represents a re-echo of congestion experienced by | ||||
an early packet. If the ECN field is set to the Not-ECT codepoint, | ||||
when the RE flag is set to "1" it represents the feedback not | ||||
established (FNE) codepoint, which signals that the packet was sent | ||||
without the benefit of congestion feedback. | ||||
It is believed that the FNE codepoint can simultaneously serve other | ||||
purposes, particularly where the start of a flow needs distinguishing | ||||
from packets later in the flow. For instance it would have been | ||||
useful to identify new flows for tag switching and might enable | ||||
similar developments in the future if it were adopted. It is similar | ||||
to the state set-up bit idea designed to protect against memory | ||||
exhaustion attacks. This idea was proposed informally by David Clark | ||||
and documented by Handley and Greenhalgh [Steps_DoS]. The FNE | ||||
codepoint can be thought of as a `soft-state set-up flag', because it | ||||
is idempotent (i.e. one occurrence of the flag is sufficient but | ||||
further occurrences achieve the same effect if previous ones were | ||||
lost). | ||||
We are sure there will probably be other claims pending on the use of | ||||
bit 48. We know of at least two [ARI05], [RFC3514] but neither have | ||||
been pursued in the IETF, so far, although the present proposal would | ||||
meet the needs of the former. | ||||
The security flag proposal (commonly known as the evil bit) was | ||||
published on 1 April 2003 as Informational RFC 3514, but it was not | ||||
adopted due to confusion over whether evil-doers might set it | ||||
inappropriately. The present proposal is backward compatible with | ||||
RFC3514 because if re-ECN compliant senders were benign they would | ||||
correctly clear the evil bit to honestly declare that they had just | ||||
received congestion feedback. Whereas evil-doers would hide | ||||
congestion feedback by setting the evil bit continuously, or at least | ||||
more often than they should. So, evil senders can be identified, | ||||
because they declare that they are good less often than they should. | ||||
5.2. Re-ECN IPv6 Wire Protocol | ||||
For IPv6, this document proposes that the new RE control flag will be | ||||
positioned as the first bit of the option field of a new Congestion | ||||
hop by hop option header (Figure 6). | ||||
0 1 2 3 | ||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
| Next Header | Hdr ext Len | Option Type | Opt Length =4 | | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
|R| Reserved for future use | | ||||
|E| | | ||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ||||
Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option | ||||
Header containing the re-ECN Extension (RE) Control Flag | ||||
0 1 2 3 4 5 6 7 8 | ||||
+-+-+-+-+-+-+-+-+- | ||||
|AIU|C|Option ID| | ||||
+-+-+-+-+-+-+-+-+- | ||||
Figure 7: Congestion Hop by Hop Option Type Encoding | ||||
The Hop-by-Hop Options header enables packets to carry information to | ||||
be examined and processed by routers or nodes along the packet's | ||||
delivery path, including the source and destination nodes. For re- | ||||
ECN, the two bits of the Action If Unrecognized (AIU) flag of the | ||||
Congestion extension header MUST be set to "00" meaning if | ||||
unrecognized `skip over option and continue processing the header'. | ||||
Then, any routers or a receiver not upgraded with the optional re-ECN | ||||
features described in this memo will simply ignore this header. But | ||||
routers with these optional re-ECN features or a re-ECN policing | ||||
function, will process this Congestion extension header. | ||||
The `C' flag MUST be set to "1" to specify that the Option Data | ||||
(currently only the RE control flag) can change en-route to the | ||||
packet's final destination. This ensures that, when an | ||||
Authentication header (AH [RFC4302]) is present in the packet, for | ||||
any option whose data may change en-route, its entire Option Data | ||||
field will be treated as zero-valued octets when computing or | ||||
verifying the packet's authenticating value. | ||||
Although the RE control flag should not be changed along the path, we | ||||
expect that the rest of this option field that is currently `Reserved | ||||
for future use' could be used for a multi-bit congestion notification | ||||
field which we would expect to change en route. As the RE flag does | ||||
not need end-to-end authentication, we set the C flag to '1'. | ||||
{ToDo: A Congestion Hop by Hop Option ID will need to be registered | ||||
with IANA.} | ||||
5.3. Router Forwarding Behaviour | ||||
Re-ECN works well without modifying the forwarding behaviour of any | ||||
routers. However, below, two OPTIONAL changes to forwarding | ||||
behaviour are defined which respectively enhance performance and | ||||
improve a router's discrimination against flooding attacks. They are | ||||
both OPTIONAL additions that we propose MAY apply by default to all | ||||
Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN | ||||
marking behaviours [RFC3168]. Specifications for PHBs MAY define | ||||
different forwarding behaviours from this default, but this is not | ||||
required. [Re-PCN] is one example. | ||||
FNE indicates ECT: | ||||
The FNE codepoint tells a router to assume that the packet was | ||||
sent by an ECN-capable transport (see Section 5.4). Therefore an | ||||
FNE packet MAY be marked rather than dropped. Note that the FNE | ||||
codepoint has been intentionally chosen so that, to RFC3168 | ||||
compliant routers (which do not inspect the RE flag) an FNE packet | ||||
appears to be Not-ECT so it will be dropped by legacy AQM | ||||
algorithms. | ||||
A network operator MUST NOT configure a queue to ECN mark rather | ||||
than drop FNE packets unless it can guarantee that FNE packets | ||||
will be rate limited, either locally or upstream. The ingress | ||||
policers discussed in Section 6.1.5 would count as rate limiters | ||||
for this purpose. | ||||
Preferential Drop: If a re-ECN capable router queue experiences very | ||||
high load so that it has to drop arriving packets (e.g. a DoS | ||||
attack), it MAY preferentially drop packets within the same | ||||
Diffserv PHB using the preference order for extended ECN | ||||
codepoints given in Table 7. Preferential dropping can be | ||||
difficult to implement on some hardware, but if feasible it would | ||||
discriminate against attack traffic if done as part of the overall | ||||
policing framework of Section 6.1.3. If nowhere else, routers at | ||||
the egress of a network SHOULD implement preferential drop | ||||
(stronger than the MAY above). For simplicity, preferences 4 & 5 | ||||
MAY be merged into one preference level. | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
| ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | | ||||
| field | bit | ECN | | (1 = drop | | | ||||
| | | codepoint | | 1st) | | | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
| 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | | ||||
| | | | | | congestion and | | ||||
| | | | | | RECT | | ||||
| 00 | 1 | FNE | +1 | 4 | Feedback not | | ||||
| | | | | | established | | ||||
| 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | | ||||
| | | | | | by congestion | | ||||
| | | | | | experienced | | ||||
| 01 | 1 | RECT | 0 | 3 | Re-ECN capable | | ||||
| | | | | | transport | | ||||
| 11 | 1 | CE(-1) | -1 | 3 | Congestion | | ||||
| | | | | | experienced | | ||||
| 10 | 1 | --CU-- | n/a | 2 | Currently Unused | | ||||
| 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | | ||||
| | | | | | only | | ||||
| 00 | 0 | Not-RECT | n/a | 1 | Not | | ||||
| | | | | | Re-ECN-capable | | ||||
| | | | | | transport | | ||||
+-------+-----+------------+-------+------------+-------------------+ | ||||
Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') | ||||
The above drop preferences are arranged to preserve packets with | ||||
more positive worth (Section 3.5), given senders of positive | ||||
packets must have honestly declared downstream congestion. This | ||||
is explained fully in Section 6 on applications, particularly when | ||||
the application of re-ECN to protect against DDoS attacks is | ||||
described. | ||||
5.4. Justification for Setting the First SYN to FNE | ||||
the initial SYN MUST be set to FNE by Re-ECT client A (Section 4.1.4) | ||||
and (Section 5.3) says a queue MAY optionally treat an FNE packet as | ||||
ECN capable, so an initial SYN may be marked CE(-1) rather than | ||||
dropped. This seems dangerous, because the sender has not yet | ||||
established whether the receiver is a RFC3168 one that does not | ||||
understand congestion marking. It also seems to allow malicious | ||||
senders to take advantage of ECN marking to avoid so much drop when | ||||
launching SYN flooding attacks. Below we explain the features of the | ||||
protocol design that remove both these dangers. | ||||
ECN-capable initial SYN with a Not-ECT server: If the TCP server B | ||||
is re-ECN capable, provision is made for it to feedback a possible | ||||
congestion marked SYN in the SYN ACK (Section 4.1.4). But if the | ||||
TCP client A finds out from the SYN ACK that the server was not | ||||
ECN-capable, the TCP client MUST conservatively consider the first | ||||
SYN as congestion marked before setting itself into Not-ECT mode. | ||||
Section 4.1.4 mandates that such a TCP client MUST also set its | ||||
initial window to 1 segment. In this way we remove the need to | ||||
cautiously avoid setting the first SYN to Not-RECT. This will | ||||
give worse performance while deployment is patchy, but better | ||||
performance once deployment is widespread. | ||||
SYN flooding attacks can't exploit ECN-capability: Malicious hosts | ||||
may think they can use the advantage that ECN-marking gives over | ||||
drop in launching classic SYN-flood attacks. But Section 5.3 | ||||
mandates that a router MUST only be configured to treat packets | ||||
with the FNE codepoint as ECN-capable if FNE packets are rate | ||||
limited somewhere. Introduction of the FNE codepoint was a | ||||
deliberate move to enable transport-neutral handling of flow-start | ||||
and flow state set-up in the IP layer where it belongs. It then | ||||
becomes possible to protect against flooding attacks of all forms | ||||
(not just SYN flooding) without transport-specific inspection for | ||||
things like the SYN flag in TCP headers. Then, for instance, SYN | ||||
flooding attacks using IPSec ESP encryption can also be rate | ||||
limited at the IP layer. | ||||
It might seem pedantic going to all this trouble to enable ECN on the | ||||
initial packet of a flow, but it is motivated by a much wider concern | ||||
to ensure safe congestion control will still be possible even if the | ||||
application mix evolves to the point where the majority of flows | ||||
consist of a single window or even a single packet. It also allows | ||||
denial of service attacks to be more easily isolated and prevented. | ||||
5.5. Control and Management | ||||
5.5.1. Negative Balance Warning | ||||
A new ICMP message type is being considered so that a dropper can | ||||
warn the apparent sender of a flow that it has started to sanction | ||||
the flow. The message would have similar semantics to the `Time | ||||
exceeded' ICMP message type. To ensure the sender has to invest some | ||||
work before the network will generate such a message, a dropper | ||||
SHOULD only send such a message for flows that have demonstrated that | ||||
they have started correctly by establishing a positive record, but | ||||
have later gone negative. The threshold is up to the implementation. | ||||
The purpose of the message is to deconfuse the cause of drops from | ||||
other causes, such as congestion or transmission losses. The dropper | ||||
would send the message to the sender of the flow, not the receiver. | ||||
If we did define this message type, it would be REQUIRED for all re- | ||||
ECT senders to parse and understand it. Note that a sender MUST only | ||||
use this message to explain why losses are occurring. A sender MUST | ||||
NOT take this message to mean that losses have occurred that it was | ||||
not aware of. Otherwise, spoof messages could be sent by malicious | ||||
sources to slow down a sender (c.f. ICMP source quench). | ||||
However, the need for this message type is not yet confirmed, as we | ||||
are considering how to prevent it being used by malicious senders to | ||||
scan for droppers and to test their threshold settings. {ToDo: | ||||
Complete this section.} | ||||
5.5.2. Rate Response Control | ||||
As discussed in Section 6.1.5 the sender's access operator will be | ||||
expected to use bulk per-user policing, but they might choose to | ||||
introduce a per-flow policer. In cases where operators do introduce | ||||
per-flow policing, there may be a need for a sender to send a request | ||||
to the ingress policer asking for permission to apply a non-default | ||||
response to congestion (where TCP-friendly is assumed to be the | ||||
default). This would require the sender to know what message | ||||
format(s) to use and to be able to discover how to address the | ||||
policer. The required control protocol(s) are outside the scope of | ||||
this document, but will require definition elsewhere. | ||||
The policer is likely to be local to the sender and inline, probably | ||||
at the ingress interface to the internetwork. So, discovery should | ||||
not be hard. A variety of control protocols already exist for some | ||||
widely used rate-responses to congestion. For instance DCCP | ||||
congestion control identifiers (CCIDs [RFC4340]) fulfil this role and | ||||
so does QoS signalling (e.g. and RSVP request for controlled load | ||||
service is equivalent to a request for no rate response to | ||||
congestion, but with admission control). | ||||
5.6. IP in IP Tunnels | ||||
For re-ECN to work correctly through IP in IP tunnels, it needs | ||||
slightly different tunnel handling to regular ECN [RFC3168]. | ||||
Currently there is some incosistency between how the handling of IP | ||||
in IP tunnels is defined in [RFC3168] and how it is defined in | ||||
[RFC4301], but re-ECN would work fine with the IPsec behaviour. This | ||||
inconsistency is addressed in a new Internet Draft [ECN-tunnel] that | ||||
proposes to update RFC3168 tunnel behaviour to bring it into line | ||||
with IPsec. Ideally, for re-ECN to work through a tunnel, the tunnel | ||||
entry should copy both the RE flag and the ECN field from the inner | ||||
to the outer IP header. Then at the tunnel exit, any congestion | ||||
marking of the outer ECN field should overwrite the inner ECN field | ||||
(unless the inner field is Not-ECT in which case an alarm should be | ||||
raised). The RE flag shouldn't change along a path, so the outer RE | ||||
flag should be the same as the inner. If it isn't a management alarm | ||||
should be raised. This behaviour is the same as the full- | ||||
functionality variant of [RFC3168] at tunnel exit, but different at | ||||
tunnel entry. | ||||
If tunnels are left as they are specified in [RFC3168], whether the | ||||
limited or full-functionality variants are used, a problem arises | ||||
with re-ECN if a tunnel crosses an inter-domain boundary, because the | ||||
difference between positive and negative markings will not be | ||||
correctly accounted for. In a limited functionality ECN tunnel, the | ||||
flow will appear to be RFC3168 compliant traffic, and therefore may | ||||
be wrongly rate limited. In a full-functionality ECN tunnel, the | ||||
result will depend whether the tunnel entry copies the inner RE flag | ||||
to the outer header or the RE flag in the outer header is always | ||||
cleared. If the former, the flow will tend to be too positive when | ||||
accounted for at borders. If the latter, it will be too negative. | ||||
If the rules set out in [ECN-tunnel] are followed then this will not | ||||
be an issue. | ||||
5.7. Non-Issues | ||||
The following issues might seem to cause unfavourable interactions | ||||
with re-ECN, but we will explain why they don't: | ||||
o Various link layers support explicit congestion notification, such | ||||
as Frame Relay and ATM. Explicit congestion notification is | ||||
proposed to be added to other link layers, such as Ethernet | ||||
(802.3ar Ethernet congestion management) and MPLS [RFC5129]; | ||||
o Encryption and IPSec. | ||||
In the case of congestion notification at the link layer, each | ||||
particular link layer scheme either manages congestion on the link | ||||
with its own link-level feedback (the usual arrangement in the cases | ||||
of ATM and Frame Relay), or congestion notification from the link | ||||
layer is merged into congestion notification at the IP level when the | ||||
frame headers are decapsulated at the end of the link (the | ||||
recommended arrangement in the Ethernet and MPLS cases). Given the | ||||
RE flag is not intended to change along the path, this means that | ||||
downstream congestion will still be measureable at any point where IP | ||||
is processed on the path by subtracting positive from negative | ||||
markings. | ||||
In the case of encryption, as long as the tunnel issues described in | ||||
Section 5.6 are dealt with, payload encryption itself will not be a | ||||
problem. The design goal of re-ECN is to include downstream | ||||
congestion in the IP header so that it is not necessary to bury into | ||||
inner headers. Obfuscation of flow identifiers is not a problem for | ||||
re-ECN policing elements. Re-ECN doesn't ever require flow | ||||
identifiers to be valid, it only requires them to be unique. So if | ||||
an IPSec encapsulating security payload (ESP [RFC4305]) or an | ||||
authentication header (AH [RFC4302]) is used, the security parameters | ||||
index (SPI) will be a sufficient flow identifier, as it is intended | ||||
to be unique to a flow without revealing actual port numbers. | ||||
In general, even if endpoints use some locally agreed scheme to hide | ||||
port numbers, re-ECN policing elements can just consider the pair of | ||||
source and destination IP addresses as the flow identifier. Re-ECN | ||||
encourages endpoints to at least tell the network layer that a | ||||
sequence of packets are all part of the same flow, if indeed they | ||||
are. The alternative would be for the sender to make each packet | ||||
appear to be a new flow, which would require them all to be marked | ||||
FNE in order to avoid being treated with the bulk of malicious flows | ||||
at the egress dropper. Given the FNE marking is worth +1 and | ||||
networks are likely to rate limit FNE packets, endpoints are given an | ||||
incentive not to set FNE on each packet. But if the sender really | ||||
does want to hide the flow relationship between packets it can choose | ||||
to pay the cost of multiple FNE packets, which in the long run will | ||||
compensate for the extra memory required on network policing elements | ||||
to process each flow. | ||||
6. Applications | ||||
6.1. Policing Congestion Response | ||||
6.1.1. The Policing Problem | ||||
The current Internet architecture trusts hosts to respond voluntarily | ||||
to congestion. Limited evidence shows that the large majority of | ||||
end-points on the Internet comply with a TCP-friendly response to | ||||
congestion. But telephony (and increasingly video) services over the | ||||
best effort Internet are attracting the interest of major commercial | ||||
operations. Most of these applications do not respond to congestion | ||||
at all. Those that can switch to lower rate codecs, still have a | ||||
lower bound below which they must become unresponsive to congestion. | ||||
Of course, the Internet is intended to support many different | ||||
application behaviours. But the problem is that this freedom can be | ||||
exercised irresponsibly. The greater problem is that we will never | ||||
be able to agree on where the boundary is between responsible and | ||||
irresponsible. Therefore re-ECN is designed to allow different | ||||
networks to set their own view of the limit to irresponsibility, and | ||||
to allow networks that choose a more conservative limit to push back | ||||
against congestion caused in more liberal networks. | ||||
As an example of the impossibility of setting a standard for | ||||
fairness, mandating TCP-friendliness would set the bar too high for | ||||
unresponsive streaming media, but still some would say the bar was | ||||
too low. Even though all known peer-to-peer filesharing applications | ||||
are TCP-compatible, they can cause a disproportionate amount of | ||||
congestion, simply by using multiple flows and by transferring data | ||||
continuously relative to other short-lived sessions. On the other | ||||
hand, if we swung the other way and set the bar low enough to allow | ||||
streaming media to be unresponsive, we would also allow denial of | ||||
service attacks, which are typically unresponsive to congestion and | ||||
consist of multiple continuous flows. | ||||
Applications that need (or choose) to be unresponsive to congestion | ||||
can effectively take (some would say steal) whatever share of | ||||
bottleneck resources they want from responsive flows. Whether or not | ||||
such free-riding is common, inability to prevent it increases the | ||||
risk of poor returns for investors in network infrastructure, leading | ||||
to under-investment. An increasing proportion of unresponsive or | ||||
free-riding demand coupled with persistent under-supply is a broken | ||||
economic cycle. Therefore, if the current, largely co-operative | ||||
consensus continues to erode, congestion collapse could become more | ||||
common in more areas of the Internet [RFC3714]. | ||||
While we have designed re-ECN so that networks can choose to deploy | ||||
stringent policing, this does not imply we advocate that every | ||||
network should introduce tight controls on those that cause | ||||
congestion. Re-ECN has been specifically designed to allow different | ||||
networks to choose how conservative or liberal they wish to be with | ||||
respect to policing congestion. But those that choose to be | ||||
conservative can protect themselves from the excesses that liberal | ||||
networks allow their users. | ||||
6.1.2. The Case Against Bottleneck Policing | ||||
The state of the art in rate policing is the bottleneck policer, | ||||
which is intended to be deployed at any forwarding resource that may | ||||
become congested. Its aim is to detect flows that cause | ||||
significantly more local congestion than others. Although operators | ||||
might solve their immediate problems by deploying bottleneck | ||||
policers, we are concerned that widespread deployment would make it | ||||
extremely hard to evolve new application behaviours. We believe the | ||||
IETF should offer re-ECN as the preferred protocol on which to base | ||||
solutions to the policing problems of operators, because it would not | ||||
harm evolvability and, frankly, it would be far more effective (see | ||||
later for why). | ||||
Approaches like [XCHOKe] & [pBox] are nice approaches for rate | ||||
policing traffic without the benefit of whole path information (such | ||||
as could be provided by re-ECN). But they must be deployed at | ||||
bottlenecks in order to work. Unfortunately, a large proportion of | ||||
traffic traverses at least two bottlenecks (in two access networks), | ||||
particularly with the current traffic mix where peer-to-peer file- | ||||
sharing is prevalent. If ECN were deployed, we believe it would be | ||||
likely that these bottleneck policers would be adapted to combine ECN | ||||
congestion marking from the upstream path with local congestion | ||||
knowledge. But then the only useful placement for such policers | ||||
would be close to the egress of the internetwork. | ||||
But then, if these bottleneck policers were widely deployed (which | ||||
would require them to be more effective than they are now), the | ||||
Internet would find itself with one universal rate adaptation policy | ||||
(probably TCP-friendliness) embedded throughout the network. Given | ||||
TCP's congestion control algorithm is already known to be hitting its | ||||
scalability limits and new algorithms are being developed for high- | ||||
speed congestion control, embedding TCP policing into the Internet | ||||
would make evolution to new algorithms extremely painful. If a | ||||
source wanted to use a different algorithm, it would have to first | ||||
discover then negotiate with all the policers on its path, | ||||
particularly those in the far access network. The IETF has already | ||||
traveled that path with the Intserv architecture and found it | ||||
constrains scalability [RFC2208]. | ||||
Anyway, if bottleneck policers were ever widely deployed, they would | ||||
be likely to be bypassed by determined attackers. They inherently | ||||
have to police fairness per flow or per source-destination pair. | ||||
Therefore they can easily be circumvented either by opening multiple | ||||
flows (by varying the end-point port number); or by spoofing the | ||||
source address but arranging with the receiver to hide the true | ||||
return address at a higher layer. | ||||
6.1.3. Re-ECN Incentive Framework | ||||
The aim is to create an incentive environment that ensures optimal | ||||
sharing of capacity despite everyone acting selfishly (including | ||||
lying and cheating). Of course, the mechanisms put in place for this | ||||
can lie dormant wherever co-operation is the norm. | ||||
Throughout this document we focus on path congestion. But some forms | ||||
of fairness, particularly TCP's, also depend on round trip time. If | ||||
TCP-fairness is required, we also propose to measure downstream path | ||||
delay using re-feedback. We give a simple outline of how this could | ||||
work in Appendix F. However, we do not expect this to be necessary, | ||||
as researchers tend to agree that only congestion control dynamics | ||||
need to depend on RTT, not the rate that the algorithm would converge | ||||
on after a period of stability. | ||||
Figure 8 sketches the incentive framework that we will describe piece | ||||
by piece throughout this section. We will do a first pass in | ||||
overview, then return to each piece in detail. We re-use the earlier | ||||
example of how downstream congestion is derived by subtracting | ||||
upstream congestion from path congestion (Figure 2) but depict | ||||
multiple trust boundaries to turn it into an internetwork. For | ||||
clarity, only downstream congestion is shown (the difference between | ||||
the two earlier plots). The graph displays downstream path | ||||
congestion seen in a typical flow as it traverses an example path | ||||
from sender S to receiver R, across networks N1, N2 & N3. Everyone | ||||
is shown using re-ECN correctly, but we intend to show why everyone | ||||
would /choose/ to use it correctly, and honestly. | ||||
Three main types of self-interest can be identified: | ||||
o Users want to transmit data across the network as fast as | ||||
possible, paying as little as possible for the privilege. In this | ||||
respect, there is no distinction between senders and receivers, | ||||
but we must be wary of potential malice by one on the other; | ||||
o Network operators want to maximise revenues from the resources | ||||
they invest in. They compete amongst themselves for the custom of | ||||
users. | ||||
o Attackers (whether users or networks) want to use any opportunity | ||||
to subvert the new re-ECN system for their own gain or to damage | ||||
the service of their victims, whether targeted or random. | ||||
policer dropper | ||||
| | | ||||
| | | ||||
S <-----N1----> <---N2---> <---N3--> R domain | ||||
| | ||||
3% |---------+ | ||||
| | | ||||
2% | +-----------------------+ | ||||
| downstream congestion | | ||||
1% | | | ||||
| | | ||||
0% +---------------------------------+====== | ||||
0 i | ||||
Figure 8: Incentive Framework, showing creation of opposing pressures | ||||
to under-declare and over-declare downstream congestion, using a | ||||
policer and a dropper | ||||
Source congestion control: We want to ensure that the sender will | ||||
throttle its rate as downstream congestion increases. Whatever | ||||
the agreed congestion response (whether TCP-compatible or some | ||||
enhanced QoS), to some extent it will always be against the | ||||
sender's interest to comply. | ||||
Ingress policing: But it is in all the network operators' interests | ||||
to encourage fair congestion response, so that their investments | ||||
are employed to satisfy the most valuable demand. The re-ECN | ||||
protocol ensures packets carry the necessary information about | ||||
their own expected downstream congestion so that N1 can deploy a | ||||
policer at its ingress to check that S1 is complying with whatever | ||||
congestion control it should be using (Section 6.1.5). If N1 is | ||||
extremely conservative it could police each flow, but it is likely | ||||
to just police the bulk amount of congestion each customer causes | ||||
without regard to flows, or if it is extremely liberal it need not | ||||
police congestion control at all. Whatever, it is always | ||||
preferable to police traffic at the very first ingress into an | ||||
internetwork, before non-compliant traffic can cause any damage. | ||||
Edge egress dropper: If the policer ensures the source has less | ||||
right to a high rate the higher it declares downstream congestion, | ||||
the source has a clear incentive to understate downstream | ||||
congestion. But, if flows of packets are understated when they | ||||
enter the internetwork, they will have become negative by the time | ||||
they leave. So, we introduce a dropper at the last network | ||||
egress, which drops packets in flows that persistently declare | ||||
negative downstream congestion (see Section 6.1.4 for details). | ||||
Inter-domain traffic policing: But next we must ask, if congestion | ||||
arises downstream (say in N3), what is the ingress network's | ||||
(N1's) incentive to police its customers' response? If N1 turns a | ||||
blind eye, its own customers benefit while other networks suffer. | ||||
This is why all inter-domain QoS architectures (e.g. Intserv, | ||||
Diffserv) police traffic each time it crosses a trust boundary. | ||||
We have already shown that re-ECN gives a trustworthy measure of | ||||
the expected downstream congestion that a flow will cause by | ||||
subtracting negative volume from positive at any intermediate | ||||
point on a path. N3 (say) can use this measure to police all the | ||||
responses to congestion of all the sources beyond its upstream | ||||
neighbour (N2), but in bulk with one very simple passive | ||||
mechanism, rather than per flow, as we will now explain. | ||||
Emulating policing with inter-domain congestion penalties: Between | ||||
high-speed networks, we would rather avoid per-flow policing, and | ||||
we would rather avoid holding back traffic while it is policed. | ||||
Instead, once re-ECN has arranged headers to carry downstream | ||||
congestion honestly, N2 can contract to pay N3 penalties in | ||||
proportion to a single bulk count of the congestion metrics | ||||
crossing their mutual trust boundary (Section 6.1.6). In this | ||||
way, N3 puts pressure on N2 to suppress downstream congestion, for | ||||
every flow passing through the border interface, even though they | ||||
will all start and end in different places, and even though they | ||||
may all be allowed different responses to congestion. The figure | ||||
depicts this downward pressure on N2 by the solid downward arrow | ||||
at the egress of N2. Then N2 has an incentive either to police | ||||
the congestion response of its own ingress traffic (from N1) or to | ||||
emulate policing by applying penalties to N1 in turn on the basis | ||||
of congestion counted at their mutual boundary. In this recursive | ||||
way, the incentives for each flow to respond correctly to | ||||
congestion trace back with each flow precisely to each source, | ||||
despite the mechanism not recognising flows (see Section 6.2.2). | ||||
Inter-domain congestion charging diversity: Any two networks are | ||||
free to agree any of a range of penalty regimes between themselves | ||||
but they would only provide the right incentives if they were | ||||
within the following reasonable constraints. N2 should expect to | ||||
have to pay penalties to N3 where penalties monotonically increase | ||||
with the volume of congestion and negative penalties are not | ||||
allowed. For instance, they may agree an SLA with tiered | ||||
congestion thresholds, where higher penalties apply the higher the | ||||
threshold that is broken. But the most obvious (and useful) form | ||||
of penalty is where N3 levies a charge on N2 proportional to the | ||||
volume of downstream congestion N2 dumps into N3. In the | ||||
explanation that follows, we assume this specific variant of | ||||
volume charging between networks - charging proportionate to the | ||||
volume of congestion. | ||||
We must make clear that we are not advocating that everyone should | ||||
use this form of contract. We are well aware that the IETF tries | ||||
to avoid standardising technology that depends on a particular | ||||
business model. And we strongly share this desire to encourage | ||||
diversity. But our aim is merely to show that border policing can | ||||
at least work with this one model, then we can assume that | ||||
operators might experiment with the metric in other models (see | ||||
Section 6.1.6 for examples). Of course, operators are free to | ||||
complement this usage element of their charges with traditional | ||||
capacity charging, and we expect they will as predicted by | ||||
economics. | ||||
No congestion charging to users: Bulk congestion penalties at trust | ||||
boundaries are passive and extremely simple, and lose none of | ||||
their per-packet precision from one boundary to the next (unlike | ||||
Diffserv all-address traffic conditioning agreements, which | ||||
dissipate their effectiveness across long topologies). But at any | ||||
trust boundary, there is no imperative to use congestion charging. | ||||
Traditional traffic policing can be used, if the complexity and | ||||
cost is preferred. In particular, at the boundary with end | ||||
customers (e.g. between S and N1), traffic policing will most | ||||
likely be more appropriate. Policer complexity is less of a | ||||
concern at the edge of the network. And end-customers are known | ||||
to be highly averse to the unpredictability of congestion | ||||
charging. | ||||
NOTE WELL: This document neither advocates nor requires congestion | ||||
charging for end customers and advocates but does not require | ||||
inter-domain congestion charging. | ||||
Competitive discipline of inter-domain traffic engineering: With | ||||
inter-domain congestion charging, a domain seems to have a | ||||
perverse incentive to fake congestion; N2's profit depends on the | ||||
difference between congestion at its ingress (its revenue) and at | ||||
its egress (its cost). So, overstating internal congestion seems | ||||
to increase profit. However, smart border routing [Smart_rtg] by | ||||
N1 will bias its routing towards the least cost routes. So, N2 | ||||
risks losing all its revenue to competitive routes if it | ||||
overstates congestion (see Section 6.2.3). In other words, if N2 | ||||
is the least congested route, its ability to raise excess profits | ||||
is limited by the congestion on the next least congested route. | ||||
Closing the loop: All the above elements conspire to trap everyone | ||||
between two opposing pressures, ensuring the downstream congestion | ||||
metric arrives at the destination neither above nor below zero. | ||||
So, we have arrived back where we started in our argument. The | ||||
ingress edge network can rely on downstream congestion declared in | ||||
the packet headers presented by the sender. So it can police the | ||||
sender's congestion response accordingly. | ||||
Evolvability of congestion control: We have seen that re-ECN enables | ||||
policing at the very first ingress. We have also seen that, as | ||||
flows continue on their path through further networks downstream, | ||||
re-ECN removes the need for further per-domain ingress policing of | ||||
all the different congestion responses allowed to each different | ||||
flow. This is why the evolvability of re-ECN policing is so | ||||
superior to bottleneck policing or to any policing of different | ||||
QoS for different flows. Even if all access networks choose to | ||||
conservatively police congestion per flow, each will want to | ||||
compete with the others to allow new responses to congestion for | ||||
new types of application. With re-ECN, each can introduce new | ||||
controls independently, without coordinating with other networks | ||||
and without having to standardise anything. But, as we have just | ||||
seen, by making inter-domain penalties proportionate to bulk | ||||
downtream congestion, downstream networks can be agnostic to the | ||||
specific congestion response for each flow, but they can still | ||||
apply more penalty the more liberal the ingress access network has | ||||
been in the response to congestion it allowed for each flow. | ||||
6.1.3.1. The Case against Classic Feedback | ||||
A system that produces an optimal outcome as a result of everyone's | ||||
selfish actions is extremely powerful. Especially one that enables | ||||
evolvability of congestion control. But why do we have to change to | ||||
re-ECN to achieve it? Can't classic congestion feedback (as used | ||||
already by standard ECN) be arranged to provide similar incentives | ||||
and similar evolvability? Superficially it can. Kelly's seminal | ||||
work showed how we can allow everyone the freedom to evolve whatever | ||||
congestion control behaviour is in their application's best interest | ||||
but still optimise the whole system of networks and users by placing | ||||
a price on congestion to ensure responsible use of this | ||||
freedom [Evol_cc]). Kelly used ECN with its classic congestion | ||||
feedback model as the mechanism to convey congestion price | ||||
information. The mechanism could be thought of as volume charging; | ||||
except only the volume of packets marked with congestion experienced | ||||
(CE) was counted. | ||||
However, below we explain why relying on classic feedback /required/ | ||||
congestion charging to be used, while re-ECN achieves the same | ||||
powerful outcome (given it is built on Kelly's foundations), but does | ||||
not /require/ congestion charging. In brief, the problem with | ||||
classic feedback is that the incentives have to trace the indirect | ||||
path back to the sender---the long way round the feedback loop. For | ||||
example, if classic feedback were used in Figure 8, N2 would have had | ||||
to influence N1 via all of N3, R & S rather than directly. | ||||
Inability to agree what is happening downstream: In order to police | ||||
its upstream neighbour's congestion response, the neighbours | ||||
should be able to agree on the congestion to be responded to. | ||||
Whatever the feedback regime, as packets change hands at each | ||||
trust boundary, any path metrics they carry are verifiable by both | ||||
neighbours. But, with a classic path metric, they can only agree | ||||
on the /upstream/ path congestion. | ||||
Inaccessible back-channel: The network needs a whole-path congestion | ||||
metric if it wants to control the source. Classically, whole path | ||||
congestion emerges at the destination, to be fed back from | ||||
receiver to sender in a back-channel. But, in any data network, | ||||
back-channels need not be visible to relays, as they are | ||||
essentially communications between the end-points. They may be | ||||
encrypted, asymmetrically routed or simply omitted, so no network | ||||
element can reliably intercept them. The congestion charging | ||||
literature solves this problem by charging the receiver and | ||||
assuming this will cause the receiver to refer the charges to the | ||||
sender. But, of course, this creates unintended side-effects... | ||||
`Receiver pays' unacceptable: In connectionless datagram networks, | ||||
receivers and receiving networks cannot prevent reception from | ||||
malicious senders, so `receiver pays' opens them to `denial of | ||||
funds' attacks. | ||||
End-user congestion charging unacceptable: Even if 'denial of funds' | ||||
were not a problem, we know that end-users are highly averse to | ||||
the unpredictability of congestion charging and anyway, we want to | ||||
avoid restricting network operators to just one retail tariff. | ||||
But with classic feedback only an upstream metric is available, so | ||||
we cannot avoid having to wrap the `receiver pays' money flow | ||||
around the feedback loop, necessarily forcing end-users to be | ||||
subjected to congestion charging. | ||||
To summarise so far, with classic feedback, policing congestion | ||||
response without losing evolvability /requires/ congestion charging | ||||
of end-users and a `receiver pays' model, whereas, with re-ECN, it is | ||||
still possible to influence incentives using congestion charging but | ||||
using the safer `sender pays' model. However, congestion charging is | ||||
only likely to be appropriate between domains. So, without losing | ||||
evolvability, re-ECN enables technical policing mechanisms that are | ||||
more appropriate for end users than congestion pricing. | ||||
We now take a second pass over the incentive framework, filling in | ||||
the detail. | ||||
6.1.4. Egress Dropper | ||||
As traffic leaves the last network before the receiver (domain N3 in | ||||
Figure 8), the fraction of positive octets in a flow should match the | ||||
fraction of negative octets introduced by congestion marking, leaving | ||||
a balance of zero. If it is less (a negative flow), it implies that | ||||
the source is understating path congestion (which will reduce the | ||||
penalties that N2 owes N3). | ||||
If flows are positive, N3 need take no action---this simply means its | ||||
upstream neighbour is paying more penalties than it needs to, and the | ||||
source is going slower than it needs to. But, to protect itself | ||||
against persistently negative flows, N3 will need to install a | ||||
dropper at its egress. Appendix E gives a suggested algorithm for | ||||
this dropper. There is no intention that the dropper algorithm needs | ||||
to be standardised, it is merely provided to show that an efficient, | ||||
robust algorithm is possible. But whatever algorithm is used must | ||||
meet the criteria below: | ||||
o It SHOULD introduce minimal false positives for honest flows; | ||||
o It SHOULD quickly detect and sanction dishonest flows (minimal | ||||
false negatives); | ||||
o It MUST be invulnerable to state exhaustion attacks from malicious | ||||
sources. For instance, if the dropper uses flow-state, it should | ||||
not be possible for a source to send numerous packets, each with a | ||||
different flow ID, to force the dropper to exhaust its memory | ||||
capacity; | ||||
o It MUST introduce sufficient loss in goodput so that malicious | ||||
sources cannot play off losses in the egress dropper against | ||||
higher allowed throughput. Salvatori [CLoop_pol] describes this | ||||
attack, which involves the source understating path congestion | ||||
then inserting forward error correction (FEC) packets to | ||||
compensate expected losses. | ||||
Note that the dropper operates on flows but we would like it not to | ||||
require per-flow state. This is why we have been careful to ensure | ||||
that all flows MUST start with a packet marked with the FNE | ||||
codepoint. If a flow does not start with the FNE codepoint, a | ||||
dropper is likely to treat it unfavourably. This risk makes it worth | ||||
setting the FNE codepoint at the start of a flow, even though there | ||||
is a cost to the sender of setting FNE (positive `worth'). Indeed, | ||||
with the FNE codepoint, the rate at which a sender can generate new | ||||
flows can be limited (Appendix G). In this respect, the FNE | ||||
codepoint works like Handley's state set-up bit [Steps_DoS]. | ||||
Appendix E also gives an example dropper implementation that | ||||
aggregates flow state. Dropper algorithms will often maintain a | ||||
moving average across flows of the fraction of RE blanked packets. | ||||
When maintaining an average across flows, a dropper SHOULD only allow | ||||
flows into the average if they start with FNE, but it SHOULD NOT | ||||
include packets with the FNE codepoint set in the average. A sender | ||||
sets the FNE codepoint when it does not have the benefit of feedback | ||||
from the receiver. So, counting packets with FNE cleared would be | ||||
likely to make the average unnecessarily positive, providing headroom | ||||
(or should we say footroom?) for dishonest (negative) traffic. | ||||
If the dropper detects a persistently negative flow, it SHOULD drop | ||||
sufficient negative and neutral packets to force the flow to not be | ||||
negative. Drops SHOULD be focused on just sufficient packets in | ||||
misbehaving flows to remove the negative bias while doing minimal | ||||
extra harm. | ||||
6.1.5. Policing | ||||
Access operators who wish to limit the congeston that a sender is | ||||
able to cause can deploy policers at the very first ingress to the | ||||
internetwork. Re-ECN has been designed to avoid the need for | ||||
bottleneck policing so that we can avoid a future where a single rate | ||||
adaptation policy is embedded throughout the network. Instead, re- | ||||
ECN allows the particular rate adaptation policy to be solely agreed | ||||
bilaterally between the sender and its ingress access provider | ||||
(Section 5.5.2 discusses possible ways to signal between them), which | ||||
allows congestion control to be policed, but maintains its | ||||
evolvability, requiring only a single, local box to be updated. | ||||
Appendix G gives examples of per-user policing algorithms. But there | ||||
is no implication that these algorithms are to be standardised, or | ||||
that they are ideal. The ingress rate policer is the part of the re- | ||||
ECN incentive framework that is intended to be the most flexible. | ||||
Once endpoint protocol handlers for re-ECN and egress droppers are in | ||||
place, operators can choose exactly which congestion response they | ||||
want to police, and whether they want to do it per user, per flow or | ||||
not at all. | ||||
The re-ECN protocol allows these ingress policers to easily perform | ||||
bulk per-user policing (Appendix G.1). This is likely to provide | ||||
sufficient incentive to the user to correctly respond to congestion | ||||
without needing the policing function to be overly complex. If an | ||||
access operator chose they could use per-flow policing according to | ||||
the widely adopted TCP rate adaptation ( Appendix G.2) or other | ||||
alternatives, however this would introduce extra complexity to the | ||||
system. | ||||
If a per-flow rate policer is used, it should use path (not | ||||
downstream) congestion as the relevant metric, which is represented | ||||
by the fraction of octets in packets with positive (Re-Echo and FNE) | ||||
and canceled (CE(0)) markings. Of course, re-ECN provides all the | ||||
information a policer needs directly in the packets being policed. | ||||
So, even policing TCP's AIMD algorithm is relatively straightforward | ||||
(Appendix G.2). | ||||
Note that we have included canceled packets in the measure of path | ||||
congestion. Canceled packets arise when the sender re-echoes earlier | ||||
congestion, but then this Re-Echo packet just happens to be | ||||
congestion marked itself. One would not normally expect many | ||||
canceled packets at the first ingress because one would not normally | ||||
expect much congestion marking to have been necessary that soon in | ||||
the path. However, a home network or campus network may well sit | ||||
between the sending endpoint and the ingress policer, so some | ||||
congestion may occur upstream of the policer. And if congestion does | ||||
occur upstream, some canceled packets should be visible, and should | ||||
be taken into account in the measure of path congestion. | ||||
But a much more important reason for including canceled packets in | ||||
the measure of path congestion at an ingress policer is that a sender | ||||
might otherwise subvert the protocol by sending canceled packets | ||||
instead of neutral (RECT) packets. Like neutral, canceled packets | ||||
are worth zero, so the sender knows they won't be counted against any | ||||
quota it might have been allowed. But unlike neutral packets, | ||||
canceled packets are immune to congestion marking, because they have | ||||
already been congestion marked. So, it is both correct and useful | ||||
that canceled packets should be included in a policer's measure of | ||||
path congestion, as this removes the incentive the sender would | ||||
otherwise have to mark more packets as canceled than it should. | ||||
An ingress policer should also ensure that flows are not already | ||||
negative when they enter the access network. As with canceled | ||||
packets, the presence of negative packets will typically be unusual. | ||||
Therefore it will be easy to detect negative flows at the ingress by | ||||
just detecting negative packets then monitoring the flow they belong | ||||
to. | ||||
Of course, even if the sender does operate its own network, it may | ||||
arrange not to congestion mark traffic. Whether the sender does this | ||||
or not is of no concern to anyone else except the sender. Such a | ||||
sender will not be policed against its own network's contribution to | ||||
congestion, but the only resulting problem would be overload in the | ||||
sender's own network. | ||||
Finally, we must not forget that an easy way to circumvent re-ECN's | ||||
defences is for the source to turn off re-ECN support, by setting the | ||||
Not-RECT codepoint, implying RFC3168 compliant traffic. Therefore an | ||||
ingress policer should put a general rate-limit on Not-RECT traffic, | ||||
which SHOULD be lax during early, patchy deployment, but will have to | ||||
become stricter as deployment widens. Similarly, flows starting | ||||
without an FNE packet can be confined by a strict rate-limit used for | ||||
the remainder of flows that haven't proved they are well-behaved by | ||||
starting correctly (therefore they need not consume any flow state--- | ||||
they are just confined to the `misbehaving' bin if they carry an | ||||
unrecognised flow ID). | ||||
6.1.6. Inter-domain Policing | ||||
One of the main design goals of re-ECN is for border security | ||||
mechanisms to be as simple as possible, otherwise they will become | ||||
the pinch-points that limit scalability of the whole internetwork. | ||||
We want to avoid per-flow processing at borders and to keep to | ||||
passive mechanisms that can monitor traffic in parallel to | ||||
forwarding, rather than having to filter traffic inline---in series | ||||
with forwarding. Such passive, off-line mechanisms are essential for | ||||
future high-speed all-optical border interconnection where packets | ||||
cannot be buffered while they are checked for policy compliance. | ||||
So far, we have been able to keep the border mechanisms simple, | ||||
despite having had to harden them against some subtle attacks on the | ||||
re-ECN design. The mechanisms are still passive and avoid per-flow | ||||
processing. | ||||
The basic accounting mechanism at each border interface simply | ||||
involves accumulating the volume of packets with positive worth (Re- | ||||
Echo and FNE), and subtracting the volume of those with negative | ||||
worth: CE(-1). Even though this mechanism takes no regard of flows, | ||||
over an accounting period (say a month) this subtraction will account | ||||
for the downstream congestion caused by all the flows traversing the | ||||
interface, wherever they come from, and wherever they go to. The two | ||||
networks can agree to use this metric however they wish to determine | ||||
some congestion-related penalty against the upstream network. | ||||
Although the algorithm could hardly be simpler, it is spelled out | ||||
using pseudo-code in Appendix H.1. | ||||
Various attempts to subvert the re-ECN design have been made. In all | ||||
cases their root cause is persistently negative flows. But, after | ||||
describing these attacks we will show that we don't actually have to | ||||
get rid of all persistently negative flows in order to thwart the | ||||
attacks. | ||||
In honest flows, downstream congestion is measured as positive minus | ||||
negative volume. So if all flows are honest (i.e. not persistently | ||||
negative), adding all positive volume and all negative volume without | ||||
regard to flows will give an aggregate measure of downstream | ||||
congestion. But such simple aggregation is only possible if no flows | ||||
are persistently negative. Unless persistently negative flows are | ||||
completely removed, they will reduce the aggregate measure of | ||||
congestion. The aggregate may still be positive overall, but not as | ||||
positive as it would have been had the negative flows been removed. | ||||
In Section 6.1.4 we discussed how to sanction traffic to remove, or | ||||
at least to identify, persistently negative flows. But, even if the | ||||
sanction for negative traffic is to discard it, unless it is | ||||
discarded at the exact point it goes negative, it will wrongly | ||||
subtract from aggregate downstream congestion, at least at any | ||||
borders it crosses after it has gone negative but before it is | ||||
discarded. | ||||
We rely on sanctions to deter dishonest understatement of congestion. | ||||
But even the ultimate sanction of discard can only be effective if | ||||
the sender is bothered about the data getting through to its | ||||
destination. A number of attacks have been identified where a sender | ||||
gains from sending dummy traffic or it can attack someone or | ||||
something using dummy traffic even though it isn't communicating any | ||||
information to anyone: | ||||
o A host can send traffic with no positive markings towards its | ||||
intended destination, aiming to transmit as much traffic as any | ||||
dropper will allow [Bauer06]. It may add forward error correction | ||||
(FEC) to repair as much drop as it experiences. | ||||
o A host can send dummy traffic into the network with no positive | ||||
markings and with no intention of communicating with anyone, but | ||||
merely to cause higher levels of congestion for others who do want | ||||
to communicate (DoS). So, to ride over the extra congestion, | ||||
everyone else has to spend more of whatever rights to cause | ||||
congestion they have been allowed. | ||||
o A network can simply create its own dummy traffic to congest | ||||
another network, perhaps causing it to lose business at no cost to | ||||
the attacking network. This is a form of denial of service | ||||
perpetrated by one network on another. The preferential drop | ||||
measures in Section 5.3 provide crude protection against such | ||||
attacks, but we are not overly worried about more accurate | ||||
prevention measures, because it is already possible for networks | ||||
to DoS other networks on the general Internet, but they generally | ||||
don't because of the grave consequences of being found out. We | ||||
are only concerned if re-ECN increases the motivation for such an | ||||
attack, as in the next example. | ||||
o A network can just generate negative traffic and send it over its | ||||
border with a neighbour to reduce the overall penalties that it | ||||
should pay to that neighbour. It could even initialise the TTL so | ||||
it expired shortly after entering the neighbouring network, | ||||
reducing the chance of detection further downstream. This attack | ||||
need not be motivated by a desire to deny service and indeed need | ||||
not cause denial of service. A network's main motivator would | ||||
most likely be to reduce the penalties it pays to a neighbour. | ||||
But, the prospect of financial gain might tempt the network into | ||||
mounting a DoS attack on the other network as well, given the gain | ||||
would offset some of the risk of being detected. | ||||
The first step towards a solution to all these problems with negative | ||||
flows is to be able to estimate the contribution they make to | ||||
downstream congestion at a border and to correct the measure | ||||
accordingly. Although ideally we want to remove negative flows | ||||
themselves, perhaps surprisingly, the most effective first step is to | ||||
cancel out the polluting effect negative flows have on the measure of | ||||
downstream congestion at a border. It is more important to get an | ||||
unbiased estimate of their effect, than to try to remove them all. A | ||||
suggested algorithm to give an unbiased estimate of the contribution | ||||
from negative flows to the downstream congestion measure is given in | ||||
Appendix H.2. | ||||
Although making an accurate assessment of the contribution from | ||||
negative flows may not be easy, just the single step of neutralising | ||||
their polluting effect on congestion metrics removes all the gains | ||||
networks could otherwise make from mounting dummy traffic attacks on | ||||
each other. This puts all networks on the same side (only with | ||||
respect to negative flows of course), rather than being pitched | ||||
against each other. The network where this flow goes negative as | ||||
well as all the networks downstream lose out from not being | ||||
reimbursed for any congestion this flow causes. So they all have an | ||||
interest in getting rid of these negative flows. Networks forwarding | ||||
a flow before it goes negative aren't strictly on the same side, but | ||||
they are disinterested bystanders---they don't care that the flow | ||||
goes negative downstream, but at least they can't actively gain from | ||||
making it go negative. The problem becomes localised so that once a | ||||
flow goes negative, all the networks from where it happens and beyond | ||||
downstream each have a small problem, each can detect it has a | ||||
problem and each can get rid of the problem if it chooses to. But | ||||
negative flows can no longer be used for any new attacks. | ||||
Once an unbiased estimate of the effect of negative flows can be | ||||
made, the problem reduces to detecting and preferably removing flows | ||||
that have gone negative as soon as possible. But importantly, | ||||
complete eradication of negative flows is no longer critical---best | ||||
endeavours will be sufficient. | ||||
For instance, let us consider the case where a source sends traffic | ||||
with no positive markings at all, hoping to at least get as much | ||||
traffic delivered as network-based droppers will allow. The flow is | ||||
likely to go at least slightly negative in the first network on the | ||||
path (N1 if we use the example network layout in Figure 8). If all | ||||
networks use the algorithm in Appendix H.2 to inflate penalties at | ||||
their border with an upstream network, they will remove the effect of | ||||
negative flows. So, for instance, N2 will not be paying a penalty to | ||||
N1 for this flow. Further, because the flow contributes no positive | ||||
markings at all, a dropper at the egress will completely remove it. | ||||
The remaining problem is that every network is carrying a flow that | ||||
is causing congestion to others but not being held to account for the | ||||
congestion it is causing. Whenever the fail-safe border algorithm | ||||
(Section 6.1.7) or the border algorithm to compensate for negative | ||||
flows (Appendix H.2) detects a negative flow, it can instantiate a | ||||
focused dropper for that flow locally. It may be some time before | ||||
the flow is detected, but the more strongly negative the flow is, the | ||||
more quickly it will be detected by the fail-safe algorithm. But, in | ||||
the meantime, it will not be distorting border incentives. Until it | ||||
is detected, if it contributes to drop anywhere, its packets will | ||||
tend to be dropped before others if queues use the preferential drop | ||||
rules in Section 5.3, which discriminate against non-positive | ||||
packets. All networks below the point where a flow goes negative | ||||
(N1, N2 and N3 in this case) have an incentive to remove this flow, | ||||
but the queue where it first goes negative (in N1) can of course | ||||
remove the problem for everyone downstream. | ||||
In the case of DDoS attacks, Section 6.2.1 describes how re-ECN | ||||
mitigates their force. | ||||
6.1.7. Inter-domain Fail-safes | ||||
The mechanisms described so far create incentives for rational | ||||
network operators to behave. That is, one operator aims to make | ||||
another behave responsibly by applying penalties and expects a | ||||
rational response (i.e. one that trades off costs against benefits). | ||||
It is usually reasonable to assume that other network operators will | ||||
behave rationally (policy routing can avoid those that might not). | ||||
But this approach does not protect against the misconfigurations and | ||||
accidents of other operators. | ||||
Therefore, we propose the following two mechanisms at a network's | ||||
borders to provide "defence in depth". Both are similar: | ||||
Highly positive flows: A small sample of positive packets should be | ||||
picked randomly as they cross a border interface. Then subsequent | ||||
packets matching the same source and destination address and DSCP | ||||
should be monitored. If the fraction of positive marking is well | ||||
above a threshold (to be determined by operational practice), a | ||||
management alarm SHOULD be raised, and the flow MAY be | ||||
automatically subject to focused drop. | ||||
Persistently negative flows: A small sample of congestion marked | ||||
(negative) packets should be picked randomly as they cross a | ||||
border interface. Then subsequent packets matching the same | ||||
source and destination address and DSCP should be monitored. If | ||||
the balance of positive minus negative markings is persistently | ||||
negative, a management alarm SHOULD be raised, and the flow MAY be | ||||
automatically subject to focused drop. | ||||
Both these mechanisms rely on the fact that highly positive (or | ||||
negative) flows will appear more quickly in the sample by selecting | ||||
randomly solely from positive (or negative) packets. | ||||
6.1.8. Simulations | ||||
Simulations of policer and dropper performance done for the multi-bit | ||||
version of re-feedback have been included in section 5 "Dropper | ||||
Performance" of [Re-fb]. Simulations of policer and dropper for the | ||||
re-ECN version described in this document are work in progress. | ||||
6.2. Other Applications | ||||
6.2.1. DDoS Mitigation | ||||
A flooding attack is inherently about congestion of a resource. | ||||
Because re-ECN ensures the sources causing network congestion | ||||
experience the cost of their own actions, it acts as a first line of | ||||
defence against DDoS. As load focuses on a victim, upstream queues | ||||
grow, requiring honest sources to pre-load packets with a higher | ||||
fraction of positive packets. Once downstream queues are so | ||||
congested that they are dropping traffic, they will be CE marking the | ||||
traffic they do forward 100%. Honest sources will therefore be | ||||
sending Re-Echo 100% (and therefore being severely rate-limited at | ||||
the ingress). | ||||
Senders under malicious control can either do the same as honest | ||||
sources, and be rate-limited at ingress, or they can understate | ||||
congestion by sending more neutral RECT packets than they should. If | ||||
sources understate congestion (i.e. do not re-echo sufficient | ||||
positive packets) and the preferential drop ranking is implemented on | ||||
queues (Section 5.3), these queues will preserve positive traffic | ||||
until last. So, the neutral traffic from malicious sources will all | ||||
be automatically dropped first. Either way, the malicious sources | ||||
cannot send more than honest sources. | ||||
Further, hosts under malicious control will tend to be re-used for | ||||
many different attacks. They will therefore build up a long term | ||||
history of causing congestion. Therefore, as long as the population | ||||
of potentially compromisable hosts around the Internet is limited, | ||||
the per-user policing algorithms in Appendix G.1 will gradually | ||||
throttle down zombies and other launchpads for attacks. Therefore, | ||||
widespread deployment of re-ECN could considerably dampen the force | ||||
of DDoS. Certainly, zombie armies could hold their fire for long | ||||
enough to be able to build up enough credit in the per-user policers | ||||
to launch an attack. But they would then still be limited to no more | ||||
throughput than other, honest users. | ||||
Inter-domain traffic policing (see Section 6.1.6)ensures that any | ||||
network that harbours compromised `zombie' hosts will have to bear | ||||
the cost of the congestion caused by traffic from zombies in | ||||
downstream networks. Such networks will be incentivised to deploy | ||||
per-user policers that rate-limit hosts that are unresponsive to | ||||
congestion so they can only send very slowly into congested paths. | ||||
As well as protecting other networks, the extremely poor performance | ||||
at any sign of congestion will incentivise the zombie's owner to | ||||
clean it up. However, the host should behave normally when using | ||||
uncongested paths. | ||||
Uniquely, re-ECN handles DDoS traffic without relying on the validity | ||||
of identifiers in packets. Certainly the egress dropper relies on | ||||
uniqueness of flow identifiers, but not their validity. So if a | ||||
source spoofs another address, re-ECN works just as well, as long as | ||||
the attacker cannot imitate all the flow identifiers of another | ||||
active flow passing through the same dropper (see Section 6.3). | ||||
Similarly, the ingress policer relies on uniqueness of flow IDs, not | ||||
their validity. Because a new flow will only be allowed any rate at | ||||
all if it starts with FNE, and the more FNE packets there are | ||||
starting new flows, the more they will be limited. Essentially a re- | ||||
ECN policer limits the bulk of all congestion entering the network | ||||
through a physical interface; limiting the congestion caused by each | ||||
flow is merely an optional extra. | ||||
6.2.2. End-to-end QoS | ||||
{ToDo: (Section 3.3.2 of [Re-fb] entitled `Edge QoS' gives an outline | ||||
of the text that will be added here).} | ||||
6.2.3. Traffic Engineering | ||||
{ToDo: } | ||||
6.2.4. Inter-Provider Service Monitoring | ||||
{ToDo: } | ||||
6.3. Limitations | ||||
The known limitations of the re-ECN approach are: | ||||
o We still cannot defend against the attack described in Section 10 | ||||
where a malicious source sends negative traffic through the same | ||||
egress dropper as another flow and imitates its flow identifiers, | ||||
allowing a malicious source to cause an innocent flow to | ||||
experience heavy drop. | ||||
o Re-feedback for TTL (re-TTL) would also be desirable at the same | ||||
time as re-ECN. Unfortunately this requires a further standards | ||||
action for the mechanisms briefly described in Appendix F | ||||
o Traffic must be ECN-capable for re-ECN to be effective. The only | ||||
defence against malicious users who turn off ECN capbility is that | ||||
networks are expected to rate limit Not-ECT traffic and to apply | ||||
higher drop preference to it during congestion. Although these | ||||
are blunt instruments, they at least represent a feasible scenario | ||||
for the future Internet where Not-ECT traffic co-exists with re- | ||||
ECN traffic, but as a severely hobbled under-class. We recommend | ||||
(Section 7.1) that while accommodating a smooth initial transition | ||||
to re-ECN, policing policies should gradually be tightened to rate | ||||
limit Not-ECT traffic more strictly in the longer term. | ||||
o When checking whether a flow is balancing positive markings with | ||||
congestion marking, re-ECN can only account for congestion | ||||
marking, not drops. So, whenever a sender experiences drop, it | ||||
does not have to re-echo the congestion event. Nonetheless, it is | ||||
hardly any advantage to be able to send faster than other flows | ||||
only if your traffic is dropped and the other traffic isn't. | ||||
o We are considering the issue of whether it would be useful to | ||||
truncate rather than drop packets that appear to be malicious, so | ||||
that the feedback loop is not broken but useful data can be | ||||
removed. | ||||
7. Incremental Deployment | 7. Incremental Deployment | |||
7.1. Incremental Deployment Features | ||||
The design of the re-ECN protocol started from the fact that the | The design of the re-ECN protocol started from the fact that the | |||
current ECN marking behaviour of queues was sufficient and that re- | current ECN marking behaviour of queues was sufficient and that re- | |||
feedback could be introduced around these queues by changing the | feedback could be introduced around these queues by changing the | |||
sender behaviour but not the routers. Otherwise, if we had required | sender behaviour but not the routers. Otherwise, if we had required | |||
routers to be changed, the chance of encountering a path that had | routers to be changed, the chance of encountering a path that had | |||
every router upgraded would be vanishly small during early | every router upgraded would be vanishly small during early | |||
deployment, giving no incentive to start deployment. Also, as there | deployment, giving no incentive to start deployment. Also, as there | |||
is no new forwarding behaviour, routers and hosts do not have to | is no new forwarding behaviour, routers and hosts do not have to | |||
signal or negotiate anything. | signal or negotiate anything. | |||
skipping to change at page 57, line 6 | skipping to change at page 34, line 18 | |||
sources will gain by upgrading to re-ECN. Thus, towards the end of | sources will gain by upgrading to re-ECN. Thus, towards the end of | |||
the voluntary incremental deployment period, RFC3168 compliant | the voluntary incremental deployment period, RFC3168 compliant | |||
transports can be given progressively stronger encouragement to | transports can be given progressively stronger encouragement to | |||
upgrade. | upgrade. | |||
The following list of minor changes, brings together all the points | The following list of minor changes, brings together all the points | |||
where re-ECN semantics for use of the two-bit ECN field are different | where re-ECN semantics for use of the two-bit ECN field are different | |||
compared to RFC3168: | compared to RFC3168: | |||
o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender | o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender | |||
sets ECT(0) by default (Section 3.4); | sets ECT(0) by default (Section 4.3); | |||
o No provision is necessary for a re-ECN capable source transport to | o No provision is necessary for a re-ECN capable source transport to | |||
use the ECN nonce (Section 4.1.2.1); | use the ECN nonce (Section 6.1.2.1); | |||
o Routers MAY preferentially drop different extended ECN codepoints | o Routers MAY preferentially drop different extended ECN codepoints | |||
(Section 5.3); | (Section 5.3); | |||
o Packets carrying the feedback not established (FNE) codepoint MAY | o Packets carrying the feedback not established (FNE) codepoint MAY | |||
optionally be marked rather than dropped by routers, even though | optionally be marked rather than dropped by routers, even though | |||
their ECN field is Not-ECT (with the important caveat in | their ECN field is Not-ECT (with the important caveat in | |||
Section 5.3); | Section 5.3); | |||
o Packets may be dropped by policing nodes because of apparent | o Packets may be dropped by policing nodes because of apparent | |||
misbehaviour, not just because of congestion (Section 6); | misbehaviour, not just because of congestion ; | |||
o Tunnel entry behaviour is still to be defined, but may have to be | o Tunnel entry behaviour is still to be defined, but may have to be | |||
different from RFC3168 (Section 5.6). | different from RFC3168 (Section 5.6). | |||
None of these changes REQUIRE any modifications to routers. Also | None of these changes REQUIRE any modifications to routers. Also | |||
none of these changes affect anything about end to end congestion | none of these changes affect anything about end to end congestion | |||
control; they are all to do with allowing networks to police that end | control; they are all to do with allowing networks to police that end | |||
to end congestion control is well-behaved. | to end congestion control is well-behaved. | |||
7.2. Incremental Deployment Incentives | 8. Related Work | |||
It would only be worth standardising the re-ECN protocol if there | ||||
existed a coherent story for how it might be incrementally deployed. | ||||
In order for it to have a chance of deployment, everyone who needs to | ||||
act must have a strong incentive to act, and the incentives must | ||||
arise in the order that deployment would have to happen. Re-ECN | ||||
works around unmodified ECN routers, but we can't just discuss why | ||||
and how re-ECN deployment might build on ECN deployment, because | ||||
there is precious little to build on in the first place. Instead, we | ||||
aim to show that re-ECN deployment could carry ECN with it. We focus | ||||
on commercial deployment incentives, although some of the arguments | ||||
apply equally to academic or government sectors. | ||||
ECN deployment: | ||||
ECN is largely implemented in commercial routers, but generally | ||||
not as a supported feature, and it has largely not been deployed | ||||
by commercial network operators. It has been released in many | ||||
Unix-based operating systems, but not in proprietary OSs like | ||||
Windows or those in many mobile devices. For detailed deployment | ||||
status, see [ECN-Deploy]. We believe the reason ECN deployment | ||||
has not happened is twofold: | ||||
* ECN requires changes to both routers and hosts. If someone | ||||
wanted to sell the improvement that ECN offers, they would have | ||||
to co-ordinate deployment of their product with others. An ECN | ||||
server only gives any improvement on an ECN network. An ECN | ||||
network only gives any improvement if used by ECN devices. | ||||
Deployment that requires co-ordination adds cost and delay and | ||||
tends to dilute any competitive advantage that might be gained. | ||||
* ECN `only' gives a performance improvement. Making a product a | ||||
bit faster (whether the product is a device or a network), | ||||
isn't usually a sufficient selling point to be worth the cost | ||||
of co-ordinating across the industry to deploy it. Network | ||||
operators tend to avoid re-configuring a working network unless | ||||
launching a new product. | ||||
ECN and Re-ECN for Edge-to-edge Assured QoS: | ||||
We believe the proposal to provide assured QoS sessions using a | ||||
form of ECN called pre-congestion notification (PCN) [PCN-arch] is | ||||
most likely to break the deadlock in ECN deployment first. It | ||||
only requires edge-to-edge deployment so it does not require | ||||
endpoint support. It can be deployed in a single network, then | ||||
grow incrementally to interconnected networks. And it provides a | ||||
different `product' (internetworked assured QoS), rather than | ||||
merely making an existing product a bit faster. | ||||
Not only could this assured QoS application kick-start ECN | ||||
deployment, it could also carry re-ECN deployment with it; because | ||||
re-ECN can enable the assured QoS region to expand to a large | ||||
internetwork where neighbouring networks do not trust each other. | ||||
[Re-PCN] argues that re-ECN security should be built in to the QoS | ||||
system from the start, explaining why and how. | ||||
If ECN and re-ECN were deployed edge-to-edge for assured QoS, | ||||
operators would gain valuable experience. They would also clear | ||||
away many technical obstacles such as firewall configurations that | ||||
block all but the RFC3168 settings of the ECN field and the RE | ||||
flag. | ||||
ECN in Access Networks: | ||||
The next obstacle to ECN deployment would be extension to access | ||||
and backhaul networks, where considerable link layer differences | ||||
makes implementation non-trivial, particularly on congested | ||||
wireless links. ECN and re-ECN work fine during partial | ||||
deployment, but they will not be very useful if the most congested | ||||
elements in networks are the last to support them. Access network | ||||
support is one of the weakest parts of this deployment story. All | ||||
we can hope is that, once the benefits of ECN are better | ||||
understood by operators, they will push for the necessary link | ||||
layer implementations as deployment proceeds. | ||||
Policing Unresponsive Flows: | ||||
Re-ECN allows a network to offer differentiated quality of service | ||||
as explained in Section 6.2.2. But we do not believe this will | ||||
motivate initial deployment of re-ECN, because the industry is | ||||
already set on alternative ways of doing QoS. Despite being much | ||||
more complicated and expensive, the alternative approaches are | ||||
here and now. | ||||
But re-ECN is critical to QoS deployment in another respect. It | ||||
can be used to prevent applications from taking whatever bandwidth | ||||
they choose without asking. | ||||
Currently, applications that remain resolute in their lack of | ||||
response to congestion are rewarded by other TCP applications. In | ||||
other words, TCP is naively friendly, in that it reduces its rate | ||||
in response to congestion whether it is competing with friends | ||||
(other TCPs) or with enemies (unresponsive applications). | ||||
Therefore, those network owners that want to sell QoS will be keen | ||||
to ensure that their users can't help themselves to QoS for free. | ||||
Given the very large revenues at stake, we believe effective | ||||
policing of congestion response will become highly sought after by | ||||
network owners. | ||||
But this does not necessarily argue for re-ECN deployment. | ||||
Network owners might choose to deploy bottleneck policers rather | ||||
than re-ECN-based policing. However, under Related Work | ||||
(Section 9) we argue that bottleneck policers are inherently | ||||
vulnerable to circumvention. | ||||
Therefore we believe there will be a strong demand from network | ||||
owners for re-ECN deployment so they can police flows that do not | ||||
ask to be unresponsive to congestion, in order to protect their | ||||
revenues from flows that do ask (QoS). In particular, we suspect | ||||
that the operators of cellular networks will want to prevent VoIP | ||||
and video applications being used freely on their networks as a | ||||
more open market develops in GPRS and 3G devices. | ||||
Initial deployments are likely to be isolated to single cellular | ||||
networks. Cellular operators would first place requirements on | ||||
device manufacturers to include re-ECN in the standards for mobile | ||||
devices. In parallel, they would put out tenders for ingress and | ||||
egress policers. Then, after a while they would start to tighten | ||||
rate limits on Not-ECT traffic from non-standard devices and they | ||||
would start policing whatever non-accredited applications people | ||||
might install on mobile devices with re-ECN support in the | ||||
operating system. This would force even independent mobile device | ||||
manufacturers to provide re-ECN support. Early standardisation | ||||
across the cellular operators is likely, including interconnection | ||||
agreements with penalties for excess downstream congestion. | ||||
We suspect some fixed broadband networks (whether cable or DSL) | ||||
would follow a similar path. However, we also believe that larger | ||||
parts of the fixed Internet would not choose to police on a per- | ||||
flow basis. Some might choose to police congestion on a per-user | ||||
basis in order to manage heavy peer-to-peer file-sharing, but it | ||||
seems likely that a sizeable majority would not deploy any form of | ||||
policing. | ||||
This hybrid situation begs the question, "How does re-ECN work for | ||||
networks that choose to using policing if they connect with others | ||||
that don't?" Traffic from non-ECN capable sources will arrive | ||||
from other networks and cause congestion within the policed, ECN- | ||||
capable networks. So networks that chose to police congestion | ||||
would rate-limit Not-ECT traffic throughout their network, | ||||
particularly at their borders. They would probably also set | ||||
higher usage prices in their interconnection contracts for | ||||
incoming Not-ECT and Not-RECT traffic. We assume that | ||||
interconnection contracts between networks in the same tier will | ||||
include congestion penalties before contracts with provider | ||||
backbones do. | ||||
A hybrid situation could remain for all time. As was explained in | ||||
the introduction, we believe in healthy competition between | ||||
policing and not policing, with no imperative to convert the whole | ||||
world to the religion of policing. Networks that chose not to | ||||
deploy egress droppers would leave themselves open to being | ||||
congested by senders in other networks. But that would be their | ||||
choice. | ||||
The important aspect of the egress dropper though is that it most | ||||
protects the network that deploys it. If a network does not | ||||
deploy an egress dropper, sources sending into it from other | ||||
networks will be able to understate the congestion they are | ||||
causing. Whereas, if a network deploys an egress dropper, it can | ||||
know how much congestion other networks are dumping into it, and | ||||
apply penalties or charges accordingly. So, whether or not a | ||||
network polices its own sources at ingress, it is in its interests | ||||
to deploy an egress dropper. | ||||
Host support: | ||||
In the above deployment scenario, host operating system support | ||||
for re-ECN came about through the cellular operators demanding it | ||||
in device standards (i.e. 3GPP). Of course, increasingly, mobile | ||||
devices are being built to support multiple wireless technologies. | ||||
So, if re-ECN were stipulated for cellular devices, it would | ||||
automatically appear in those devices connected to the wireless | ||||
fringes of fixed networks if they coupled cellular with WiFi or | ||||
Bluetooth technology, for instance. Also, once implemented in the | ||||
operating system of one mobile device, it would tend to be found | ||||
in other devices using the same family of operating system. | ||||
Therefore, whether or not a fixed network deployed ECN, or | ||||
deployed re-ECN policers and droppers, many of its hosts might | ||||
well be using re-ECN over it. Indeed, they would be at an | ||||
advantage when communicating with hosts across re-ECN policed | ||||
networks that rate limited Not-RECT traffic. | ||||
Other possible scenarios: | ||||
The above is thankfully not the only plausible scenario we can | ||||
think of. One of the many clubs of operators that meet regularly | ||||
around the world might decide to act together to persuade a major | ||||
operating system manufacturer to implement re-ECN. And they may | ||||
agree between them on an interconnection model that includes | ||||
congestion penalties. | ||||
Re-ECN provides an interesting opportunity for device | ||||
manufacturers as well as network operators. Policers can be | ||||
configured loosely when first deployed. Then as re-ECN take-up | ||||
increases, they can be tightened up, so that a network with re-ECN | ||||
deployed can gradually squeeze down the service provided to | ||||
RFC3168 compliant devices that have not upgraded to re-ECN. Many | ||||
device vendors rely on replacement sales. And operating system | ||||
companies rely heavily on new release sales. Also support | ||||
services would like to be able to force stragglers to upgrade. | ||||
So, the ability to throttle service to RFC3168 compliant operating | ||||
systems is quite valuable. | ||||
Also, policing unresponsive sources may not be the only or even | ||||
the first application that drives deployment. It may be policing | ||||
causes of heavy congestion (e.g. peer-to-peer file-sharing). Or | ||||
it may be mitigation of denial of service. Or we may be wrong in | ||||
thinking simpler QoS will not be the initial motivation for re-ECN | ||||
deployment. Indeed, the combined pressure for all these may be | ||||
the motivator, but it seems optimistic to expect such a level of | ||||
joined-up thinking from today's communications industry. We | ||||
believe a single application alone must be a sufficient motivator. | ||||
In short, everyone gains from adding accountability to TCP/IP, | ||||
except the selfish or malicious. So, deployment incentives tend | ||||
to be strong. | ||||
8. Architectural Rationale | ||||
In the Internet's technical community, the danger of not responding | ||||
to congestion is well-understood, as well as its attendant risk of | ||||
congestion collapse [RFC3714]. However, one side of the Internet's | ||||
commercial community considers that the very essence of IP is to | ||||
provide open access to the internetwork for all applications. They | ||||
see congestion as a symptom of over-conservative investment, and rely | ||||
on revising application designs to find novel ways to keep | ||||
applications working despite congestion. They argue that the | ||||
Internet was never intended to be solely for TCP-friendly | ||||
applications. Meanwhile, another side of the Internet's commercial | ||||
community believes that it is worthwhile providing a network for | ||||
novel applications only if it has sufficient capacity, which can | ||||
happen only if a greater share of application revenues can be | ||||
/assured/ for the infrastructure provider. Otherwise the major | ||||
investments required would carry too much risk and wouldn't happen. | ||||
The lesson articulated in [Tussle] is that we shouldn't embed our | ||||
view on these arguments into the Internet at design time. Instead we | ||||
should design the Internet so that the outcome of these arguments can | ||||
get decided at run-time. Re-ECN is designed in that spirit. Once | ||||
the protocol is available, different network operators can choose how | ||||
liberal they want to be in holding people accountable for the | ||||
congestion they cause. Some might boldly invest in capacity and not | ||||
police its use at all, hoping that novel applications will result. | ||||
Others might use re-ECN for fine-grained flow policing, expecting to | ||||
make money selling vertically integrated services. Yet others might | ||||
sit somewhere half-way, perhaps doing coarse, per-user policing. All | ||||
might change their minds later. But re-ECN always allows them to | ||||
interconnect so that the careful ones can protect themselves from the | ||||
liberal ones. | ||||
The incentive-based approach used for re-ECN is based on Gibbens and | ||||
Kelly's arguments [Evol_cc] on allowing endpoints the freedom to | ||||
evolve new congestion control algorithms for new applications. They | ||||
ensured responsible behaviour despite everyone's self-interest by | ||||
applying pricing to ECN marking, and Kelly had proved stability and | ||||
optimality in an earlier paper. | ||||
Re-ECN keeps all the underlying economic incentives, but rearranges | ||||
the feedback. The idea is to allow a network operator (if it | ||||
chooses) to deploy engineering mechanisms like policers at the front | ||||
of the network which can be designed to behave /as if/ they are | ||||
responding to congestion prices. Rather than having to subject users | ||||
to congestion pricing, networks can then use more traditional | ||||
charging regimes (or novel ones). But the engineering can constrain | ||||
the overall amount of congestion a user can cause. This provides a | ||||
buffer against completely outrageous congestion control, but still | ||||
makes it easy for novel applications to evolve if they need different | ||||
congestion control to the norms. It also allows novel charging | ||||
regimes to evolve. | ||||
Despite being achieved with a relatively minor protocol change, re- | ||||
ECN is an architectural change. Previously, Internet congestion | ||||
could only be controlled by the data sender, because it was the only | ||||
one both in a position to control the load and in a position to see | ||||
information on congestion. Re-ECN levels the playing field. It | ||||
recognises that the network also has a role to play in moderating | ||||
(policing) congestion control. But policing is only truly effective | ||||
at the first ingress into an internetwork, whereas path congestion | ||||
was previously only visible at the last egress. So, re-ECN | ||||
democratises congestion information. Then the choice over who | ||||
actually controls congestion can be made at run-time, not design | ||||
time---a bit like an aircraft with dual controls. And different | ||||
operators can make different choices. We believe non-architectural | ||||
approaches to this problem are unlikely to offer more than partial | ||||
solutions (see Section 9). | ||||
Importantly, re-ECN does not require assumptions about specific | ||||
congestion responses to be embedded in any network elements, except | ||||
at the first ingress to the internetwork if that level of control is | ||||
desired by the ingress operator. But such tight policing will be a | ||||
matter of agreement between the source and its access network | ||||
operator. The ingress operator need not police congestion response | ||||
at flow granularity; it can simply hold a source responsible for the | ||||
aggregate congestion it causes, perhaps keeping it within a monthly | ||||
congestion quota. Or if the ingress network trusts the source, it | ||||
can do nothing. | ||||
Therefore, the aim of the re-ECN protocol is NOT solely to police | ||||
TCP-friendliness. Re-ECN preserves IP as a generic network layer for | ||||
all sorts of responses to congestion, for all sorts of transports. | ||||
Re-ECN merely ensures truthful downstream congestion information is | ||||
available in the network layer for all sorts of accountability | ||||
applications. | ||||
The end to end design principle does not say that all functions | ||||
should be moved out of the lower layers---only those functions that | ||||
are not generic to all higher layers. Re-ECN adds a function to the | ||||
network layer that is generic, but was omitted: accountability for | ||||
causing congestion. Accountability is not something that an end-user | ||||
can provide to themselves. We believe re-ECN adds no more than is | ||||
sufficient to hold each flow accountable, even if it consists of a | ||||
single datagram. | ||||
"Accountability" implies being able to identify who is responsible | ||||
for causing congestion. However, at the network layer it would NOT | ||||
be useful to identify the cause of congestion by adding individual or | ||||
organisational identity information, NOR by using source IP | ||||
addresses. Rather than bringing identity information to the point of | ||||
congestion, we bring downstream congestion information to the point | ||||
where the cause can be most easily identified and dealt with. That | ||||
is, at any trust boundary congestion can be associated with the | ||||
physically connected upstream neighbour that is directly responsible | ||||
for causing it (whether intentionally or not). A trust boundary | ||||
interface is exactly the place to police or throttle in order to | ||||
directly mitigate congestion, rather than having to trace the | ||||
(ir)responsible party in order to shut them down. | ||||
Some considered that ECN itself was a layering violation. The | ||||
reasoning went that the interface to a layer should provide a service | ||||
to the higher layer and hide how the lower layer does it. However, | ||||
ECN reveals the state of the network layer and below to the transport | ||||
layer. A more positive way to describe ECN is that it is like the | ||||
return value of a function call to the network layer. It explicitly | ||||
returns the status of the request to deliver a packet, by returning a | ||||
value representing the current risk that a packet will not be served. | ||||
Re-ECN has similar semantics, except the transport layer must try to | ||||
guess the return value, then it can use the actual return value from | ||||
the network layer to modify the next guess. | ||||
The guiding principle behind all the discussion in Section 6.1.6 on | ||||
Policing is that any gain from subverting the protocol should be | ||||
precisely neutralised, rather than punished. If a gain is punished | ||||
to a greater extent than is sufficient to neutralise it, it will most | ||||
likely open up a new vulnerability, where the amplifying effect of | ||||
the punishment mechanism can be turned on others. | ||||
For instance, if possible, flows should be removed as soon as they go | ||||
negative, but we do NOT RECOMMEND any attempts to discard such flows | ||||
further upstream while they are still positive. Such over-zealous | ||||
push-back is unnecessary and potentially dangerous. These flows have | ||||
paid their `fare' up to the point they go negative, so there is no | ||||
harm in delivering them that far. If someone downstream asks for a | ||||
flow to be dropped as near to the source as possible, because they | ||||
say it is going to become negative later, an upstream node cannot | ||||
test the truth of this assertion. Rather than have to authenticate | ||||
such messages, re-ECN has been designed so that flows can be dropped | ||||
solely based on locally measurable evidence. A message hinting that | ||||
a flow should be watched closely to test for negativity is fine. But | ||||
not a message that claims that a positive flow will go negative | ||||
later, so it should be dropped. . | ||||
9. Related Work | ||||
{Due to lack of time, this section is incomplete. The reader is | ||||
referred to the Related Work section of [Re-fb] for a brief selection | ||||
of related ideas.} | ||||
9.1. Policing Rate Response to Congestion | ||||
ATM network elements send congestion back-pressure | ||||
messages [ITU-T.I.371] along each connection, duplicating any end to | ||||
end feedback because they don't trust it. On the other hand, re-ECN | ||||
ensures information in forwarded packets can be used for congestion | ||||
management without requiring a connection-oriented architecture and | ||||
re-using the overhead of fields that are already set aside for end to | ||||
end congestion control (and routing loop detection in the case of re- | ||||
TTL in Appendix F). | ||||
We borrowed ideas from policers in the literature [pBox],[XCHOKe], | ||||
AFD etc. for our rate equation policer. However, without the benefit | ||||
of re-ECN they don't police the correct rate for the condition of | ||||
their path. They detect unusually high /absolute/ rates, but only | ||||
while the policer itself is congested, because they work by detecting | ||||
prevalent flows in the discards from the local RED queue. These | ||||
policers must sit at every potential bottleneck, whereas our policer | ||||
need only be located at each ingress to the internetwork. As Floyd & | ||||
Fall explain [pBox], the limitation of their approach is that a high | ||||
sending rate might be perfectly legitimate, if the rest of the path | ||||
is uncongested or the round trip time is short. Commercially | ||||
available rate policers cap the rate of any one flow. Or they | ||||
enforce monthly volume caps in an attempt to control high volume | ||||
file-sharing. They limit the value a customer derives. They might | ||||
also limit the congestion customers can cause, but only as an | ||||
accidental side-effect. They actually punish traffic that fills | ||||
troughs as much as traffic that causes peaks in utilisation. In | ||||
practice network operators need to be able to allocate service by | ||||
cost during congestion, and by value at other times. | ||||
9.2. Congestion Notification Integrity | 8.1. Congestion Notification Integrity | |||
The choice of two ECT code-points in the ECN field [RFC3168] | The choice of two ECT code-points in the ECN field [RFC3168] | |||
permitted future flexibility, optionally allowing the sender to | permitted future flexibility, optionally allowing the sender to | |||
encode the experimental ECN nonce [RFC3540] in the packet stream. | encode the experimental ECN nonce [RFC3540] in the packet stream. | |||
This mechanism has since been included in the specifications of DCCP | This mechanism has since been included in the specifications of DCCP | |||
[RFC4340]. | [RFC4340]. | |||
The ECN nonce is an elegant scheme that allows the sender to detect | The ECN nonce is an elegant scheme that allows the sender to detect | |||
if someone in the feedback loop - the receiver especially - tries to | if someone in the feedback loop - the receiver especially - tries to | |||
claim no congestion was experienced when in fact congestion led to | claim no congestion was experienced when in fact congestion led to | |||
skipping to change at page 67, line 5 | skipping to change at page 35, line 44 | |||
can police their upstream neighbours, to encourage them to police | can police their upstream neighbours, to encourage them to police | |||
their users in turn. But most importantly, it requires the sender to | their users in turn. But most importantly, it requires the sender to | |||
declare path congestion to the network and it can remove traffic at | declare path congestion to the network and it can remove traffic at | |||
the egress if this declaration is dishonest. So it can police | the egress if this declaration is dishonest. So it can police | |||
correctly, irrespective of whether the receiver tries to suppress | correctly, irrespective of whether the receiver tries to suppress | |||
congestion feedback or whether the sender ignores genuine congestion | congestion feedback or whether the sender ignores genuine congestion | |||
feedback. Therefore the re-ECN protocol addresses a much wider range | feedback. Therefore the re-ECN protocol addresses a much wider range | |||
of cheating problems, which includes the one addressed by the ECN | of cheating problems, which includes the one addressed by the ECN | |||
nonce. | nonce. | |||
9.3. Identifying Upstream and Downstream Congestion | 9. Security Considerations | |||
Purple [Purple] proposes that queues should use the CWR flag in the | ||||
TCP header of ECN-capable flows to work out path congestion and | ||||
therefore downstream congestion in a similar way to re-ECN. However, | ||||
because CWR is in the transport layer, it is not always visible to | ||||
network layer routers and policers. Purple's motivation was to | ||||
improve AQM, not policing. But, of course, nodes trying to avoid a | ||||
policer would not be expected to allow CWR to be visible. | ||||
10. Security Considerations | ||||
This whole memo concerns the deployment of a secure congestion | This whole memo concerns the deployment of a secure congestion | |||
control framework. However, below we list some specific security | control framework. However, below we list some specific security | |||
issues that we are still working on: | issues that we are still working on: | |||
o Malicious users have ability to launch dynamically changing | o Malicious users have ability to launch dynamically changing | |||
attacks, exploiting the time it takes to detect an attack, given | attacks, exploiting the time it takes to detect an attack, given | |||
ECN marking is binary. We are concentrating on subtle | ECN marking is binary. We are concentrating on subtle | |||
interactions between the ingress policer and the egress dropper in | interactions between the ingress policer and the egress dropper in | |||
an effort to make it impossible to game the system. | an effort to make it impossible to game the system. | |||
skipping to change at page 67, line 37 | skipping to change at page 36, line 17 | |||
o There is an inherent need for at least some flow state at the | o There is an inherent need for at least some flow state at the | |||
egress dropper given the binary marking environment, which leads | egress dropper given the binary marking environment, which leads | |||
to an apparent vulnerability to state exhaustion attacks. An | to an apparent vulnerability to state exhaustion attacks. An | |||
egress dropper design with bounded flow state is in write-up. | egress dropper design with bounded flow state is in write-up. | |||
o A malicious source can spoof another user's address and send | o A malicious source can spoof another user's address and send | |||
negative traffic to the same destination in order to fool the | negative traffic to the same destination in order to fool the | |||
dropper into sanctioning the other user's flow. To prevent or | dropper into sanctioning the other user's flow. To prevent or | |||
mitigate these two different kinds of DoS attack, against the | mitigate these two different kinds of DoS attack, against the | |||
dropper and against given flows, we are considering various | dropper and against given flows, we are considering various | |||
protection mechanisms. Section 5.5.1 discusses one of these. | protection mechanisms. | |||
o A malicious client can send requests using a spoofed source | o A malicious client can send requests using a spoofed source | |||
address to a server (such as a DNS server) that tends to respond | address to a server (such as a DNS server) that tends to respond | |||
with single packet responses. This server will then be tricked | with single packet responses. This server will then be tricked | |||
into having to set FNE on the first (and only) packet of all these | into having to set FNE on the first (and only) packet of all these | |||
wasted responses. Given packets marked FNE are worth +1, this | wasted responses. Given packets marked FNE are worth +1, this | |||
will cause such servers to consume more of their allowance to | will cause such servers to consume more of their allowance to | |||
cause congestion than they would wish to. In general, re-ECN is | cause congestion than they would wish to. In general, re-ECN is | |||
deliberately designed so that single packet flows have to bear the | deliberately designed so that single packet flows have to bear the | |||
cost of not discovering the congestion state of their path. One | cost of not discovering the congestion state of their path. One | |||
skipping to change at page 68, line 46 | skipping to change at page 37, line 26 | |||
was defined). But it would be sufficient for a pair of endpoints to | was defined). But it would be sufficient for a pair of endpoints to | |||
make random checks on whether the RE flag was the same when it | make random checks on whether the RE flag was the same when it | |||
reached the egress as when it left the ingress. Indeed, if IPSec AH | reached the egress as when it left the ingress. Indeed, if IPSec AH | |||
had covered the RE flag, any network intending to alter sufficient RE | had covered the RE flag, any network intending to alter sufficient RE | |||
flags to make a gain would have focused its alterations on packets | flags to make a gain would have focused its alterations on packets | |||
without authenticating headers (AHs). | without authenticating headers (AHs). | |||
The security of re-ECN has been deliberately designed to not rely on | The security of re-ECN has been deliberately designed to not rely on | |||
cryptography. | cryptography. | |||
11. IANA Considerations | 10. IANA Considerations | |||
This memo includes no request to IANA (yet). | This memo includes no request to IANA (yet). | |||
If this memo was to progress to standards track, it would list: | If this memo was to progress to standards track, it would list: | |||
o The new RE flag in IPv4 (Section 5.1) and its extension with the | o The new RE flag in IPv4 (Section 5.1) and its extension with the | |||
ECN field to create a new set of extended ECN (EECN) codepoints; | ECN field to create a new set of extended ECN (EECN) codepoints; | |||
o The definition of the EECN codepoints for default Diffserv PHBs | o The definition of the EECN codepoints for default Diffserv PHBs | |||
(Section 3.3) | (Section 4.2) | |||
o The new extension header for IPv6 (Section 5.2); | o The new extension header for IPv6 (Section 5.2); | |||
o The new combinations of flags in the TCP header for capability | o The new combinations of flags in the TCP header for capability | |||
negotiation (Section 4.1.3); | negotiation (Section 6.1.3); | |||
o The new ICMP message type (Section 5.5.1). | ||||
12. Conclusions | 11. Conclusions | |||
{ToDo:} | {ToDo:} | |||
13. Acknowledgements | 12. Acknowledgements | |||
Sebastien Cazalet and Andrea Soppera contributed to the idea of re- | Sebastien Cazalet and Andrea Soppera contributed to the idea of re- | |||
feedback. All the following have given helpful comments: Andrea | feedback. All the following have given helpful comments: Andrea | |||
Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, | Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, | |||
Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, | Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, | |||
John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru | John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru | |||
Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd | Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd | |||
(ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark | (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark | |||
Handley (who developed the attack with canceled packets), Adam | Handley (who developed the attack with canceled packets), Adam | |||
Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft | Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft | |||
(Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who | (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who | |||
complemented our own dummy traffic attacks with others), Liz Maida | complemented our own dummy traffic attacks with others), Liz Maida | |||
(MIT), and comments from participants in the CRN/CFP Broadband and | (MIT), and comments from participants in the CRN/CFP Broadband and | |||
DoS-resistant Internet working groups.A special thank you to | DoS-resistant Internet working groups.A special thank you to | |||
Alessandro Salvatori for coming up with fiendish attacks on re-ECN. | Alessandro Salvatori for coming up with fiendish attacks on re-ECN. | |||
14. Comments Solicited | 13. Comments Solicited | |||
Comments and questions are encouraged and very welcome. They can be | Comments and questions are encouraged and very welcome. They can be | |||
addressed to the IETF Transport Area working group's mailing list | addressed to the IETF Transport Area working group's mailing list | |||
<tsvwg@ietf.org>, and/or to the authors. | <tsvwg@ietf.org>, and/or to the authors. | |||
15. References | 14. References | |||
15.1. Normative References | 14.1. Normative References | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | RFCs to Indicate Requirement Levels", | |||
BCP 14, RFC 2119, March 1997. | ||||
[RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion | [RFC2581] Allman, M., Paxson, V., and W. | |||
Control", RFC 2581, April 1999. | Stevens, "TCP Congestion Control", | |||
RFC 2581, April 1999. | ||||
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | [RFC3168] Ramakrishnan, K., Floyd, S., and D. | |||
of Explicit Congestion Notification (ECN) to IP", | Black, "The Addition of Explicit | |||
Congestion Notification (ECN) to IP", | ||||
RFC 3168, September 2001. | RFC 3168, September 2001. | |||
[RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's | [RFC3390] Allman, M., Floyd, S., and C. | |||
Initial Window", RFC 3390, October 2002. | Partridge, "Increasing TCP's Initial | |||
Window", RFC 3390, October 2002. | ||||
[RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram | [RFC4340] Kohler, E., Handley, M., and S. | |||
Congestion Control Protocol (DCCP)", RFC 4340, March 2006. | Floyd, "Datagram Congestion Control | |||
Protocol (DCCP)", RFC 4340, | ||||
March 2006. | ||||
[RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion | [RFC4341] Floyd, S. and E. Kohler, "Profile for | |||
Control Protocol (DCCP) Congestion Control ID 2: TCP-like | Datagram Congestion Control Protocol | |||
Congestion Control", RFC 4341, March 2006. | (DCCP) Congestion Control ID 2: TCP- | |||
like Congestion Control", RFC 4341, | ||||
March 2006. | ||||
[RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for | [RFC4342] Floyd, S., Kohler, E., and J. Padhye, | |||
Datagram Congestion Control Protocol (DCCP) Congestion | "Profile for Datagram Congestion | |||
Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, | Control Protocol (DCCP) Congestion | |||
Control ID 3: TCP-Friendly Rate | ||||
Control (TFRC)", RFC 4342, | ||||
March 2006. | March 2006. | |||
[RFC4960] Stewart, R., "Stream Control Transmission Protocol", | [RFC4960] Stewart, R., "Stream Control | |||
RFC 4960, September 2007. | Transmission Protocol", RFC 4960, | |||
September 2007. | ||||
15.2. Informative References | 14.2. Informative References | |||
[ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the | [ARI05] Adams, J., Roberts, L., and A. | |||
Internet to Support Real-Time Content Supply from a Large | IJsselmuiden, "Changing the Internet | |||
Fraction of Broadband Residential Users", BT Technology | to Support Real-Time Content Supply | |||
from a Large Fraction of Broadband | ||||
Residential Users", BT Technology | ||||
Journal (BTTJ) 23(2), April 2005. | Journal (BTTJ) 23(2), April 2005. | |||
[Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the | [ECN-tunnel] Briscoe, B., "Layered Encapsulation | |||
assumptions underlying mechanism design for the Internet", | of Congestion Notification", | |||
Proc. Workshop on the Economics of Networked Systems | draft-briscoe-tsvwg-ecn-tunnel-00 | |||
(NetEcon06) , June 2006, <http://www.cs.duke.edu/nicl/ | (work in progress), June 2007. | |||
netecon06/papers/ne06-assessing.pdf>. | ||||
[CLoop_pol] | ||||
Salvatori, A., "Closed Loop Traffic Policing", Politecnico | ||||
Torino and Institut Eurecom Masters Thesis , | ||||
September 2005. | ||||
[ECN-Deploy] | ||||
Floyd, S., "ECN (Explicit Congestion Notification) in | ||||
TCP/IP; Implementation and Deployment of ECN", Web-page , | ||||
May 2004, | ||||
<http://www.icir.org/floyd/ecn.html#implementations>. | ||||
[ECN-tunnel] | ||||
Briscoe, B., "Layered Encapsulation of Congestion | ||||
Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in | ||||
progress), June 2007. | ||||
[Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the | ||||
evolution of congestion control", Automatica 35(12)1969-- | ||||
1985, December 1999, | ||||
<http://www.statslab.cam.ac.uk/~frank/evol.html>. | ||||
[I-D.ietf-tcpm-ecnsyn] | ||||
Kuzmanovic, A., "Adding Explicit Congestion Notification | ||||
(ECN) Capability to TCP's SYN/ACK Packets", | ||||
draft-ietf-tcpm-ecnsyn-05 (work in progress), | ||||
February 2008. | ||||
[I-D.moncaster-tcpm-rcv-cheat] | ||||
Moncaster, T., "A TCP Test to Allow Senders to Identify | ||||
Receiver Non-Compliance", | ||||
draft-moncaster-tcpm-rcv-cheat-02 (work in progress), | ||||
November 2007. | ||||
[ITU-T.I.371] | ||||
ITU-T, "Traffic Control and Congestion Control in | ||||
{B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. | ||||
[Jiang02] Jiang, H. and D. Dovrolis, "The Macroscopic Behavior of | ||||
the TCP Congestion Avoidance Algorithm", ACM SIGCOMM | ||||
CCR 32(3)75-88, July 2002, | ||||
<http://doi.acm.org/10.1145/571697.571725>. | ||||
[Mathis97] | ||||
Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The | ||||
Macroscopic Behavior of the TCP Congestion Avoidance | ||||
Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, | ||||
<http://doi.acm.org/10.1145/263932.264023>. | ||||
[PCN-arch] | [I-D.ietf-tcpm-ecnsyn] Kuzmanovic, A., "Adding Explicit | |||
Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., | Congestion Notification (ECN) | |||
Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion | Capability to TCP's SYN/ACK | |||
Notification Architecture", draft-ietf-pcn-architecture-03 | Packets", draft-ietf-tcpm-ecnsyn-05 | |||
(work in progress), February 2008. | (work in progress), February 2008. | |||
[Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: | [I-D.moncaster-tcpm-rcv-cheat] Moncaster, T., "A TCP Test to Allow | |||
Predictive Active Queue Management Utilizing Congestion | Senders to Identify Receiver Non- | |||
Information", Proc. Local Computer Networks (LCN 2003) , | Compliance", | |||
October 2003. | draft-moncaster-tcpm-rcv-cheat-02 | |||
(work in progress), November 2007. | ||||
[RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, | [PCN-arch] Eardley, P., Babiarz, J., Chan, K., | |||
M., Romanow, A., Weinrib, A., and L. Zhang, "Resource | Charny, A., Geib, R., Karagiannis, | |||
ReSerVation Protocol (RSVP) Version 1 Applicability | G., Menth, M., and T. Tsou, "Pre- | |||
Statement Some Guidelines on Deployment", RFC 2208, | Congestion Notification | |||
September 1997. | Architecture", | |||
draft-ietf-pcn-architecture-03 (work | ||||
in progress), February 2008. | ||||
[RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, | [RFC2309] Braden, B., Clark, D., Crowcroft, J., | |||
S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., | Davie, B., Deering, S., Estrin, D., | |||
Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, | Floyd, S., Jacobson, V., Minshall, | |||
S., Wroclawski, J., and L. Zhang, "Recommendations on | G., Partridge, C., Peterson, L., | |||
Queue Management and Congestion Avoidance in the | Ramakrishnan, K., Shenker, S., | |||
Wroclawski, J., and L. Zhang, | ||||
"Recommendations on Queue Management | ||||
and Congestion Avoidance in the | ||||
Internet", RFC 2309, April 1998. | Internet", RFC 2309, April 1998. | |||
[RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., | [RFC2475] Blake, S., Black, D., Carlson, M., | |||
and W. Weiss, "An Architecture for Differentiated | Davies, E., Wang, Z., and W. Weiss, | |||
"An Architecture for Differentiated | ||||
Services", RFC 2475, December 1998. | Services", RFC 2475, December 1998. | |||
[RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission | [RFC2988] Paxson, V. and M. Allman, "Computing | |||
Timer", RFC 2988, November 2000. | TCP's Retransmission Timer", | |||
RFC 2988, November 2000. | ||||
[RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", | ||||
RFC 3124, June 2001. | ||||
[RFC3514] Bellovin, S., "The Security Flag in the IPv4 Header", | [RFC3124] Balakrishnan, H. and S. Seshan, "The | |||
RFC 3514, April 2003. | Congestion Manager", RFC 3124, | |||
June 2001. | ||||
[RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit | [RFC3514] Bellovin, S., "The Security Flag in | |||
Congestion Notification (ECN) Signaling with Nonces", | the IPv4 Header", RFC 3514, | |||
RFC 3540, June 2003. | April 2003. | |||
[RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion | [RFC3540] Spring, N., Wetherall, D., and D. | |||
Control for Voice Traffic in the Internet", RFC 3714, | Ely, "Robust Explicit Congestion | |||
March 2004. | Notification (ECN) Signaling with | |||
Nonces", RFC 3540, June 2003. | ||||
[RFC4301] Kent, S. and K. Seo, "Security Architecture for the | [RFC4301] Kent, S. and K. Seo, "Security | |||
Internet Protocol", RFC 4301, December 2005. | Architecture for the Internet | |||
Protocol", RFC 4301, December 2005. | ||||
[RFC4302] Kent, S., "IP Authentication Header", RFC 4302, | [RFC4302] Kent, S., "IP Authentication Header", | |||
December 2005. | RFC 4302, December 2005. | |||
[RFC4305] Eastlake, D., "Cryptographic Algorithm Implementation | [RFC4305] Eastlake, D., "Cryptographic | |||
Requirements for Encapsulating Security Payload (ESP) and | Algorithm Implementation Requirements | |||
Authentication Header (AH)", RFC 4305, December 2005. | for Encapsulating Security Payload | |||
(ESP) and Authentication Header | ||||
(AH)", RFC 4305, December 2005. | ||||
[RFC5129] Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion | [RFC5129] Davie, B., Briscoe, B., and J. Tay, | |||
Marking in MPLS", RFC 5129, January 2008. | "Explicit Congestion Marking in | |||
MPLS", RFC 5129, January 2008. | ||||
[Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN | [Re-PCN] Briscoe, B., "Emulating Border Flow | |||
on Bulk Data", draft-briscoe-re-pcn-border-cheat-01 (work | Policing using Re-ECN on Bulk Data", | |||
in progress), February 2008. | draft-briscoe-re-pcn-border-cheat-01 | |||
(work in progress), February 2008. | ||||
[Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., | [Re-fb] Briscoe, B., Jacquet, A., Di Cairano- | |||
Salvatori, A., Soppera, A., and M. Koyabe, "Policing | Gilfedder, C., Salvatori, A., | |||
Congestion Response in an Internetwork Using Re-Feedback", | Soppera, A., and M. Koyabe, "Policing | |||
ACM SIGCOMM CCR 35(4)277--288, August 2005, <http:// | Congestion Response in an | |||
www.acm.org/sigs/sigcomm/sigcomm2005/ | Internetwork Using Re-Feedback", ACM | |||
SIGCOMM CCR 35(4)277--288, | ||||
August 2005, <http://www.acm.org/ | ||||
sigs/sigcomm/sigcomm2005/ | ||||
techprog.html#session8>. | techprog.html#session8>. | |||
[Savage99] | [Savage99] Savage, S., Cardwell, N., Wetherall, | |||
Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | D., and T. Anderson, "TCP congestion | |||
"TCP congestion control with a misbehaving receiver", ACM | control with a misbehaving receiver", | |||
SIGCOMM CCR 29(5), October 1999, | ACM SIGCOMM CCR 29(5), October 1999, | |||
<http://citeseer.ist.psu.edu/savage99tcp.html>. | <http://citeseer.ist.psu.edu/ | |||
savage99tcp.html>. | ||||
[Smart_rtg] | ||||
Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, | ||||
"Optimizing Cost and Performance for Multihoming", ACM | ||||
SIGCOMM CCR 34(4)79--92, October 2004, | ||||
<http://citeseer.ist.psu.edu/698472.html>. | ||||
[Steps_DoS] | ||||
Handley, M. and A. Greenhalgh, "Steps towards a DoS- | ||||
resistant Internet Architecture", Proc. ACM SIGCOMM | ||||
workshop on Future directions in network architecture | ||||
(FDNA'04) pp 49--56, August 2004. | ||||
[Tussle] Clark, D., Sollins, K., Wroclawski, J., and R. Braden, | ||||
"Tussle in Cyberspace: Defining Tomorrow's Internet", ACM | ||||
SIGCOMM CCR 32(4)347--356, October 2002, | ||||
<http://www.acm.org/sigcomm/sigcomm2002/papers/ | ||||
tussle.pdf>. | ||||
[XCHOKe] Chhabra, P., Chuig, S., Goel, A., John, A., Kumar, A., | [Steps_DoS] Handley, M. and A. Greenhalgh, "Steps | |||
Saran, H., and R. Shorey, "XCHOKe: Malicious Source | towards a DoS-resistant Internet | |||
Control for Congestion Avoidance at Internet Gateways", | Architecture", Proc. ACM SIGCOMM | |||
Proceedings of IEEE International Conference on Network | workshop on Future directions in | |||
Protocols (ICNP-02) , November 2002, | network architecture (FDNA'04) pp | |||
<http://www.cc.gatech.edu/~akumar/xchoke.pdf>. | 49--56, August 2004. | |||
[pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End | [re-ecn-motive] Briscoe, B., "Re-ECN: The Motivation | |||
Congestion Control in the Internet", IEEE/ACM Transactions | for Adding Congestion Accountability | |||
on Networking 7(4) 458--472, August 1999, | to TCP/IP", draft-briscoe-tsvwg-re- | |||
<http://www.aciri.org/floyd/end2end-paper.html>. | ecn-tcp-motivation-00 (work in | |||
progress), February 2009. | ||||
Appendix A. Precise Re-ECN Protocol Operation | Appendix A. Precise Re-ECN Protocol Operation | |||
{ToDo: fix this} | {ToDo: fix this} | |||
The protocol operation in the middle described in Section 3.4 was an | The protocol operation in the middle described in Section 4.3 was an | |||
approximation. In fact, standard ECN router marking combines 1% and | approximation. In fact, standard ECN router marking combines 1% and | |||
2% marking into slightly less than 3% whole-path marking, because | 2% marking into slightly less than 3% whole-path marking, because | |||
routers deliberately mark CE whether or not it has already been | routers deliberately mark CE whether or not it has already been | |||
marked by another router upstream. So the combined marking fraction | marked by another router upstream. So the combined marking fraction | |||
would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. | would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. | |||
To generalise this we will need some notation. | To generalise this we will need some notation. | |||
o j represents the index of each resource (typically queues) along a | o j represents the index of each resource (typically queues) along a | |||
path, ranging from 0 at the first router to n-1 at the last. | path, ranging from 0 at the first router to n-1 at the last. | |||
skipping to change at page 75, line 12 | skipping to change at page 42, line 44 | |||
p_0 = u_n | p_0 = u_n | |||
= 1 - (1 - m_1)(1 - m_2)... | = 1 - (1 - m_1)(1 - m_2)... | |||
Similarly, at some point j in the middle of the network, if p = 1 - | Similarly, at some point j in the middle of the network, if p = 1 - | |||
(1 - u_j)(1 - v_j), then | (1 - u_j)(1 - v_j), then | |||
v_j = 1 - (1 - p)/(1 - u_j) | v_j = 1 - (1 - p)/(1 - u_j) | |||
~= p - u_j; if u_j << 100% | ~= p - u_j; if u_j << 100% | |||
So, between the two routers in the example in Section 3.4, congestion | So, between the two routers in the example in Section 4.3, congestion | |||
downstream is | downstream is | |||
v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) | v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) | |||
= 2.00%, | = 2.00%, | |||
or a useful approximation of downstream congestion is | or a useful approximation of downstream congestion is | |||
v_1 ~= 2.98% - 1.00% | v_1 ~= 2.98% - 1.00% | |||
~= 1.98%. | ~= 1.98%. | |||
Appendix B. Justification for Two Codepoints Signifying Zero Worth | Appendix B. Justification for Two Codepoints Signifying Zero Worth | |||
Packets | Packets | |||
It may seem a waste of a codepoint to set aside two codepoints of the | It may seem a waste of a codepoint to set aside two codepoints of the | |||
Extended ECN field to signify zero worth (RECT and CE(0) are both | Extended ECN field to signify zero worth (RECT and CE(0) are both | |||
worth zero). The justification is subtle, but worth recording. | worth zero). The justification is subtle, but worth recording. | |||
skipping to change at page 76, line 49 | skipping to change at page 44, line 33 | |||
the same as the proportion of RECT packets changed to CE(-1) and the | the same as the proportion of RECT packets changed to CE(-1) and the | |||
proportion of Re-Echo packets changed to CE(0). Double checking | proportion of Re-Echo packets changed to CE(0). Double checking | |||
using such redundant relationships can improve the security of a | using such redundant relationships can improve the security of a | |||
scheme (cf. double-entry book-keeping or the ECN Nonce). | scheme (cf. double-entry book-keeping or the ECN Nonce). | |||
Alternatively, it might be necessary to exploit the redundancy in the | Alternatively, it might be necessary to exploit the redundancy in the | |||
future to encode an extra information channel. | future to encode an extra information channel. | |||
Appendix C. ECN Compatibility | Appendix C. ECN Compatibility | |||
The rationale for choosing the particular combinations of SYN and SYN | The rationale for choosing the particular combinations of SYN and SYN | |||
ACK flags in Section 4.1.3 is as follows. | ACK flags in Section 6.1.3 is as follows. | |||
Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant | Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant | |||
ECN receivers so we wanted to use the same flags as would be used | ECN receivers so we wanted to use the same flags as would be used | |||
in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same | in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same | |||
time, we wanted a server (host B) that is Re-ECT to be able to | time, we wanted a server (host B) that is Re-ECT to be able to | |||
recognise that the client (A) is also Re-ECT. We believe also | recognise that the client (A) is also Re-ECT. We believe also | |||
setting NS=1 in the initial SYN achieves both these objectives, as | setting NS=1 in the initial SYN achieves both these objectives, as | |||
it should be ignored by RFC3168 compliant ECT receivers and by | it should be ignored by RFC3168 compliant ECT receivers and by | |||
ECT-Nonce receivers. But senders that are not Re-ECT should not | ECT-Nonce receivers. But senders that are not Re-ECT should not | |||
set NS=1. At the time ECN was defined, the NS flag was not | set NS=1. At the time ECN was defined, the NS flag was not | |||
skipping to change at page 80, line 5 | skipping to change at page 47, line 30 | |||
This behaviour happens to match TCP's congestion window control in | This behaviour happens to match TCP's congestion window control in | |||
slow start, which is why for TCP sources, only the first and third | slow start, which is why for TCP sources, only the first and third | |||
packet need be FNE packets. | packet need be FNE packets. | |||
A source that would open the congestion window any quicker would have | A source that would open the congestion window any quicker would have | |||
to insert more FNE packets. As another example a UDP source sending | to insert more FNE packets. As another example a UDP source sending | |||
VBR traffic might need to send several FNE packets ahead of the | VBR traffic might need to send several FNE packets ahead of the | |||
traffic peaks it generates. | traffic peaks it generates. | |||
Appendix E. Example Egress Dropper Algorithm | Appendix E. Argument for holding back the ECN nonce | |||
{ToDo: Write up the basic algorithm with flow state, then the | ||||
aggregated one.} | ||||
Appendix F. Re-TTL | ||||
This Appendix gives an overview of a proposal to be able to overload | ||||
the TTL field in the IP header to monitor downstream propagation | ||||
delay. This is included to show that it would be possible to take | ||||
account of RTT if it was deemed desirable. | ||||
Delay re-feedback can be achieved by overloading the TTL field, | ||||
without changing IP or router TTL processing. A target value for TTL | ||||
at the destination would need standardising, say 16. If the path hop | ||||
count increased by more than 16 during a routing change, it would | ||||
temporarily be mistaken for a routing loop, so this target would need | ||||
to be chosen to exceed typical hop count increases. The TCP wire | ||||
protocol and handlers would need modifying to feed back the | ||||
destination TTL and initialise it. It would be necessary to | ||||
standardise the unit of TTL in terms of real time (as was the | ||||
original intent in the early days of the Internet). | ||||
In the longer term, precision could be improved if routers | ||||
decremented TTL to represent exact propagation delay to the next | ||||
router. That is, for a router to decrement TTL by, say, 1.8 time | ||||
units it would alternate the decrement of every packet between 1 & 2 | ||||
at a ratio of 1:4. Although this might sometimes require a seemingly | ||||
dangerous null decrement, a packet in a loop would still decrement to | ||||
zero after 255 time units on average. As more routers were upgraded | ||||
to this more accurate TTL decrement, path delay estimates would | ||||
become increasingly accurate despite the presence of some RFC3168 | ||||
compliant routers that continued to always decrement the TTL by 1. | ||||
Appendix G. Policer Designs to ensure Congestion Responsiveness | ||||
G.1. Per-user Policing | ||||
User policing requires a policer on the ingress interface of the | ||||
access router associated with the user. At that point, the traffic | ||||
of the user hasn't diverged on different routes yet; nor has it mixed | ||||
with traffic from other sources. | ||||
In order to ensure that a user doesn't generate more congestion in | ||||
the network than her due share, a modified bulk token-bucket is | ||||
maintained with the following parameter: | ||||
o b_0 the initial token level | ||||
o r the filling rate | ||||
o b_max the bucket depth | ||||
The same token bucket algorithm is used as in many areas of | ||||
networking, but how it is used is very different: | ||||
o all traffic from a user over the lifetime of their subscription is | ||||
policed in the same token bucket. | ||||
o only positive and canceled packets (Re-Echo, FNE and CE(0)) | ||||
consume tokens | ||||
Such a policer will allow network operators to throttle the | ||||
contribution of their users to network congestion. This will require | ||||
the appropriate contractual terms to be in place between operators | ||||
and users. For instance: a condition for a user to subscribe to a | ||||
given network service may be that she should not cause more than a | ||||
volume C_user of congestion over a reference period T_user, although | ||||
she may carry forward up to N_user times her allowance at the end of | ||||
each period. These terms directly set the parameter of the user | ||||
policer: | ||||
o b_0 = C_user | ||||
o r = C_user/T_user | ||||
o b_max = b_0 * (N_user +1) | ||||
Besides the congestion budget policer above, another user policer may | ||||
be necessary to further rate-limit FNE packets, if they are to be | ||||
marked rather than dropped (see discussion in Section 5.3.). Rate- | ||||
limiting FNE packets will prevent high bursts of new flow arrivals, | ||||
which is a very useful feature in DoS prevention. A condition to | ||||
subscribe to a given network service would have to be that a user | ||||
should not generate more than C_FNE FNE packets, over a reference | ||||
period T_FNE, with no option to carry forward any of the allowance at | ||||
the end of each period. These terms directly set the parameters of | ||||
the FNE policer: | ||||
o b_0 = C_FNE | ||||
o r = C_FNE/T_FNE | ||||
o b_max = b_0 | ||||
T_FNE should be a much shorter period than T_user: for instance T_FNE | ||||
could be in the order of minutes while T_user could be in order of | ||||
weeks. | ||||
G.2. Per-flow Rate Policing | ||||
Whilst we believe that simple per-user policing would be sufficient | ||||
to ensure senders comply with congestion control, some operators may | ||||
wish to police the rate response of each flow to congestion as well. | ||||
Although we do not believe this will be neceesary, we include this | ||||
section to show how one could perform per-flow policing using | ||||
enforcement of TCP-fairness as an example. Per-flow policing aims to | ||||
enforce congestion responsiveness on the shortest information | ||||
timescale on a network path: packet roundtrips. | ||||
This again requires that the appropriate terms be agreed between a | ||||
network operator and its users, where a congestion responsiveness | ||||
policy might be required for the use of a given network service | ||||
(perhaps unless the user specifically requests otherwise). | ||||
As an example, we describe below how a rate adaptation policer can be | ||||
designed when the applicable rate adaptation policy is TCP- | ||||
compliance. In that context, the average throughput of a flow will | ||||
be expected to be bounded by the value of the TCP throughput during | ||||
congestion avoidance, given in Mathis' formula [Mathis97] | ||||
x_TCP = k * s / ( T * sqrt(m) ) | ||||
where: | ||||
o x_TCP is the throughput of the TCP flow in packets per second, | ||||
o k is a constant upper-bounded by sqrt(3/2), | ||||
o s is the average packet size of the flow, | ||||
o T is the roundtrip time of the flow, | ||||
o m is the congestion level experienced by the flow. | ||||
We define the marking period N=1/m which represents the average | ||||
number of packets between two positive or canceled packets. Mathis' | ||||
formula can be re-written as: | ||||
x_TCP = k*s*sqrt(N)/T | ||||
We can then get the average inter-mark time in a compliant TCP flow, | ||||
dt_TCP, by solving (x_TCP/s)*dt_TCP = N which gives | ||||
dt_TCP = sqrt(N)*T/k | ||||
We rely on this equation for the design of a rate-adaptation policer | ||||
as a variation of a token bucket. In that case a policer has to be | ||||
set up for each policed flow. This may be triggered by FNE packets, | ||||
with the remainder of flows being all rate limited together if they | ||||
do not start with an FNE packet. | ||||
Where maintaining per flow state is not a problem, for instance on | ||||
some access routers, systematic per-flow policing may be considered. | ||||
Should per-flow state be more constrained, rate adaptation policing | ||||
could be limited to a random sample of flows exhibiting positive or | ||||
canceled packets. | ||||
As in the case of user policing, only positive or canceled packets | ||||
will consume tokens, however the amount of tokens consumed will | ||||
depend on the congestion signal. | ||||
When a new rate adaptation policer is set up for flow j, the | ||||
following state is created: | ||||
o a token bucket b_j of depth b_max starting at level b_0 | ||||
o a timestamp t_j = timenow() | ||||
o a counter N_j = 0 | ||||
o a roundtrip estimate T_j | ||||
o a filling rate r | ||||
When the policing node forwards a packet of flow j with no Re-Echo: | ||||
o . the counter is incremented: N_j += 1 | ||||
When the policing node forwards a packet of flow j carrying a | ||||
congestion mark (CE): | ||||
o the counter is incremented: N_j += 1 | ||||
o the token level is adjusted: b_j += r*(timenow()-t_j) - sqrt(N_j)* | ||||
T_j/k | ||||
o the counter is reset: N_j = 0 | ||||
o the timer is reset: t_j = timenow() | ||||
An implementation example will be given in a later draft that avoids | ||||
having to extract the square root. | ||||
Analysis: For a TCP flow, for r= 1 token/sec, on average, | ||||
r*(timenow()-t_j)-sqrt(N_j)* T_j/k = dt_TCP - sqrt(N)*T/k = 0 | ||||
This means that the token level will fluctuate around its initial | ||||
level. The depth b_max of the bucket sets the timescale on which the | ||||
rate adaptation policy is performed while the filling rate r sets the | ||||
trade-off between responsiveness and robustness: | ||||
o the higher b_max, the longer it will take to catch greedy flows | ||||
o the higher r, the fewer false positives (greedy verdict on | ||||
compliant flows) but the more false negatives (compliant verdict | ||||
on greedy flows) | ||||
This rate adaptation policer requires the availability of a roundtrip | ||||
estimate which may be obtained for instance from the application of | ||||
re-feedback to the downstream delay Appendix F or passive estimation | ||||
[Jiang02]. | ||||
When the bucket of a policer located at the access router (whether it | ||||
is a per-user policer or a per-flow policer) becomes empty, the | ||||
access router SHOULD drop at least all packets causing the token | ||||
level to become negative. The network operator MAY take further | ||||
sanctions if the token level of the per-flow policers associated with | ||||
a user becomes negative. | ||||
Appendix H. Downstream Congestion Metering Algorithms | ||||
H.1. Bulk Downstream Congestion Metering Algorithm | ||||
To meter the bulk amount of downstream congestion in traffic crossing | ||||
an inter-domain border an algorithm is needed that accumulates the | ||||
size of positive packets and subtracts the size of negative packets. | ||||
We maintain two counters: | ||||
V_b: accumulated congestion volume | ||||
B: total data volume (in case it is needed) | ||||
A suitable pseudo-code algorithm for a border router is as follows: | ||||
==================================================================== | ||||
V_b = 0 | ||||
B = 0 | ||||
for each Re-ECN-capable packet { | ||||
b = readLength(packet) /* set b to packet size */ | ||||
B += b /* accumulate total volume */ | ||||
if readEECN(packet) == (Re-Echo || FNE) { | ||||
V_b += b /* increment... */ | ||||
} elseif readEECN(packet) == CE(-1) { | ||||
V_b -= b /* ...or decrement V_b... */ | ||||
} /*...depending on EECN field */ | ||||
} | ||||
==================================================================== | ||||
At the end of an accounting period this counter V_b represents the | ||||
congestion volume that penalties could be applied to, as described in | ||||
Section 6.1.6. | ||||
For instance, accumulated volume of congestion through a border | ||||
interface over a month might be V_b = 5PB (petabyte = 10^15 byte). | ||||
This might have resulted from an average downstream congestion level | ||||
of 1% on an accumulated total data volume of B = 500PB. | ||||
H.2. Inflation Factor for Persistently Negative Flows | ||||
The following process is suggested to complement the simple algorithm | ||||
above in order to protect against the various attacks from | ||||
persistently negative flows described in Section 6.1.6. As explained | ||||
in that section, the most important and first step is to estimate the | ||||
contribution of persistently negative flows to the bulk volume of | ||||
downstream pre-congestion and to inflate this bulk volume as if these | ||||
flows weren't there. The process below has been designed to give an | ||||
unbiased estimate, but it may be possible to define other processes | ||||
that achieve similar ends. | ||||
While the above simple metering algorithm is counting the bulk of | ||||
traffic over an accounting period, the meter should also select a | ||||
subset of the whole flow ID space that is small enough to be able to | ||||
realistically measure but large enough to give a realistic sample. | ||||
Many different samples of different subsets of the ID space should be | ||||
taken at different times during the accounting period, preferably | ||||
covering the whole ID space. During each sample, the meter should | ||||
count the volume of positive packets and subtract the volume of | ||||
negative, maintaining a separate account for each flow in the sample. | ||||
It should run a lot longer than the large majority of flows, to avoid | ||||
a bias from missing the starts and ends of flows, which tend to be | ||||
positive and negative respectively. | ||||
Once the accounting period finishes, the meter should calculate the | ||||
total of the accounts V_{bI} for the subset of flows I in the sample, | ||||
and the total of the accounts V_{fI} excluding flows with a negative | ||||
account from the subset I. Then the weighted mean of all these | ||||
samples should be taken a_S = sum_{forall I} V_{fI} / sum_{forall I} | ||||
V_{bI}. | ||||
If V_b is the result of the bulk accounting algorithm over the | ||||
accounting period (Appendix H.1) it can be inflated by this factor | ||||
a_S to get a good unbiased estimate of the volume of downstream | ||||
congestion over the accounting period a_S.V_b, without being polluted | ||||
by the effect of persistently negative flows. | ||||
Appendix I. Argument for holding back the ECN nonce | ||||
The ECN nonce is a mechanism that allows a /sending/ transport to | The ECN nonce is a mechanism that allows a /sending/ transport to | |||
detect if drop or ECN marking at a congested router has been | detect if drop or ECN marking at a congested router has been | |||
suppressed by a node somewhere in the feedback loop---another router | suppressed by a node somewhere in the feedback loop---another router | |||
or the receiver. | or the receiver. | |||
Space for the ECN nonce was set aside in [RFC3168] (currently | Space for the ECN nonce was set aside in [RFC3168] (currently | |||
proposed standard) while the full nonce mechanism is specified in | proposed standard) while the full nonce mechanism is specified in | |||
[RFC3540] (currently experimental). The specifications for [RFC4340] | [RFC3540] (currently experimental). The specifications for [RFC4340] | |||
(currently proposed standard) requires that "Each DCCP sender SHOULD | (currently proposed standard) requires that "Each DCCP sender SHOULD | |||
skipping to change at page 88, line 16 | skipping to change at page 49, line 29 | |||
because Re-ECN marking fractions at inter-domain borders would be | because Re-ECN marking fractions at inter-domain borders would be | |||
polluted by unknown levels of nonce traffic. | polluted by unknown levels of nonce traffic. | |||
The authors are aware that Re-ECN must prove it has the potential it | The authors are aware that Re-ECN must prove it has the potential it | |||
claims if it is to displace the nonce. Therefore, every effort has | claims if it is to displace the nonce. Therefore, every effort has | |||
been made to complete a comprehensive specification of Re-ECN so that | been made to complete a comprehensive specification of Re-ECN so that | |||
its potential can be assessed. We therefore seek the opinion of the | its potential can be assessed. We therefore seek the opinion of the | |||
Internet community on whether the Re-ECN protocol is sufficiently | Internet community on whether the Re-ECN protocol is sufficiently | |||
useful to warrant standards action. | useful to warrant standards action. | |||
Appendix F. Alternative Terminology Used in Other Documents | ||||
A number of alternative terms have been used in various documents | ||||
describign re-feedback and re-ECN. These are set out in the | ||||
following table | ||||
+-------------------+---------------+-------------------------------+ | ||||
| Current | EECN | Colour | | ||||
| Terminology | codepoint | | | ||||
+-------------------+---------------+-------------------------------+ | ||||
| Cautious | FNE | Green | | ||||
| Positive | Re-Echo | Black | | ||||
| Neutral | RECT | Grey | | ||||
| Negative | CE(-1) | Red | | ||||
| Cancelled | CE(0) | Red-Black | | ||||
| Legacy ECN | ECT(0) | White | | ||||
| Currently Unused | --CU-- | Currently unused | | ||||
| | | | | ||||
| Legacy | Not-ECT | White | | ||||
+-------------------+---------------+-------------------------------+ | ||||
Table 7: Alternative re-ECN Terminology | ||||
Authors' Addresses | Authors' Addresses | |||
Bob Briscoe | Bob Briscoe | |||
BT & UCL | BT & UCL | |||
B54/77, Adastral Park | B54/77, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
Phone: +44 1473 645196 | Phone: +44 1473 645196 | |||
Email: bob.briscoe@bt.com | EMail: bob.briscoe@bt.com | |||
URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ | URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ | |||
Arnaud Jacquet | Arnaud Jacquet | |||
BT | BT | |||
B54/70, Adastral Park | B54/70, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
Phone: +44 1473 647284 | Phone: +44 1473 647284 | |||
Email: arnaud.jacquet@bt.com | EMail: arnaud.jacquet@bt.com | |||
URI: | URI: | |||
Toby Moncaster | Toby Moncaster | |||
BT | BT | |||
B54/70, Adastral Park | B54/70, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
Phone: +44 1473 648734 | Phone: +44 1473 648734 | |||
Email: toby.moncaster@bt.com | EMail: toby.moncaster@bt.com | |||
Alan Smith | Alan Smith | |||
BT | BT | |||
B54/76, Adastral Park | B54/76, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
Phone: +44 1473 640404 | Phone: +44 1473 640404 | |||
Email: alan.p.smith@bt.com | EMail: alan.p.smith@bt.com | |||
Full Copyright Statement | Full Copyright Statement | |||
Copyright (C) The IETF Trust (2008). | Copyright (C) The IETF Trust (2009). | |||
This document is subject to the rights, licenses and restrictions | This document is subject to the rights, licenses and restrictions | |||
contained in BCP 78, and except as set forth therein, the authors | contained in BCP 78, and except as set forth therein, the authors | |||
retain all their rights. | retain all their rights. | |||
This document and the information contained herein are provided on an | This document and the information contained herein are provided on an | |||
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | |||
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND | OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND | |||
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS | THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS | |||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF | OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF | |||
skipping to change at page 90, line 45 | skipping to change at page 52, line 45 | |||
such proprietary rights by implementers or users of this | such proprietary rights by implementers or users of this | |||
specification can be obtained from the IETF on-line IPR repository at | specification can be obtained from the IETF on-line IPR repository at | |||
http://www.ietf.org/ipr. | http://www.ietf.org/ipr. | |||
The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
rights that may cover technology that may be required to implement | rights that may cover technology that may be required to implement | |||
this standard. Please address the information to the IETF at | this standard. Please address the information to the IETF at | |||
ietf-ipr@ietf.org. | ietf-ipr@ietf.org. | |||
Acknowledgments | Acknowledgements | |||
Funding for the RFC Editor function is provided by the IETF | Funding for the RFC Editor function is provided by the IETF | |||
Administrative Support Activity (IASA). This document was produced | Administrative Support Activity (IASA). This document was produced | |||
using xml2rfc v1.32 (of http://xml.resource.org/) from a source in | using xml2rfc v1.32 (of http://xml.resource.org/) from a source in | |||
RFC-2629 XML format. | RFC-2629 XML format. | |||
End of changes. 137 change blocks. | ||||
2640 lines changed or deleted | 870 lines changed or added | |||
This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |