draft-briscoe-tsvwg-re-ecn-tcp-05.txt | draft-briscoe-tsvwg-re-ecn-tcp-06.txt | |||
---|---|---|---|---|
Transport Area Working Group B. Briscoe | Transport Area Working Group B. Briscoe | |||
Internet-Draft BT & UCL | Internet-Draft BT & UCL | |||
Intended status: Standards Track A. Jacquet | Intended status: Standards Track A. Jacquet | |||
Expires: July 13, 2008 T. Moncaster | Expires: January 15, 2009 T. Moncaster | |||
A. Smith | A. Smith | |||
BT | BT | |||
January 10, 2008 | July 14, 2008 | |||
Re-ECN: Adding Accountability for Causing Congestion to TCP/IP | Re-ECN: Adding Accountability for Causing Congestion to TCP/IP | |||
draft-briscoe-tsvwg-re-ecn-tcp-05 | draft-briscoe-tsvwg-re-ecn-tcp-06 | |||
Status of this Memo | Status of this Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 37 | skipping to change at page 1, line 37 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
This Internet-Draft will expire on July 13, 2008. | This Internet-Draft will expire on January 15, 2009. | |||
Copyright Notice | Copyright Notice | |||
Copyright (C) The IETF Trust (2008). | Copyright (C) The IETF Trust (2008). | |||
Abstract | Abstract | |||
This document introduces a new protocol for explicit congestion | This document introduces a new protocol for explicit congestion | |||
notification (ECN), termed re-ECN, which can be deployed | notification (ECN), termed re-ECN, which can be deployed | |||
incrementally around unmodified routers. The protocol arranges an | incrementally around unmodified routers. It enbales the the upstream | |||
extended ECN field in each packet so that, as it crosses any | party at any trust boundary in the internetwork to be held | |||
interface in an internetwork, it will carry a truthful prediction of | responsible for the congestion they cause, or allow to be caused. | |||
congestion on the remainder of its path. Then the upstream party at | ||||
any trust boundary in the internetwork can be held responsible for | So, networks can introduce straightforward accountability for | |||
the congestion they cause, or allow to be caused. So, networks can | congestion and policing mechanisms for incoming traffic from end- | |||
introduce straightforward accountability and policing mechanisms for | customers or from neighbouring network domains. The protocol works | |||
incoming traffic from end-customers or from neighbouring network | by arranging an extended ECN field in each packet so that, as it | |||
domains. The purpose of this document is to specify the re-ECN | crosses any interface in an internetwork, it will carry a truthful | |||
protocol at the IP layer and to give guidelines on any consequent | prediction of congestion on the remainder of its path. The purpose | |||
changes required to transport protocols. It includes the changes | of this document is to specify the re-ECN protocol at the IP layer | |||
required to TCP both as an example and as a specification. It also | and to give guidelines on any consequent changes required to | |||
gives examples of mechanisms that can use the protocol to ensure data | transport protocols. It includes the changes required to TCP both as | |||
sources respond correctly to congestion. And it describes example | an example and as a specification. It also gives examples of | |||
mechanisms that ensure the dominant selfish strategy of both network | mechanisms that can use the protocol to ensure data sources respond | |||
domains and end-points will be to set the extended ECN field | correctly to congestion. And it describes example mechanisms that | |||
honestly. | ensure the dominant selfish strategy of both network domains and end- | |||
points will be to set the extended ECN field honestly. | ||||
Authors' Statement: Status (to be removed by the RFC Editor) | Authors' Statement: Status (to be removed by the RFC Editor) | |||
Although the re-ECN protocol is intended to make a simple but far- | Although the re-ECN protocol is intended to make a simple but far- | |||
reaching change to the Internet architecture, the most immediate | reaching change to the Internet architecture, the most immediate | |||
priority for the authors is to delay any move of the ECN nonce to | priority for the authors is to delay any move of the ECN nonce to | |||
Proposed Standard status. The argument for this position is | Proposed Standard status. The argument for this position is | |||
developed in Appendix I. | developed in Appendix I. | |||
Changes from previous drafts (to be removed by the RFC Editor) | Changes from previous drafts (to be removed by the RFC Editor) | |||
Full diffs created using the rfcdiff tool are available at | Full diffs created using the rfcdiff tool are available at | |||
<http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#retcp> | <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#retcp> | |||
From -04 to -05 (current version): | From -05 to -06 (current version): | |||
Clarifications made to Section 1 and Section 3. | ||||
Minor editorial changes throughout. | ||||
From -04 to -05: | ||||
Completed justification for packet marking with FNE during slow- | Completed justification for packet marking with FNE during slow- | |||
start(Appendix D). | start(Appendix D). | |||
Minor editorial changes throughout. | Minor editorial changes throughout. | |||
From -03 to -04: | From -03 to -04: | |||
Clarified reasons for holding back ECN nonce (Section 3.2 & | Clarified reasons for holding back ECN nonce (Section 3.3 & | |||
Appendix I). | Appendix I). | |||
Clarified Figure 1. | Clarified Figure 2. | |||
Added Section 4.1.1.1 on equivalence of drops and ECN marks. | Added Section 4.1.1.1 on equivalence of drops and ECN marks. | |||
Improved precision of Section 5.6 on IP in IP tunnels. | Improved precision of Section 5.6 on IP in IP tunnels. | |||
Explained the RTT fairness is possible to enforce, but unlikely to | Explained the RTT fairness is possible to enforce, but unlikely to | |||
be required (Section 6.1.3 & Appendix F). | be required (Section 6.1.3 & Appendix F). | |||
Explained that bulk per-user policing should be adequate but per- | Explained that bulk per-user policing should be adequate but per- | |||
flow policing is also possible if desired, though it is not likely | flow policing is also possible if desired, though it is not likely | |||
skipping to change at page 3, line 27 | skipping to change at page 3, line 33 | |||
From -02 to -03: | From -02 to -03: | |||
Started guidelines for re-ECN support in DCCP and SCTP. | Started guidelines for re-ECN support in DCCP and SCTP. | |||
Added annex on limitations of nonce mechanism. | Added annex on limitations of nonce mechanism. | |||
Minor editorial changes throughout. | Minor editorial changes throughout. | |||
From -01 to -02: | From -01 to -02: | |||
Explanation on informal terminology in Section 3.4 clarified. | Explanation on informal terminology in Section 3.5 clarified. | |||
IPv6 wire protocol encoding added (Section 5.2). | IPv6 wire protocol encoding added (Section 5.2). | |||
Text on (non-)issues with tunnels, encryption and link layer | Text on (non-)issues with tunnels, encryption and link layer | |||
congestion notification added (Section 5.6 & Section 5.7). | congestion notification added (Section 5.6 & Section 5.7). | |||
Section added giving evolvability arguments against encouraging | Section added giving evolvability arguments against encouraging | |||
bottleneck policing (Section 6.1.2). And text on re-ECN's | bottleneck policing (Section 6.1.2). And text on re-ECN's | |||
evolvability by design added to Section 6.1.3 | evolvability by design added to Section 6.1.3 | |||
skipping to change at page 4, line 8 | skipping to change at page 4, line 11 | |||
Encoding of re-ECN wire protocol changed for reasons given in | Encoding of re-ECN wire protocol changed for reasons given in | |||
Appendix B and consequently draft substantially re-written. | Appendix B and consequently draft substantially re-written. | |||
Substantial text added in sections on applications, incremental | Substantial text added in sections on applications, incremental | |||
deployment, architectural rationale and security considerations. | deployment, architectural rationale and security considerations. | |||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
2. Requirements notation . . . . . . . . . . . . . . . . . . . . 7 | 2. Requirements notation . . . . . . . . . . . . . . . . . . . . 8 | |||
3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 | 3. Protocol Overview . . . . . . . . . . . . . . . . . . . . . . 8 | |||
3.1. Background and Applicability . . . . . . . . . . . . . . . 8 | 3.1. Background and Applicability . . . . . . . . . . . . . . . 8 | |||
3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or | 3.2. Simplified Re-ECN Protocol . . . . . . . . . . . . . . . . 10 | |||
v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 | 3.3. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or | |||
3.3. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 11 | v6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 | |||
3.4. Informal Terminology . . . . . . . . . . . . . . . . . . . 13 | 3.4. Re-ECN Protocol Operation . . . . . . . . . . . . . . . . 12 | |||
3.5. Informal Terminology . . . . . . . . . . . . . . . . . . . 14 | ||||
4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 | 4. Transport Layers . . . . . . . . . . . . . . . . . . . . . . . 15 | |||
4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 4.1. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 | |||
4.1.1. RECN mode: Full re-ECN capable transport . . . . . . . 16 | 4.1.1. RECN mode: Full Re-ECN capable transport . . . . . . . 17 | |||
4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or | 4.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 | |||
Nonce ECT Receiver . . . . . . . . . . . . . . . . . . 20 | compliant ECN Receiver . . . . . . . . . . . . . . . . 20 | |||
4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 21 | 4.1.3. Capability Negotiation . . . . . . . . . . . . . . . . 21 | |||
4.1.4. Extended ECN (EECN) Field Settings during Flow | 4.1.4. Extended ECN (EECN) Field Settings during Flow | |||
Start or after Idle Periods . . . . . . . . . . . . . 23 | Start or after Idle Periods . . . . . . . . . . . . . 23 | |||
4.1.5. Pure ACKS, Retransmissions, Window Probes and | 4.1.5. Pure ACKS, Retransmissions, Window Probes and | |||
Partial ACKs . . . . . . . . . . . . . . . . . . . . . 26 | Partial ACKs . . . . . . . . . . . . . . . . . . . . . 27 | |||
4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 27 | 4.2. Other Transports . . . . . . . . . . . . . . . . . . . . . 27 | |||
4.2.1. General Guidelines for Adding Re-ECN to Other | 4.2.1. General Guidelines for Adding Re-ECN to Other | |||
Transports . . . . . . . . . . . . . . . . . . . . . . 27 | Transports . . . . . . . . . . . . . . . . . . . . . . 27 | |||
4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28 | 4.2.2. Guidelines for adding Re-ECN to RSVP or NSIS . . . . . 28 | |||
4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28 | 4.2.3. Guidelines for adding Re-ECN to DCCP . . . . . . . . . 28 | |||
4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 28 | 4.2.4. Guidelines for adding Re-ECN to SCTP . . . . . . . . . 29 | |||
5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 28 | 5. Network Layer . . . . . . . . . . . . . . . . . . . . . . . . 29 | |||
5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 28 | 5.1. Re-ECN IPv4 Wire Protocol . . . . . . . . . . . . . . . . 29 | |||
5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 30 | 5.2. Re-ECN IPv6 Wire Protocol . . . . . . . . . . . . . . . . 30 | |||
5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 31 | 5.3. Router Forwarding Behaviour . . . . . . . . . . . . . . . 31 | |||
5.4. Justification for Setting the First SYN to FNE . . . . . . 32 | 5.4. Justification for Setting the First SYN to FNE . . . . . . 33 | |||
5.5. Control and Management . . . . . . . . . . . . . . . . . . 33 | 5.5. Control and Management . . . . . . . . . . . . . . . . . . 34 | |||
5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 33 | 5.5.1. Negative Balance Warning . . . . . . . . . . . . . . . 34 | |||
5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 34 | 5.5.2. Rate Response Control . . . . . . . . . . . . . . . . 35 | |||
5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 34 | 5.6. IP in IP Tunnels . . . . . . . . . . . . . . . . . . . . . 35 | |||
5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 35 | 5.7. Non-Issues . . . . . . . . . . . . . . . . . . . . . . . . 36 | |||
6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 36 | 6. Applications . . . . . . . . . . . . . . . . . . . . . . . . . 37 | |||
6.1. Policing Congestion Response . . . . . . . . . . . . . . . 36 | 6.1. Policing Congestion Response . . . . . . . . . . . . . . . 37 | |||
6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 36 | 6.1.1. The Policing Problem . . . . . . . . . . . . . . . . . 37 | |||
6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 37 | 6.1.2. The Case Against Bottleneck Policing . . . . . . . . . 38 | |||
6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 38 | 6.1.3. Re-ECN Incentive Framework . . . . . . . . . . . . . . 39 | |||
6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 45 | 6.1.4. Egress Dropper . . . . . . . . . . . . . . . . . . . . 46 | |||
6.1.5. Policing . . . . . . . . . . . . . . . . . . . . . . . 47 | 6.1.5. Policing . . . . . . . . . . . . . . . . . . . . . . . 47 | |||
6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 48 | 6.1.6. Inter-domain Policing . . . . . . . . . . . . . . . . 49 | |||
6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 52 | 6.1.7. Inter-domain Fail-safes . . . . . . . . . . . . . . . 52 | |||
6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 53 | 6.1.8. Simulations . . . . . . . . . . . . . . . . . . . . . 53 | |||
6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 53 | 6.2. Other Applications . . . . . . . . . . . . . . . . . . . . 53 | |||
6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 53 | 6.2.1. DDoS Mitigation . . . . . . . . . . . . . . . . . . . 53 | |||
6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54 | 6.2.2. End-to-end QoS . . . . . . . . . . . . . . . . . . . . 54 | |||
6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 54 | 6.2.3. Traffic Engineering . . . . . . . . . . . . . . . . . 55 | |||
6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 54 | 6.2.4. Inter-Provider Service Monitoring . . . . . . . . . . 55 | |||
6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 54 | 6.3. Limitations . . . . . . . . . . . . . . . . . . . . . . . 55 | |||
7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 55 | 7. Incremental Deployment . . . . . . . . . . . . . . . . . . . . 56 | |||
7.1. Incremental Deployment Features . . . . . . . . . . . . . 55 | 7.1. Incremental Deployment Features . . . . . . . . . . . . . 56 | |||
7.2. Incremental Deployment Incentives . . . . . . . . . . . . 57 | 7.2. Incremental Deployment Incentives . . . . . . . . . . . . 57 | |||
8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 61 | 8. Architectural Rationale . . . . . . . . . . . . . . . . . . . 62 | |||
9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 64 | 9. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 65 | |||
9.1. Policing Rate Response to Congestion . . . . . . . . . . . 64 | 9.1. Policing Rate Response to Congestion . . . . . . . . . . . 65 | |||
9.2. Congestion Notification Integrity . . . . . . . . . . . . 65 | 9.2. Congestion Notification Integrity . . . . . . . . . . . . 66 | |||
9.3. Identifying Upstream and Downstream Congestion . . . . . . 66 | 9.3. Identifying Upstream and Downstream Congestion . . . . . . 67 | |||
10. Security Considerations . . . . . . . . . . . . . . . . . . . 66 | 10. Security Considerations . . . . . . . . . . . . . . . . . . . 67 | |||
11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 | 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 68 | |||
12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 68 | 12. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 69 | |||
13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 68 | 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 69 | |||
14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69 | 14. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 69 | |||
15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 69 | 15. References . . . . . . . . . . . . . . . . . . . . . . . . . . 70 | |||
15.1. Normative References . . . . . . . . . . . . . . . . . . . 69 | 15.1. Normative References . . . . . . . . . . . . . . . . . . . 70 | |||
15.2. Informative References . . . . . . . . . . . . . . . . . . 70 | 15.2. Informative References . . . . . . . . . . . . . . . . . . 70 | |||
Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 73 | Appendix A. Precise Re-ECN Protocol Operation . . . . . . . . . . 74 | |||
Appendix B. Justification for Two Codepoints Signifying Zero | Appendix B. Justification for Two Codepoints Signifying Zero | |||
Worth Packets . . . . . . . . . . . . . . . . . . . . 74 | Worth Packets . . . . . . . . . . . . . . . . . . . . 75 | |||
Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 76 | Appendix C. ECN Compatibility . . . . . . . . . . . . . . . . . . 76 | |||
Appendix D. Packet Marking with FNE During Flow Start . . . . . . 77 | Appendix D. Packet Marking with FNE During Flow Start . . . . . . 78 | |||
Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 79 | Appendix E. Example Egress Dropper Algorithm . . . . . . . . . . 80 | |||
Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 79 | Appendix F. Re-TTL . . . . . . . . . . . . . . . . . . . . . . . 80 | |||
Appendix G. Policer Designs to ensure Congestion | Appendix G. Policer Designs to ensure Congestion | |||
Responsiveness . . . . . . . . . . . . . . . . . . . 80 | Responsiveness . . . . . . . . . . . . . . . . . . . 80 | |||
G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 80 | G.1. Per-user Policing . . . . . . . . . . . . . . . . . . . . 80 | |||
G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 81 | G.2. Per-flow Rate Policing . . . . . . . . . . . . . . . . . . 82 | |||
Appendix H. Downstream Congestion Metering Algorithms . . . . . . 84 | Appendix H. Downstream Congestion Metering Algorithms . . . . . . 84 | |||
H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 84 | H.1. Bulk Downstream Congestion Metering Algorithm . . . . . . 84 | |||
H.2. Inflation Factor for Persistently Negative Flows . . . . . 85 | H.2. Inflation Factor for Persistently Negative Flows . . . . . 85 | |||
Appendix I. Argument for holding back the ECN nonce . . . . . . . 85 | Appendix I. Argument for holding back the ECN nonce . . . . . . . 86 | |||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 87 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 88 | |||
Intellectual Property and Copyright Statements . . . . . . . . . . 89 | Intellectual Property and Copyright Statements . . . . . . . . . . 90 | |||
1. Introduction | 1. Introduction | |||
This document aims: | This document aims: | |||
o To provide a complete specification of the addition of the re-ECN | o To provide a complete specification of the addition of the re-ECN | |||
protocol to IP and guidelines on how to add it to transport layer | protocol to IP and guidelines on how to add it to transport layer | |||
protocols, including a complete specification of re-ECN in TCP as | protocols, including a complete specification of re-ECN in TCP as | |||
an example; | an example; | |||
o To show how a number of hard problems become much easier to solve | o To show how a number of hard problems become much easier to solve | |||
once re-ECN is available in IP. | once re-ECN is available in IP. | |||
In ECN [RFC3168] congested queues probabilistically mark packets as | ||||
they approach a congested state. The receiver informs the sender | ||||
that they have seen one or more marks. In re-ECN the sender must | ||||
predict the level of congestion on the path by re-inserting feedback | ||||
according to the marking scheme described later in this draft. This | ||||
results in packets that carry a prediction of downstream congestion. | ||||
If a sender understates expected congestion compared to actual | ||||
congestion then the network could discard packets or enact some other | ||||
sanction. A policer can also be introduced at the ingress of | ||||
networks that can limit the congestion caused (or base penalties on | ||||
it). | ||||
It is important to add a few key points. | ||||
o It can be seen that it takes one round trip before any feedback is | ||||
received. For this reason a sender must make a conservative | ||||
prediction by transmitting IP packets with a special Feedback Not | ||||
Established (FNE) marking. | ||||
o It should be noted that the prediction is carried in-band in | ||||
normal data packets and for many transports feedback can be | ||||
carried in the normal acknowledgements or control packets. | ||||
o The re-ECN protocol is independent of the transport. In TCP, | ||||
acknowledgments are used to convey the feedback from receiver to | ||||
sender. This memo concentrates on TCP as an example transport | ||||
protocol, however the re-ECN protocol is compatible with any | ||||
transport where feedback can be sent from receiver to sender. | ||||
A general statement of the problem solved by re-ECN is to provide | A general statement of the problem solved by re-ECN is to provide | |||
sufficient information in each IP datagram to be able to hold senders | sufficient information in each IP datagram to be able to hold senders | |||
and whole networks accountable for the congestion they cause | and whole networks accountable for the congestion they cause | |||
downstream, before they cause it. But the every-day problems that | downstream, before they cause it. But the every-day problems that | |||
re-ECN can solve are much more recognisable than this rather generic | re-ECN can solve are much more recognisable than this rather generic | |||
statement: mitigating distributed denial of service (DDoS); | statement: mitigating distributed denial of service (DDoS); | |||
simplifying differentiation of quality of service (QoS); policing | simplifying differentiation of quality of service (QoS); policing | |||
compliance to congestion control; and so on. | compliance to congestion control; and so on. | |||
Uniquely, re-ECN manages to enable solutions to these problems | Uniquely, re-ECN manages to enable solutions to these problems | |||
skipping to change at page 6, line 45 | skipping to change at page 7, line 26 | |||
For instance, some network owners want to block applications like | For instance, some network owners want to block applications like | |||
voice and video unless their network is compensated for the extra | voice and video unless their network is compensated for the extra | |||
share of bottleneck bandwidth taken. These real-time applications | share of bottleneck bandwidth taken. These real-time applications | |||
tend to be unresponsive when congestion arises. Whereas elastic TCP- | tend to be unresponsive when congestion arises. Whereas elastic TCP- | |||
based applications back away quickly, ending up taking a much smaller | based applications back away quickly, ending up taking a much smaller | |||
share of congested capacity for themselves. Other network owners | share of congested capacity for themselves. Other network owners | |||
want to invest in large amounts of capacity and make their gains from | want to invest in large amounts of capacity and make their gains from | |||
simplicity of operation and economies of scale. | simplicity of operation and economies of scale. | |||
Re-ECN allows the more conservative networks to police out flows that | re-ECN allows the more conservative networks to police out flows that | |||
have not asked to be unresponsive to congestion---not because they | have not asked to be unresponsive to congestion---not because they | |||
are voice or video---just because they don't respond to congestion. | are voice or video---just because they don't respond to congestion. | |||
But it also allows other networks to choose not to police. | But it also allows other networks to choose not to police. | |||
Crucially, when flows from liberal networks cross into a conservative | Crucially, when flows from liberal networks cross into a conservative | |||
network, re-ECN enables the conservative network to apply penalties | network, re-ECN enables the conservative network to apply penalties | |||
to its neighbouring networks for the congestion they allow to be | to its neighbouring networks for the congestion they allow to be | |||
caused. And these penalties can be applied to bulk data, without | caused. And these penalties can be applied to bulk data, without | |||
regard to flows. | regard to flows. | |||
Then, if unresponsive applications become so dominant that some of | Then, if unresponsive applications become so dominant that some of | |||
the more liberal networks experience congestion collapse [RFC3714], | the more liberal networks experience congestion collapse [RFC3714], | |||
they can change their minds and use re-ECN to apply tighter controls | they can change their minds and use re-ECN to apply tighter controls | |||
in order to bring congestion back under control. | in order to bring congestion back under control. | |||
Re-ECN works by arranging that each packet arrives at each network | re-ECN works by arranging that each packet arrives at each network | |||
element carrying a view of expected congestion on its own downstream | element carrying a view of expected congestion on its own downstream | |||
path, albeit averaged over multiple packets. Most usefully, | path, albeit averaged over multiple packets. Most usefully, | |||
congestion on the remainder of the path becomes visible in the IP | congestion on the remainder of the path becomes visible in the IP | |||
header at the first ingress. Many of the applications of re-ECN | header at the first ingress. Many of the applications of re-ECN | |||
involve a policer at this ingress using the view of downstream | involve a policer at this ingress using the view of downstream | |||
congestion arriving in packets to police or control the packet rate. | congestion arriving in packets to police or control the packet rate. | |||
Importantly, the scheme is recursive: a whole network harbouring | Importantly, the scheme is recursive: a whole network harbouring | |||
users causing congestion in downstream networks can be held | users causing congestion in downstream networks can be held | |||
responsible or policed by its downstream neighbour. | responsible or policed by its downstream neighbour. | |||
This document is structured as follows. First an overview of the re- | This document is structured as follows. First an overview of the re- | |||
ECN protocol is given (Section 3), outlining its attributes and | ECN protocol is given (Section 3), outlining its attributes and | |||
explaining conceptually how it works as a whole. The two main parts | explaining conceptually how it works as a whole. The two main parts | |||
of the document follow, as described above. That is, the protocol | of the document follow. That is, the protocol specification divided | |||
specification divided into transport (Section 4) and network | into transport (Section 4) and network (Section 5) layers which | |||
(Section 5) layers, then the applications it can be put to, such as | contain most of the standards compliance terminology, then the | |||
policing DDoS, QoS and congestion control (Section 6). Although | applications re-ECN can be put to, such as policing DDoS, QoS and | |||
these applications do not require standardisation themselves, they | congestion control (Section 6). Although these applications do not | |||
are described in a fair degree of detail in order to explain how re- | require standardisation themselves, they are described in a fair | |||
ECN can be used. Given re-ECN proposes to use the last undefined bit | degree of detail in order to explain how re-ECN can be used. Given | |||
in the IPv4 header, we felt it necessary to outline the potential | re-ECN proposes to use the last undefined bit in the IPv4 header, we | |||
that re-ECN could release in return for being given that bit. | felt it necessary to outline the potential that re-ECN could release | |||
in return for being given that bit. | ||||
Deployment issues discussed throughout the document are brought | Deployment issues discussed throughout the document are brought | |||
together in Section 7, which is followed by a brief section | together in Section 7, which is followed by a brief section | |||
explaining the somewhat subtle rationale for the design from an | explaining the somewhat subtle rationale for the design from an | |||
architectural perspective (Section 8). We end by describing related | architectural perspective (Section 8). We end by describing related | |||
work (Section 9), listing security considerations (Section 10) and | work (Section 9), listing security considerations (Section 10) and | |||
finally drawing conclusions (Section 12). | finally drawing conclusions (Section 12). | |||
2. Requirements notation | 2. Requirements notation | |||
skipping to change at page 8, line 15 | skipping to change at page 8, line 45 | |||
document considers many cases where malicious nodes may not comply | document considers many cases where malicious nodes may not comply | |||
with the protocol. When such contingencies are described, if any of | with the protocol. When such contingencies are described, if any of | |||
the above keywords are not capitalised, that is deliberate. So, for | the above keywords are not capitalised, that is deliberate. So, for | |||
instance, the following two apparently contradictory sentences would | instance, the following two apparently contradictory sentences would | |||
be perfectly consistent: i) x MUST do this; ii) x may not do this. | be perfectly consistent: i) x MUST do this; ii) x may not do this. | |||
3. Protocol Overview | 3. Protocol Overview | |||
3.1. Background and Applicability | 3.1. Background and Applicability | |||
First we briefly recap the essentials of the ECN protocol [RFC3168]. | The re-ECN protocol makes no changes and has no effect on the TCP | |||
Two bits in the IP protocol (v4 or v6) are assigned to the ECN field. | congestion control algorithm or on other rate responses to | |||
The sender clears the field to "00" (Not-ECT) if either end-point | congestion. re-ECN is not a new congestion control protocol, rather | |||
transport is not ECN-capable. Otherwise it indicates an ECN-capable | it is orthogonal to congestion control itself. Re-ECN is concerned | |||
transport (ECT) using either of the two code-points "10" or "01" | with revealing information about congestion so that users and | |||
(ECT(0) and ECT(1) resp.). | networks can be held accountable for the congestion they cause, or | |||
allow to be caused. | ||||
ECN-capable routers probabilistically set "11" if congestion is | Re-ECN builds on ECN so we briefly recap the essentials of the ECN | |||
protocol [RFC3168]. Two bits in the IP protocol (v4 or v6) are | ||||
assigned to the ECN field. The sender clears the field to "00" (Not- | ||||
ECT) if either end-point transport is not ECN-capable. Otherwise it | ||||
indicates an ECN-capable transport (ECT) using either of the two | ||||
code-points "10" or "01" (ECT(0) and ECT(1) resp.). | ||||
ECN-capable queues probabilistically set "11" if congestion is | ||||
experienced (CE), the marking probability increasing with the length | experienced (CE), the marking probability increasing with the length | |||
of the queue at its egress link (typically using the RED | of the queue at its egress link (typically using the RED | |||
algorithm [RFC2309]). However, they still drop rather than mark Not- | algorithm [RFC2309]). However, they still drop rather than mark Not- | |||
ECT packets. With multiple ECN-capable routers on a path, a flow of | ECT packets. With multiple ECN-capable queues on a path, a flow of | |||
packets accumulates the fraction of CE marking that each router adds. | packets accumulates the fraction of CE marking that each queue adds. | |||
The combined effect of the packet marking of all the routers along | The combined effect of the packet marking of all the queues along the | |||
the path signals congestion of the whole path to the receiver. So, | path signals congestion of the whole path to the receiver. So, for | |||
for example, if one router early in a path is marking 1% of packets | example, if one queue early in a path is marking 1% of packets and | |||
and another later in a path is marking 2%, flows that pass through | another later in a path is marking 2%, flows that pass through both | |||
both routers will experience approximately 3% marking (see Appendix A | queues will experience approximately 3% marking (see Appendix A for a | |||
for a precise treatment). | precise treatment). | |||
The choice of two ECT code-points in the ECN field [RFC3168] | The choice of two ECT code-points in the ECN field [RFC3168] | |||
permitted future flexibility, optionally allowing the sender to | permitted future flexibility, optionally allowing the sender to | |||
encode the experimental ECN nonce [RFC3540] in the packet stream. | encode the experimental ECN nonce [RFC3540] in the packet stream. | |||
The nonce is designed to allow a sender to check the integrity of | The nonce is designed to allow a sender to check the integrity of | |||
congestion feedback. But Section 9.2 explains that it still gives no | congestion feedback. But Section 9.2 explains that it still gives no | |||
control over how fast the sender transmits as a result of the | control over how fast the sender transmits as a result of the | |||
feedback. On the other hand, re-ECN is designed both to ensure that | feedback. On the other hand, re-ECN is designed both to ensure that | |||
congestion is declared honestly and that the sender's rate responds | congestion is declared honestly and that the sender's rate responds | |||
appropriately. | appropriately. | |||
skipping to change at page 9, line 10 | skipping to change at page 9, line 48 | |||
re-inserted or re-echoed feedback. But it actually works even when | re-inserted or re-echoed feedback. But it actually works even when | |||
no feedback is available. In fact it has been carefully designed to | no feedback is available. In fact it has been carefully designed to | |||
work for single datagram flows. It also encourages aggregation of | work for single datagram flows. It also encourages aggregation of | |||
single packet flows by congestion control proxies. Then, even if the | single packet flows by congestion control proxies. Then, even if the | |||
traffic mix of the Internet were to become dominated by short | traffic mix of the Internet were to become dominated by short | |||
messages, it would still be possible to control congestion | messages, it would still be possible to control congestion | |||
effectively and efficiently. | effectively and efficiently. | |||
Changing the Internet's feedback architecture seems to imply | Changing the Internet's feedback architecture seems to imply | |||
considerable upheaval. But re-ECN can be deployed incrementally at | considerable upheaval. But re-ECN can be deployed incrementally at | |||
the transport layer around unmodified routers using existing fields | the transport layer around unmodified queues using existing fields in | |||
in IP (v4 or v6). However it does also require the last undefined | IP (v4 or v6). However it does also require the last undefined bit | |||
bit in the IPv4 header, which it uses in combination with the 2-bit | in the IPv4 header, which it uses in combination with the 2-bit ECN | |||
ECN field to create four new codepoints. Nonetheless, changes to IP | field to create four new codepoints. Nonetheless, we RECOMMENDED | |||
routers are RECOMMENDED in order to improve resilience against DoS | adding optional preferentail drop to IP queues based on the re-ECN | |||
attacks. Similarly, re-ECN works best if both the sender and | fields in order to improve resilience against DoS attacks. | |||
receiver transports are re-ECN-capable, but it can work with just | Similarly, re-ECN works best if both the sender and receiver | |||
sender support. Section 7.1 summarises the incremental deployment | transports are re-ECN-capable, but it can work with just sender | |||
strategy. | support. Section 7.1 summarises the incremental deployment strategy. | |||
The re-ECN protocol makes no changes and has no effect on the TCP | ||||
congestion control algorithm or on other rate responses to | ||||
congestion. Re-ECN is only concerned with enabling the ingress | ||||
network to police that a source is complying with a congestion | ||||
control algorithm, which is orthogonal to congestion control itself. | ||||
Before re-ECN can be considered worthy of using up the last bit in | Before re-ECN can be considered worthy of using up the last bit in | |||
the IP header, we must be sure that all our claims are robust. We | the IP header, we must be sure that all our claims are robust. We | |||
have gradually been reducing the list of outstanding issues, but the | have gradually been reducing the list of outstanding issues, but the | |||
few that still remain are listed in Section 6.3. We expect new | few that still remain are listed in Section 6.3. We expect new | |||
attacks may still be found, but we offer the re-ECN protocol on the | attacks may still be found, but we offer the re-ECN protocol on the | |||
basis that it is built on fairly solid theoretical foundations and, | basis that it is built on fairly solid theoretical foundations and, | |||
so far, it has proved possible to keep it relatively robust. | so far, it has proved possible to keep it relatively robust. | |||
3.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) | 3.2. Simplified Re-ECN Protocol | |||
We describe here the simplified re-ECN protocol. In this first | ||||
description we assume packets and segments are synonymous. | ||||
Packets are sent from a sender to a receiver. In Figure 1 the queues | ||||
(Q1 and Q2) are ECN enabled as per RFC 3168 [ref]. If congestion | ||||
occurs then packets are marked with the congestion experienced (CE) | ||||
flag exactly as in the ECN protocol [RFC3168]; the routers do not | ||||
need to be modified and do not need to know the re-ECN protocol. On | ||||
reception of marked packets the receiver notifies the sender of the | ||||
current count of marked packets. Note that this is the number of | ||||
packets marked rather than the setting of the ECE flag in ECN. The | ||||
sender uses this information to re-echo mark packets in exact | ||||
correspondence to the number of CE marked bytes observed at the | ||||
receiver. | ||||
+--------- Feedback----------+ | ||||
| | | ||||
v | | ||||
+---+ +----+ +----+ +---+ | ||||
| | RE | | | | | | | ||||
| S |--->| Q1 |--->| Q2 |--->| R | | ||||
| | | | | | | | | ||||
+---+ +----+ +----+ +---+ | ||||
Figure 1: Simple Re-ECN | ||||
3.3. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) | ||||
The re-ECN wire protocol uses the two bit ECN field broadly as in | The re-ECN wire protocol uses the two bit ECN field broadly as in | |||
RFC3168 [RFC3168] as described above, but with five differences of | RFC3168 [RFC3168] as described above, but with five differences of | |||
detail (brought together in a list in Section 7.1). This | detail (brought together in a list in Section 7.1). This | |||
specification defines a new re-ECN extension (RE) flag. We will | specification defines a new re-ECN extension (RE) flag. We will | |||
defer the definition of the actual position of the RE flag in the | defer the definition of the actual position of the RE flag in the | |||
IPv4 & v6 headers until Section 5. Until then it will suffice to use | IPv4 & v6 headers until Section 5. When we don't need to choose | |||
an abstraction of the IPv4 and v6 wire protocols by just calling it | between IPv4 and v6 wire protocols it will suffice call it the RE | |||
the RE flag. | flag. | |||
Unlike the ECN field, the RE flag is intended to be set by the sender | Unlike the ECN field, the RE flag is intended to be set by the sender | |||
and remain unchanged along the path, although it can be read by | and remain unchanged along the path, although it can be read by | |||
network elements that understand the re-ECN protocol. It is feasible | network elements that understand the re-ECN protocol. It is feasible | |||
that a network element MAY change the setting of the RE flag, perhaps | that a network element MAY change the setting of the RE flag, perhaps | |||
acting as a proxy for an end-point, but such a protocol would have to | acting as a proxy for an end-point, but such a protocol would have to | |||
be defined in another specification (e.g. [Re-PCN]). | be defined in another specification (e.g. [Re-PCN]). | |||
Although the RE flag is a separate, single bit field, it can be read | Although the RE flag is a separate, single bit field, it can be read | |||
as an extension to the two-bit ECN field; the three concatenated bits | as an extension to the two-bit ECN field; the three concatenated bits | |||
in what we will call the extended ECN field (EECN) making eight | in what we will call the extended ECN field (EECN) giving eight | |||
codepoints. We will use the RFC3168 names of the ECN codepoints to | codepoints. We will use the RFC3168 names of the ECN codepoints to | |||
describe settings of the ECN field when the RE flag setting is "don't | describe settings of the ECN field when the RE flag setting is "don't | |||
care", but we also define the following six extended ECN codepoint | care", but we also define the following six extended ECN codepoint | |||
names for when we need to be more specific. | names for when we need to be more specific. | |||
RFC3168 ECN defines uses for all four codepoints of the two-bit ECN | One of re-ECN's codepoints is an alternative use of the codepoint set | |||
field. This memo widens the codepoint space to eight, and uses six | aside in RFC3168 for the ECN nonce (ECT(1)). Transports using re-ECN | |||
codepoints. One of re-ECN's codepoints is an alternative use of the | do not need to use the ECN nonce as long as the sender is also | |||
codepoint set aside in RFC3168 for the ECN nonce (ECT(1)). | checking for transport protocol compliance | |||
Transports not using re-ECN can still use the ECN nonce, while those | [I-D.moncaster-tcpm-rcv-cheat]. The case for doing this is given in | |||
using re-ECN do not need to as long as the sender is also checking | Appendix I. Two re-ECN codepoints are given compatible uses to those | |||
for transport protocol compliance [I-D.moncaster-tcpm-rcv-cheat]. | defined in RFC3168 (Not-ECT and CE). The other codepoint used by | |||
The case for doing this is given in Appendix I. Two re-ECN | RFC3168 (ECT(0)) isn't used for re-ECN. Altogether this leave one | |||
codepoints are given compatible uses to those defined in RFC3168 | codepoint of the eight unused by ECN or re-ECN and available for | |||
(Not-ECT and CE). The other codepoint used by RFC3168 (ECT(0)) isn't | future use. | |||
used for re-ECN. Altogether this leave one codepoint of the eight | ||||
unused and available for future use. | ||||
+-------+------------+------+--------------+------------------------+ | +-------+------------+------+--------------+------------------------+ | |||
| ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | | | ECN | RFC3168 | RE | Extended ECN | Re-ECN meaning | | |||
| field | codepoint | flag | codepoint | | | | field | codepoint | flag | codepoint | | | |||
+-------+------------+------+--------------+------------------------+ | +-------+------------+------+--------------+------------------------+ | |||
| 00 | Not-ECT | 0 | Not-RECT | Not re-ECN-capable | | | 00 | Not-ECT | 0 | Not-ECT | Not re-ECN-capable | | |||
| | | | | transport | | | | | | | transport | | |||
| 00 | Not-ECT | 1 | FNE | Feedback not | | | 00 | --- | 1 | FNE | Feedback not | | |||
| | | | | established | | | | | | | established | | |||
| 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | | | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | | |||
| | | | | and RECT | | | | | | | and RECT | | |||
| 01 | ECT(1) | 1 | RECT | Re-ECN capable | | | 01 | --- | 1 | RECT | Re-ECN capable | | |||
| | | | | transport | | | | | | | transport | | |||
| 10 | ECT(0) | 0 | --- | Legacy ECN use only | | | 10 | ECT(0) | 0 | ECT(0) | RFC3168 ECN use only | | |||
| | | | | | | | | | | | | | |||
| 10 | ECT(0) | 1 | --CU-- | Currently unused | | | 10 | --- | 1 | --CU-- | Currently unused | | |||
| | | | | | | | | | | | | | |||
| 11 | CE | 0 | CE(0) | Re-Echo canceled by | | | 11 | CE | 0 | CE(0) | Re-Echo canceled by | | |||
| | | | | congestion experienced | | | | | | | congestion experienced | | |||
| 11 | CE | 1 | CE(-1) | Congestion experienced | | | 11 | --- | 1 | CE(-1) | Congestion experienced | | |||
+-------+------------+------+--------------+------------------------+ | +-------+------------+------+--------------+------------------------+ | |||
Table 1: Extended ECN Codepoints | Table 1: Extended ECN Codepoints | |||
3.3. Re-ECN Protocol Operation | 3.4. Re-ECN Protocol Operation | |||
In this section we will give an overview of the operation of the re- | In this section we will give an overview of the operation of the re- | |||
ECN protocol for TCP/IP, leaving a detailed specification to the | ECN protocol for TCP/IP, leaving a detailed specification to the | |||
following sections. Other transports will be discussed later. | following sections. Other transports will be discussed later. | |||
In summary, the protocol adds a third `re-echo' stage to the existing | In summary, the protocol adds a third `re-echo' stage to the existing | |||
TCP/IP ECN protocol. Whenever the network adds CE congestion | TCP/IP ECN protocol. Whenever the network adds CE congestion | |||
signalling to the IP header on the forward data path, the receiver | signalling to the IP header on the forward data path, the receiver | |||
feeds it back to the ingress using TCP, then the sender re-echoes it | feeds it back to the ingress using TCP, then the sender re-echoes it | |||
into the forward data path using the RE flag in the next packet. | into the forward data path using the RE flag in the next packet. | |||
Prior to receiving any feedback a sender will not know which setting | Prior to receiving any feedback a sender will not know which setting | |||
of the RE flag to use, so it sets the feedback not established (FNE) | of the RE flag to use, so it sets the feedback not established (FNE) | |||
codepoint. The network reads the FNE codepoint conservatively as | codepoint. The network reads the FNE codepoint conservatively as | |||
equivalent to re-echoed congestion. | equivalent to re-echoed congestion. | |||
Specifically, once a flow is established, a re-ECN sender always | Specifically, once feedback from a flow is established, a re-ECN | |||
initialises the ECN field to ECT(1). And it usually sets the RE flag | sender always initialises the ECN field to ECT(1). And it usually | |||
to "1". Whenever a router re-marks a packet to CE, the receiver | sets the RE flag to "1". Whenever a queue marks a packet to CE, the | |||
feeds back this event to the sender. On receiving this feedback, the | receiver feeds back this event to the sender. On receiving this | |||
re-ECN sender will clear the RE flag to "0" in the next packet it | feedback, the re-ECN sender will clear the RE flag to "0" in the next | |||
sends. | packet it sends. | |||
We chose to set and clear the RE flag this way round to ease | We chose to set and clear the RE flag this way round to ease | |||
incremental deployment (see Section 7.1). To avoid confusion we will | incremental deployment (see Section 7.1). To avoid confusion we will | |||
use the term `blanking' (rather than marking) when the RE flag is | use the term `blanking' (rather than marking) when the RE flag is | |||
cleared to "0". So, over a stream of packets, we will talk of the | cleared to "0". So, over a stream of packets, we will talk of the | |||
`RE blanking fraction' as the fraction of octets in packets with the | `RE blanking fraction' as the fraction of octets in packets with the | |||
RE flag cleared to "0". | RE flag cleared to "0". | |||
_ _ _ _ | +---+ +----+ +----+ +---+ | |||
/ \ / \ / \ / \ | | S |--| Q1 |----------------| Q2 |--| R | | |||
| S |--| 0 | - - - - - - - - | i |--| D | | +---+ +----+ +----+ +---+ | |||
\ _ / \ _ / \ _ / \ _ / | ||||
. . . . | . . . . | |||
^ . . . . | ^ . . . . | |||
| . . . . | | . . . . | |||
| . RE blanking fraction . . | | . RE blanking fraction . . | |||
3% |-------------------------------+======= | 3% |-------------------------------+======= | |||
| . . | . | | . . | . | |||
2% | . . | . | 2% | . . | . | |||
| . . CE marking fraction | . | | . . CE marking fraction | . | |||
1% | . +----------------------+ . | 1% | . +----------------------+ . | |||
| . | . . | | . | . . | |||
0% +---------------------------------------> | 0% +---------------------------------------> | |||
^ 0 ^ i ^ resource index | ^ ^ ^ | |||
0 ^ 1 ^ 2 observation points | L M N Observation points | |||
| | | ||||
1.00% 2.00% marking fraction | ||||
Figure 1: A 2-Router Example (Imprecise) | Figure 2: A 2-Queue Example (Imprecise) | |||
Figure 1 uses a simple network to illustrate how re-ECN allows | Figure 2 uses a simple network to illustrate how re-ECN allows queues | |||
routers to measure downstream congestion. The horizontal axis | to measure downstream congestion. The receiver views a CE marking | |||
represents the index of each congestible resource (typically queues) | fraction of 3% which is fed back to the sender. The sender sets an | |||
along a path through the Internet. There may be many routers on the | RE blanking fraction of 3% to match this. This RE blanking fraction | |||
path, but we assume only two are currently congested (those with | can be observed along the path as the RE flag is not changed by | |||
resource index 0 and i). The two superimposed plots show the | network nodes once set by the sender. This is shown by the | |||
fraction of each extended ECN codepoint in a flow observed along this | horizontal line at 3% in the figure. The CE marked fraction is shown | |||
path. Given about 3% of packets reaching the destination are marked | by the stepped line which rises to meet the RE blanking fraction line | |||
CE, in response to feedback the sender will blank the RE flag in | with steps at at each queue where packets are marked. Two queues are | |||
about 3% of packets it sends. Then approximate downstream congestion | shown (Q1 and Q2) that are currently congested. Each time packets | |||
can be measured at the observation points shown along the path by | pass through a fraction are marked; 1% at Q1 and 2% at Q2). The | |||
subtracting the CE marking fraction from the RE blanking fraction, as | approximate downstream congestion can be measured at the observation | |||
shown in the table below (Appendix A derives these approximations | points shown along the path by subtracting the CE marking fraction | |||
from a precise analysis). | from the RE blanking fraction, as shown in the table below | |||
(Appendix A derives these approximations from a precise analysis). | ||||
+-------------------+------------------------------+ | +-------------------+------------------------------+ | |||
| Observation point | Approx downstream congestion | | | Observation point | Approx downstream congestion | | |||
+-------------------+------------------------------+ | +-------------------+------------------------------+ | |||
| 0 | 3% - 0% = 3% | | | L | 3% - 0% = 3% | | |||
| 1 | 3% - 1% = 2% | | | M | 3% - 1% = 2% | | |||
| 2 | 3% - 3% = 0% | | | N | 3% - 3% = 0% | | |||
+-------------------+------------------------------+ | +-------------------+------------------------------+ | |||
Table 2: Downstream Congestion Measured at Example Observation Points | Table 2: Downstream Congestion Measured at Example Observation Points | |||
All along the path, whole-path congestion remains unchanged so it can | All along the path, whole-path congestion remains unchanged so it can | |||
be used as a reference against which to compare upstream congestion. | be used as a reference against which to compare upstream congestion. | |||
The difference predicts downstream congestion for the rest of the | The difference predicts downstream congestion for the rest of the | |||
path. Therefore, measuring the fractions of each codepoint at any | path. Therefore, measuring the fractions of each codepoint at any | |||
point in the Internet will reveal upstream, downstream and whole path | point in the Internet will reveal upstream, downstream and whole path | |||
congestion. | congestion. | |||
Note that we have introduced discussion of marking and blanking | Note that we have introduced discussion of marking and blanking | |||
fractions solely for illustration. To be absolutely clear, these | fractions solely for illustration. To be absolutely clear, for TCP | |||
fractions are averages that would result from the behaviour of a TCP | these fractions are averages that would result from the behaviour of | |||
protocol handler mechanically blanking outgoing packets in direct | the protocol handler mechanically blanking outgoing packets in direct | |||
response to incoming feedback---we are not saying any protocol | response to incoming feedback---we are not saying any protocol | |||
handler works with these average fractions directly. | handler has to work with these average fractions directly. | |||
3.4. Informal Terminology | 3.5. Informal Terminology | |||
In the rest of this memo we will loosely talk of positive or negative | In the rest of this memo we will loosely talk of positive or negative | |||
flows, meaning flows where the moving average of the downstream | flows, meaning flows where the moving average of the downstream | |||
congestion metric is persistently positive or negative. The notion | congestion metric is persistently positive or negative. A negative | |||
of a negative metric arises because it is derived by subtracting one | flow is one where more CE marked packets than re-ECN blanked packets | |||
metric from another. Of course actual downstream congestion cannot | arrive. Likewise in positive flows more re-ECN blanked packets | |||
be negative, only the metric can (whether due to time lags or | arrive than CE marked packets. The notion of a negative metric | |||
deliberate malice). | arises because it is derived by subtracting one metric from another. | |||
Of course actual downstream congestion cannot be negative, only the | ||||
metric can (whether due to time lags or deliberate malice). | ||||
Just as we will loosely talk of positive and negative flows, we will | Just as we will loosely talk of positive and negative flows, we will | |||
also talk of positive or negative packets, meaning packets that | also talk of positive or negative packets, meaning packets that | |||
contribute positively or negatively to the downstream congestion | contribute positively or negatively to the downstream congestion | |||
metric. | metric. | |||
Therefore we will talk of packets having `worth' of +1, 0 or -1, | Therefore we will talk of packets having `worth' of +1, 0 or -1, | |||
which, when multiplied by their size, indicates their contribution to | which, when multiplied by their size, indicates their contribution to | |||
the downstream congestion metric. | the downstream congestion metric. | |||
Figure 2 shows the main state transitions of the system once a flow | The idea is that most packets start with zero worth. Every time the | |||
is established, showing the worth of packets in each state. When the | network decrements the worth of a packet, the sender increments the | |||
network congestion marks a packet it decrements its worth (moving | worth of a later packet. Then, over time, as many positive octets | |||
from the left of the main square to the right). When the sender | should arrive at the receiver as negative. Note we have said octets | |||
blanks the RE flag in order to re-echo congestion it increments the | not packets, so if packets are of different sizes, the worth should | |||
worth of a packet (moving from the bottom of the main square to the | be incremented on enough octets to balance the octets in negative | |||
top). | packets arriving at the receiver. It is this balance that will allow | |||
the network to hold the sender accountable for the congestion it | ||||
Sender state Sent Worth Received Worth | causes. | |||
packet packet | ||||
+----------------------------------------------------+ | ||||
| ^ | ||||
V | | ||||
Congestion echoed -->Re-Echo +1 --+---> CE(0) 0 --+ | ||||
(positive) | (canceled) | | ||||
V network | | ||||
| congestion | | ||||
| | | ||||
Flow established --> RECT 0 ----+-> CE(-1) -1 --+ | ||||
^ (neutral) | | (negative) | ||||
| | | | ||||
| no V V | ||||
| congestion | | | ||||
+-----------<--------------+-+ | ||||
Figure 2: Re-ECN System State Diagram (bootstrap not shown) | ||||
The idea is that every time the network decrements the worth of a | ||||
packet, the sender increments the worth of a later packet. Then, | ||||
over time, as many positive octets should arrive at the receiver as | ||||
negative. Note we have said octets not packets, so if packets are of | ||||
different sizes, the worth should be incremented on enough octets to | ||||
balance the octets in negative packets arriving at the receiver. It | ||||
is this balance that will allow the network to hold the sender | ||||
accountable for the congestion it causes, as we shall see. The | ||||
informal outline below uses TCP as an example transport, but the idea | ||||
would be broadly similar for any transport that adapts its rate to | ||||
congestion. | ||||
We will start with the sender in `flow established' state. Normally, | ||||
as acknowledgements of earlier packets arrive that don't feedback any | ||||
congestion, the congestion window can be opened, so the sender goes | ||||
round the smaller sub-loop, sending RECT packets (worth 0) and | ||||
returning to the flow established state to send another one. If a | ||||
router congestion marks one of the packets, it decrements the | ||||
packet's worth. The sender will have been continuing to traverse | ||||
round the smaller feedback loop every time acknowledgements arrive. | ||||
But when congestion feedback returns from this packet that was marked | ||||
with -1 worth (the largest loop in the figure) the sender jumps to | ||||
the congestion echoed state in order to re-echo the congestion, | ||||
incrementing the worth of the next packet to +1 by blanking its RE | ||||
flag. The sender then returns to the flow established state and | ||||
continues round the smaller loop, sending packets worth 0. Note that | ||||
the size of the loops is just an artefact of the figure; it is not | ||||
meant to imply that one loop is slower than the other - they are both | ||||
the same end to end feedback loop. | ||||
If a packet carrying re-echoed congestion happens to also be | If a packet carrying re-echoed congestion happens to also be | |||
congestion marked, the +1 worth added by the sender will be cancelled | congestion marked, the +1 worth added by the sender will be cancelled | |||
out by the -1 network congestion marking. Although the two worth | out by the -1 network congestion marking. Although the two worth | |||
values correctly cancel out, neither the congestion marking nor the | values correctly cancel out, neither the congestion marking nor the | |||
re-echoed congestion are lost, because the RE bit and the ECN field | re-echoed congestion are lost, because the RE bit and the ECN field | |||
are orthogonal. So, whenever this happens, the receiver will | are orthogonal. So, whenever this happens, the receiver will | |||
correctly detect and re-echo the new congestion event as well (the | correctly detect and re-echo the new congestion event as well. | |||
top sub-loop). When we need to distinguish, we will sometimes call a | ||||
packet marked RECT 'neutral' (0 worth), while we will call the CE(0) | ||||
marking 'canceled' (also 0 worth). If a re-echoed packet isn't | ||||
unlucky enough to be further congestion marked, the sender will | ||||
return to the flow established state and continue to send RECT | ||||
packets (worth 0). | ||||
The table below specifies unambiguously the worth of each extended | The table below specifies unambiguously the worth of each extended | |||
ECN codepoint. Note the order is different from the previous table | ECN codepoint. Note the order is different from the previous table | |||
to better show how the worth increments and decrements. The FNE | to better show how the worth increments and decrements. The FNE | |||
codepoint is an exception. It is used in the flow bootstrap process | codepoint is used in the flow bootstrap process (explained later) and | |||
(explained later) and has the same positive (+1) worth as a packet | has the same positive (+1) worth as a packet with the Re-Echo | |||
with the Re-Echo codepoint. | codepoint. | |||
+--------+------+----------------+-------+--------------------------+ | +--------+------+----------------+-------+--------------------------+ | |||
| ECN | RE | Extended ECN | Worth | Re-ECN meaning | | | ECN | RE | Extended ECN | Worth | Re-ECN meaning | | |||
| field | bit | codepoint | | | | | field | bit | codepoint | | | | |||
+--------+------+----------------+-------+--------------------------+ | +--------+------+----------------+-------+--------------------------+ | |||
| 00 | 0 | Not-RECT | ... | Not re-ECN-capable | | | 00 | 0 | Not-RECT | ... | Not re-ECN-capable | | |||
| | | | | transport | | | | | | | transport | | |||
| 00 | 1 | FNE | +1 | Feedback not established | | ||||
| 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | | | 01 | 0 | Re-Echo | +1 | Re-echoed congestion and | | |||
| | | | | RECT | | | | | | | RECT | | |||
| 10 | 0 | --- | ... | Legacy ECN use only | | | 10 | 0 | --- | ... | RFC3168 ECN use only | | |||
| 11 | 0 | CE(0) | 0 | Re-Echo canceled by | | | 11 | 0 | CE(0) | 0 | Re-Echo canceled by | | |||
| | | | | congestion experienced | | | | | | | congestion experienced | | |||
| 00 | 1 | FNE | +1 | Feedback not established | | ||||
| 01 | 1 | RECT | 0 | Re-ECN capable transport | | | 01 | 1 | RECT | 0 | Re-ECN capable transport | | |||
| 10 | 1 | --CU-- | ... | Currently unused | | | 10 | 1 | --CU-- | ... | Currently unused | | |||
| | | | | | | | | | | | | | |||
| 11 | 1 | CE(-1) | -1 | Congestion experienced | | | 11 | 1 | CE(-1) | -1 | Congestion experienced | | |||
+--------+------+----------------+-------+--------------------------+ | +--------+------+----------------+-------+--------------------------+ | |||
Table 3: 'Worth' of Extended ECN Codepoints | Table 3: 'Worth' of Extended ECN Codepoints | |||
4. Transport Layers | 4. Transport Layers | |||
4.1. TCP | 4.1. TCP | |||
Re-ECN capability at the sender is essential. At the receiver it is | Re-ECN capability at the sender is essential. At the receiver it is | |||
optional, as long as the receiver has a basic (`vanilla flavour') | optional, as long as the receiver has a basic RFC3168-compliant ECN- | |||
RFC3168-compliant ECN-capable transport (ECT) [RFC3168]. Given re- | capable transport (ECT) [RFC3168]. Given re-ECN is not the first | |||
ECN is not the first attempt to define the semantics of the ECN | attempt to define the semantics of the ECN field, we give a table | |||
field, we give a table below summarising what happens for various | below summarising what happens for various combinations of | |||
combinations of capabilities of the sender S and receiver R, as | capabilities of the sender S and receiver R, as indicated in the | |||
indicated in the first four columns below. The last column gives the | first four columns below. The last column gives the mode a half- | |||
mode a half-connection should be in after the first two of the three | connection should be in after the first two of the three TCP | |||
TCP handshakes. | handshakes. | |||
+--------+--------------+------------+---------+--------------------+ | +--------+--------------+------------+---------+--------------------+ | |||
| Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | | | Re-ECT | ECT-Nonce | ECT | Not-ECT | S-R | | |||
| | (RFC3540) | (RFC3168) | | Half-connection | | | | (RFC3540) | (RFC3168) | | Half-connection | | |||
| | | | | Mode | | | | | | | Mode | | |||
+--------+--------------+------------+---------+--------------------+ | +--------+--------------+------------+---------+--------------------+ | |||
| SR | | | | RECN | | | SR | | | | RECN | | |||
| S | R | | | RECN-Co | | | S | R | | | RECN-Co | | |||
| S | | R | | RECN-Co | | | S | | R | | RECN-Co | | |||
| S | | | R | Not-ECT | | | S | | | R | Not-ECT | | |||
skipping to change at page 16, line 32 | skipping to change at page 16, line 37 | |||
Table 4: Modes of TCP Half-connection for Combinations of ECN | Table 4: Modes of TCP Half-connection for Combinations of ECN | |||
Capabilities of Sender S and Receiver R | Capabilities of Sender S and Receiver R | |||
We will describe what happens in each mode, then describe how they | We will describe what happens in each mode, then describe how they | |||
are negotiated. The abbreviations for the modes in the above table | are negotiated. The abbreviations for the modes in the above table | |||
mean: | mean: | |||
RECN: Full re-ECN capable transport | RECN: Full re-ECN capable transport | |||
RECN-Co: Re-ECN sender in compatibility mode with a | RECN-Co: Re-ECN sender in compatibility mode with a RFC3168 | |||
vanilla [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable | compliant [RFC3168] ECN receiver or an [RFC3540] ECN nonce-capable | |||
receiver. Implementation of this mode is OPTIONAL. | receiver. Implementation of this mode is OPTIONAL. | |||
Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when | Not-ECT: Not ECN-capable transport, as defined in [RFC3168] for when | |||
at least one of the transports does not understand even basic ECN | at least one of the transports does not understand even basic ECN | |||
marking. | marking. | |||
Note that we use the term Re-ECT for a host transport that is re-ECN- | Note that we use the term Re-ECT for a host transport that is re-ECN- | |||
capable but RECN for the modes of the half connections between hosts | capable but RECN for the modes of the half connections between hosts | |||
when they are both Re-ECT. If a host transport is Re-ECT, this fact | when they are both Re-ECT. If a host transport is Re-ECT, this fact | |||
alone does NOT imply either of its half connections will necessarily | alone does NOT imply either of its half connections will necessarily | |||
be in RECN mode, at least not until it has confirmed that the other | be in RECN mode, at least not until it has confirmed that the other | |||
host is Re-ECT. | host is Re-ECT. | |||
4.1.1. RECN mode: Full re-ECN capable transport | 4.1.1. RECN mode: Full Re-ECN capable transport | |||
In full RECN mode, for each half connection, both the sender and the | In full RECN mode, for each half connection, both the sender and the | |||
receiver each maintain an unsigned integer counter we will call ECC | receiver each maintain an unsigned integer counter we will call ECC | |||
(echo congestion counter). The receiver maintains a count, modulo 8, | (echo congestion counter). The receiver maintains a count of how | |||
of how many times a CE marked packet has arrived during the half- | many times a CE marked packet has arrived during the half-connection. | |||
connection. Once a RECN connection is established, the three TCP | Once a RECN connection is established, the three TCP option flags | |||
option flags (ECE, CWR & NS) used for ECN-related functions in other | (ECE, CWR & NS) used for ECN-related functions in other versions of | |||
versions of ECN are used as a 3-bit field for the receiver to | ECN are used as a 3-bit field for the receiver to repeatedly tell the | |||
repeatedly tell the sender the current value of ECC whenever it sends | sender the current value of ECC, modulo 8, whenever it sends a TCP | |||
a TCP ACK. We will call this the echo congestion increment (ECI) | ACK. We will call this the echo congestion increment (ECI) field. | |||
field. This overloaded use of these 3 option flags as one 3-bit ECI | This overloaded use of these 3 option flags as one 3-bit ECI field is | |||
field is shown in Figure 4. The actual definition of the TCP header, | shown in Figure 4. The actual definition of the TCP header, | |||
including the addition of support for the ECN nonce, is shown for | including the addition of support for the ECN nonce, is shown for | |||
comparison in Figure 3. This specification does not redefine the | comparison in Figure 3. This specification does not redefine the | |||
names of these three TCP option flags, it merely overloads them with | names of these three TCP option flags, it merely overloads them with | |||
another definition once a flow is established. | another definition once a flow is established. | |||
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
| | | N | C | E | U | A | P | R | S | F | | | | | N | C | E | U | A | P | R | S | F | | |||
| Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | | Header Length | Reserved | S | W | C | R | C | S | S | Y | I | | |||
| | | | R | E | G | K | H | T | N | N | | | | | | R | E | G | K | H | T | N | N | | |||
skipping to change at page 17, line 41 | skipping to change at page 17, line 47 | |||
| | | | G | K | H | T | N | N | | | | | | G | K | H | T | N | N | | |||
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | |||
Figure 4: Definition of the ECI field within bytes 13 and 14 of the | Figure 4: Definition of the ECI field within bytes 13 and 14 of the | |||
TCP Header, overloading the current definitions above for established | TCP Header, overloading the current definitions above for established | |||
RECN flows. | RECN flows. | |||
Receiver Action in RECN Mode | Receiver Action in RECN Mode | |||
Every time a CE marked packet arrives at a receiver in RECN mode, | Every time a CE marked packet arrives at a receiver in RECN mode, | |||
the receiver transport increments its local value of ECC modulo 8 | the receiver transport increments its local value of ECC and MUST | |||
and MUST echo its value to the sender in the ECI field of the next | echo its value, modulo 8, to the sender in the ECI field of the | |||
ACK. It MUST repeat the same value of ECI in every subsequent ACK | next ACK. It MUST repeat the same value of ECI in every | |||
until the next CE event, when it increments ECI again. | subsequent ACK until the next CE event, when it increments ECI | |||
again. | ||||
The increment of the local ECC values is modulo 8 so the field | The increment of the local ECC values is modulo 8 so the field | |||
value simply wraps round back to zero when it overflows. The | value simply wraps round back to zero when it overflows. The | |||
least significant bit is to the right (labelled bit 9). | least significant bit is to the right (labelled bit 9). | |||
A receiver in RECN mode MAY delay the echo of a CE to the next | A receiver in RECN mode MAY delay the echo of a CE to the next | |||
delayed-ACK, which would be necessary if ACK-withholding were | delayed-ACK, which would be necessary if ACK-withholding were | |||
implemented. | implemented. | |||
Sender Action in RECN Mode | Sender Action in RECN Mode | |||
On the arrival of every ACK, the sender compares the ECI field | On the arrival of every ACK, the sender compares the ECI field | |||
with its own ECC value, then replaces its local value with that | with its own ECC value, then replaces its local value with that | |||
from the ACK. The difference D is assumed to be the number of CE | from the ACK. The difference D (D = (ECI + 8 - ECC mod 8) mod 8) | |||
marked packets that arrived at the receiver since it sent the | is assumed to be the number of CE marked packets that arrived at | |||
previously received ACK (but see below for the sender's safety | the receiver since it sent the previously received ACK (but see | |||
strategy). Whenever the ECI field increments by D (and/or d drops | below for the sender's safety strategy). Whenever the ECI field | |||
are detected), the sender MUST clear the RE flag to "0" in the IP | increments by D (and/or d drops are detected), the sender MUST | |||
header of the next D' data packets it sends (where D' = D + d), | clear the RE flag to "0" in the IP header of the next D' data | |||
effectively re-echoing each single increment of ECI. Otherwise | packets it sends (where D' = D + d), effectively re-echoing each | |||
the data sender MUST send all data packets with RE set to "1". | single increment of ECI. Otherwise the data sender MUST send all | |||
data packets with RE set to "1". | ||||
As a general rule, once a flow is established, as well as setting | As a general rule, once a flow is established, as well as setting | |||
or clearing the RE flag as above, a data sender in RECN mode MUST | or clearing the RE flag as above, a data sender in RECN mode MUST | |||
always set the ECN field to ECT(1). However, the settings of the | always set the ECN field to ECT(1). However, the settings of the | |||
extended ECN field during flow start are defined in Section 4.1.4. | extended ECN field during flow start are defined in Section 4.1.4. | |||
As we have already emphasised, the re-ECN protocol makes no | As we have already emphasised, the re-ECN protocol makes no | |||
changes and has no effect on the TCP congestion control algorithm. | changes and has no effect on the TCP congestion control algorithm. | |||
So, each increment of ECI (or detection of a drop) also triggers | So, the first increment of ECI (or detection of a drop) in a RTT | |||
the standard TCP congestion response, but with no more than one | triggers the standard TCP congestion response, no more than one | |||
congestion response per round trip, as usual. | congestion response per round trip, as usual. However, the sender | |||
re-echoes every increment of ECI irrespective of RTTs. | ||||
A TCP sender also acts as the receiver for the other half- | A TCP sender also acts as the receiver for the other half- | |||
connection. The host will maintain two ECC values S.ECC and R.ECC | connection. The host will maintain two ECC values S.ECC and R.ECC | |||
as sender and receiver respectively. Every TCP header sent by a | as sender and receiver respectively. Every TCP header sent by a | |||
host in RECN mode will also repeat the prevailing value of R.ECC | host in RECN mode will also repeat the prevailing value of R.ECC | |||
in its ECI field. If a sender in RECN mode has to retransmit a | in its ECI field. If a sender in RECN mode has to retransmit a | |||
packet due to a suspected loss, the re-transmitted packet MUST | packet due to a suspected loss, the re-transmitted packet MUST | |||
carry the latest prevailing value of R.ECC when it is re- | carry the latest prevailing value of R.ECC when it is re- | |||
transmitted, which will not necessarily be the one it carried | transmitted, which will not necessarily be the one it carried | |||
originally. | originally. | |||
4.1.1.1. Drops and Marks | 4.1.1.1. Drops and Marks | |||
Re-ECN is based on the ECN protocol [RFC3168] which in turn is | Re-ECN is based on the ECN protocol [RFC3168] . In turn the | |||
typically based on the RED algorithm [RFC2309]. This algorithm marks | congestion markings ECN uses are typically based on the RED | |||
packets as CE with a probability that increases as the size of the | algorithm [RFC2309]. This algorithm marks packets as CE with a | |||
router queue increases. Howeverif the queue becomes too full then it | probability that increases as the size of the router queue increases. | |||
will revert to dropping packets. Because of this it is important | However, if the queue becomes too full then it will revert to | |||
that re-ECN treats each packet drop it detects as if it were actually | dropping packets. Because of this it is important that a re-ECN | |||
a CE mark. This ensures that it can continue to correctly echo | sender treats each packet drop it detects as if it were actually a CE | |||
congestion even through a highly congested path. | mark. This ensures that it can continue to correctly echo congestion | |||
even through a highly congested path. | ||||
In order to ensure that drops are correctly echoed the sender needs | In order to ensure that drops are correctly echoed the sender needs | |||
to add the number of drops detected per RTT to the difference in ECI | to add the number of drops detected per RTT to the difference in ECI | |||
value waiting to be echoed. A drop is defined as set out in | value waiting to be echoed. Drop detection is defined as set out in | |||
[RFC2581] -- if the connection is in slow start then a single | [RFC2581] -- if the connection is in slow start then a single | |||
duplicate aknowledgement will be treated as an indication of a drop. | duplicate aknowledgement will be treated as an indication of a drop. | |||
When the system is in the congestion avoidance stage then 3 duplicate | When the system is in the congestion avoidance stage then 3 duplicate | |||
acknowledgements will be treated as a sign of a drop. In all cases, | acknowledgements will be treated as a sign of a drop. In all cases, | |||
if a re-transmission time-out occurs then that will be treatd as a | if a re-transmission time-out occurs then that will be treatd as a | |||
drop. | drop. | |||
4.1.1.2. Safety against Long Pure ACK Loss Sequences | 4.1.1.2. Safety against Long Pure ACK Loss Sequences | |||
The ECI method was chosen for echoing congestion marking because a | The ECI method was chosen for echoing congestion marking because a | |||
skipping to change at page 20, line 4 | skipping to change at page 20, line 12 | |||
previous ACK but with a sequence number unchanged from the previously | previous ACK but with a sequence number unchanged from the previously | |||
received ACK, it SHOULD conservatively assume that the ECI field | received ACK, it SHOULD conservatively assume that the ECI field | |||
incremented by D' = L - ((L-D) mod 8), where D is the apparent | incremented by D' = L - ((L-D) mod 8), where D is the apparent | |||
increase in the ECI field. For example if the ACK arriving after 9 | increase in the ECI field. For example if the ACK arriving after 9 | |||
pure ACK losses apparently increased ECI by 2, the assumed increment | pure ACK losses apparently increased ECI by 2, the assumed increment | |||
of ECI would still be 2. But if ECI apparently increased by 2 after | of ECI would still be 2. But if ECI apparently increased by 2 after | |||
11 pure ACK losses, ECI should be assumed to have increased by 10. | 11 pure ACK losses, ECI should be assumed to have increased by 10. | |||
A re-ECN sender MAY implement a heuristic algorithm to predict beyond | A re-ECN sender MAY implement a heuristic algorithm to predict beyond | |||
reasonable doubt that the ECI field probably did not wrap within a | reasonable doubt that the ECI field probably did not wrap within a | |||
sequence of lost pure ACKs. But such an algorithm is NOT REQUIRED. | sequence of lost pure ACKs. But such an algorithm is OPTIONAL. Such | |||
Such an algorithm MUST NOT be used unless it is proven to work even | an algorithm MUST NOT be used unless it is proven to work even in the | |||
in the presence of correlation between high ACK loss rate on the back | presence of correlation between high ACK loss rate on the back | |||
channel and high CE marking rate on the forward channel. | channel and high CE marking rate on the forward channel. | |||
Whatever assumption a re-ECN sender makes about potentially lost CE | Whatever assumption a re-ECN sender makes about potentially lost CE | |||
marks, both its congestion control and its re-echoing behaviour | marks, both its congestion control and its re-echoing behaviour | |||
SHOULD be consistent with the assumption it makes. | SHOULD be consistent with the assumption it makes. | |||
4.1.2. RECN-Co mode: Re-ECT Sender with a Vanilla or Nonce ECT Receiver | 4.1.2. RECN-Co mode: Re-ECT Sender with a RFC3168 compliant ECN | |||
Receiver | ||||
If the half-connection is in RECN-Co mode, ECN feedback proceeds no | If the half-connection is in RECN-Co mode, ECN feedback proceeds no | |||
differently to that of vanilla ECN. In other words, the receiver | differently to that of RFC3168 compliant ECN. In other words, the | |||
sets the ECE flag repeatedly in the TCP header and the sender | receiver sets the ECE flag repeatedly in the TCP header and the | |||
responds by setting the CWR flag. Although RECN-Co mode is used when | sender responds by setting the CWR flag. Although RECN-Co mode is | |||
the receiver has not implemented the re-ECN protocol, the sender can | used when the receiver has not implemented the re-ECN protocol, the | |||
infer enough from its vanilla ECN feedback to set or clear the RE | sender can infer enough from its RFC3168 compliant ECN feedback to | |||
flag reasonably well. Specifically, every time the receiver toggles | set or clear the RE flag reasonably well. Specifically, every time | |||
the ECE field from "0" to "1" (or a loss is detected), as well as | the receiver toggles the ECE field from "0" to "1" (or a loss is | |||
setting CWR in the TCP flags, the re-ECN sender MUST blank the RE | detected), as well as setting CWR in the TCP flags, the re-ECN sender | |||
flag of the next packet to "0" as it would do in full RECN mode. | MUST blank the RE flag of the next packet to "0" as it would do in | |||
Otherwise, the data sender SHOULD send all other packets with RE set | full RECN mode. Otherwise, the data sender SHOULD send all other | |||
to "1". Once a flow is established, a re-ECN data sender in RECN-Co | packets with RE set to "1". Once a flow is established, a re-ECN | |||
mode MUST always set the ECN field to ECT(1). | data sender in RECN-Co mode MUST always set the ECN field to ECT(1). | |||
If a CE marked packet arrives at the receiver within a round trip | If a CE marked packet arrives at the receiver within a round trip | |||
time of a previous mark, the receiver will still be echoing ECE for | time of a previous mark, the receiver will still be echoing ECE for | |||
the last CE mark. Therefore, such a mark will be missed by the | the last CE mark. Therefore, such a mark will be missed by the | |||
sender. Of course, this isn't of concern for congestion control, but | sender. Of course, this isn't of concern for congestion control, but | |||
it does mean that very occasionally the RE blanking fraction will be | it does mean that very occasionally the RE blanking fraction will be | |||
understated. Therefore flows in RECN-Co mode may occasionally be | understated. Therefore flows in RECN-Co mode may occasionally be | |||
mistaken for very lightly cheating flows and consequently might | mistaken for very lightly cheating flows and consequently might | |||
suffer a small number of packet drops through an egress dropper | suffer a small number of packet drops through an egress dropper | |||
(Section 6.1.4). We expect re-ECN would be deployed for some time | (Section 6.1.4). We expect re-ECN would be deployed for some time | |||
before policers and droppers start to enforce it. So, given there is | before policers and droppers start to enforce it. So, given there is | |||
not much ECN deployment yet anyway, this minor problem may affect | not much ECN deployment yet anyway, this minor problem may affect | |||
only a very small proportion of flows, reducing to nothing over the | only a very small proportion of flows, reducing to nothing over the | |||
years as vanilla ECN hosts upgrade. The use of RECN-Co mode would | years as RFC3168 compliant ECN hosts upgrade. The use of RECN-Co | |||
need to be reviewed in the light of experience at the time of re-ECN | mode would need to be reviewed in the light of experience at the time | |||
deployment. | of re-ECN deployment. | |||
RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their | RECN-Co mode is OPTIONAL. Re-ECN implementers who want to keep their | |||
code simple, MAY choose not to implement this mode. If they do not, | code simple, MAY choose not to implement this mode. If they do not, | |||
a re-ECN sender SHOULD fall back to vanilla ECT mode in the presence | a re-ECN sender SHOULD fall back to RFC3168 compliant ECT mode in the | |||
of an ECN-capable receiver. It MAY choose to fall back to the ECT- | presence of an ECN-capable receiver. It MAY choose to fall back to | |||
Nonce mode, but if re-ECN implementers don't want to be bothered with | the ECT-Nonce mode, but if re-ECN implementers don't want to be | |||
RECN-Co mode, they probably won't want to add an ECT-Nonce mode | bothered with RECN-Co mode, they probably won't want to add an ECT- | |||
either. | Nonce mode either. | |||
4.1.2.1. Re-ECN support for the ECN Nonce | 4.1.2.1. Re-ECN support for the ECN Nonce | |||
A TCP half-connection in RECN-Co mode MUST NOT support the ECN | A TCP half-connection in RECN-Co mode MUST NOT support the ECN | |||
Nonce [RFC3540]. This means that the sending code of a re-ECN | Nonce [RFC3540]. This means that the sending code of a re-ECN | |||
implementation will never need to include ECN Nonce support. Re-ECN | implementation will never need to include ECN Nonce support. Re-ECN | |||
is intended to provide wider protection than the ECN nonce against | is intended to provide wider protection than the ECN nonce against | |||
congestion control misbehaviour, and re-ECN only requires support | congestion control misbehaviour, and re-ECN only requires support | |||
from the sender, therefore it is preferable to specifically rule out | from the sender, therefore it is preferable to specifically rule out | |||
the need for dual sender implementations. As a consequence, a re-ECN | the need for dual sender implementations. As a consequence, a re-ECN | |||
capable sender will never set ECT(0), so it will be easier for | capable sender will never set ECT(0), so it will be easier for | |||
network elements to discriminate re-ECN traffic flows from other ECN | network elements to discriminate re-ECN traffic flows from other ECN | |||
traffic, which will always contain some ECT(0) packets. | traffic, which will always contain some ECT(0) packets. | |||
However, a re-ECN implementation MAY OPTIONALLY include receiving | However, a re-ECN implementation MAY OPTIONALLY include receiving | |||
code that complies with the ECN Nonce protocol when interacting with | code that complies with the ECN Nonce protocol when interacting with | |||
a sender that supports the ECN nonce (rather than re-ECN), but this | a sender that supports the ECN nonce (rather than re-ECN), but this | |||
support is NOT REQUIRED. | support is not required. | |||
RFC3540 allows an ECN nonce sender to choose whether to sanction a | RFC3540 allows an ECN nonce sender to choose whether to sanction a | |||
receiver that does not ever set the nonce sum. Given re-ECN is | receiver that does not ever set the nonce sum. Given re-ECN is | |||
intended to provide wider protection than the ECN nonce against | intended to provide wider protection than the ECN nonce against | |||
congestion control misbehaviour, implementers of re-ECN receivers MAY | congestion control misbehaviour, implementers of re-ECN receivers MAY | |||
choose not to implement backwards compatibility with the ECN nonce | choose not to implement backwards compatibility with the ECN nonce | |||
capability. This may be because they deem that the risk of sanctions | capability. This may be because they deem that the risk of sanctions | |||
is low, perhaps because significant deployment of the ECN nonce seems | is low, perhaps because significant deployment of the ECN nonce seems | |||
unlikely at implementation time. | unlikely at implementation time. | |||
4.1.3. Capability Negotiation | 4.1.3. Capability Negotiation | |||
During the TCP hand-shake at the start of a connection, an originator | During the TCP hand-shake at the start of a connection, an originator | |||
of the connection (host A) with a re-ECN-capable transport MUST | of the connection (host A) with a re-ECN-capable transport MUST | |||
indicate it is Re-ECT by setting the TCP options NS=1, CWR=1 and | indicate it is Re-ECT by setting the TCP flags NS=1, CWR=1 and ECE=1 | |||
ECE=1 in the initial SYN. | in the initial SYN. | |||
A responding Re-ECT host (host B) MUST return a SYN ACK with flags | A responding Re-ECT host (host B) MUST return a SYN ACK with flags | |||
CWR=1 and ECE=0. The responding host MUST NOT set this combination | CWR=1 and ECE=0. The responding host MUST NOT set this combination | |||
of flags unless the preceding SYN has already indicated Re-ECT | of flags unless the preceding SYN has already indicated Re-ECT | |||
support as above. A Re-ECT server (B) can use either setting of the | support as above. Normally a Re-ECT server (B) will reply to a Re- | |||
NS flag combined with this type of SYN ACK in response to a SYN from | ECT client with NS=0, but if the initial SYN from Re-ECT client A is | |||
a Re-ECT client (A). Normally a Re-ECT server will reply to a Re-ECT | marked CE(-1), a Re-ECT server B MUST increment its local value of | |||
client with NS=0, but in the special circumstance below it can return | ECC. But B cannot reflect the value of ECC in the SYN ACK, because | |||
a SYN ACK with NS=1. | it is still using the 3 bits to negotiate connection capabilities. | |||
So, server B MUST set the alternative TCP header flags in its SYN | ||||
If the initial SYN from Re-ECT client A is marked CE(-1), a Re-ECT | ACK: NS=1, CWR=1 and ECE=0. | |||
server B MUST increment its local value of ECC. But B cannot reflect | ||||
the value of ECC in the SYN ACK, because it is still using the 3 bits | ||||
to negotiate connection capabilities. So, server B MUST set the | ||||
alternative TCP header flags in its SYN ACK: NS=1, CWR=1 and ECE=0. | ||||
These handshakes are summarised in Table 5 below, with X meaning | These handshakes are summarised in Table 5 below, with X indicating | |||
`don't care'. The handshakes used for the other flavours of ECN are | NS can be either 0 or 1 depending on whether congestion had been | |||
experienced. The handshakes used for the other flavours of ECN are | ||||
also shown for comparison. To compress the width of the table, the | also shown for comparison. To compress the width of the table, the | |||
headings of the first four columns have been severely abbreviated, as | headings of the first four columns have been severely abbreviated, as | |||
follows: | follows: | |||
R: *R*e-ECT | R: *R*e-ECT | |||
N: ECT-*N*once (RFC3540) | N: ECT-*N*once (RFC3540) | |||
E: *E*CT (RFC3168) | E: *E*CT (RFC3168) | |||
skipping to change at page 22, line 47 | skipping to change at page 23, line 6 | |||
Responder (B) | Responder (B) | |||
As soon as a re-ECN capable TCP server receives a SYN, it MUST set | As soon as a re-ECN capable TCP server receives a SYN, it MUST set | |||
its two half-connections into the modes given in Table 5. As soon as | its two half-connections into the modes given in Table 5. As soon as | |||
a re-ECN capable TCP client receives a SYN ACK, it MUST set its two | a re-ECN capable TCP client receives a SYN ACK, it MUST set its two | |||
half-connections into the modes given in Table 5. The half- | half-connections into the modes given in Table 5. The half- | |||
connections will remain in these modes for the rest of the | connections will remain in these modes for the rest of the | |||
connection, including for the third segment of TCP's three-way hand- | connection, including for the third segment of TCP's three-way hand- | |||
shake (the ACK). | shake (the ACK). | |||
{ToDo: Consider SYNs within a connection.} | {ToDo: Consider RSTs within a connection.} | |||
Recall that, if the SYN ACK reflects the same flag settings as the | Recall that, if the SYN ACK reflects the same flag settings as the | |||
preceding SYN (because there is a broken legacy implementation that | preceding SYN (because there is a broken RFC3168 compliant | |||
behaves this way), RFC3168 specifies that the whole connection MUST | implementation that behaves this way), RFC3168 specifies that the | |||
revert to Not-ECT. | whole connection MUST revert to Not-ECT. | |||
Also note that, whenever the SYN flag of a TCP segment is set | Also note that, whenever the SYN flag of a TCP segment is set | |||
(including when the ACK flag is also set), the NS, CWR and ECE flags | (including when the ACK flag is also set), the NS, CWR and ECE flags | |||
MUST NOT be interpreted as the 3-bit ECI value, which is only set as | ( i.e the ECI field of the SYNACK) MUST NOT be interpreted as the | |||
a copy of the local ECC value in non-SYN packets. | 3-bit ECI value, which is only set as a copy of the local ECC value | |||
in non-SYN packets. | ||||
4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after | 4.1.4. Extended ECN (EECN) Field Settings during Flow Start or after | |||
Idle Periods | Idle Periods | |||
If the originator (A) of a TCP connection supports re-ECN it MUST set | If the originator (A) of a TCP connection supports re-ECN it MUST set | |||
the extended ECN (EECN) field in the IP header of the initial SYN | the extended ECN (EECN) field in the IP header of the initial SYN | |||
packet to the feedback not established (FNE) codepoint. | packet to the feedback not established (FNE) codepoint. | |||
FNE is a new extended ECN codepoint defined by this specification | FNE is a new extended ECN codepoint defined by this specification | |||
(Section 3.2). The feedback not established (FNE) codepoint is used | (Section 3.3). The feedback not established (FNE) codepoint is used | |||
when the transport does not have the benefit of ECN feedback so it | when the transport does not have the benefit of ECN feedback so it | |||
cannot decide whether to set or clear the RE flag. | cannot decide whether to set or clear the RE flag. | |||
If after receiving a SYN the server B has set its sending half- | If after receiving a SYN the server B has set its sending half- | |||
connection into RECN mode or RECN-Co mode, it MUST set the extended | connection into RECN mode or RECN-Co mode, it MUST set the extended | |||
ECN field in the IP header of its SYN ACK to the feedback not | ECN field in the IP header of its SYN ACK to the feedback not | |||
established (FNE) codepoint. Note the careful wording here, which | established (FNE) codepoint. Note the careful wording here, which | |||
means that Re-ECT server B MUST set FNE on a SYN ACK whether it is | means that Re-ECT server B MUST set FNE on a SYN ACK whether it is | |||
responding to a SYN from a Re-ECT client or from a client that is | responding to a SYN from a Re-ECT client or from a client that is | |||
merely ECN-capable. | merely ECN-capable. This is because FNE indicates the transport is | |||
ECN capable. | ||||
The original ECN specification [RFC3168] required SYNs and SYN ACKs | The original ECN specification [RFC3168] required SYNs and SYN ACKs | |||
to use the Not-ECT codepoint of the ECN field. The aim was to | to use the Not-ECT codepoint of the ECN field. The aim was to | |||
prevent well-known DoS attacks such as SYN flooding being able to | prevent well-known DoS attacks such as SYN flooding being able to | |||
gain from the advantage that ECN capability afforded over drop at | gain from the advantage that ECN capability afforded over drop at | |||
ECN-capable routers. | ECN-capable routers. | |||
For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this | For a SYN ACK, Kuzmanovic [I-D.ietf-tcpm-ecnsyn] has shown that this | |||
caution was unnecessary, and proposes to allow a SYN ACK to be ECN- | caution was unnecessary, and proposes to allow a SYN ACK to be ECN- | |||
capable to improve performance. We have gone further by proposing to | capable to improve performance. By stipulating the FNE codepoint for | |||
make the initial SYN ECN-capable too. By stipulating the FNE | the initial SYN, we comply with RFC3168 in word but not in spirit, | |||
codepoint for the initial SYN, we comply with RFC3168 in word but not | because we have indeed set the ECN field to Not-ECT, but we have | |||
in spirit, because we have indeed set the ECN field to Not-ECT, but | extended the ECN field with another bit. And it will be seen | |||
we have extended the ECN field with another bit. And it will be seen | ||||
(Section 5.3) that we have defined one setting of that bit to mean an | (Section 5.3) that we have defined one setting of that bit to mean an | |||
ECN-capable transport. Therefore, by proposing that the FNE | ECN-capable transport. Therefore, by proposing that the FNE | |||
codepoint MUST be used on the initial SYN of a connection, we have | codepoint MUST be used on the initial SYN of a connection, we have | |||
(deliberately) made the initial SYN ECN-capable. Section 5.4 | gone further by proposing to make the initial SYN ECN-capable too. | |||
justifies deciding to make the initial SYN ECN-capable. | Section 5.4 justifies deciding to make the initial SYN ECN-capable. | |||
Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will | Once a TCP half connection is in RECN mode or RECN-Co mode, FNE will | |||
have already been set on the initial SYN and possibly the SYN ACK as | have already been set on the initial SYN and possibly the SYN ACK as | |||
above. But each re-ECN sender will have to set FNE cautiously on a | above. But each re-ECN sender will have to set FNE cautiously on a | |||
few data packets as well, given a number of packets will usually have | few data packets as well, given a number of packets will usually have | |||
to be sent before sufficient congestion feedback is received. The | to be sent before sufficient congestion feedback is received. The | |||
behaviour will be different depending on the mode of the half- | behaviour will be different depending on the mode of the half- | |||
connection: | connection: | |||
RECN mode: Given the constraints on TCP's initial window [RFC3390] | RECN mode: Given the constraints on TCP's initial window [RFC3390] | |||
and its exponential window increase during slow start | and its exponential window increase during slow start | |||
phase [RFC2581], it turns out that the sender SHOULD set FNE on | phase [RFC2581], it turns out that the sender SHOULD set FNE on | |||
the first and third data packets in its flow, assuming equal sized | the first and third data packets in its flow after the initial | |||
data packets once a flow is established. Appendix D presents the | 3-way handshake, assuming equal sized data packets once a flow is | |||
calculation that led to this conclusion. Below, after running | established. Appendix D presents the calculation that led to this | |||
through the start of an example TCP session, we give the intuition | conclusion. Below, after running through the start of an example | |||
learned from that calculation. | TCP session, we give the intuition learned from that calculation. | |||
RECN-Co mode: A re-ECT sender that switches into re-ECN | RECN-Co mode: A re-ECT sender that switches into re-ECN | |||
compatibility mode or into Not-ECT mode (because it has detected | compatibility mode or into Not-ECT mode (because it has detected | |||
the corresponding host is not re-ECN capable) MUST limit its | the corresponding host is not re-ECN capable) MUST limit its | |||
initial window to 1 segment. The reasoning behind this constraint | initial window to 1 segment. The reasoning behind this constraint | |||
is given in Section 5.4. Having set this initial window, a re-ECN | is given in Section 5.4. Having set this initial window, a re-ECN | |||
sender in RECN-Co mode SHOULD set FNE on the first and third data | sender in RECN-Co mode SHOULD set FNE on the first and third data | |||
packets in a flow, as for RECN mode. | packets in a flow, as for RECN mode. | |||
+----+------+----------------+-------+-------+---------------+------+ | +----+------+----------------+-------+-------+---------------+------+ | |||
skipping to change at page 25, line 21 | skipping to change at page 25, line 49 | |||
(EECN) field. | (EECN) field. | |||
Also shown on the receiving side of the table is the value of the | Also shown on the receiving side of the table is the value of the | |||
receiver's echo congestion counter (R.ECC) after processing the | receiver's echo congestion counter (R.ECC) after processing the | |||
incoming EECN header. Note that, once a host sets a half-connection | incoming EECN header. Note that, once a host sets a half-connection | |||
into RECN mode, it MUST initialise its local value of ECC to zero. | into RECN mode, it MUST initialise its local value of ECC to zero. | |||
The intuition that Appendix D gives for why a sender should set FNE | The intuition that Appendix D gives for why a sender should set FNE | |||
on the first and third data packets is as follows. At line 13, a | on the first and third data packets is as follows. At line 13, a | |||
packet sent by B is shown with an '*', which means it has been | packet sent by B is shown with an '*', which means it has been | |||
congestion marked by an intermediate router from RECT to CE(-1). On | congestion marked by an intermediate queue from RECT to CE(-1). On | |||
receiving this CE marked packet, client A increments its ECC counter | receiving this CE marked packet, client A increments its ECC counter | |||
to 1 as shown. This was the 7th data packet B sent, but before | to 1 as shown. This was the 7th data packet B sent, but before | |||
feedback about this event returns to B, it might well have sent many | feedback about this event returns to B, it might well have sent many | |||
more packets. Indeed, during exponential slow start, about as many | more packets. Indeed, during exponential slow start, about as many | |||
packets will be in flight (unacknowledged) as have been acknowledged. | packets will be in flight (unacknowledged) as have been acknowledged. | |||
So, when the feedback from the congestion event on B's 7th segment | So, when the feedback from the congestion event on B's 7th segment | |||
returns, B will have sent about 7 further packets that will still be | returns, B will have sent about 7 further packets that will still be | |||
in flight. At that stage, B's best estimate of the network's packet | in flight. At that stage, B's best estimate of the network's packet | |||
marking fraction will be 1/7. So, as B will have sent about 14 | marking fraction will be 1/7. So, as B will have sent about 14 | |||
packets, it should have already marked 2 of them as FNE in order to | packets, it should have already marked 2 of them as FNE in order to | |||
skipping to change at page 26, line 19 | skipping to change at page 26, line 46 | |||
that the design of network policers can be deterministic, this | that the design of network policers can be deterministic, this | |||
specification deliberately puts an absolute lower limit on how long a | specification deliberately puts an absolute lower limit on how long a | |||
connection can be idle before the packet that resumes the connection | connection can be idle before the packet that resumes the connection | |||
must be set to FNE, rather than relating it to the connection round | must be set to FNE, rather than relating it to the connection round | |||
trip time. We use the lower bound of the retransmission timeout | trip time. We use the lower bound of the retransmission timeout | |||
(RTO) [RFC2988], which is commonly used as the idle period before TCP | (RTO) [RFC2988], which is commonly used as the idle period before TCP | |||
must reduce to the restart window [RFC2581]. Note our specification | must reduce to the restart window [RFC2581]. Note our specification | |||
of re-ECN's idle period is NOT intended to change the idle period for | of re-ECN's idle period is NOT intended to change the idle period for | |||
TCP's restart, nor indeed for any other purposes. | TCP's restart, nor indeed for any other purposes. | |||
{ToDo: Describe how the sender falls back to legacy modes if packets | {ToDo: Describe how the sender falls back to RFC3168 modes if packets | |||
don't appear to be getting through (to work round firewalls | don't appear to be getting through (to work round firewalls | |||
discarding packets they consider unusual).} | discarding packets they consider unusual).} | |||
4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs | 4.1.5. Pure ACKS, Retransmissions, Window Probes and Partial ACKs | |||
A re-ECN sender MUST clear the RE flag to "0" and set the ECN field | A re-ECN sender MUST clear the RE flag to "0" and set the ECN field | |||
to Not-ECT in pure ACKs, retransmissions and window probes, as | to Not-ECT in pure ACKs, retransmissions and window probes, as | |||
specified in [RFC3168]. Our eventual goal is for all packets to be | specified in [RFC3168]. Our eventual goal is for all packets to be | |||
sent with re-ECN enabled, and we believe the semantics of the ECI | sent with re-ECN enabled, and we believe the semantics of the ECI | |||
field go a long way towards being able to achieve this. However, we | field go a long way towards being able to achieve this. However, we | |||
skipping to change at page 26, line 46 | skipping to change at page 27, line 28 | |||
general principle we work to is to remain compatible with TCP's | general principle we work to is to remain compatible with TCP's | |||
congestion control which is driven by congestion events at packet | congestion control which is driven by congestion events at packet | |||
granularity while at the same time aiming to blank the RE flag on at | granularity while at the same time aiming to blank the RE flag on at | |||
least as many octets in a flow as have been marked CE. | least as many octets in a flow as have been marked CE. | |||
Therefore, a re-ECN TCP receiver MUST increment its ECC value as many | Therefore, a re-ECN TCP receiver MUST increment its ECC value as many | |||
times as CE marked packets have been received. And that value MUST | times as CE marked packets have been received. And that value MUST | |||
be echoed to the sender in the first available ACK using the ECI | be echoed to the sender in the first available ACK using the ECI | |||
field. This ensures the TCP sender's congestion control receives | field. This ensures the TCP sender's congestion control receives | |||
timely feedback on congestion events at the same packet granularity | timely feedback on congestion events at the same packet granularity | |||
that they were generated on congested routers. | that they were generated on congested queues. | |||
Then, a re-ECN sender stores the difference D between its own ECC | Then, a re-ECN sender stores the difference D between its own ECC | |||
value and the incoming ECI field by incrementing a counter R. Then, R | value and the incoming ECI field by incrementing a counter R. Then, R | |||
is decremented by 1 each subsequent packet that is sent with the RE | is decremented by 1 each subsequent packet that is sent with the RE | |||
flag blanked, until R is no longer positive. Using this technique, | flag blanked, until R is no longer positive. Using this technique, | |||
whenever a re-ECN transport sends a not re-ECN capable (NRECN) packet | whenever a re-ECN transport sends a not re-ECN capable packet (e.g. a | |||
(e.g. a retransmission), the remaining packets required to have the | retransmission), the remaining packets required to have the RE flag | |||
RE flag blanked will be automatically carried over to subsequent | blanked will be automatically carried over to subsequent packets, | |||
packets, through the variable R. | through the variable R. | |||
This does not ensure precisely the same number of octets have RE | This does not ensure precisely the same number of octets have RE | |||
blanked as were CE marked. But we believe positive errors will | blanked as were CE marked. But we believe positive errors will | |||
cancel negative over a long enough period. {ToDo: However, more | cancel negative over a long enough period. {ToDo: However, more | |||
research is needed to prove whether this is so. If it is not, it may | research is needed to prove whether this is so. If it is not, it may | |||
be necessary to increment and decrement R in octets rather than | be necessary to increment and decrement R in octets rather than | |||
packets, by incrementing R as the product of D and the size in octets | packets, by incrementing R as the product of D and the size in octets | |||
of packets being sent (typically the MSS).} | of packets being sent (typically the MSS).} | |||
4.2. Other Transports | 4.2. Other Transports | |||
4.2.1. General Guidelines for Adding Re-ECN to Other Transports | 4.2.1. General Guidelines for Adding Re-ECN to Other Transports | |||
Re-ECT sender transports that have established the receiver transport | As a general rule, Re-ECT sender transports that have established the | |||
is at least ECN-capable (not necessarily re-ECN capable) MUST blank | receiver transport is at least ECN-capable (not necessarily re-ECN | |||
the RE codepoint in packets carrying at least as many octets as | capable) MUST blank the RE codepoint for at least as many octets as | |||
arrive at receiver with the CE codepoint set. Re-ECN-capable sender | arrive at receiver with the CE codepoint set. Re-ECN-capable sender | |||
transports should always initialise the ECN field to the ECT(1) | transports should always initialise the ECN field to the ECT(1) | |||
codepoint once a flow is established. | codepoint once a flow is established. | |||
If the sender transport does not have sufficient feedback to even | If the sender transport does not have sufficient feedback to even | |||
estimate the path's CE rate, it SHOULD set FNE continuously. If the | estimate the path's CE rate, it SHOULD set FNE continuously. If the | |||
sender transport has some, perhaps stale, feedback to estimate that | sender transport has some, perhaps stale, feedback to estimate that | |||
the path's CE rate is nearly definitely less than E%, the transport | the path's CE rate is nearly definitely less than E%, the transport | |||
MAY blank RE in packets for E% of sent octets, and set the RECT | MAY blank RE in packets for E% of sent octets, and set the RECT | |||
codepoint for the remainder. | codepoint for the remainder. | |||
skipping to change at page 28, line 25 | skipping to change at page 29, line 7 | |||
4.2.3. Guidelines for adding Re-ECN to DCCP | 4.2.3. Guidelines for adding Re-ECN to DCCP | |||
Beside adjusting the initial features negotiation sequence, operating | Beside adjusting the initial features negotiation sequence, operating | |||
re-ECN in DCCP [RFC4340] could be achieved by defining a new option | re-ECN in DCCP [RFC4340] could be achieved by defining a new option | |||
to be added to acknowledgments, that would include a multibit field | to be added to acknowledgments, that would include a multibit field | |||
where the destination could copy its ECC. | where the destination could copy its ECC. | |||
4.2.4. Guidelines for adding Re-ECN to SCTP | 4.2.4. Guidelines for adding Re-ECN to SCTP | |||
Annex 1 in [RFC2960] gives the specifications for SCTP to support | Appendix A in [RFC4960] gives the specifications for SCTP to support | |||
ECN. Similar steps should be taken to support re-ECN. Beside | ECN. Similar steps should be taken to support re-ECN. Beside | |||
adjusting the initial features negotiation sequence, operating re-ECN | adjusting the initial features negotiation sequence, operating re-ECN | |||
in SCTP could be achieved by defining a new control chunk, that would | in SCTP could be achieved by defining a new control chunk, that would | |||
include a multibit field where the destination could copy its ECC | include a multibit field where the destination could copy its ECC | |||
5. Network Layer | 5. Network Layer | |||
5.1. Re-ECN IPv4 Wire Protocol | 5.1. Re-ECN IPv4 Wire Protocol | |||
The wire protocol of the ECN field in the IP header remains largely | The wire protocol of the ECN field in the IP header remains largely | |||
unchanged from [RFC3168]. However, an extension to the ECN field we | unchanged from [RFC3168]. However, an extension to the ECN field we | |||
call the RE (re-ECN extension) flag (Section 3.2) is defined in this | call the RE (Re-ECN extension) flag (Section 3.3) is defined in this | |||
document. It doubles the extended ECN codepoint space, giving 8 | document. It doubles the extended ECN codepoint space, giving 8 | |||
potential codepoints. The semantics of the extra codepoints are | potential codepoints. The semantics of the extra codepoints are | |||
backward compatible with the semantics of the 4 original codepoints | backward compatible with the semantics of the 4 original codepoints | |||
[RFC3168] (Section 7.1 collects together and summarises all the | [RFC3168] (Section 7.1 collects together and summarises all the | |||
changes defined in this document). | changes defined in this document). | |||
For IPv4, this document proposes that the new RE control flag will be | For IPv4, this document proposes that the new RE control flag will be | |||
positioned where the `reserved' control flag was at bit 48 of the | positioned where the `reserved' control flag was at bit 48 of the | |||
IPv4 header (counting from 0). Alternatively, some would call this | IPv4 header (counting from 0). Alternatively, some would call this | |||
bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 | bit 0 (counting from 0) of byte 7 (counting from 1) of the IPv4 | |||
skipping to change at page 30, line 21 | skipping to change at page 30, line 50 | |||
0 1 2 3 | 0 1 2 3 | |||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
| Next Header | Hdr ext Len | Option Type | Opt Length =4 | | | Next Header | Hdr ext Len | Option Type | Opt Length =4 | | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
|R| Reserved for future use | | |R| Reserved for future use | | |||
|E| | | |E| | | |||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |||
Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option | Figure 6: Definition of a New IPv6 Congestion Hop by Hop Option | |||
Header containing the Re-ECN Extension (RE) Control Flag | Header containing the re-ECN Extension (RE) Control Flag | |||
0 1 2 3 4 5 6 7 8 | 0 1 2 3 4 5 6 7 8 | |||
+-+-+-+-+-+-+-+-+- | +-+-+-+-+-+-+-+-+- | |||
|AIU|C|Option ID| | |AIU|C|Option ID| | |||
+-+-+-+-+-+-+-+-+- | +-+-+-+-+-+-+-+-+- | |||
Figure 7: Congestion Hop by Hop Option Type Encoding | Figure 7: Congestion Hop by Hop Option Type Encoding | |||
The Hop-by-Hop Options header enables packets to carry information to | The Hop-by-Hop Options header enables packets to carry information to | |||
be examined and processed by routers or nodes along the packet's | be examined and processed by routers or nodes along the packet's | |||
delivery path, including the source and destination nodes. For re- | delivery path, including the source and destination nodes. For re- | |||
skipping to change at page 30, line 44 | skipping to change at page 31, line 25 | |||
Congestion extension header MUST be set to "00" meaning if | Congestion extension header MUST be set to "00" meaning if | |||
unrecognized `skip over option and continue processing the header'. | unrecognized `skip over option and continue processing the header'. | |||
Then, any routers or a receiver not upgraded with the optional re-ECN | Then, any routers or a receiver not upgraded with the optional re-ECN | |||
features described in this memo will simply ignore this header. But | features described in this memo will simply ignore this header. But | |||
routers with these optional re-ECN features or a re-ECN policing | routers with these optional re-ECN features or a re-ECN policing | |||
function, will process this Congestion extension header. | function, will process this Congestion extension header. | |||
The `C' flag MUST be set to "1" to specify that the Option Data | The `C' flag MUST be set to "1" to specify that the Option Data | |||
(currently only the RE control flag) can change en-route to the | (currently only the RE control flag) can change en-route to the | |||
packet's final destination. This ensures that, when an | packet's final destination. This ensures that, when an | |||
Authentication header (AH [RFC2402]) is present in the packet, for | Authentication header (AH [RFC4302]) is present in the packet, for | |||
any option whose data may change en-route, its entire Option Data | any option whose data may change en-route, its entire Option Data | |||
field will be treated as zero-valued octets when computing or | field will be treated as zero-valued octets when computing or | |||
verifying the packet's authenticating value. | verifying the packet's authenticating value. | |||
Although the RE control flag should not be changed along the path, we | Although the RE control flag should not be changed along the path, we | |||
expect that the rest of this option field that is currently `Reserved | expect that the rest of this option field that is currently `Reserved | |||
for future use' could be used for a multi-bit congestion notification | for future use' could be used for a multi-bit congestion notification | |||
field which we would expect to change en route. As the RE flag does | field which we would expect to change en route. As the RE flag does | |||
not need end-to-end authentication, we set the C flag to '1'. | not need end-to-end authentication, we set the C flag to '1'. | |||
skipping to change at page 31, line 19 | skipping to change at page 31, line 48 | |||
5.3. Router Forwarding Behaviour | 5.3. Router Forwarding Behaviour | |||
Re-ECN works well without modifying the forwarding behaviour of any | Re-ECN works well without modifying the forwarding behaviour of any | |||
routers. However, below, two OPTIONAL changes to forwarding | routers. However, below, two OPTIONAL changes to forwarding | |||
behaviour are defined which respectively enhance performance and | behaviour are defined which respectively enhance performance and | |||
improve a router's discrimination against flooding attacks. They are | improve a router's discrimination against flooding attacks. They are | |||
both OPTIONAL additions that we propose MAY apply by default to all | both OPTIONAL additions that we propose MAY apply by default to all | |||
Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN | Diffserv per-hop scheduling behaviours (PHBs) [RFC2475] and ECN | |||
marking behaviours [RFC3168]. Specifications for PHBs MAY define | marking behaviours [RFC3168]. Specifications for PHBs MAY define | |||
different forwarding behaviours from this default, but this is NOT | different forwarding behaviours from this default, but this is not | |||
REQUIRED. [Re-PCN] is one example. | required. [Re-PCN] is one example. | |||
FNE indicates ECT: | FNE indicates ECT: | |||
The FNE codepoint tells a router to assume that the packet was | The FNE codepoint tells a router to assume that the packet was | |||
sent by an ECN-capable transport (see Section 5.4). Therefore an | sent by an ECN-capable transport (see Section 5.4). Therefore an | |||
FNE packet MAY be marked rather than dropped. Note that the FNE | FNE packet MAY be marked rather than dropped. Note that the FNE | |||
codepoint has been intentionally chosen so that, to legacy routers | codepoint has been intentionally chosen so that, to RFC3168 | |||
(which do not inspect the RE flag) an FNE packet appears to be | compliant routers (which do not inspect the RE flag) an FNE packet | |||
Not-ECT so it will be dropped by legacy AQM algorithms. | appears to be Not-ECT so it will be dropped by legacy AQM | |||
algorithms. | ||||
A network operator MUST NOT configure a router to ECN mark rather | A network operator MUST NOT configure a queue to ECN mark rather | |||
than drop FNE packets unless it can guarantee that FNE packets | than drop FNE packets unless it can guarantee that FNE packets | |||
will be rate limited, either locally or upstream. The ingress | will be rate limited, either locally or upstream. The ingress | |||
policers discussed in Section 6.1.5 would count as rate limiters | policers discussed in Section 6.1.5 would count as rate limiters | |||
for this purpose. | for this purpose. | |||
Preferential Drop: If a re-ECN capable router experiences very high | Preferential Drop: If a re-ECN capable router queue experiences very | |||
load so that it has to drop arriving packets (e.g. a DoS attack), | high load so that it has to drop arriving packets (e.g. a DoS | |||
it MAY preferentially drop packets within the same Diffserv PHB | attack), it MAY preferentially drop packets within the same | |||
using the preference order for extended ECN codepoints given in | Diffserv PHB using the preference order for extended ECN | |||
Table 7. Preferential dropping can be difficult to implement on | codepoints given in Table 7. Preferential dropping can be | |||
some hardware, but if feasible it would discriminate against | difficult to implement on some hardware, but if feasible it would | |||
attack traffic if done as part of the overall policing framework | discriminate against attack traffic if done as part of the overall | |||
of Section 6.1.3. If nowhere else, routers at the egress of a | policing framework of Section 6.1.3. If nowhere else, routers at | |||
network SHOULD implement preferential drop (stronger than the MAY | the egress of a network SHOULD implement preferential drop | |||
above). For simplicity, preferences 4 & 5 MAY be merged into one | (stronger than the MAY above). For simplicity, preferences 4 & 5 | |||
preference level. | MAY be merged into one preference level. | |||
+-------+-----+------------+-------+------------+-------------------+ | +-------+-----+------------+-------+------------+-------------------+ | |||
| ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | | | ECN | RE | Extended | Worth | Drop Pref | Re-ECN meaning | | |||
| field | bit | ECN | | (1 = drop | | | | field | bit | ECN | | (1 = drop | | | |||
| | | codepoint | | 1st) | | | | | | codepoint | | 1st) | | | |||
+-------+-----+------------+-------+------------+-------------------+ | +-------+-----+------------+-------+------------+-------------------+ | |||
| 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | | | 01 | 0 | Re-Echo | +1 | 5/4 | Re-echoed | | |||
| | | | | | congestion and | | | | | | | | congestion and | | |||
| | | | | | RECT | | | | | | | | RECT | | |||
| 00 | 1 | FNE | +1 | 4 | Feedback not | | | 00 | 1 | FNE | +1 | 4 | Feedback not | | |||
| | | | | | established | | | | | | | | established | | |||
| 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | | | 11 | 0 | CE(0) | 0 | 3 | Re-Echo canceled | | |||
| | | | | | by congestion | | | | | | | | by congestion | | |||
| | | | | | experienced | | | | | | | | experienced | | |||
| 01 | 1 | RECT | 0 | 3 | Re-ECN capable | | | 01 | 1 | RECT | 0 | 3 | Re-ECN capable | | |||
| | | | | | transport | | | | | | | | transport | | |||
| 11 | 1 | CE(-1) | -1 | 3 | Congestion | | | 11 | 1 | CE(-1) | -1 | 3 | Congestion | | |||
| | | | | | experienced | | | | | | | | experienced | | |||
| 10 | 1 | --CU-- | n/a | 2 | Currently Unused | | | 10 | 1 | --CU-- | n/a | 2 | Currently Unused | | |||
| 10 | 0 | --- | n/a | 2 | Legacy ECN use | | | 10 | 0 | --- | n/a | 2 | RFC3168 ECN use | | |||
| | | | | | only | | | | | | | | only | | |||
| 00 | 0 | Not-RECT | n/a | 1 | Not | | | 00 | 0 | Not-RECT | n/a | 1 | Not | | |||
| | | | | | re-ECN-capable | | | | | | | | Re-ECN-capable | | |||
| | | | | | transport | | | | | | | | transport | | |||
+-------+-----+------------+-------+------------+-------------------+ | +-------+-----+------------+-------+------------+-------------------+ | |||
Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') | Table 7: Drop Preference of EECN Codepoints (Sorted by `Worth') | |||
The above drop preferences are arranged to preserve packets with | The above drop preferences are arranged to preserve packets with | |||
more positive worth (Section 3.4), given senders of positive | more positive worth (Section 3.5), given senders of positive | |||
packets must have honestly declared downstream congestion. This | packets must have honestly declared downstream congestion. This | |||
is explained fully in Section 6 on applications, particularly when | is explained fully in Section 6 on applications, particularly when | |||
the application of re-ECN to protect against DDoS attacks is | the application of re-ECN to protect against DDoS attacks is | |||
described. | described. | |||
5.4. Justification for Setting the First SYN to FNE | 5.4. Justification for Setting the First SYN to FNE | |||
Congested routers may mark an FNE packet to CE(-1) (Section 5.3), and | the initial SYN MUST be set to FNE by Re-ECT client A (Section 4.1.4) | |||
the initial SYN MUST be set to FNE by Re-ECT client A | and (Section 5.3) says a queue MAY optionally treat an FNE packet as | |||
(Section 4.1.4). So an initial SYN may be marked CE(-1) rather than | ECN capable, so an initial SYN may be marked CE(-1) rather than | |||
dropped. This seems dangerous, because the sender has not yet | dropped. This seems dangerous, because the sender has not yet | |||
established whether the receiver is a legacy one that does not | established whether the receiver is a RFC3168 one that does not | |||
understand congestion marking. It also seems to allow malicious | understand congestion marking. It also seems to allow malicious | |||
senders to take advantage of ECN marking to avoid so much drop when | senders to take advantage of ECN marking to avoid so much drop when | |||
launching SYN flooding attacks. Below we explain the features of the | launching SYN flooding attacks. Below we explain the features of the | |||
protocol design that remove both these dangers. | protocol design that remove both these dangers. | |||
ECN-capable initial SYN with a Not-ECT server: If the TCP server B | ECN-capable initial SYN with a Not-ECT server: If the TCP server B | |||
is re-ECN capable, provision is made for it to feedback a possible | is re-ECN capable, provision is made for it to feedback a possible | |||
congestion marked SYN in the SYN ACK (Section 4.1.4). But if the | congestion marked SYN in the SYN ACK (Section 4.1.4). But if the | |||
TCP client A finds out from the SYN ACK that the server was not | TCP client A finds out from the SYN ACK that the server was not | |||
ECN-capable, the TCP client MUST consider the first SYN as | ECN-capable, the TCP client MUST conservatively consider the first | |||
congestion marked before setting itself into Not-ECT mode. | SYN as congestion marked before setting itself into Not-ECT mode. | |||
Section 4.1.4 mandates that such a TCP client MUST also set its | Section 4.1.4 mandates that such a TCP client MUST also set its | |||
initial window to 1 segment. In this way we remove the need to | initial window to 1 segment. In this way we remove the need to | |||
cautiously avoid setting the first SYN to Not-RECT. This will | cautiously avoid setting the first SYN to Not-RECT. This will | |||
give worse performance while deployment is patchy, but better | give worse performance while deployment is patchy, but better | |||
performance once deployment is widespread. | performance once deployment is widespread. | |||
SYN flooding attacks can't exploit ECN-capability: Malicious hosts | SYN flooding attacks can't exploit ECN-capability: Malicious hosts | |||
may think they can use the advantage that ECN-marking gives over | may think they can use the advantage that ECN-marking gives over | |||
drop in launching classic SYN-flood attacks. But Section 5.3 | drop in launching classic SYN-flood attacks. But Section 5.3 | |||
mandates that a router MUST only be configured to treat packets | mandates that a router MUST only be configured to treat packets | |||
with the FNE codepoint as ECN-capable if FNE packets are rate | with the FNE codepoint as ECN-capable if FNE packets are rate | |||
limited. Introduction of the FNE codepoint was a deliberate move | limited somewhere. Introduction of the FNE codepoint was a | |||
to enable transport-neutral handling of flow-start and flow state | deliberate move to enable transport-neutral handling of flow-start | |||
set-up in the IP layer where it belongs. It then becomes possible | and flow state set-up in the IP layer where it belongs. It then | |||
to protect against flooding attacks of all forms (not just SYN | becomes possible to protect against flooding attacks of all forms | |||
flooding) without transport-specific inspection for things like | (not just SYN flooding) without transport-specific inspection for | |||
the SYN flag in TCP headers. Then, for instance, SYN flooding | things like the SYN flag in TCP headers. Then, for instance, SYN | |||
attacks using IPSec ESP encryption can also be rate limited at the | flooding attacks using IPSec ESP encryption can also be rate | |||
IP layer. | limited at the IP layer. | |||
It might seem pedantic going to all this trouble to enable ECN on the | It might seem pedantic going to all this trouble to enable ECN on the | |||
initial packet of a flow, but it is motivated by a much wider concern | initial packet of a flow, but it is motivated by a much wider concern | |||
to ensure safe congestion control will still be possible even if the | to ensure safe congestion control will still be possible even if the | |||
application mix evolves to the point where the majority of flows | application mix evolves to the point where the majority of flows | |||
consist of a single window or even a single packet. It also allows | consist of a single window or even a single packet. It also allows | |||
denial of service attacks to be more easily isolated and prevented. | denial of service attacks to be more easily isolated and prevented. | |||
5.5. Control and Management | 5.5. Control and Management | |||
skipping to change at page 35, line 15 | skipping to change at page 36, line 15 | |||
flag should be the same as the inner. If it isn't a management alarm | flag should be the same as the inner. If it isn't a management alarm | |||
should be raised. This behaviour is the same as the full- | should be raised. This behaviour is the same as the full- | |||
functionality variant of [RFC3168] at tunnel exit, but different at | functionality variant of [RFC3168] at tunnel exit, but different at | |||
tunnel entry. | tunnel entry. | |||
If tunnels are left as they are specified in [RFC3168], whether the | If tunnels are left as they are specified in [RFC3168], whether the | |||
limited or full-functionality variants are used, a problem arises | limited or full-functionality variants are used, a problem arises | |||
with re-ECN if a tunnel crosses an inter-domain boundary, because the | with re-ECN if a tunnel crosses an inter-domain boundary, because the | |||
difference between positive and negative markings will not be | difference between positive and negative markings will not be | |||
correctly accounted for. In a limited functionality ECN tunnel, the | correctly accounted for. In a limited functionality ECN tunnel, the | |||
flow will appear to be legacy traffic, and therefore may be wrongly | flow will appear to be RFC3168 compliant traffic, and therefore may | |||
rate limited. In a full-functionality ECN tunnel, the result will | be wrongly rate limited. In a full-functionality ECN tunnel, the | |||
depend whether the tunnel entry copies the inner RE flag to the outer | result will depend whether the tunnel entry copies the inner RE flag | |||
header or the RE flag in the outer header is always cleared. If the | to the outer header or the RE flag in the outer header is always | |||
former, the flow will tend to be too positive when accounted for at | cleared. If the former, the flow will tend to be too positive when | |||
borders. If the latter, it will be too negative. If the rules set | accounted for at borders. If the latter, it will be too negative. | |||
out in [ECN-tunnel] are followed then this will not be an issue. | If the rules set out in [ECN-tunnel] are followed then this will not | |||
be an issue. | ||||
5.7. Non-Issues | 5.7. Non-Issues | |||
The following issues might seem to cause unfavourable interactions | The following issues might seem to cause unfavourable interactions | |||
with re-ECN, but we will explain why they don't: | with re-ECN, but we will explain why they don't: | |||
o Various link layers support explicit congestion notification, such | o Various link layers support explicit congestion notification, such | |||
as Frame Relay and ATM. Explicit congestion notification is | as Frame Relay and ATM. Explicit congestion notification is | |||
proposed to be added to other link layers, such as Ethernet | proposed to be added to other link layers, such as Ethernet | |||
(802.3ar Ethernet congestion management) and MPLS [ECN-MPLS]; | (802.3ar Ethernet congestion management) and MPLS [RFC5129]; | |||
o Encryption and IPSec. | o Encryption and IPSec. | |||
In the case of congestion notification at the link layer, each | In the case of congestion notification at the link layer, each | |||
particular link layer scheme either manages congestion on the link | particular link layer scheme either manages congestion on the link | |||
with its own link-level feedback (the usual arrangement in the cases | with its own link-level feedback (the usual arrangement in the cases | |||
of ATM and Frame Relay), or congestion notification from the link | of ATM and Frame Relay), or congestion notification from the link | |||
layer is merged into congestion notification at the IP level when the | layer is merged into congestion notification at the IP level when the | |||
frame headers are decapsulated at the end of the link (the | frame headers are decapsulated at the end of the link (the | |||
recommended arrangement in the Ethernet and MPLS cases). Given the | recommended arrangement in the Ethernet and MPLS cases). Given the | |||
skipping to change at page 36, line 6 | skipping to change at page 37, line 7 | |||
is processed on the path by subtracting positive from negative | is processed on the path by subtracting positive from negative | |||
markings. | markings. | |||
In the case of encryption, as long as the tunnel issues described in | In the case of encryption, as long as the tunnel issues described in | |||
Section 5.6 are dealt with, payload encryption itself will not be a | Section 5.6 are dealt with, payload encryption itself will not be a | |||
problem. The design goal of re-ECN is to include downstream | problem. The design goal of re-ECN is to include downstream | |||
congestion in the IP header so that it is not necessary to bury into | congestion in the IP header so that it is not necessary to bury into | |||
inner headers. Obfuscation of flow identifiers is not a problem for | inner headers. Obfuscation of flow identifiers is not a problem for | |||
re-ECN policing elements. Re-ECN doesn't ever require flow | re-ECN policing elements. Re-ECN doesn't ever require flow | |||
identifiers to be valid, it only requires them to be unique. So if | identifiers to be valid, it only requires them to be unique. So if | |||
an IPSec encapsulating security payload (ESP [RFC2406]) or an | an IPSec encapsulating security payload (ESP [RFC4305]) or an | |||
authentication header (AH [RFC2402]) is used, the security parameters | authentication header (AH [RFC4302]) is used, the security parameters | |||
index (SPI) will be a sufficient flow identifier, as it is intended | index (SPI) will be a sufficient flow identifier, as it is intended | |||
to be unique to a flow without revealing actual port numbers. | to be unique to a flow without revealing actual port numbers. | |||
In general, even if endpoints use some locally agreed scheme to hide | In general, even if endpoints use some locally agreed scheme to hide | |||
port numbers, re-ECN policing elements can just consider the pair of | port numbers, re-ECN policing elements can just consider the pair of | |||
source and destination IP addresses as the flow identifier. Re-ECN | source and destination IP addresses as the flow identifier. Re-ECN | |||
encourages endpoints to at least tell the network layer that a | encourages endpoints to at least tell the network layer that a | |||
sequence of packets are all part of the same flow, if indeed they | sequence of packets are all part of the same flow, if indeed they | |||
are. The alternative would be for the sender to make each packet | are. The alternative would be for the sender to make each packet | |||
appear to be a new flow, which would require them all to be marked | appear to be a new flow, which would require them all to be marked | |||
skipping to change at page 39, line 9 | skipping to change at page 40, line 9 | |||
delay using re-feedback. We give a simple outline of how this could | delay using re-feedback. We give a simple outline of how this could | |||
work in Appendix F. However, we do not expect this to be necessary, | work in Appendix F. However, we do not expect this to be necessary, | |||
as researchers tend to agree that only congestion control dynamics | as researchers tend to agree that only congestion control dynamics | |||
need to depend on RTT, not the rate that the algorithm would converge | need to depend on RTT, not the rate that the algorithm would converge | |||
on after a period of stability. | on after a period of stability. | |||
Figure 8 sketches the incentive framework that we will describe piece | Figure 8 sketches the incentive framework that we will describe piece | |||
by piece throughout this section. We will do a first pass in | by piece throughout this section. We will do a first pass in | |||
overview, then return to each piece in detail. We re-use the earlier | overview, then return to each piece in detail. We re-use the earlier | |||
example of how downstream congestion is derived by subtracting | example of how downstream congestion is derived by subtracting | |||
upstream congestion from path congestion (Figure 1) but depict | upstream congestion from path congestion (Figure 2) but depict | |||
multiple trust boundaries to turn it into an internetwork. For | multiple trust boundaries to turn it into an internetwork. For | |||
clarity, only downstream congestion is shown (the difference between | clarity, only downstream congestion is shown (the difference between | |||
the two earlier plots). The graph displays downstream path | the two earlier plots). The graph displays downstream path | |||
congestion seen in a typical flow as it traverses an example path | congestion seen in a typical flow as it traverses an example path | |||
from sender S to receiver R, across networks N1, N2 & N4. Everyone | from sender S to receiver R, across networks N1, N2 & N3. Everyone | |||
is shown using re-ECN correctly, but we intend to show why everyone | is shown using re-ECN correctly, but we intend to show why everyone | |||
would /choose/ to use it correctly, and honestly. | would /choose/ to use it correctly, and honestly. | |||
Three main types of self-interest can be identified: | Three main types of self-interest can be identified: | |||
o Users want to transmit data across the network as fast as | o Users want to transmit data across the network as fast as | |||
possible, paying as little as possible for the privilege. In this | possible, paying as little as possible for the privilege. In this | |||
respect, there is no distinction between senders and receivers, | respect, there is no distinction between senders and receivers, | |||
but we must be wary of potential malice by one on the other; | but we must be wary of potential malice by one on the other; | |||
o Network operators want to maximise revenues from the resources | o Network operators want to maximise revenues from the resources | |||
they invest in. They compete amongst themselves for the custom of | they invest in. They compete amongst themselves for the custom of | |||
users. | users. | |||
o Attackers (whether users or networks) want to use any opportunity | o Attackers (whether users or networks) want to use any opportunity | |||
to subvert the new re-ECN system for their own gain or to damage | to subvert the new re-ECN system for their own gain or to damage | |||
the service of their victims, whether targeted or random. | the service of their victims, whether targeted or random. | |||
policer | policer dropper | |||
| | | | | |||
| | | | | |||
S <-----N1----> <---N2---> <---N4--> R domain | S <-----N1----> <---N2---> <---N3--> R domain | |||
| : : | ||||
A\|/: : | ||||
| V : : | ||||
3% |---------+ : | ||||
| : | : | ||||
2% | : +-----------------------+ : | ||||
| : downstream congestion | : | ||||
1% | : | : | ||||
| : | : | ||||
0% +---------------------------------+=====--> | ||||
0 i ^ resource index | ||||
| | /|\ | ||||
1.00% 2.00% | marking fraction | ||||
| | | | |||
dropper | 3% |---------+ | |||
| | | ||||
2% | +-----------------------+ | ||||
| downstream congestion | | ||||
1% | | | ||||
| | | ||||
0% +---------------------------------+====== | ||||
0 i | ||||
Figure 8: Incentive Framework, showing creation of opposing pressures | Figure 8: Incentive Framework, showing creation of opposing pressures | |||
to under-declare and over-declare downstream congestion, using a | to under-declare and over-declare downstream congestion, using a | |||
policer and a dropper | policer and a dropper | |||
Source congestion control: We want to ensure that the sender will | Source congestion control: We want to ensure that the sender will | |||
throttle its rate as downstream congestion increases. Whatever | throttle its rate as downstream congestion increases. Whatever | |||
the agreed congestion response (whether TCP-compatible or some | the agreed congestion response (whether TCP-compatible or some | |||
enhanced QoS), to some extent it will always be against the | enhanced QoS), to some extent it will always be against the | |||
sender's interest to comply. | sender's interest to comply. | |||
skipping to change at page 41, line 9 | skipping to change at page 42, line 6 | |||
Edge egress dropper: If the policer ensures the source has less | Edge egress dropper: If the policer ensures the source has less | |||
right to a high rate the higher it declares downstream congestion, | right to a high rate the higher it declares downstream congestion, | |||
the source has a clear incentive to understate downstream | the source has a clear incentive to understate downstream | |||
congestion. But, if flows of packets are understated when they | congestion. But, if flows of packets are understated when they | |||
enter the internetwork, they will have become negative by the time | enter the internetwork, they will have become negative by the time | |||
they leave. So, we introduce a dropper at the last network | they leave. So, we introduce a dropper at the last network | |||
egress, which drops packets in flows that persistently declare | egress, which drops packets in flows that persistently declare | |||
negative downstream congestion (see Section 6.1.4 for details). | negative downstream congestion (see Section 6.1.4 for details). | |||
..competitive routing | ||||
.' : '. | ||||
.' p e n a l:t i e s '. | ||||
: | : \ : | ||||
A : | : | : | ||||
|S <-----N1----> <---N2---> <---N4--> R domain | ||||
| : | : | : | ||||
| V | : | : | ||||
3% |--------+ | : | : | ||||
| | V V V V | ||||
2% | +-----------------------+ | ||||
| downstream congestion | | ||||
1% | : | | ||||
| : | | ||||
0% +--------------------------------+=====--> | ||||
0 ^ i resource index | ||||
| /|\ | | ||||
1.00% | 2.00% marking fraction | ||||
| | ||||
sanctions | ||||
Figure 9: Incentives at Inter-domain Borders | ||||
Inter-domain traffic policing: But next we must ask, if congestion | Inter-domain traffic policing: But next we must ask, if congestion | |||
arises downstream (say in N4), what is the ingress network's | arises downstream (say in N3), what is the ingress network's | |||
(N1's) incentive to police its customers' response? If N1 turns a | (N1's) incentive to police its customers' response? If N1 turns a | |||
blind eye, its own customers benefit while other networks suffer. | blind eye, its own customers benefit while other networks suffer. | |||
This is why all inter-domain QoS architectures (e.g. Intserv, | This is why all inter-domain QoS architectures (e.g. Intserv, | |||
Diffserv) police traffic each time it crosses a trust boundary. | Diffserv) police traffic each time it crosses a trust boundary. | |||
We have already shown that re-ECN gives a trustworthy measure of | We have already shown that re-ECN gives a trustworthy measure of | |||
the expected downstream congestion that a flow will cause by | the expected downstream congestion that a flow will cause by | |||
subtracting negative volume from positive at any intermediate | subtracting negative volume from positive at any intermediate | |||
point on a path. N4 (say) can use this measure to police all the | point on a path. N3 (say) can use this measure to police all the | |||
responses to congestion of all the sources beyond its upstream | responses to congestion of all the sources beyond its upstream | |||
neighbour (N2), but in bulk with one very simple passive | neighbour (N2), but in bulk with one very simple passive | |||
mechanism, rather than per flow, as we will now explain using | mechanism, rather than per flow, as we will now explain. | |||
Figure 9. | ||||
Emulating policing with inter-domain congestion penalties: Between | Emulating policing with inter-domain congestion penalties: Between | |||
high-speed networks, we would rather avoid per-flow policing, and | high-speed networks, we would rather avoid per-flow policing, and | |||
we would rather avoid holding back traffic while it is policed. | we would rather avoid holding back traffic while it is policed. | |||
Instead, once re-ECN has arranged headers to carry downstream | Instead, once re-ECN has arranged headers to carry downstream | |||
congestion honestly, N2 can contract to pay N4 penalties in | congestion honestly, N2 can contract to pay N3 penalties in | |||
proportion to a single bulk count of the congestion metrics | proportion to a single bulk count of the congestion metrics | |||
crossing their mutual trust boundary (Section 6.1.6). In this | crossing their mutual trust boundary (Section 6.1.6). In this | |||
way, N4 puts pressure on N2 to suppress downstream congestion, for | way, N3 puts pressure on N2 to suppress downstream congestion, for | |||
every flow passing through the border interface, even though they | every flow passing through the border interface, even though they | |||
will all start and end in different places, and even though they | will all start and end in different places, and even though they | |||
may all be allowed different responses to congestion. The figure | may all be allowed different responses to congestion. The figure | |||
depicts this downward pressure on N2 by the solid downward arrow | depicts this downward pressure on N2 by the solid downward arrow | |||
at the egress of N2. Then N2 has an incentive either to police | at the egress of N2. Then N2 has an incentive either to police | |||
the congestion response of its own ingress traffic (from N1) or to | the congestion response of its own ingress traffic (from N1) or to | |||
emulate policing by applying penalties to N1 in turn on the basis | emulate policing by applying penalties to N1 in turn on the basis | |||
of congestion counted at their mutual boundary. In this recursive | of congestion counted at their mutual boundary. In this recursive | |||
way, the incentives for each flow to respond correctly to | way, the incentives for each flow to respond correctly to | |||
congestion trace back with each flow precisely to each source, | congestion trace back with each flow precisely to each source, | |||
despite the mechanism not recognising flows (see Section 6.2.2). | despite the mechanism not recognising flows (see Section 6.2.2). | |||
Inter-domain congestion charging diversity: Any two networks are | Inter-domain congestion charging diversity: Any two networks are | |||
free to agree any of a range of penalty regimes between themselves | free to agree any of a range of penalty regimes between themselves | |||
but they would only provide the right incentives if they were | but they would only provide the right incentives if they were | |||
within the following reasonable constraints. N2 should expect to | within the following reasonable constraints. N2 should expect to | |||
have to pay penalties to N4 where penalties monotonically increase | have to pay penalties to N3 where penalties monotonically increase | |||
with the volume of congestion and negative penalties are not | with the volume of congestion and negative penalties are not | |||
allowed. For instance, they may agree an SLA with tiered | allowed. For instance, they may agree an SLA with tiered | |||
congestion thresholds, where higher penalties apply the higher the | congestion thresholds, where higher penalties apply the higher the | |||
threshold that is broken. But the most obvious (and useful) form | threshold that is broken. But the most obvious (and useful) form | |||
of penalty is where N4 levies a charge on N2 proportional to the | of penalty is where N3 levies a charge on N2 proportional to the | |||
volume of downstream congestion N2 dumps into N4. In the | volume of downstream congestion N2 dumps into N3. In the | |||
explanation that follows, we assume this specific variant of | explanation that follows, we assume this specific variant of | |||
volume charging between networks - charging proportionate to the | volume charging between networks - charging proportionate to the | |||
volume of congestion. | volume of congestion. | |||
We must make clear that we are not advocating that everyone should | We must make clear that we are not advocating that everyone should | |||
use this form of contract. We are well aware that the IETF tries | use this form of contract. We are well aware that the IETF tries | |||
to avoid standardising technology that depends on a particular | to avoid standardising technology that depends on a particular | |||
business model. And we strongly share this desire to encourage | business model. And we strongly share this desire to encourage | |||
diversity. But our aim is merely to show that border policing can | diversity. But our aim is merely to show that border policing can | |||
at least work with this one model, then we can assume that | at least work with this one model, then we can assume that | |||
skipping to change at page 43, line 28 | skipping to change at page 44, line 4 | |||
inter-domain congestion charging, a domain seems to have a | inter-domain congestion charging, a domain seems to have a | |||
perverse incentive to fake congestion; N2's profit depends on the | perverse incentive to fake congestion; N2's profit depends on the | |||
difference between congestion at its ingress (its revenue) and at | difference between congestion at its ingress (its revenue) and at | |||
its egress (its cost). So, overstating internal congestion seems | its egress (its cost). So, overstating internal congestion seems | |||
to increase profit. However, smart border routing [Smart_rtg] by | to increase profit. However, smart border routing [Smart_rtg] by | |||
N1 will bias its routing towards the least cost routes. So, N2 | N1 will bias its routing towards the least cost routes. So, N2 | |||
risks losing all its revenue to competitive routes if it | risks losing all its revenue to competitive routes if it | |||
overstates congestion (see Section 6.2.3). In other words, if N2 | overstates congestion (see Section 6.2.3). In other words, if N2 | |||
is the least congested route, its ability to raise excess profits | is the least congested route, its ability to raise excess profits | |||
is limited by the congestion on the next least congested route. | is limited by the congestion on the next least congested route. | |||
This pressure on N2 to remain competitive is represented by the | ||||
dotted downward arrow at the ingress to N2 in Figure 9. | ||||
Closing the loop: All the above elements conspire to trap everyone | Closing the loop: All the above elements conspire to trap everyone | |||
between two opposing pressures (the downward and upward arrows in | between two opposing pressures, ensuring the downstream congestion | |||
Figure 8 & Figure 9), ensuring the downstream congestion metric | metric arrives at the destination neither above nor below zero. | |||
arrives at the destination neither above nor below zero. So, we | So, we have arrived back where we started in our argument. The | |||
have arrived back where we started in our argument. The ingress | ingress edge network can rely on downstream congestion declared in | |||
edge network can rely on downstream congestion declared in the | the packet headers presented by the sender. So it can police the | |||
packet headers presented by the sender. So it can police the | ||||
sender's congestion response accordingly. | sender's congestion response accordingly. | |||
Evolvability of congestion control: We have seen that re-ECN enables | Evolvability of congestion control: We have seen that re-ECN enables | |||
policing at the very first ingress. We have also seen that, as | policing at the very first ingress. We have also seen that, as | |||
flows continue on their path through further networks downstream, | flows continue on their path through further networks downstream, | |||
re-ECN removes the need for further per-domain ingress policing of | re-ECN removes the need for further per-domain ingress policing of | |||
all the different congestion responses allowed to each different | all the different congestion responses allowed to each different | |||
flow. This is why the evolvability of re-ECN policing is so | flow. This is why the evolvability of re-ECN policing is so | |||
superior to bottleneck policing or to any policing of different | superior to bottleneck policing or to any policing of different | |||
QoS for different flows. Even if all access networks choose to | QoS for different flows. Even if all access networks choose to | |||
skipping to change at page 44, line 35 | skipping to change at page 45, line 8 | |||
except only the volume of packets marked with congestion experienced | except only the volume of packets marked with congestion experienced | |||
(CE) was counted. | (CE) was counted. | |||
However, below we explain why relying on classic feedback /required/ | However, below we explain why relying on classic feedback /required/ | |||
congestion charging to be used, while re-ECN achieves the same | congestion charging to be used, while re-ECN achieves the same | |||
powerful outcome (given it is built on Kelly's foundations), but does | powerful outcome (given it is built on Kelly's foundations), but does | |||
not /require/ congestion charging. In brief, the problem with | not /require/ congestion charging. In brief, the problem with | |||
classic feedback is that the incentives have to trace the indirect | classic feedback is that the incentives have to trace the indirect | |||
path back to the sender---the long way round the feedback loop. For | path back to the sender---the long way round the feedback loop. For | |||
example, if classic feedback were used in Figure 8, N2 would have had | example, if classic feedback were used in Figure 8, N2 would have had | |||
to influence N1 via all of N4, R & S rather than directly. | to influence N1 via all of N3, R & S rather than directly. | |||
Inability to agree what is happening downstream: In order to police | Inability to agree what is happening downstream: In order to police | |||
its upstream neighbour's congestion response, the neighbours | its upstream neighbour's congestion response, the neighbours | |||
should be able to agree on the congestion to be responded to. | should be able to agree on the congestion to be responded to. | |||
Whatever the feedback regime, as packets change hands at each | Whatever the feedback regime, as packets change hands at each | |||
trust boundary, any path metrics they carry are verifiable by both | trust boundary, any path metrics they carry are verifiable by both | |||
neighbours. But, with a classic path metric, they can only agree | neighbours. But, with a classic path metric, they can only agree | |||
on the /upstream/ path congestion. | on the /upstream/ path congestion. | |||
Inaccessible back-channel: The network needs a whole-path congestion | Inaccessible back-channel: The network needs a whole-path congestion | |||
skipping to change at page 45, line 37 | skipping to change at page 46, line 10 | |||
using the safer `sender pays' model. However, congestion charging is | using the safer `sender pays' model. However, congestion charging is | |||
only likely to be appropriate between domains. So, without losing | only likely to be appropriate between domains. So, without losing | |||
evolvability, re-ECN enables technical policing mechanisms that are | evolvability, re-ECN enables technical policing mechanisms that are | |||
more appropriate for end users than congestion pricing. | more appropriate for end users than congestion pricing. | |||
We now take a second pass over the incentive framework, filling in | We now take a second pass over the incentive framework, filling in | |||
the detail. | the detail. | |||
6.1.4. Egress Dropper | 6.1.4. Egress Dropper | |||
As traffic leaves the last network before the receiver (domain N4 in | As traffic leaves the last network before the receiver (domain N3 in | |||
Figure 8), the fraction of positive octets in a flow should match the | Figure 8), the fraction of positive octets in a flow should match the | |||
fraction of negative octets introduced by congestion marking, leaving | fraction of negative octets introduced by congestion marking, leaving | |||
a balance of zero. If it is less (a negative flow), it implies that | a balance of zero. If it is less (a negative flow), it implies that | |||
the source is understating path congestion (which will reduce the | the source is understating path congestion (which will reduce the | |||
penalties that N2 owes N4). | penalties that N2 owes N3). | |||
If flows are positive, N4 need take no action---this simply means its | If flows are positive, N3 need take no action---this simply means its | |||
upstream neighbour is paying more penalties than it needs to, and the | upstream neighbour is paying more penalties than it needs to, and the | |||
source is going slower than it needs to. But, to protect itself | source is going slower than it needs to. But, to protect itself | |||
against persistently negative flows, N4 will need to install a | against persistently negative flows, N3 will need to install a | |||
dropper at its egress. Appendix E gives a suggested algorithm for | dropper at its egress. Appendix E gives a suggested algorithm for | |||
this dropper. There is no intention that the dropper algorithm needs | this dropper. There is no intention that the dropper algorithm needs | |||
to be standardised, it is merely provided to show that an efficient, | to be standardised, it is merely provided to show that an efficient, | |||
robust algorithm is possible. But whatever algorithm is used must | robust algorithm is possible. But whatever algorithm is used must | |||
meet the criteria below: | meet the criteria below: | |||
o It SHOULD introduce minimal false positives for honest flows; | o It SHOULD introduce minimal false positives for honest flows; | |||
o It SHOULD quickly detect and sanction dishonest flows (minimal | o It SHOULD quickly detect and sanction dishonest flows (minimal | |||
false negatives); | false negatives); | |||
skipping to change at page 48, line 35 | skipping to change at page 49, line 7 | |||
Of course, even if the sender does operate its own network, it may | Of course, even if the sender does operate its own network, it may | |||
arrange not to congestion mark traffic. Whether the sender does this | arrange not to congestion mark traffic. Whether the sender does this | |||
or not is of no concern to anyone else except the sender. Such a | or not is of no concern to anyone else except the sender. Such a | |||
sender will not be policed against its own network's contribution to | sender will not be policed against its own network's contribution to | |||
congestion, but the only resulting problem would be overload in the | congestion, but the only resulting problem would be overload in the | |||
sender's own network. | sender's own network. | |||
Finally, we must not forget that an easy way to circumvent re-ECN's | Finally, we must not forget that an easy way to circumvent re-ECN's | |||
defences is for the source to turn off re-ECN support, by setting the | defences is for the source to turn off re-ECN support, by setting the | |||
Not-RECT codepoint, implying legacy traffic. Therefore an ingress | Not-RECT codepoint, implying RFC3168 compliant traffic. Therefore an | |||
policer should put a general rate-limit on Not-RECT traffic, which | ingress policer should put a general rate-limit on Not-RECT traffic, | |||
SHOULD be lax during early, patchy deployment, but will have to | which SHOULD be lax during early, patchy deployment, but will have to | |||
become stricter as deployment widens. Similarly, flows starting | become stricter as deployment widens. Similarly, flows starting | |||
without an FNE packet can be confined by a strict rate-limit used for | without an FNE packet can be confined by a strict rate-limit used for | |||
the remainder of flows that haven't proved they are well-behaved by | the remainder of flows that haven't proved they are well-behaved by | |||
starting correctly (therefore they need not consume any flow state--- | starting correctly (therefore they need not consume any flow state--- | |||
they are just confined to the `misbehaving' bin if they carry an | they are just confined to the `misbehaving' bin if they carry an | |||
unrecognised flow ID). | unrecognised flow ID). | |||
6.1.6. Inter-domain Policing | 6.1.6. Inter-domain Policing | |||
One of the main design goals of re-ECN is for border security | One of the main design goals of re-ECN is for border security | |||
skipping to change at page 51, line 39 | skipping to change at page 52, line 9 | |||
Once an unbiased estimate of the effect of negative flows can be | Once an unbiased estimate of the effect of negative flows can be | |||
made, the problem reduces to detecting and preferably removing flows | made, the problem reduces to detecting and preferably removing flows | |||
that have gone negative as soon as possible. But importantly, | that have gone negative as soon as possible. But importantly, | |||
complete eradication of negative flows is no longer critical---best | complete eradication of negative flows is no longer critical---best | |||
endeavours will be sufficient. | endeavours will be sufficient. | |||
For instance, let us consider the case where a source sends traffic | For instance, let us consider the case where a source sends traffic | |||
with no positive markings at all, hoping to at least get as much | with no positive markings at all, hoping to at least get as much | |||
traffic delivered as network-based droppers will allow. The flow is | traffic delivered as network-based droppers will allow. The flow is | |||
likely to go at least slightly negative in the first network on the | likely to go at least slightly negative in the first network on the | |||
path (N1 if we use the example network layout in Figure 9). If all | path (N1 if we use the example network layout in Figure 8). If all | |||
networks use the algorithm in Appendix H.2 to inflate penalties at | networks use the algorithm in Appendix H.2 to inflate penalties at | |||
their border with an upstream network, they will remove the effect of | their border with an upstream network, they will remove the effect of | |||
negative flows. So, for instance, N2 will not be paying a penalty to | negative flows. So, for instance, N2 will not be paying a penalty to | |||
N1 for this flow. Further, because the flow contributes no positive | N1 for this flow. Further, because the flow contributes no positive | |||
markings at all, a dropper at the egress will completely remove it. | markings at all, a dropper at the egress will completely remove it. | |||
The remaining problem is that every network is carrying a flow that | The remaining problem is that every network is carrying a flow that | |||
is causing congestion to others but not being held to account for the | is causing congestion to others but not being held to account for the | |||
congestion it is causing. Whenever the fail-safe border algorithm | congestion it is causing. Whenever the fail-safe border algorithm | |||
(Section 6.1.7) or the border algorithm to compensate for negative | (Section 6.1.7) or the border algorithm to compensate for negative | |||
flows (Appendix H.2) detects a negative flow, it can instantiate a | flows (Appendix H.2) detects a negative flow, it can instantiate a | |||
focused dropper for that flow locally. It may be some time before | focused dropper for that flow locally. It may be some time before | |||
the flow is detected, but the more strongly negative the flow is, the | the flow is detected, but the more strongly negative the flow is, the | |||
more quickly it will be detected by the fail-safe algorithm. But, in | more quickly it will be detected by the fail-safe algorithm. But, in | |||
the meantime, it will not be distorting border incentives. Until it | the meantime, it will not be distorting border incentives. Until it | |||
is detected, if it contributes to drop anywhere, its packets will | is detected, if it contributes to drop anywhere, its packets will | |||
tend to be dropped before others if routers use the preferential drop | tend to be dropped before others if queues use the preferential drop | |||
rules in Section 5.3, which discriminate against non-positive | rules in Section 5.3, which discriminate against non-positive | |||
packets. All networks below the point where a flow goes negative | packets. All networks below the point where a flow goes negative | |||
(N1, N2 and N4 in this case) have an incentive to remove this flow, | (N1, N2 and N3 in this case) have an incentive to remove this flow, | |||
but the router where it first goes negative (in N1) can of course | but the queue where it first goes negative (in N1) can of course | |||
remove the problem for everyone downstream. | remove the problem for everyone downstream. | |||
In the case of DDoS attacks, Section 6.2.1 describes how re-ECN | In the case of DDoS attacks, Section 6.2.1 describes how re-ECN | |||
mitigates their force. | mitigates their force. | |||
6.1.7. Inter-domain Fail-safes | 6.1.7. Inter-domain Fail-safes | |||
The mechanisms described so far create incentives for rational | The mechanisms described so far create incentives for rational | |||
network operators to behave. That is, one operator aims to make | network operators to behave. That is, one operator aims to make | |||
another behave responsibly by applying penalties and expects a | another behave responsibly by applying penalties and expects a | |||
skipping to change at page 53, line 21 | skipping to change at page 53, line 41 | |||
6.2. Other Applications | 6.2. Other Applications | |||
6.2.1. DDoS Mitigation | 6.2.1. DDoS Mitigation | |||
A flooding attack is inherently about congestion of a resource. | A flooding attack is inherently about congestion of a resource. | |||
Because re-ECN ensures the sources causing network congestion | Because re-ECN ensures the sources causing network congestion | |||
experience the cost of their own actions, it acts as a first line of | experience the cost of their own actions, it acts as a first line of | |||
defence against DDoS. As load focuses on a victim, upstream queues | defence against DDoS. As load focuses on a victim, upstream queues | |||
grow, requiring honest sources to pre-load packets with a higher | grow, requiring honest sources to pre-load packets with a higher | |||
fraction of positive packets. Once downstream routers are so | fraction of positive packets. Once downstream queues are so | |||
congested that they are dropping traffic, they will be CE marking the | congested that they are dropping traffic, they will be CE marking the | |||
traffic they do forward 100%. Honest sources will therefore be | traffic they do forward 100%. Honest sources will therefore be | |||
sending Re-Echo 100% (and therefore being severely rate-limited at | sending Re-Echo 100% (and therefore being severely rate-limited at | |||
the ingress). | the ingress). | |||
Senders under malicious control can either do the same as honest | Senders under malicious control can either do the same as honest | |||
sources, and be rate-limited at ingress, or they can understate | sources, and be rate-limited at ingress, or they can understate | |||
congestion by sending more neutral RECT packets than they should. If | congestion by sending more neutral RECT packets than they should. If | |||
sources understate congestion (i.e. do not re-echo sufficient | sources understate congestion (i.e. do not re-echo sufficient | |||
positive packets) and the preferential drop ranking is implemented on | positive packets) and the preferential drop ranking is implemented on | |||
routers (Section 5.3), these routers will preserve positive traffic | queues (Section 5.3), these queues will preserve positive traffic | |||
until last. So, the neutral traffic from malicious sources will all | until last. So, the neutral traffic from malicious sources will all | |||
be automatically dropped first. Either way, the malicious sources | be automatically dropped first. Either way, the malicious sources | |||
cannot send more than honest sources. | cannot send more than honest sources. | |||
Further, hosts under malicious control will tend to be re-used for | Further, hosts under malicious control will tend to be re-used for | |||
many different attacks. They will therefore build up a long term | many different attacks. They will therefore build up a long term | |||
history of causing congestion. Therefore, as long as the population | history of causing congestion. Therefore, as long as the population | |||
of potentially compromisable hosts around the Internet is limited, | of potentially compromisable hosts around the Internet is limited, | |||
the per-user policing algorithms in Appendix G.1 will gradually | the per-user policing algorithms in Appendix G.1 will gradually | |||
throttle down zombies and other launchpads for attacks. Therefore, | throttle down zombies and other launchpads for attacks. Therefore, | |||
skipping to change at page 55, line 32 | skipping to change at page 56, line 10 | |||
o We are considering the issue of whether it would be useful to | o We are considering the issue of whether it would be useful to | |||
truncate rather than drop packets that appear to be malicious, so | truncate rather than drop packets that appear to be malicious, so | |||
that the feedback loop is not broken but useful data can be | that the feedback loop is not broken but useful data can be | |||
removed. | removed. | |||
7. Incremental Deployment | 7. Incremental Deployment | |||
7.1. Incremental Deployment Features | 7.1. Incremental Deployment Features | |||
The design of the re-ECN protocol started from the fact that the | The design of the re-ECN protocol started from the fact that the | |||
current ECN marking behaviour of routers was sufficient and that re- | current ECN marking behaviour of queues was sufficient and that re- | |||
feedback could be introduced around these routers by changing the | feedback could be introduced around these queues by changing the | |||
sender behaviour but not the routers. Otherwise, if we had required | sender behaviour but not the routers. Otherwise, if we had required | |||
routers to be changed, the chance of encountering a path that had | routers to be changed, the chance of encountering a path that had | |||
every router upgraded would be vanishly small during early | every router upgraded would be vanishly small during early | |||
deployment, giving no incentive to start deployment. Also, as there | deployment, giving no incentive to start deployment. Also, as there | |||
is no new forwarding behaviour, routers and hosts do not have to | is no new forwarding behaviour, routers and hosts do not have to | |||
signal or negotiate anything. | signal or negotiate anything. | |||
However, networks that choose to protect themselves using re-ECN do | However, networks that choose to protect themselves using re-ECN do | |||
have to add new security functions at their trust boundaries with | have to add new security functions at their trust boundaries with | |||
others. They distinguish legacy traffic by its ECN field. Traffic | others. They distinguish legacy traffic by its ECN field. Traffic | |||
from Not-ECT transports is distinguishable by its Not-RECT marking. | from Not-ECT transports is distinguishable by its Not-ECT marking. | |||
Traffic from legacy ECN transports is distinguished from re-ECN by | Traffic from RFC3168 compliant ECN transports is distinguished from | |||
which of ECT(0) or ECT(1) is used. We chose to use ECT(1) for re-ECN | re-ECN by which of ECT(0) or ECT(1) is used. We chose to use ECT(1) | |||
traffic deliberately. Existing ECN sources set ECT(0) on either 50% | for re-ECN traffic deliberately. Existing ECN sources set ECT(0) on | |||
(the nonce) or 100% (the default) of packets, whereas re-ECN does not | either 50% (the nonce) or 100% (the default) of packets, whereas re- | |||
use ECT(0) at all. We can use this distinguishing feature of legacy | ECN does not use ECT(0) at all. We can use this distinguishing | |||
ECN traffic to separate it out for different treatment at the various | feature of RFC3168 compliant ECN traffic to separate it out for | |||
border security functions: egress dropping, ingress policing and | different treatment at the various border security functions: egress | |||
border policing. | dropping, ingress policing and border policing. | |||
The general principle we adopt is that an egress dropper will not | The general principle we adopt is that an egress dropper will not | |||
drop any legacy traffic, but ingress and border policers will limit | drop any legacy traffic, but ingress and border policers will limit | |||
the bulk rate of legacy traffic that can enter each network. Then, | the bulk rate of legacy traffic (Not-ECT, ECT(0) and those amrked | |||
during early re-ECN deployment, operators can set very permissive (or | with the unused codepoint) that can enter each network. Then, during | |||
non-existent) rate-limits on legacy traffic, but once re-ECN | early re-ECN deployment, operators can set very permissive (or non- | |||
existent) rate-limits on legacy traffic, but once re-ECN | ||||
implementations are generally available, legacy traffic can be rate- | implementations are generally available, legacy traffic can be rate- | |||
limited increasingly harshly. Ultimately, an operator might choose | limited increasingly harshly. Ultimately, an operator might choose | |||
to block all legacy traffic entering its network, or at least only | to block all legacy traffic entering its network, or at least only | |||
allow through a trickle. | allow through a trickle. | |||
Then, as the limits are set more strictly, the more legacy ECN | Then, as the limits are set more strictly, the more RFC3168 ECN | |||
sources will gain by upgrading to re-ECN. Thus, towards the end of | sources will gain by upgrading to re-ECN. Thus, towards the end of | |||
the voluntary incremental deployment period, legacy transports can be | the voluntary incremental deployment period, RFC3168 compliant | |||
given progressively stronger encouragement to upgrade. | transports can be given progressively stronger encouragement to | |||
upgrade. | ||||
The following list of minor changes, brings together all the points | The following list of minor changes, brings together all the points | |||
where Re-ECN semantics for use of the two-bit ECN field are different | where re-ECN semantics for use of the two-bit ECN field are different | |||
compared to RFC3168: | compared to RFC3168: | |||
o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender | o A re-ECN sender sets ECT(1) by default, whereas an RFC3168 sender | |||
sets ECT(0) by default (Section 3.3); | sets ECT(0) by default (Section 3.4); | |||
o No provision is necessary for a re-ECN capable source transport to | o No provision is necessary for a re-ECN capable source transport to | |||
use the ECN nonce (Section 4.1.2.1); | use the ECN nonce (Section 4.1.2.1); | |||
o Routers MAY preferentially drop different extended ECN codepoints | o Routers MAY preferentially drop different extended ECN codepoints | |||
(Section 5.3); | (Section 5.3); | |||
o Packets carrying the feedback not established (FNE) codepoint MAY | o Packets carrying the feedback not established (FNE) codepoint MAY | |||
optionally be marked rather than dropped by routers, even though | optionally be marked rather than dropped by routers, even though | |||
their ECN field is Not-ECT (with the important caveat in | their ECN field is Not-ECT (with the important caveat in | |||
skipping to change at page 57, line 44 | skipping to change at page 58, line 20 | |||
Deployment that requires co-ordination adds cost and delay and | Deployment that requires co-ordination adds cost and delay and | |||
tends to dilute any competitive advantage that might be gained. | tends to dilute any competitive advantage that might be gained. | |||
* ECN `only' gives a performance improvement. Making a product a | * ECN `only' gives a performance improvement. Making a product a | |||
bit faster (whether the product is a device or a network), | bit faster (whether the product is a device or a network), | |||
isn't usually a sufficient selling point to be worth the cost | isn't usually a sufficient selling point to be worth the cost | |||
of co-ordinating across the industry to deploy it. Network | of co-ordinating across the industry to deploy it. Network | |||
operators tend to avoid re-configuring a working network unless | operators tend to avoid re-configuring a working network unless | |||
launching a new product. | launching a new product. | |||
ECN and re-ECN for Edge-to-edge Assured QoS: | ECN and Re-ECN for Edge-to-edge Assured QoS: | |||
We believe the proposal to provide assured QoS sessions using a | We believe the proposal to provide assured QoS sessions using a | |||
form of ECN called pre-congestion notification (PCN) [PCN-arch] is | form of ECN called pre-congestion notification (PCN) [PCN-arch] is | |||
most likely to break the deadlock in ECN deployment first. It | most likely to break the deadlock in ECN deployment first. It | |||
only requires edge-to-edge deployment so it does not require | only requires edge-to-edge deployment so it does not require | |||
endpoint support. It can be deployed in a single network, then | endpoint support. It can be deployed in a single network, then | |||
grow incrementally to interconnected networks. And it provides a | grow incrementally to interconnected networks. And it provides a | |||
different `product' (internetworked assured QoS), rather than | different `product' (internetworked assured QoS), rather than | |||
merely making an existing product a bit faster. | merely making an existing product a bit faster. | |||
Not only could this assured QoS application kick-start ECN | Not only could this assured QoS application kick-start ECN | |||
deployment, it could also carry re-ECN deployment with it; because | deployment, it could also carry re-ECN deployment with it; because | |||
re-ECN can enable the assured QoS region to expand to a large | re-ECN can enable the assured QoS region to expand to a large | |||
internetwork where neighbouring networks do not trust each other. | internetwork where neighbouring networks do not trust each other. | |||
[Re-PCN] argues that re-ECN security should be built in to the QoS | [Re-PCN] argues that re-ECN security should be built in to the QoS | |||
system from the start, explaining why and how. | system from the start, explaining why and how. | |||
If ECN and re-ECN were deployed edge-to-edge for assured QoS, | If ECN and re-ECN were deployed edge-to-edge for assured QoS, | |||
operators would gain valuable experience. They would also clear | operators would gain valuable experience. They would also clear | |||
away many technical obstacles such as firewall configurations that | away many technical obstacles such as firewall configurations that | |||
block all but the legacy settings of the ECN field and the RE | block all but the RFC3168 settings of the ECN field and the RE | |||
flag. | flag. | |||
ECN in Access Networks: | ECN in Access Networks: | |||
The next obstacle to ECN deployment would be extension to access | The next obstacle to ECN deployment would be extension to access | |||
and backhaul networks, where considerable link layer differences | and backhaul networks, where considerable link layer differences | |||
makes implementation non-trivial, particularly on congested | makes implementation non-trivial, particularly on congested | |||
wireless links. ECN and re-ECN work fine during partial | wireless links. ECN and re-ECN work fine during partial | |||
deployment, but they will not be very useful if the most congested | deployment, but they will not be very useful if the most congested | |||
elements in networks are the last to support them. Access network | elements in networks are the last to support them. Access network | |||
skipping to change at page 60, line 44 | skipping to change at page 61, line 21 | |||
So, if re-ECN were stipulated for cellular devices, it would | So, if re-ECN were stipulated for cellular devices, it would | |||
automatically appear in those devices connected to the wireless | automatically appear in those devices connected to the wireless | |||
fringes of fixed networks if they coupled cellular with WiFi or | fringes of fixed networks if they coupled cellular with WiFi or | |||
Bluetooth technology, for instance. Also, once implemented in the | Bluetooth technology, for instance. Also, once implemented in the | |||
operating system of one mobile device, it would tend to be found | operating system of one mobile device, it would tend to be found | |||
in other devices using the same family of operating system. | in other devices using the same family of operating system. | |||
Therefore, whether or not a fixed network deployed ECN, or | Therefore, whether or not a fixed network deployed ECN, or | |||
deployed re-ECN policers and droppers, many of its hosts might | deployed re-ECN policers and droppers, many of its hosts might | |||
well be using re-ECN over it. Indeed, they would be at an | well be using re-ECN over it. Indeed, they would be at an | |||
advantage when communicating with hosts across Re-ECN policed | advantage when communicating with hosts across re-ECN policed | |||
networks that rate limited Not-RECT traffic. | networks that rate limited Not-RECT traffic. | |||
Other possible scenarios: | Other possible scenarios: | |||
The above is thankfully not the only plausible scenario we can | The above is thankfully not the only plausible scenario we can | |||
think of. One of the many clubs of operators that meet regularly | think of. One of the many clubs of operators that meet regularly | |||
around the world might decide to act together to persuade a major | around the world might decide to act together to persuade a major | |||
operating system manufacturer to implement re-ECN. And they may | operating system manufacturer to implement re-ECN. And they may | |||
agree between them on an interconnection model that includes | agree between them on an interconnection model that includes | |||
congestion penalties. | congestion penalties. | |||
Re-ECN provides an interesting opportunity for device | Re-ECN provides an interesting opportunity for device | |||
manufacturers as well as network operators. Policers can be | manufacturers as well as network operators. Policers can be | |||
configured loosely when first deployed. Then as re-ECN take-up | configured loosely when first deployed. Then as re-ECN take-up | |||
increases, they can be tightened up, so that a network with re-ECN | increases, they can be tightened up, so that a network with re-ECN | |||
deployed can gradually squeeze down the service provided to legacy | deployed can gradually squeeze down the service provided to | |||
devices that have not upgraded to re-ECN. Many device vendors | RFC3168 compliant devices that have not upgraded to re-ECN. Many | |||
rely on replacement sales. And operating system companies rely | device vendors rely on replacement sales. And operating system | |||
heavily on new release sales. Also support services would like to | companies rely heavily on new release sales. Also support | |||
be able to force stragglers to upgrade. So, the ability to | services would like to be able to force stragglers to upgrade. | |||
throttle service to legacy operating systems is quite valuable. | So, the ability to throttle service to RFC3168 compliant operating | |||
systems is quite valuable. | ||||
Also, policing unresponsive sources may not be the only or even | Also, policing unresponsive sources may not be the only or even | |||
the first application that drives deployment. It may be policing | the first application that drives deployment. It may be policing | |||
causes of heavy congestion (e.g. peer-to-peer file-sharing). Or | causes of heavy congestion (e.g. peer-to-peer file-sharing). Or | |||
it may be mitigation of denial of service. Or we may be wrong in | it may be mitigation of denial of service. Or we may be wrong in | |||
thinking simpler QoS will not be the initial motivation for re-ECN | thinking simpler QoS will not be the initial motivation for re-ECN | |||
deployment. Indeed, the combined pressure for all these may be | deployment. Indeed, the combined pressure for all these may be | |||
the motivator, but it seems optimistic to expect such a level of | the motivator, but it seems optimistic to expect such a level of | |||
joined-up thinking from today's communications industry. We | joined-up thinking from today's communications industry. We | |||
believe a single application alone must be a sufficient motivator. | believe a single application alone must be a sufficient motivator. | |||
skipping to change at page 63, line 10 | skipping to change at page 63, line 32 | |||
(policing) congestion control. But policing is only truly effective | (policing) congestion control. But policing is only truly effective | |||
at the first ingress into an internetwork, whereas path congestion | at the first ingress into an internetwork, whereas path congestion | |||
was previously only visible at the last egress. So, re-ECN | was previously only visible at the last egress. So, re-ECN | |||
democratises congestion information. Then the choice over who | democratises congestion information. Then the choice over who | |||
actually controls congestion can be made at run-time, not design | actually controls congestion can be made at run-time, not design | |||
time---a bit like an aircraft with dual controls. And different | time---a bit like an aircraft with dual controls. And different | |||
operators can make different choices. We believe non-architectural | operators can make different choices. We believe non-architectural | |||
approaches to this problem are unlikely to offer more than partial | approaches to this problem are unlikely to offer more than partial | |||
solutions (see Section 9). | solutions (see Section 9). | |||
Importantly, re-ECN does NOT REQUIRE assumptions about specific | Importantly, re-ECN does not require assumptions about specific | |||
congestion responses to be embedded in any network elements, except | congestion responses to be embedded in any network elements, except | |||
at the first ingress to the internetwork if that level of control is | at the first ingress to the internetwork if that level of control is | |||
desired by the ingress operator. But such tight policing will be a | desired by the ingress operator. But such tight policing will be a | |||
matter of agreement between the source and its access network | matter of agreement between the source and its access network | |||
operator. The ingress operator need not police congestion response | operator. The ingress operator need not police congestion response | |||
at flow granularity; it can simply hold a source responsible for the | at flow granularity; it can simply hold a source responsible for the | |||
aggregate congestion it causes, perhaps keeping it within a monthly | aggregate congestion it causes, perhaps keeping it within a monthly | |||
congestion quota. Or if the ingress network trusts the source, it | congestion quota. Or if the ingress network trusts the source, it | |||
can do nothing. | can do nothing. | |||
skipping to change at page 66, line 28 | skipping to change at page 67, line 7 | |||
declare path congestion to the network and it can remove traffic at | declare path congestion to the network and it can remove traffic at | |||
the egress if this declaration is dishonest. So it can police | the egress if this declaration is dishonest. So it can police | |||
correctly, irrespective of whether the receiver tries to suppress | correctly, irrespective of whether the receiver tries to suppress | |||
congestion feedback or whether the sender ignores genuine congestion | congestion feedback or whether the sender ignores genuine congestion | |||
feedback. Therefore the re-ECN protocol addresses a much wider range | feedback. Therefore the re-ECN protocol addresses a much wider range | |||
of cheating problems, which includes the one addressed by the ECN | of cheating problems, which includes the one addressed by the ECN | |||
nonce. | nonce. | |||
9.3. Identifying Upstream and Downstream Congestion | 9.3. Identifying Upstream and Downstream Congestion | |||
Purple [Purple] proposes that routers should use the CWR flag in the | Purple [Purple] proposes that queues should use the CWR flag in the | |||
TCP header of ECN-capable flows to work out path congestion and | TCP header of ECN-capable flows to work out path congestion and | |||
therefore downstream congestion in a similar way to re-ECN. However, | therefore downstream congestion in a similar way to re-ECN. However, | |||
because CWR is in the transport layer, it is not always visible to | because CWR is in the transport layer, it is not always visible to | |||
network layer routers and policers. Purple's motivation was to | network layer routers and policers. Purple's motivation was to | |||
improve AQM, not policing. But, of course, nodes trying to avoid a | improve AQM, not policing. But, of course, nodes trying to avoid a | |||
policer would not be expected to allow CWR to be visible. | policer would not be expected to allow CWR to be visible. | |||
10. Security Considerations | 10. Security Considerations | |||
This whole memo concerns the deployment of a secure congestion | This whole memo concerns the deployment of a secure congestion | |||
skipping to change at page 68, line 31 | skipping to change at page 69, line 9 | |||
11. IANA Considerations | 11. IANA Considerations | |||
This memo includes no request to IANA (yet). | This memo includes no request to IANA (yet). | |||
If this memo was to progress to standards track, it would list: | If this memo was to progress to standards track, it would list: | |||
o The new RE flag in IPv4 (Section 5.1) and its extension with the | o The new RE flag in IPv4 (Section 5.1) and its extension with the | |||
ECN field to create a new set of extended ECN (EECN) codepoints; | ECN field to create a new set of extended ECN (EECN) codepoints; | |||
o The definition of the EECN codepoints for default Diffserv PHBs | o The definition of the EECN codepoints for default Diffserv PHBs | |||
(Section 3.2) | (Section 3.3) | |||
o The new extension header for IPv6 (Section 5.2); | o The new extension header for IPv6 (Section 5.2); | |||
o The new combinations of flags in the TCP header for capability | o The new combinations of flags in the TCP header for capability | |||
negotiation (Section 4.1.3); | negotiation (Section 4.1.3); | |||
o The new ICMP message type (Section 5.5.1). | o The new ICMP message type (Section 5.5.1). | |||
12. Conclusions | 12. Conclusions | |||
skipping to change at page 69, line 12 | skipping to change at page 69, line 36 | |||
Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, | Soppera, David Songhurst, Peter Hovell, Louise Burness, Phil Eardley, | |||
Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, | Steve Rudkin, Marc Wennink, Fabrice Saffre, Cefn Hoile, Steve Wright, | |||
John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru | John Davey, Martin Koyabe, Carla Di Cairano-Gilfedder, Alexandru | |||
Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd | Murgu, Nigel Geffen, Pete Willis, John Adams (BT), Sally Floyd | |||
(ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark | (ICIR), Joe Babiarz, Kwok Ho-Chan (Nortel), Stephen Hailes, Mark | |||
Handley (who developed the attack with canceled packets), Adam | Handley (who developed the attack with canceled packets), Adam | |||
Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft | Greenhalgh (who developed the attack on DNS) (UCL), Jon Crowcroft | |||
(Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who | (Uni Cam), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (who | |||
complemented our own dummy traffic attacks with others), Liz Maida | complemented our own dummy traffic attacks with others), Liz Maida | |||
(MIT), and comments from participants in the CRN/CFP Broadband and | (MIT), and comments from participants in the CRN/CFP Broadband and | |||
DoS-resistant Internet working groups. | DoS-resistant Internet working groups.A special thank you to | |||
Alessandro Salvatori for coming up with fiendish attacks on re-ECN. | ||||
14. Comments Solicited | 14. Comments Solicited | |||
Comments and questions are encouraged and very welcome. They can be | Comments and questions are encouraged and very welcome. They can be | |||
addressed to the IETF Transport Area working group's mailing list | addressed to the IETF Transport Area working group's mailing list | |||
<tsvwg@ietf.org>, and/or to the authors. | <tsvwg@ietf.org>, and/or to the authors. | |||
15. References | 15. References | |||
15.1. Normative References | 15.1. Normative References | |||
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate | |||
Requirement Levels", BCP 14, RFC 2119, March 1997. | Requirement Levels", BCP 14, RFC 2119, March 1997. | |||
[RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, | ||||
S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., | ||||
Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, | ||||
S., Wroclawski, J., and L. Zhang, "Recommendations on | ||||
Queue Management and Congestion Avoidance in the | ||||
Internet", RFC 2309, April 1998. | ||||
[RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion | [RFC2581] Allman, M., Paxson, V., and W. Stevens, "TCP Congestion | |||
Control", RFC 2581, April 1999. | Control", RFC 2581, April 1999. | |||
[RFC2960] Stewart, R., Xie, Q., Morneault, K., Sharp, C., | ||||
Schwarzbauer, H., Taylor, T., Rytina, I., Kalla, M., | ||||
Zhang, L., and V. Paxson, "Stream Control Transmission | ||||
Protocol", RFC 2960, October 2000. | ||||
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | |||
of Explicit Congestion Notification (ECN) to IP", | of Explicit Congestion Notification (ECN) to IP", | |||
RFC 3168, September 2001. | RFC 3168, September 2001. | |||
[RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's | [RFC3390] Allman, M., Floyd, S., and C. Partridge, "Increasing TCP's | |||
Initial Window", RFC 3390, October 2002. | Initial Window", RFC 3390, October 2002. | |||
[RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram | [RFC4340] Kohler, E., Handley, M., and S. Floyd, "Datagram | |||
Congestion Control Protocol (DCCP)", RFC 4340, March 2006. | Congestion Control Protocol (DCCP)", RFC 4340, March 2006. | |||
[RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion | [RFC4341] Floyd, S. and E. Kohler, "Profile for Datagram Congestion | |||
Control Protocol (DCCP) Congestion Control ID 2: TCP-like | Control Protocol (DCCP) Congestion Control ID 2: TCP-like | |||
Congestion Control", RFC 4341, March 2006. | Congestion Control", RFC 4341, March 2006. | |||
[RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for | [RFC4342] Floyd, S., Kohler, E., and J. Padhye, "Profile for | |||
Datagram Congestion Control Protocol (DCCP) Congestion | Datagram Congestion Control Protocol (DCCP) Congestion | |||
Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, | Control ID 3: TCP-Friendly Rate Control (TFRC)", RFC 4342, | |||
March 2006. | March 2006. | |||
[RFC4960] Stewart, R., "Stream Control Transmission Protocol", | ||||
RFC 4960, September 2007. | ||||
15.2. Informative References | 15.2. Informative References | |||
[ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the | [ARI05] Adams, J., Roberts, L., and A. IJsselmuiden, "Changing the | |||
Internet to Support Real-Time Content Supply from a Large | Internet to Support Real-Time Content Supply from a Large | |||
Fraction of Broadband Residential Users", BT Technology | Fraction of Broadband Residential Users", BT Technology | |||
Journal (BTTJ) 23(2), April 2005. | Journal (BTTJ) 23(2), April 2005. | |||
[Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the | [Bauer06] Bauer, S., Faratin, P., and R. Beverly, "Assessing the | |||
assumptions underlying mechanism design for the Internet", | assumptions underlying mechanism design for the Internet", | |||
Proc. Workshop on the Economics of Networked Systems | Proc. Workshop on the Economics of Networked Systems | |||
skipping to change at page 70, line 39 | skipping to change at page 71, line 11 | |||
Salvatori, A., "Closed Loop Traffic Policing", Politecnico | Salvatori, A., "Closed Loop Traffic Policing", Politecnico | |||
Torino and Institut Eurecom Masters Thesis , | Torino and Institut Eurecom Masters Thesis , | |||
September 2005. | September 2005. | |||
[ECN-Deploy] | [ECN-Deploy] | |||
Floyd, S., "ECN (Explicit Congestion Notification) in | Floyd, S., "ECN (Explicit Congestion Notification) in | |||
TCP/IP; Implementation and Deployment of ECN", Web-page , | TCP/IP; Implementation and Deployment of ECN", Web-page , | |||
May 2004, | May 2004, | |||
<http://www.icir.org/floyd/ecn.html#implementations>. | <http://www.icir.org/floyd/ecn.html#implementations>. | |||
[ECN-MPLS] | ||||
Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion | ||||
Marking in MPLS", draft-ietf-tsvwg-ecn-mpls-01 (work in | ||||
progress), June 2007. | ||||
[ECN-tunnel] | [ECN-tunnel] | |||
Briscoe, B., "Layered Encapsulation of Congestion | Briscoe, B., "Layered Encapsulation of Congestion | |||
Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in | Notification", draft-briscoe-tsvwg-ecn-tunnel-00 (work in | |||
progress), June 2007. | progress), June 2007. | |||
[Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the | [Evol_cc] Gibbens, R. and F. Kelly, "Resource pricing and the | |||
evolution of congestion control", Automatica 35(12)1969-- | evolution of congestion control", Automatica 35(12)1969-- | |||
1985, December 1999, | 1985, December 1999, | |||
<http://www.statslab.cam.ac.uk/~frank/evol.html>. | <http://www.statslab.cam.ac.uk/~frank/evol.html>. | |||
[I-D.ietf-tcpm-ecnsyn] | [I-D.ietf-tcpm-ecnsyn] | |||
Kuzmanovic, A., "Adding Explicit Congestion Notification | Kuzmanovic, A., "Adding Explicit Congestion Notification | |||
(ECN) Capability to TCP's SYN/ACK Packets", | (ECN) Capability to TCP's SYN/ACK Packets", | |||
draft-ietf-tcpm-ecnsyn-03 (work in progress), | draft-ietf-tcpm-ecnsyn-05 (work in progress), | |||
November 2007. | February 2008. | |||
[I-D.moncaster-tcpm-rcv-cheat] | [I-D.moncaster-tcpm-rcv-cheat] | |||
Moncaster, T., "A TCP Test to Allow Senders to Identify | Moncaster, T., "A TCP Test to Allow Senders to Identify | |||
Receiver Non-Compliance", | Receiver Non-Compliance", | |||
draft-moncaster-tcpm-rcv-cheat-02 (work in progress), | draft-moncaster-tcpm-rcv-cheat-02 (work in progress), | |||
November 2007. | November 2007. | |||
[ITU-T.I.371] | [ITU-T.I.371] | |||
ITU-T, "Traffic Control and Congestion Control in | ITU-T, "Traffic Control and Congestion Control in | |||
{B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. | {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. | |||
skipping to change at page 71, line 36 | skipping to change at page 71, line 51 | |||
[Mathis97] | [Mathis97] | |||
Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The | Mathis, M., Semke, J., Mahdavi, J., and T. Ott, "The | |||
Macroscopic Behavior of the TCP Congestion Avoidance | Macroscopic Behavior of the TCP Congestion Avoidance | |||
Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, | Algorithm", ACM SIGCOMM CCR 27(3)67--82, July 1997, | |||
<http://doi.acm.org/10.1145/263932.264023>. | <http://doi.acm.org/10.1145/263932.264023>. | |||
[PCN-arch] | [PCN-arch] | |||
Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., | Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., | |||
Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion | Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion | |||
Notification Architecture", | Notification Architecture", draft-ietf-pcn-architecture-03 | |||
draft-eardley-pcn-architecture-00 (work in progress), | (work in progress), February 2008. | |||
June 2007. | ||||
[Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: | [Purple] Pletka, R., Waldvogel, M., and S. Mannal, "PURPLE: | |||
Predictive Active Queue Management Utilizing Congestion | Predictive Active Queue Management Utilizing Congestion | |||
Information", Proc. Local Computer Networks (LCN 2003) , | Information", Proc. Local Computer Networks (LCN 2003) , | |||
October 2003. | October 2003. | |||
[RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, | [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, | |||
M., Romanow, A., Weinrib, A., and L. Zhang, "Resource | M., Romanow, A., Weinrib, A., and L. Zhang, "Resource | |||
ReSerVation Protocol (RSVP) Version 1 Applicability | ReSerVation Protocol (RSVP) Version 1 Applicability | |||
Statement Some Guidelines on Deployment", RFC 2208, | Statement Some Guidelines on Deployment", RFC 2208, | |||
September 1997. | September 1997. | |||
[RFC2402] Kent, S. and R. Atkinson, "IP Authentication Header", | [RFC2309] Braden, B., Clark, D., Crowcroft, J., Davie, B., Deering, | |||
RFC 2402, November 1998. | S., Estrin, D., Floyd, S., Jacobson, V., Minshall, G., | |||
Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, | ||||
[RFC2406] Kent, S. and R. Atkinson, "IP Encapsulating Security | S., Wroclawski, J., and L. Zhang, "Recommendations on | |||
Payload (ESP)", RFC 2406, November 1998. | Queue Management and Congestion Avoidance in the | |||
Internet", RFC 2309, April 1998. | ||||
[RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., | [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z., | |||
and W. Weiss, "An Architecture for Differentiated | and W. Weiss, "An Architecture for Differentiated | |||
Services", RFC 2475, December 1998. | Services", RFC 2475, December 1998. | |||
[RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission | [RFC2988] Paxson, V. and M. Allman, "Computing TCP's Retransmission | |||
Timer", RFC 2988, November 2000. | Timer", RFC 2988, November 2000. | |||
[RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", | [RFC3124] Balakrishnan, H. and S. Seshan, "The Congestion Manager", | |||
RFC 3124, June 2001. | RFC 3124, June 2001. | |||
skipping to change at page 72, line 33 | skipping to change at page 72, line 47 | |||
Congestion Notification (ECN) Signaling with Nonces", | Congestion Notification (ECN) Signaling with Nonces", | |||
RFC 3540, June 2003. | RFC 3540, June 2003. | |||
[RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion | [RFC3714] Floyd, S. and J. Kempf, "IAB Concerns Regarding Congestion | |||
Control for Voice Traffic in the Internet", RFC 3714, | Control for Voice Traffic in the Internet", RFC 3714, | |||
March 2004. | March 2004. | |||
[RFC4301] Kent, S. and K. Seo, "Security Architecture for the | [RFC4301] Kent, S. and K. Seo, "Security Architecture for the | |||
Internet Protocol", RFC 4301, December 2005. | Internet Protocol", RFC 4301, December 2005. | |||
[RFC4302] Kent, S., "IP Authentication Header", RFC 4302, | ||||
December 2005. | ||||
[RFC4305] Eastlake, D., "Cryptographic Algorithm Implementation | ||||
Requirements for Encapsulating Security Payload (ESP) and | ||||
Authentication Header (AH)", RFC 4305, December 2005. | ||||
[RFC5129] Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion | ||||
Marking in MPLS", RFC 5129, January 2008. | ||||
[Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN | [Re-PCN] Briscoe, B., "Emulating Border Flow Policing using Re-ECN | |||
on Bulk Data", draft-briscoe-re-pcn-border-cheat-00 (work | on Bulk Data", draft-briscoe-re-pcn-border-cheat-01 (work | |||
in progress), July 2007. | in progress), February 2008. | |||
[Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., | [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., | |||
Salvatori, A., Soppera, A., and M. Koyabe, "Policing | Salvatori, A., Soppera, A., and M. Koyabe, "Policing | |||
Congestion Response in an Internetwork Using Re-Feedback", | Congestion Response in an Internetwork Using Re-Feedback", | |||
ACM SIGCOMM CCR 35(4)277--288, August 2005, <http:// | ACM SIGCOMM CCR 35(4)277--288, August 2005, <http:// | |||
www.acm.org/sigs/sigcomm/sigcomm2005/ | www.acm.org/sigs/sigcomm/sigcomm2005/ | |||
techprog.html#session8>. | techprog.html#session8>. | |||
[Savage99] | [Savage99] | |||
Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | Savage, S., Cardwell, N., Wetherall, D., and T. Anderson, | |||
skipping to change at page 73, line 36 | skipping to change at page 74, line 10 | |||
[pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End | [pBox] Floyd, S. and K. Fall, "Promoting the Use of End-to-End | |||
Congestion Control in the Internet", IEEE/ACM Transactions | Congestion Control in the Internet", IEEE/ACM Transactions | |||
on Networking 7(4) 458--472, August 1999, | on Networking 7(4) 458--472, August 1999, | |||
<http://www.aciri.org/floyd/end2end-paper.html>. | <http://www.aciri.org/floyd/end2end-paper.html>. | |||
Appendix A. Precise Re-ECN Protocol Operation | Appendix A. Precise Re-ECN Protocol Operation | |||
{ToDo: fix this} | {ToDo: fix this} | |||
The protocol operation in the middle described in Section 3.3 was an | The protocol operation in the middle described in Section 3.4 was an | |||
approximation. In fact, standard ECN router marking combines 1% and | approximation. In fact, standard ECN router marking combines 1% and | |||
2% marking into slightly less than 3% whole-path marking, because | 2% marking into slightly less than 3% whole-path marking, because | |||
routers deliberately mark CE whether or not it has already been | routers deliberately mark CE whether or not it has already been | |||
marked by another router upstream. So the combined marking fraction | marked by another router upstream. So the combined marking fraction | |||
would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. | would actually be 100% - (100% - 1%)(100% - 2%) = 2.98%. | |||
To generalise this we will need some notation. | To generalise this we will need some notation. | |||
o j represents the index of each resource (typically queues) along a | o j represents the index of each resource (typically queues) along a | |||
path, ranging from 0 at the first router to n-1 at the last. | path, ranging from 0 at the first router to n-1 at the last. | |||
skipping to change at page 74, line 37 | skipping to change at page 75, line 12 | |||
p_0 = u_n | p_0 = u_n | |||
= 1 - (1 - m_1)(1 - m_2)... | = 1 - (1 - m_1)(1 - m_2)... | |||
Similarly, at some point j in the middle of the network, if p = 1 - | Similarly, at some point j in the middle of the network, if p = 1 - | |||
(1 - u_j)(1 - v_j), then | (1 - u_j)(1 - v_j), then | |||
v_j = 1 - (1 - p)/(1 - u_j) | v_j = 1 - (1 - p)/(1 - u_j) | |||
~= p - u_j; if u_j << 100% | ~= p - u_j; if u_j << 100% | |||
So, between the two routers in the example in Section 3.3, congestion | So, between the two routers in the example in Section 3.4, congestion | |||
downstream is | downstream is | |||
v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) | v_1 = 100.00% - (100% - 2.98%) / (100% - 1.00%) | |||
= 2.00%, | = 2.00%, | |||
or a useful approximation of downstream congestion is | or a useful approximation of downstream congestion is | |||
v_1 ~= 2.98% - 1.00% | v_1 ~= 2.98% - 1.00% | |||
~= 1.98%. | ~= 1.98%. | |||
Appendix B. Justification for Two Codepoints Signifying Zero Worth | Appendix B. Justification for Two Codepoints Signifying Zero Worth | |||
Packets | Packets | |||
It may seem a waste of a codepoint to set aside two codepoints of the | It may seem a waste of a codepoint to set aside two codepoints of the | |||
Extended ECN field to signify zero worth (RECT and CE(0) are both | Extended ECN field to signify zero worth (RECT and CE(0) are both | |||
worth zero). The justification is subtle, but worth recording. | worth zero). The justification is subtle, but worth recording. | |||
The original version of re-ECN ([Re-fb] and draft-00 of this memo) | The original version of Re-ECN ([Re-fb] and draft-00 of this memo) | |||
used three codepoints for neutral (ECT(1)), positive (ECT(0)) and | used three codepoints for neutral (ECT(1)), positive (ECT(0)) and | |||
negative (CE) packets. The sender set packets to neutral unless re- | negative (CE) packets. The sender set packets to neutral unless re- | |||
echoing congestion, when it set them positive, in much the same way | echoing congestion, when it set them positive, in much the same way | |||
that it blanks the RE flag in the current protocol. However, routers | that it blanks the RE flag in the current protocol. However, routers | |||
were meant to mark congestion by setting packets negative (CE) | were meant to mark congestion by setting packets negative (CE) | |||
irrespective of whether they had previously been neutral or positive. | irrespective of whether they had previously been neutral or positive. | |||
However, we did not arrange for senders to remember which packet had | However, we did not arrange for senders to remember which packet had | |||
been sent with which codepoint, or for feedback to say exactly which | been sent with which codepoint, or for feedback to say exactly which | |||
packets arrived with which codepoints. The transport was meant to | packets arrived with which codepoints. The transport was meant to | |||
inflate the number of positive packets it sent to allow for a few | inflate the number of positive packets it sent to allow for a few | |||
being wiped out by congestion marking. We (wrongly) assumed that | being wiped out by congestion marking. We (wrongly) assumed that | |||
routers would congestion mark packets indiscriminately, so the | routers would congestion mark packets indiscriminately, so the | |||
transport could infer how many positive packets had been marked and | transport could infer how many positive packets had been marked and | |||
compensate accordingly by re-echoing. But this created a perverse | compensate accordingly by re-echoing. But this created a perverse | |||
incentive for routers to preferentially congestion mark positive | incentive for routers to preferentially congestion mark positive | |||
packets rather than neutral ones. | packets rather than neutral ones. | |||
We could have removed this perverse incentive by requiring re-ECN | We could have removed this perverse incentive by requiring Re-ECN | |||
senders to remember which packets they had sent with which codepoint. | senders to remember which packets they had sent with which codepoint. | |||
And for feedback from the receiver to identify which packets arrived | And for feedback from the receiver to identify which packets arrived | |||
as which. Then, if a positive packet was congestion marked to | as which. Then, if a positive packet was congestion marked to | |||
negative, the sender could have re-echoed twice to maintain the | negative, the sender could have re-echoed twice to maintain the | |||
balance between positive and negative at the receiver. | balance between positive and negative at the receiver. | |||
Instead, we chose to make re-echoing congestion (blanking RE) | Instead, we chose to make re-echoing congestion (blanking RE) | |||
orthogonal to congestion notification (marking CE), which required a | orthogonal to congestion notification (marking CE), which required a | |||
second neutral codepoint (the orthogonal scheme forms the main square | second neutral codepoint. Then the receiver would be able to detect | |||
of four codepoints in Figure 2). Then the receiver would be able to | and echo a congestion event even if it arrived on a packet that had | |||
detect and echo a congestion event even if it arrived on a packet | originally been positive. | |||
that had originally been positive. | ||||
If we had added extra complexity to the sender and receiver | If we had added extra complexity to the sender and receiver | |||
transports to track changes to individual packets, we could have made | transports to track changes to individual packets, we could have made | |||
it work, but then routers would have had an incentive to mark | it work, but then routers would have had an incentive to mark | |||
positive packets with half the probability of neutral packets. That | positive packets with half the probability of neutral packets. That | |||
in turn would have led router algorithms to become more complex. | in turn would have led router algorithms to become more complex. | |||
Then senders wouldn't know whether a mark had been introduced by a | Then senders wouldn't know whether a mark had been introduced by a | |||
simple or a complex router algorithm. That in turn would have | simple or a complex router algorithm. That in turn would have | |||
required another codepoint to distinguish between legacy ECN and new | required another codepoint to distinguish between RFC3168 ECN and new | |||
re-ECN router marking. | Re-ECN router marking. | |||
Once the cost of IP header codepoint real-estate was the same for | Once the cost of IP header codepoint real-estate was the same for | |||
both schemes, there was no doubt that the simpler option for | both schemes, there was no doubt that the simpler option for | |||
endpoints and for routers should be chosen. The resulting protocol | endpoints and for routers should be chosen. The resulting protocol | |||
also no longer needed the tricky inflation/deflation complexity of | also no longer needed the tricky inflation/deflation complexity of | |||
the original (broken) scheme. It was also much simpler to understand | the original (broken) scheme. It was also much simpler to understand | |||
conceptually. | conceptually. | |||
A further advantage of the new orthogonal four-codepoint scheme was | A further advantage of the new orthogonal four-codepoint scheme was | |||
that senders owned sole rights to change the RE flag and routers | that senders owned sole rights to change the RE flag and routers | |||
skipping to change at page 76, line 29 | skipping to change at page 77, line 5 | |||
using such redundant relationships can improve the security of a | using such redundant relationships can improve the security of a | |||
scheme (cf. double-entry book-keeping or the ECN Nonce). | scheme (cf. double-entry book-keeping or the ECN Nonce). | |||
Alternatively, it might be necessary to exploit the redundancy in the | Alternatively, it might be necessary to exploit the redundancy in the | |||
future to encode an extra information channel. | future to encode an extra information channel. | |||
Appendix C. ECN Compatibility | Appendix C. ECN Compatibility | |||
The rationale for choosing the particular combinations of SYN and SYN | The rationale for choosing the particular combinations of SYN and SYN | |||
ACK flags in Section 4.1.3 is as follows. | ACK flags in Section 4.1.3 is as follows. | |||
Choice of SYN flags: A re-ECN sender can work with vanilla ECN | Choice of SYN flags: A Re-ECN sender can work with RFC3168 compliant | |||
receivers so we wanted to use the same flags as would be used in | ECN receivers so we wanted to use the same flags as would be used | |||
an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same time, | in an ECN-setup SYN [RFC3168] (CWR=1, ECE=1). But at the same | |||
we wanted a server (host B) that is Re-ECT to be able to recognise | time, we wanted a server (host B) that is Re-ECT to be able to | |||
that the client (A) is also Re-ECT. We believe also setting NS=1 | recognise that the client (A) is also Re-ECT. We believe also | |||
in the initial SYN achieves both these objectives, as it should be | setting NS=1 in the initial SYN achieves both these objectives, as | |||
ignored by vanilla ECT receivers and by ECT-Nonce receivers. But | it should be ignored by RFC3168 compliant ECT receivers and by | |||
senders that are not Re-ECT should not set NS=1. At the time ECN | ECT-Nonce receivers. But senders that are not Re-ECT should not | |||
was defined, the NS flag was not defined, so setting NS=1 should | set NS=1. At the time ECN was defined, the NS flag was not | |||
be ignored by existing ECT receivers (but testing against | defined, so setting NS=1 should be ignored by existing ECT | |||
implementations may yet prove otherwise). The ECN Nonce | receivers (but testing against implementations may yet prove | |||
RFC [RFC3540] is silent on what the NS field might be set to in | otherwise). The ECN Nonce RFC [RFC3540] is silent on what the NS | |||
the TCP SYN, but we believe the intent was for a nonce client to | field might be set to in the TCP SYN, but we believe the intent | |||
set NS=0 in the initial SYN (again only testing will tell). | was for a nonce client to set NS=0 in the initial SYN (again only | |||
Therefore we define a Re-ECN-setup SYN as one with NS=1, CWR=1 & | testing will tell). Therefore we define a Re-ECN-setup SYN as one | |||
ECE=1 | with NS=1, CWR=1 & ECE=1 | |||
Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to | Choice of SYN ACK flags: Choice of SYN ACK: The client (A) needs to | |||
be able to determine whether the server (B) is Re-ECT. The | be able to determine whether the server (B) is Re-ECT. The | |||
original ECN specification required an ECT server to respond to an | original ECN specification required an ECT server to respond to an | |||
ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There | ECN-setup SYN with an ECN-setup SYN ACK of CWR=0 and ECE=1. There | |||
is no room to modify this by setting the NS flag, as that is | is no room to modify this by setting the NS flag, as that is | |||
already set in the SYN ACK of an ECT-Nonce server. So we used the | already set in the SYN ACK of an ECT-Nonce server. So we used the | |||
only combination of CWR and ECE that would not be used by existing | only combination of CWR and ECE that would not be used by existing | |||
TCP receivers: CWR=1 and ECE=0. The original ECN specification | TCP receivers: CWR=1 and ECE=0. The original ECN specification | |||
defines this combination as a non-ECN-setup SYN ACK, which remains | defines this combination as a non-ECN-setup SYN ACK, which remains | |||
true for vanilla and Nonce ECTs. But for re-ECN we define it as a | true for RFC3168 compliant and Nonce ECTs. But for Re-ECN we | |||
Re-ECN-setup SYN ACK. We didn't use a SYN ACK with both CWR and | define it as a Re-ECN-setup SYN ACK. We didn't use a SYN ACK with | |||
ECE cleared to 0 because that would be the likely response from | both CWR and ECE cleared to 0 because that would be the likely | |||
most Not-ECT receivers. And we didn't use a SYN ACK with both CWR | response from most Not-ECT receivers. And we didn't use a SYN ACK | |||
and ECE set to 1 either, as at least one broken receiver | with both CWR and ECE set to 1 either, as at least one broken | |||
implementation echoes whatever flags were in the SYN into its SYN | receiver implementation echoes whatever flags were in the SYN into | |||
ACK. Therefore we define a Re-ECN-setup SYN ACK as one with CWR=1 | its SYN ACK. Therefore we define a Re-ECN-setup SYN ACK as one | |||
& ECE=0. | with CWR=1 & ECE=0. | |||
Choice of two alternative SYN ACKs: the NS flag may take either | Choice of two alternative SYN ACKs: the NS flag may take either | |||
value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- | value in a Re-ECN-setup SYN ACK. Section 5.4 REQUIRES that a Re- | |||
ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to | ECT server MUST set the NS flag to 1 in a Re-ECN-setup SYN ACK to | |||
echo congestion experienced (CE) on the initial SYN. Otherwise a | echo congestion experienced (CE) on the initial SYN. Otherwise a | |||
Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current | Re-ECN-setup SYN ACK MUST be returned with NS=0. The only current | |||
known use of the NS flag in a SYN ACK is to indicate support for | known use of the NS flag in a SYN ACK is to indicate support for | |||
the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. | the ECN nonce, which will be negotiated by setting CWR=0 & ECE=1. | |||
Given the ECN nonce MUST NOT be used for a RECN mode connection, a | Given the ECN nonce MUST NOT be used for a RECN mode connection, a | |||
Re-ECN-setup SYN ACK can use either setting of the NS flag without | Re-ECN-setup SYN ACK can use either setting of the NS flag without | |||
skipping to change at page 80, line 13 | skipping to change at page 80, line 36 | |||
original intent in the early days of the Internet). | original intent in the early days of the Internet). | |||
In the longer term, precision could be improved if routers | In the longer term, precision could be improved if routers | |||
decremented TTL to represent exact propagation delay to the next | decremented TTL to represent exact propagation delay to the next | |||
router. That is, for a router to decrement TTL by, say, 1.8 time | router. That is, for a router to decrement TTL by, say, 1.8 time | |||
units it would alternate the decrement of every packet between 1 & 2 | units it would alternate the decrement of every packet between 1 & 2 | |||
at a ratio of 1:4. Although this might sometimes require a seemingly | at a ratio of 1:4. Although this might sometimes require a seemingly | |||
dangerous null decrement, a packet in a loop would still decrement to | dangerous null decrement, a packet in a loop would still decrement to | |||
zero after 255 time units on average. As more routers were upgraded | zero after 255 time units on average. As more routers were upgraded | |||
to this more accurate TTL decrement, path delay estimates would | to this more accurate TTL decrement, path delay estimates would | |||
become increasingly accurate despite the presence of some legacy | become increasingly accurate despite the presence of some RFC3168 | |||
routers that continued to always decrement the TTL by 1. | compliant routers that continued to always decrement the TTL by 1. | |||
Appendix G. Policer Designs to ensure Congestion Responsiveness | Appendix G. Policer Designs to ensure Congestion Responsiveness | |||
G.1. Per-user Policing | G.1. Per-user Policing | |||
User policing requires a policer on the ingress interface of the | User policing requires a policer on the ingress interface of the | |||
access router associated with the user. At that point, the traffic | access router associated with the user. At that point, the traffic | |||
of the user hasn't diverged on different routes yet; nor has it mixed | of the user hasn't diverged on different routes yet; nor has it mixed | |||
with traffic from other sources. | with traffic from other sources. | |||
skipping to change at page 84, line 30 | skipping to change at page 85, line 8 | |||
V_b: accumulated congestion volume | V_b: accumulated congestion volume | |||
B: total data volume (in case it is needed) | B: total data volume (in case it is needed) | |||
A suitable pseudo-code algorithm for a border router is as follows: | A suitable pseudo-code algorithm for a border router is as follows: | |||
==================================================================== | ==================================================================== | |||
V_b = 0 | V_b = 0 | |||
B = 0 | B = 0 | |||
for each re-ECN-capable packet { | for each Re-ECN-capable packet { | |||
b = readLength(packet) /* set b to packet size */ | b = readLength(packet) /* set b to packet size */ | |||
B += b /* accumulate total volume */ | B += b /* accumulate total volume */ | |||
if readEECN(packet) == (Re-Echo || FNE) { | if readEECN(packet) == (Re-Echo || FNE) { | |||
V_b += b /* increment... */ | V_b += b /* increment... */ | |||
} elseif readEECN(packet) == CE(-1) { | } elseif readEECN(packet) == CE(-1) { | |||
V_b -= b /* ...or decrement V_b... */ | V_b -= b /* ...or decrement V_b... */ | |||
} /*...depending on EECN field */ | } /*...depending on EECN field */ | |||
} | } | |||
==================================================================== | ==================================================================== | |||
skipping to change at page 87, line 13 | skipping to change at page 87, line 36 | |||
sending transports (e.g. large servers) want to allocate their /own/ | sending transports (e.g. large servers) want to allocate their /own/ | |||
resources in proportion to the rates that each network path can | resources in proportion to the rates that each network path can | |||
sustain, based on congestion control. In that case, the nonce allows | sustain, based on congestion control. In that case, the nonce allows | |||
senders to be assured that they aren't being duped into giving more | senders to be assured that they aren't being duped into giving more | |||
of their own resources to a particular flow. And if congestion | of their own resources to a particular flow. And if congestion | |||
suppression is detected, the sending transport can rate limit the | suppression is detected, the sending transport can rate limit the | |||
offending connection to protect its own resources. Certainly, this | offending connection to protect its own resources. Certainly, this | |||
is a useful function, but the IETF should carefully decide whether | is a useful function, but the IETF should carefully decide whether | |||
such a single, very specific case warrants IP header space. | such a single, very specific case warrants IP header space. | |||
In contrast, re-ECN allows all routers to fully protect themselves | In contrast, Re-ECN allows all routers to fully protect themselves | |||
from such attacks, without having to trust anyone - senders, | from such attacks, without having to trust anyone - senders, | |||
receivers, neighbouring networks. Re-ECN is therefore proposed in | receivers, neighbouring networks. Re-ECN is therefore proposed in | |||
preference to the ECN nonce on the basis that it addresses the | preference to the ECN nonce on the basis that it addresses the | |||
generic problem of accountability for congestion of a network's | generic problem of accountability for congestion of a network's | |||
resources at the IP layer. | resources at the IP layer. | |||
Delaying the ECN nonce is justified because the applicability of the | Delaying the ECN nonce is justified because the applicability of the | |||
ECN nonce seems too limited for it to consume a two-bit codepoint in | ECN nonce seems too limited for it to consume a two-bit codepoint in | |||
the IP header. It therefore seems prudent to give time for an | the IP header. It therefore seems prudent to give time for an | |||
alternative way to be found to do the one function the nonce is | alternative way to be found to do the one function the nonce is | |||
essential for. | essential for. | |||
Moreover, while we have re-designed the re-ECN codepoints so that | Moreover, while we have re-designed the Re-ECN codepoints so that | |||
they do not prevent the ECN nonce progressing, the same is not true | they do not prevent the ECN nonce progressing, the same is not true | |||
the other way round. If the ECN nonce started to see some deployment | the other way round. If the ECN nonce started to see some deployment | |||
(perhaps because it was blessed with proposed standard status), | (perhaps because it was blessed with proposed standard status), | |||
incremental deployment of re-ECN would effectively be impossible, | incremental deployment of Re-ECN would effectively be impossible, | |||
because re-ECN marking fractions at inter-domain borders would be | because Re-ECN marking fractions at inter-domain borders would be | |||
polluted by unknown levels of nonce traffic. | polluted by unknown levels of nonce traffic. | |||
The authors are aware that re-ECN must prove it has the potential it | The authors are aware that Re-ECN must prove it has the potential it | |||
claims if it is to displace the nonce. Therefore, every effort has | claims if it is to displace the nonce. Therefore, every effort has | |||
been made to complete a comprehensive specification of re-ECN so that | been made to complete a comprehensive specification of Re-ECN so that | |||
its potential can be assessed. We therefore seek the opinion of the | its potential can be assessed. We therefore seek the opinion of the | |||
Internet community on whether the re-ECN protocol is sufficiently | Internet community on whether the Re-ECN protocol is sufficiently | |||
useful to warrant standards action. | useful to warrant standards action. | |||
Authors' Addresses | Authors' Addresses | |||
Bob Briscoe | Bob Briscoe | |||
BT & UCL | BT & UCL | |||
B54/77, Adastral Park | B54/77, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
End of changes. 174 change blocks. | ||||
592 lines changed or deleted | 577 lines changed or added | |||
This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |