draft-briscoe-tsvwg-ecn-tunnel-00.txt | draft-briscoe-tsvwg-ecn-tunnel-01.txt | |||
---|---|---|---|---|
Transport Area Working Group B. Briscoe | Transport Area Working Group B. Briscoe | |||
Internet-Draft BT | Internet-Draft BT | |||
Intended status: Standards Track June 30, 2007 | Intended status: Standards Track July 14, 2008 | |||
Expires: January 1, 2008 | Expires: January 15, 2009 | |||
Layered Encapsulation of Congestion Notification | Layered Encapsulation of Congestion Notification | |||
draft-briscoe-tsvwg-ecn-tunnel-00 | draft-briscoe-tsvwg-ecn-tunnel-01 | |||
Status of this Memo | Status of this Memo | |||
By submitting this Internet-Draft, each author represents that any | By submitting this Internet-Draft, each author represents that any | |||
applicable patent or other IPR claims of which he or she is aware | applicable patent or other IPR claims of which he or she is aware | |||
have been or will be disclosed, and any of which he or she becomes | have been or will be disclosed, and any of which he or she becomes | |||
aware will be disclosed, in accordance with Section 6 of BCP 79. | aware will be disclosed, in accordance with Section 6 of BCP 79. | |||
Internet-Drafts are working documents of the Internet Engineering | Internet-Drafts are working documents of the Internet Engineering | |||
Task Force (IETF), its areas, and its working groups. Note that | Task Force (IETF), its areas, and its working groups. Note that | |||
skipping to change at page 1, line 34 | skipping to change at page 1, line 34 | |||
and may be updated, replaced, or obsoleted by other documents at any | and may be updated, replaced, or obsoleted by other documents at any | |||
time. It is inappropriate to use Internet-Drafts as reference | time. It is inappropriate to use Internet-Drafts as reference | |||
material or to cite them other than as "work in progress." | material or to cite them other than as "work in progress." | |||
The list of current Internet-Drafts can be accessed at | The list of current Internet-Drafts can be accessed at | |||
http://www.ietf.org/ietf/1id-abstracts.txt. | http://www.ietf.org/ietf/1id-abstracts.txt. | |||
The list of Internet-Draft Shadow Directories can be accessed at | The list of Internet-Draft Shadow Directories can be accessed at | |||
http://www.ietf.org/shadow.html. | http://www.ietf.org/shadow.html. | |||
This Internet-Draft will expire on January 1, 2008. | This Internet-Draft will expire on January 15, 2009. | |||
Copyright Notice | ||||
Copyright (C) The IETF Trust (2007). | ||||
Abstract | Abstract | |||
This document redefines how the explicit congestion notification | This document redefines how the explicit congestion notification | |||
(ECN) field of the outer IP header of a tunnel should be constructed. | (ECN) field of the outer IP header of a tunnel should be constructed. | |||
It brings all IP in IP tunnels (v4 or v6) into line with the way | It brings all IP in IP tunnels (v4 or v6) into line with the way | |||
IPsec tunnels now construct the ECN field, ensuring that the outer | IPsec tunnels now construct the ECN field. It includes a thorough | |||
header reveals any congestion experienced so far on the path. It | analysis of the reasoning for this change and the implications. It | |||
specifies the default ECN tunneling behaviour for any Diffserv per- | also gives guidelines on the encapsulation of IP congestion | |||
hop behaviour (PHB), but also gives general principles to guide the | notification by any outer header, whether encapsulated in an IP | |||
design of alternate congestion marking behaviours for specific PHBs | tunnel or in a lower layer header. Following these guidelines should | |||
and for lower layer congestion notification schemes. | help interworking, if the IETF or other standards bodies specify any | |||
new encapsulation of congestion notification. | ||||
Table of Contents | Table of Contents | |||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 | |||
2. Requirements notation . . . . . . . . . . . . . . . . . . . . 5 | 1.1. The Need for Rationalisation . . . . . . . . . . . . . . . 4 | |||
3. Design Constraints . . . . . . . . . . . . . . . . . . . . . . 6 | 1.2. Document Roadmap . . . . . . . . . . . . . . . . . . . . . 5 | |||
3.1. Security Constraints . . . . . . . . . . . . . . . . . . . 6 | 1.3. Scope . . . . . . . . . . . . . . . . . . . . . . . . . . 6 | |||
3.2. Control Constraints . . . . . . . . . . . . . . . . . . . 7 | 2. Requirements Language . . . . . . . . . . . . . . . . . . . . 8 | |||
3.3. Management Constraints . . . . . . . . . . . . . . . . . . 8 | 3. Design Constraints . . . . . . . . . . . . . . . . . . . . . . 8 | |||
4. Design Principles . . . . . . . . . . . . . . . . . . . . . . 9 | 3.1. Security Constraints . . . . . . . . . . . . . . . . . . . 8 | |||
5. Default ECN Tunnelling Rules . . . . . . . . . . . . . . . . . 11 | 3.2. Control Constraints . . . . . . . . . . . . . . . . . . . 10 | |||
6. Backward Compatibility . . . . . . . . . . . . . . . . . . . . 12 | 3.3. Management Constraints . . . . . . . . . . . . . . . . . . 11 | |||
7. Changes from Earlier RFCs . . . . . . . . . . . . . . . . . . 13 | 4. Design Principles . . . . . . . . . . . . . . . . . . . . . . 12 | |||
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 14 | 4.1. Design Guidelines for New Encapsulations of Congestion | |||
9. Security Considerations . . . . . . . . . . . . . . . . . . . 14 | Notification . . . . . . . . . . . . . . . . . . . . . . . 13 | |||
10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 14 | 5. Default ECN Tunnelling Rules . . . . . . . . . . . . . . . . . 15 | |||
11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 15 | 6. Backward Compatibility . . . . . . . . . . . . . . . . . . . . 16 | |||
12. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 15 | 7. Changes from Earlier RFCs . . . . . . . . . . . . . . . . . . 18 | |||
13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 15 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 | |||
13.1. Normative References . . . . . . . . . . . . . . . . . . . 15 | 9. Security Considerations . . . . . . . . . . . . . . . . . . . 19 | |||
13.2. Informative References . . . . . . . . . . . . . . . . . . 16 | 10. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 21 | |||
Appendix A. In-path Load Regulation . . . . . . . . . . . . . . . 17 | 11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 20 | 12. Comments Solicited . . . . . . . . . . . . . . . . . . . . . . 22 | |||
Intellectual Property and Copyright Statements . . . . . . . . . . 21 | 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |||
13.1. Normative References . . . . . . . . . . . . . . . . . . . 22 | ||||
13.2. Informative References . . . . . . . . . . . . . . . . . . 23 | ||||
Appendix A. Why resetting CE on encapsulation harms PCN . . . . . 25 | ||||
Appendix B. Contribution to Congestion across a Tunnel . . . . . 25 | ||||
Appendix C. Ideal Decapsulation Rules . . . . . . . . . . . . . . 27 | ||||
Appendix D. Non-Dependence of Tunnelling on In-path Load | ||||
Regulation . . . . . . . . . . . . . . . . . . . . . 28 | ||||
D.1. Dependence of In-Path Load Regulation on Tunnelling . . . 29 | ||||
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 32 | ||||
Intellectual Property and Copyright Statements . . . . . . . . . . 34 | ||||
Changes from previous drafts (to be removed by the RFC Editor) | ||||
From -00 to -01: | ||||
* Related everything conceptually to the uniform and pipe models | ||||
of RFC2983 on Diffserv Tunnels, and completely removed the | ||||
dependence of tunnelling behaviour on the presence of any in- | ||||
path load regulation by using the [1 - Before] [2 - Outer] | ||||
function placement concepts from RFC2983. | ||||
* Added specifc cases where the existing standards limit new | ||||
proposals. | ||||
* Added sub-structure to Introduction (Need for Rationalisation, | ||||
Roadmap), added new Introductory subsection on "Scope" and | ||||
improved clarity | ||||
* Added Design Guidelines for New Encapsulations of Congestion | ||||
Notification | ||||
* Considerably clarified the Backward Compatibility section | ||||
* Considerably extended the Security Considerations section | ||||
* Summarised the primary rationale much better in the conclusions | ||||
* Added numerous extra acknowledgements | ||||
* Added Appendix A. "Why resetting CE on encapsulation harms | ||||
PCN", Appendix B. "Contribution to Congestion across a Tunnel" | ||||
and Appendix C. "Ideal Decapsulation Rules" | ||||
* Changed Appendix A "In-path Load Regulation" to "Non-Dependence | ||||
of Tunnelling on In-path Load Regulation" and added sub-section | ||||
on "Dependence of In-Path Load Regulation on Tunnelling" | ||||
1. Introduction | 1. Introduction | |||
This document redefines how the explicit congestion notification | This document redefines how the explicit congestion notification | |||
(ECN) field [RFC3168] of the outer IP header of a tunnel should be | (ECN) field [RFC3168] of the outer IP header of a tunnel should be | |||
constructed. It brings all IP in IP tunnels (v4 or v6) into line | constructed. It brings all IP in IP tunnels (v4 or v6) into line | |||
with the way IPsec tunnels [RFC4301] now construct the ECN field, | with the way IPsec tunnels [RFC4301] now construct the ECN field, | |||
ensuring that the outer header reveals any congestion experienced so | ensuring that the outer header reveals any congestion experienced so | |||
far on the path. Although this memo focuses on IP in IP tunnelling | far on the whole path, not just since the last tunnel ingress. | |||
it also gives generalised advice for any encapsulation by lower layer | ||||
headers. | ||||
ECN allows a congested resource to notify the onset of congestion | ECN allows a congested resource to notify the onset of congestion | |||
without having to drop packets, by explicitly marking a proportion of | without having to drop packets, by explicitly marking a proportion of | |||
packets with the congestion experienced (CE) codepoint. Congestion | packets with the congestion experienced (CE) codepoint. Because | |||
notification is unusual in that it propagates from the physical layer | congestion is exhaustion of a physical resource, if the transport | |||
upwards to the transport layer, because congestion is exhaustion of a | layer is to deal with congestion, congestion notification must | |||
physical resource. The transport layer can directly detect loss of a | propagate upwards; from the physical layer to the transport layer. | |||
packet (or frame) by a lower layer. But if a lower layer marks a | The transport layer can directly detect loss of a packet (or frame) | |||
packet (or frame) to notify incipient congestion, this marking has to | by a lower layer. But if a lower layer marks rather than drops a | |||
be explicitly copied up the layers at every header decapsulation. | forward-travelling data packet (or frame) in order to notify | |||
So, at each decapsulation of an outer (lower layer) header a | incipient congestion, this marking has to be explicitly copied up the | |||
congestion marking has to be arranged to propagate into the forwarded | layers at every header decapsulation. So, at each decapsulation of | |||
(upper layer) header. It must continue upwards until it reaches the | an outer (lower layer) header a congestion marking has to be arranged | |||
destination transport, which should feed congestion notification back | to propagate into the forwarded (upper layer) header. It must | |||
to the source transport. | continue upwards until it reaches the destination transport. Then | |||
typically the destination feeds this congestion notification back to | ||||
the source transport. Given encapsulation by lower layer headers is | ||||
functionally similar to tunnelling, it is necessary to arrange | ||||
similar propagation of congestion notification up the layers. For | ||||
instance, ECN and its propagation up the layers has recently been | ||||
specified for MPLS [RFC5129]. | ||||
Note that often lower layer resources are arranged to be protected by | As packets pass up the layers, current specifications of | |||
higher layer buffers, so instead of blocking occurring at the lower | decapsulation behaviours are largely all consistent and correct. | |||
layer, it occurs when the higher layer queue overflows. Thus, non- | However, as packets pass down the layers, specifications of | |||
blocking link and physical layer technologies do not have to | encapsulation behaviours are not consistent. This document is | |||
implement congestion notification, which can be introduced solely in | primarily aimed at rationalising encapsulation. (Nevertheless, | |||
IP layer active queue management (AQM). However, if we want to use | Appendix C explains why the consistency of decapsulation solutions | |||
congestion notification, we have to arrange for it to be explicitly | will not last for long and proposes a fix to decapsulation rules as | |||
copied up the layers when IP is tunnelled in IP (and if a particular | well. The IETF can then discuss whether to rationalise decapsulation | |||
link layer technology isn't protected from blocking by network layer | at the same time as encapsulation.) | |||
queues). | ||||
1.1. The Need for Rationalisation | ||||
IPsec tunnel mode is a specific form of tunnelling that can hide the | IPsec tunnel mode is a specific form of tunnelling that can hide the | |||
inner headers. Because the ECN field has to be mutable, it cannot be | inner headers. Because the ECN field has to be mutable, it cannot be | |||
covered by IPsec encryption or authentication calculations. | covered by IPsec encryption or authentication calculations. | |||
Therefore concern has been raised in the past that the ECN field | Therefore concern has been raised in the past that the ECN field | |||
could be used as a low bandwidth covert channel to communicate with | could be used as a low bandwidth covert channel to communicate with | |||
someone on the unprotected public Internet even if an end-host is | someone on the unprotected public Internet even if an end-host is | |||
restricted to only communicate with the public Internet through an | restricted to only communicate with the public Internet through an | |||
IPsec gateway. However, the recently updated version of IPsec | IPsec gateway. However, the updated version of IPsec [RFC4301] chose | |||
[RFC4301] chose not to block this covert channel, deciding that the | not to block this covert channel, deciding that the threat could be | |||
threat could be managed given the channel bandwidth is so limited | managed given the channel bandwidth is so limited (ECN is a 2-bit | |||
(ECN is a 2-bit field). | field). | |||
An unfortunate sequence of standards actions leading up to this | An unfortunate sequence of standards actions leading up to this | |||
latest change in IPsec has left us with nearly the worst of all | latest change in IPsec has left us with nearly the worst of all | |||
possible combinations of outcomes, despite the best endeavours of | possible combinations of outcomes, despite the best endeavours of | |||
everyone concerned. Even though information about congestion | everyone concerned. The controversy has been over whether to reveal | |||
experienced on the upstream path has various uses if it is revealed | information about congestion experienced on the path upstream of the | |||
tunnel ingress. Even though this has various uses if it is revealed | ||||
in the outer header of a tunnel, when ECN was standardised[RFC3168] | in the outer header of a tunnel, when ECN was standardised[RFC3168] | |||
it was decided that all IP in IP tunnels should hide upstream | it was decided that all IP in IP tunnels should hide this upstream | |||
congestion information simply to avoid the extra complexity of two | congestion simply to avoid the extra complexity of two different | |||
different mechanisms for IPsec and non-IPsec tunnels. However, now | mechanisms for IPsec and non-IPsec tunnels. However, now that | |||
that [RFC4301] IPsec tunnels deliberately no longer hide this | [RFC4301] IPsec tunnels deliberately no longer hide this information, | |||
information, we are left in the perverse position where non-IPsec | we are left in the perverse position where non-IPsec tunnels still | |||
tunnels still hide congestion information unnecessarily. This | hide congestion information unnecessarily. This document is designed | |||
document is designed to correct that anomaly. | to correct that anomaly. | |||
Specifically, RFC3168 says that, if a tunnel supports ECN (termed a | Specifically, RFC3168 says that, if a tunnel fully supports ECN | |||
'full-functionality' ECN tunnel), the tunnel ingress must not copy a | (termed a 'full-functionality' ECN tunnel in [RFC3168]), the tunnel | |||
CE marking from the inner header into the outer header that it | ingress must not copy a CE marking from the inner header into the | |||
creates. Instead the tunnel ingress has to set the ECN field of the | outer header that it creates. Instead the tunnel ingress has to set | |||
outer header to ECT(0) (i.e. codepoint 10). We term this 'resetting' | the ECN field of the outer header to ECT(0) (i.e. codepoint 10). We | |||
a CE codepoint. However, RFC4301 reverses this, stating that the | term this 'resetting' a CE codepoint. However, RFC4301 reverses | |||
tunnel ingress must simply copy the ECN field from the inner to the | this, stating that the tunnel ingress must simply copy the ECN field | |||
outer header. The main purpose of this document is to carry over | from the inner to the outer header. The main purpose of this | |||
this new relaxed attitude to covert channels from IPsec to all IP in | document is to carry the new behaviour of IPsec over to all IP in IP | |||
IP tunnels, so all tunnel ingress nodes consistently copy the ECN | tunnels, so all tunnel ingress nodes consistently copy the ECN field. | |||
field. | ||||
The rest of the document deals with the knock-on effects of this | Why does it matter if we have different ECN encapsulation behaviours | |||
apparently minor change. It is organised as follows: | for IPsec and non-IPsec tunnels? The general argument is that | |||
gratuitous inconsistency constrains the available design space and | ||||
makes it harder to design networks and new protocols that work | ||||
predictably. | ||||
Already complicated constraints have had to be added to a standards | ||||
track congestion marking proposal. The section of the pre-congestion | ||||
notification (PCN) architecture [I-D.ietf-pcn-architecture] on | ||||
tunnelling says PCN works correctly in the presence of RFC4301 IPsec | ||||
encapsulation (and RFC5129 MPLS encapsulation). However it doesn't | ||||
work with RFC3168 IP in IP encapsulation (Appendix A explains why). | ||||
Section 3 assesses further security, control and management functions | ||||
that cannot be achieved in each case (resetting vs copying CE | ||||
markings). It finds that resetting CE makes life difficult in a | ||||
number of directions, while copying CE harms nothing (other than | ||||
opening a low bit-rate covert channel vulnerability which the | ||||
Security Area deems is manageable). | ||||
1.2. Document Roadmap | ||||
Most of the document gives a thorough analysis of the knock-on | ||||
effects of the apparently minor change to tunnel encapsulation. The | ||||
reader may jump to Section 5 if only interested in standards actions | ||||
impacting implementation. The whole document is organised as | ||||
follows: | ||||
o S.5 of RFC3168 permits the Diffserv codepoint (DSCP)[RFC2474] to | o S.5 of RFC3168 permits the Diffserv codepoint (DSCP)[RFC2474] to | |||
'switch in' different behaviours for marking the ECN field, just | 'switch in' different behaviours for marking the ECN field, just | |||
as it switches in different per-hop behaviours (PHBs) for | as it switches in different per-hop behaviours (PHBs) for | |||
scheduling. Therefore we cannot only discuss the ECN protocol | scheduling. Therefore we cannot only discuss the ECN protocol | |||
that RFC3168 gives as a default. We need to also give guidance | that RFC3168 gives as a default. We need to also give guidance | |||
for possible different marking schemes. Therefore in Section 3 we | for possible different marking schemes. Therefore in Section 3 we | |||
lay out the design constraints when tunneling congestion | lay out the design constraints when tunnelling congestion | |||
notification. | notification. | |||
o Then in Section 4 we resolve the tensions between these | o Then in Section 4 we resolve the tensions between these | |||
constraints to give general design principles on how a tunnel | constraints to give general design principles and guidelines on | |||
should process congestion notification; principles that could | how a tunnel should process congestion notification; principles | |||
apply to any marking behaviour for any PHB, not just the default | that could apply to any marking behaviour for any PHB, not just | |||
in RFC3168. In particular, we examine the underlying principles | the default in RFC3168. In particular, we examine the underlying | |||
behind whether CE should be reset or copied into the outer header | principles behind whether CE should be reset or copied into the | |||
at the ingress to a tunnel--or indeed at the ingress of any | outer header at the ingress to a tunnel--or indeed at the ingress | |||
layered encapsulation of headers with congestion notification | of any layered encapsulation of headers with congestion | |||
fields. | notification fields. We end this section with a bulleted list of | |||
more design guidelines for new encapsulations of congestion | ||||
notification. | ||||
o Section 5 then confirms the precise rules for the default ECN | o Section 5 then uses precise standards terminology to confirm the | |||
tunnelling behaviour based on the above design principles. These | rules for the default ECN tunnelling behaviour based on the above | |||
rules apply to all PHBs, unless stated otherwise in the | design principles. | |||
specification of a PHB. There is no requirement for a PHB to | ||||
state anything about ECN behaviour if the default behaviour is | ||||
sufficient. | ||||
o Extending the new IPsec tunnel ingress behaviour to all IP in IP | o Extending the new IPsec tunnel ingress behaviour to all IP in IP | |||
tunnels causes one further knock-on effect that is dealt with in | tunnels requires consideration of backwards compatibility, which | |||
Section 6 on Backward Compatibility. If one end of an IPsec | is covered in Section 6 and changes from earlier RFCs are brought | |||
tunnel is compliant with [RFC4301], assuming IKEv2 key management | together in Section 7. | |||
is used, the other end can be guaranteed to also be [RFC4301] | ||||
compliant. So there is no backward compatibility problem with | ||||
IKEv2 RFC4301 IPsec tunnels. But once we extend our scope to any | ||||
IP in IP tunnel, we have to cater for the possibility that a | ||||
tunnel ingress compliant with this specification is sending to an | ||||
egress that doesn't even understand ECN (e.g. a legacy [RFC2003] | ||||
tunnel egress). If a tunnel ingress copied incoming ECN-capable | ||||
headers into outer headers, then a legacy tunnel egress would | ||||
discard any congestion markings added to the outer header within | ||||
the tunnel. ECN-capable traffic sources would not see any | ||||
congestion feedback and instead continually ratchet up their share | ||||
of the bandwidth without realising that cross-flows from other ECN | ||||
sources were continually having to ratchet down. | ||||
The scope of this document is all IP in IP tunnelling, irrespective | o Finally, a number of security considerations are discussed and | |||
of whether IPv4 or IPv6 is used for either of the inner and outer | conclusions are drawn. | |||
headers. The document only concerns wire protocol processing at | ||||
tunnel endpoints and makes no changes or recommendations concerning | ||||
algorithms for congestion marking or congestion response. The | ||||
general design principles of Section 4 may also be useful when any | ||||
datagram/packet/frame with a congestion notification capability is | ||||
encapsulated by a connectionless outer header [BBnet] that might also | ||||
support a congestion notification capability in the future as | ||||
discussed in S.9.3 of [RFC3168] (e.g. IP encapsulated in L2TP | ||||
[RFC2661], GRE [RFC1701] or PPTP [RFC2637]). However, of course, the | ||||
IETF does not have standards authority over every link or tunnel | ||||
protocol, so this document focuses only on IP in IP. | ||||
[I-D.ietf-tsvwg-ecn-mpls] applies these principles to IP in MPLS and | ||||
to MPLS in MPLS. | ||||
2. Requirements notation | 1.3. Scope | |||
This document only concerns wire protocol processing at tunnel | ||||
endpoints and makes no changes or recommendations concerning | ||||
algorithms for congestion marking or congestion response. | ||||
This document specifies a common, default congestion encapsulation | ||||
for any IP in IP tunnelling, based on that now specified for IPsec. | ||||
It applies irrespective of whether IPv4 or IPv6 is used for either of | ||||
the inner and outer headers. It applies to all PHBs, unless stated | ||||
otherwise in the specification of a PHB. It is intended to be a good | ||||
trade off between somewhat conflicting security, control and | ||||
management requirements. | ||||
Nonetheless, if necessary, an alternate congestion encapsulation | ||||
behaviour can be introduced as part of the definition of an alternate | ||||
congestion marking scheme used by a specific Diffserv PHB (see S.5 of | ||||
[RFC3168] and [RFC4774]). When designing such new encapsulation | ||||
schemes, the principles in Section 4 should be followed as closely as | ||||
possible. There is no requirement for a PHB to state anything about | ||||
ECN tunnelling behaviour if the default behaviour is sufficient. | ||||
Often lower layer resources (e.g. a point-to-point Ethernet link) are | ||||
arranged to be protected by higher layer buffers, so instead of | ||||
congestion occurring at the lower layer, it merely causes the queue | ||||
from the higher layer to overflow. Such non-blocking link and | ||||
physical layer technologies do not have to implement congestion | ||||
notification, which can be introduced solely in the active queue | ||||
management (AQM) from the IP layer. However, not all link layer | ||||
technologies are always protected from congestion by buffers at | ||||
higher layers (e.g. a subnetwork of Ethernet links and switches can | ||||
congest internally). In these cases, when adding congestion | ||||
notification to lower layers, we have to arrange for it to be | ||||
explicitly copied up the layers, just as when IP is tunnelled in IP. | ||||
As well as guiding alternate IP in IP tunnelling schemes, the design | ||||
guidelines of Section 4 are intended to be followed when IP packets | ||||
are encapsulated by any connectionless datagram/packet/frame where | ||||
the outer header is designed to support a congestion notification | ||||
capability. [RFC5129] already deals with handling ECN for IP in MPLS | ||||
and MPLS in MPLS, and S.9.3 of [RFC3168] lists IP encapsulated in | ||||
L2TP [RFC2661], GRE [RFC1701] or PPTP [RFC2637] as possible examples | ||||
where ECN may be added in future. | ||||
Of course, the IETF does not have standards authority over every link | ||||
or tunnel protocol, so this document merely aims to define the | ||||
interface between IP ECN and lower layer congestion notification. | ||||
Then the IETF or the relevant standards body can be free to define | ||||
the specifics of each lower layer scheme, but a common interface | ||||
should ensure interworking across all technologies. | ||||
Note that just because there is forward congestion notification in a | ||||
lower layer protocol, if the lower layer has its own feedback and | ||||
load regulation, there is no need to propagate it up the layers. For | ||||
instance, FECN (forward ECN) has been present in Frame Relay and EFCI | ||||
(explicit forward congestion indication) in ATM [ITU-T.I.371] for a | ||||
long time, but they have been used for internal management rather | ||||
than being propagated to endpoint transports for them to control end- | ||||
to-end congestion. | ||||
[RFC2983] is a comprehensive primer on differentiated services and | ||||
tunnels. Given ECN raises similar issues to differentiated services | ||||
when interacting with tunnels, useful concepts introduced in RFC2983 | ||||
are used throughout, with brief recaps of the explanations where | ||||
necessary. | ||||
2. Requirements Language | ||||
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", | |||
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |||
document are to be interpreted as described in [RFC2119]. | document are to be interpreted as described in RFC 2119 [RFC2119]. | |||
3. Design Constraints | 3. Design Constraints | |||
Tunnel processing of a congestion notification field has to meet | Tunnel processing of a congestion notification field has to meet | |||
congestion control needs without creating new information security | congestion control needs without creating new information security | |||
vulnerabilities (if information security is required). | vulnerabilities (if information security is required). | |||
3.1. Security Constraints | 3.1. Security Constraints | |||
Information security can be assured by using various end to end | Information security can be assured by using various end to end | |||
skipping to change at page 6, line 48 | skipping to change at page 9, line 10 | |||
IPsec encryption is typically used to prevent 'M' seeing messages | IPsec encryption is typically used to prevent 'M' seeing messages | |||
from 'A' to 'B'. IPsec authentication is used to prevent 'M' | from 'A' to 'B'. IPsec authentication is used to prevent 'M' | |||
masquerading as the sender of messages from 'A' to 'B' or altering | masquerading as the sender of messages from 'A' to 'B' or altering | |||
their contents. But 'I' can also use IPsec tunnel mode to allow 'A' | their contents. But 'I' can also use IPsec tunnel mode to allow 'A' | |||
to communicate with 'B', but impose encryption to prevent 'A' leaking | to communicate with 'B', but impose encryption to prevent 'A' leaking | |||
information to 'M'. Or 'E' can insist that 'I' uses tunnel mode | information to 'M'. Or 'E' can insist that 'I' uses tunnel mode | |||
authentication to prevent 'M' communicating information to 'B'. | authentication to prevent 'M' communicating information to 'B'. | |||
Mutable IP header fields such as the ECN field (as well as the TTL/ | Mutable IP header fields such as the ECN field (as well as the TTL/ | |||
Hop Limit and DS fields) cannot be included in the cryptographic | Hop Limit and DS fields) cannot be included in the cryptographic | |||
calculations of IPsec. Therefore, if 'I' encrypts but copies these | calculations of IPsec. Therefore, if 'I' copies these mutable fields | |||
mutable fields into the outer header that is exposed across the | into the outer header that is exposed across the tunnel it will have | |||
tunnel it will have allowed a covert channel from 'A' to M. And if | allowed a covert channel from 'A' to M that bypasses its encryption | |||
'E' copies these fields from the outer header to the inner, even if | of the inner header. And if 'E' copies these fields from the outer | |||
it validates authentication from 'I', it will have allowed a covert | header to the inner, even if it validates authentication from 'I', it | |||
channel from 'M' to 'B'. | will have allowed a covert channel from 'M' to 'B'. | |||
ECN at the IP layer is designed to carry information about congestion | ECN at the IP layer is designed to carry information about congestion | |||
from a congested resource to some downstream node that will feed the | from a congested resource towards downstream nodes. Typically a | |||
information back somehow to the point upstream of the congestion that | downstream transport might feed the information back somehow to the | |||
can regulate the load on the congested resource. In terms of the | point upstream of the congestion that can regulate the load on the | |||
above scenario, ECN is effectively intended to create an information | congested resource, but other actions are possible (see [RFC3168] | |||
channel from 'M' to 'B', for 'B' to forward to 'A'. Therefore the | S.6). In terms of the above unicast scenario, ECN is typically | |||
goals of IPsec and ECN are mutually incompatible. | intended to create an information channel from 'M' to 'B', for 'B' to | |||
forward to 'A'. Therefore the goals of IPsec and ECN are mutually | ||||
incompatible. | ||||
With respect to the DS or ECN fields, S.5.1.2 of RFC4301 says, | With respect to the DS or ECN fields, S.5.1.2 of RFC4301 says, | |||
"controls are provided to manage the bandwidth of this [covert] | "controls are provided to manage the bandwidth of this [covert] | |||
channel". Using the ECN processing rules of RFC4301, the channel | channel". Using the ECN processing rules of RFC4301, the channel | |||
bandwidth is two bits per datagram from 'A' to 'M' and one bit per | bandwidth is two bits per datagram from 'A' to 'M' and one bit per | |||
datagram from 'M' to 'A' because 'E' limits the combinations it will | datagram from 'M' to 'A' (because 'E' limits the combinations of the | |||
copy. In both cases the covert channel bandwidth is further reduced | 2-bit ECN field that it will copy). In both cases the covert channel | |||
by noise from any real congestion marking. RFC4301 therefore implies | bandwidth is further reduced by noise from any real congestion | |||
that these covert channels are sufficiently limited to be considered | marking. RFC4301 therefore implies that these covert channels are | |||
a manageable threat. However, with respect to the larger (6b) DS | sufficiently limited to be considered a manageable threat. However, | |||
field, the same section of RFC4301 says not copying is the default, | with respect to the larger (6b) DS field, the same section of RFC4301 | |||
but a configuration option can allow copying "to allow a local | says not copying is the default, but a configuration option can allow | |||
administrator to decide whether the covert channel provided by | copying "to allow a local administrator to decide whether the covert | |||
copying these bits outweighs the benefits of copying". Of course, an | channel provided by copying these bits outweighs the benefits of | |||
administrator considering copying of the DS field has to take into | copying". Of course, an administrator considering copying of the DS | |||
account that it could be concatenated with the ECN field giving an 8b | field has to take into account that it could be concatenated with the | |||
per datagram channel. | ECN field giving an 8b per datagram covert channel. | |||
Thus, for tunnelling the 6b Diffserv field two conceptual models have | ||||
had to be defined so that administrators can trade off security | ||||
against the needs of traffic conditioning [RFC2983]: | ||||
The uniform model: where the DIffserv field is preserved end-to-end | ||||
by copying into the outer header on encapsulation and copying from | ||||
the outer header on decapsulation. | ||||
The pipe model: where the outer header is independent of that in the | ||||
inner header so it hides the Diffserv field of the inner header | ||||
from any interaction with nodes along the tunnel. | ||||
However, for ECN, the new IPsec security architecture in RFC4301 only | ||||
standardised one tunnelling model equivalent to the uniform model. | ||||
It deemed that simplicity was more important than allowing | ||||
administrators the option of a tiny increment in security especially | ||||
given not copying congestion indications could seriously harm | ||||
everyone's network service. | ||||
3.2. Control Constraints | 3.2. Control Constraints | |||
Congestion control requires that any congestion notification marked | Congestion control requires that any congestion notification marked | |||
into packets by a resource will be able to traverse a feedback loop | into packets by a resource will be able to traverse a feedback loop | |||
back to a node capable of controlling the load on that resource. To | back to a node capable of controlling the load on that resource. To | |||
avoid ambiguity later rather than calling this node the data source | be precise, rather than calling this node the data source, we will | |||
we will call it the Load Regulator. This will allow us to deal with | call it the Load Regulator. This will allow us to deal with | |||
exceptional cases where load is not regulated by the data source, but | exceptional cases where load is not regulated by the data source, but | |||
usually the two will be synonymous. Note the term "a node _capable | usually the two terms will be synonymous. Note the term "a node | |||
of_ controlling the load" deliberately includes a source application | _capable of_ controlling the load" deliberately includes a source | |||
that doesn't actually control the load but ought to (e.g. an | application that doesn't actually control the load but ought to (e.g. | |||
application without congestion control that uses UDP). | an application without congestion control that uses UDP). | |||
A--->R--->I=========>M=========>E-------->B | A--->R--->I=========>M=========>E-------->B | |||
Figure 2: Simple Tunnel Scenario | Figure 2: Simple Tunnel Scenario | |||
We now consider a similar tunneling scenario to the IPsec one just | We now consider a similar tunnelling scenario to the IPsec one just | |||
described, but without the different security domains so we can just | described, but without the different security domains so we can just | |||
focus on ensuring the control loop and management monitoring can work | focus on ensuring the control loop and management monitoring can work | |||
(Figure 2). If we want resources in the tunnel to be able to | (Figure 2). If we want resources in the tunnel to be able to | |||
explicitly notify congestion and the feedback loop is from 'B' to | explicitly notify congestion and the feedback path is from 'B' to | |||
'A', it will certainly be necessary for 'E' to copy any CE marking | 'A', it will certainly be necessary for 'E' to copy any CE marking | |||
from the outer header to the inner header for onward transmission to | from the outer header to the inner header for onward transmission to | |||
'B', otherwise congestion notification from resources like 'M' cannot | 'B', otherwise congestion notification from resources like 'M' cannot | |||
be fed back to the Load Regulator ('A'). But it doesn't seem | be fed back to the Load Regulator ('A'). But it doesn't seem | |||
necessary for 'I' to copy CE markings from the inner to the outer | necessary for 'I' to copy CE markings from the inner to the outer | |||
header. For instance, if resource 'R' is congested, it can send | header. For instance, if resource 'R' is congested, it can send | |||
congestion information to 'B' using the congestion field in the inner | congestion information to 'B' using the congestion field in the inner | |||
header without 'I' copying the congestion field into the outer header | header without 'I' copying the congestion field into the outer header | |||
and 'E' copying it back to the inner header. 'E' can then write any | and 'E' copying it back to the inner header. 'E' can still write any | |||
additional congestion marking introduced across the tunnel into the | additional congestion marking introduced across the tunnel into the | |||
congestion field of the inner header. | congestion field of the inner header. | |||
Indeed, this arrangement can be extended to multi-level congestion | ||||
marking (such as that proposed for PCN [PCN-arch]) as long as all the | ||||
marks have unambiguously ranked values. For instance, if a | ||||
hypothetical multi-level marking scheme for PCN had PCN-capable | ||||
codepoints ranked 1, 2 and 3, then, if 'I' reset the outer congestion | ||||
field to the lowest ranked value that is PCN-capable (1), 'E' would | ||||
simply write the highest ranked of the inner and outer congestion | ||||
markings into the forwarded header. For instance, if the inner | ||||
marking on arrival at 'I' was 3 and 'I' reset the outer to 1, but 'M' | ||||
subsequently set it to 2, then the header forwarded by 'E' would be | ||||
max(3,2) = 3. | ||||
It might be useful for the tunnel egress to be able to tell whether | It might be useful for the tunnel egress to be able to tell whether | |||
congestion occurred across a tunnel or upstream of it. If outer | congestion occurred across a tunnel or upstream of it. If outer | |||
header congestion marking was reset at the tunnel ingress ('I'), by | header congestion marking was reset by the tunnel ingress ('I'), at | |||
the end of a tunnel ('E') the outer headers would indicate congestion | the end of a tunnel ('E') the outer headers would indicate congestion | |||
experienced across the tunnel ('I' to 'E'), while the inner header | experienced across the tunnel ('I' to 'E'), while the inner header | |||
would indicate congestion upstream of 'I'. But the same information | would indicate congestion upstream of 'I'. But similar information | |||
could be gleaned even if the tunnel ingress copied the inner to the | can be gleaned even if the tunnel ingress copies the inner to the | |||
outer headers. By the end of the tunnel ('E'), any packet with an | outer headers. At the end of the tunnel ('E'), any packet with an | |||
_extra_ mark in the outer header relative to the inner header would | _extra_ mark in the outer header relative to the inner header | |||
indicate congestion across the tunnel ('I' to 'E'), while the inner | indicates congestion across the tunnel ('I' to 'E'), while the inner | |||
header would still indicate congestion upstream of ('I'). | header would still indicate congestion upstream of ('I'). Appendix B | |||
gives a more precise method for inferring the congestion level | ||||
introduced across a tunnel. | ||||
All this shows that 'E' can preserve the control loop irrespective of | All this shows that 'E' can preserve the control loop irrespective of | |||
whether 'I' copies congestion notification into the outer header or | whether 'I' copies congestion notification into the outer header or | |||
resets it. | resets it. | |||
That is the situation for existing control arrangements but, because | ||||
copying reveals more information, it would open up possibilities for | ||||
better control system designs. For instance, Appendix A describes | ||||
how resetting CE marking at a tunnel ingress confuses a proposed | ||||
congestion marking scheme on the standards track. It ends up | ||||
removing excessive amounts of traffic unnecessarily. Whereas copying | ||||
CE markings at ingress leads to the correct control behaviour. | ||||
3.3. Management Constraints | 3.3. Management Constraints | |||
As well as control, there are also management constraints. | As well as control, there are also management constraints. | |||
Specifically, a management system may monitor congestion markings in | Specifically, a management system may monitor congestion markings in | |||
passing packets, perhaps at the border between networks as part of a | passing packets, perhaps at the border between networks as part of a | |||
service level agreement. For instance, monitors at the borders of | service level agreement. For instance, monitors at the borders of | |||
autonomous systems may need to measure how much congestion has | autonomous systems may need to measure how much congestion has | |||
accumulated since the original source to determine between them how | accumulated since the original source to determine between them how | |||
much of the congestion is contributed by each domain. | much of the congestion is contributed by each domain. | |||
Therefore it should be clear how far back in the path the congestion | Therefore, when monitoring the middle of a path, it should be | |||
markings have accumulated from. In this document we term this the | possible to establish how far back in the path congestion markings | |||
baseline of the congestion marking, i.e. the source of the layer that | have accumulated from. In this document we term this the baseline of | |||
last reset rather than copied the congestion notification field when | congestion marking (or the Congestion Baseline), i.e. the source of | |||
creating an outer header. Given some tunnels cross domain borders | the layer that last reset (or created) the congestion notification | |||
(e.g. consider M in Figure 2 is monitoring a border), it is therefore | field. Given some tunnels cross domain borders (e.g. consider M in | |||
desirable for 'I' to copy congestion accumulated so far into the | Figure 2 is monitoring a border), it would therefore be desirable for | |||
outer headers exposed across the tunnel. | 'I' to copy congestion accumulated so far into the outer headers | |||
exposed across the tunnel. | ||||
Appendix A discusses various scenarios where the Load Regulator lies | Appendix D discusses various scenarios where the Load Regulator lies | |||
in-path, not at the source host as we would typically expect. It | in-path, not at the source host as we would typically expect. It | |||
concludes that the baseline for congestion notification should be | concludes that a Congestion Baseline is determined by where the Load | |||
determined by where the Load Regulator function is, whether it is at | Regulator function is, which should be identified in the transport | |||
the source host or within the path. Therefore every tunnel ingress | layer, not by addresses in network layer headers. This applies | |||
should copy the ECN field into the outer header it creates unless it | whether the Load Regulator is at the source host or within the path. | |||
is also a Load Regulator, in which case it should reset any CE | The appendix also discusses where a Load Regulator function should be | |||
markings, which is an exception to the normal copying rule for a | located relative to a local encapsulation function. | |||
tunnel ingress. | ||||
4. Design Principles | 4. Design Principles | |||
The constraints from the three perspectives of security, control and | The constraints from the three perspectives of security, control and | |||
management in Section 3 are somewhat in tension as to whether a | management in Section 3 are somewhat in tension as to whether a | |||
tunnel ingress should copy congestion markings into the outer header | tunnel ingress should copy congestion markings into the outer header | |||
it creates or reset them. From the control perspective either | it creates or reset them. From the control perspective either | |||
copying or resetting works. From the management perspective copying | copying or resetting works for existing arrangements, but copying has | |||
is preferable (with the exception of an in-path load regulator). | more potential for simplifying control. From the management | |||
From the security perspective resetting is preferable but copying is | perspective copying is preferable. From the security perspective | |||
now considered acceptable given the bandwidth of a 2-bit covert | resetting is preferable but copying is now considered acceptable | |||
channel can be managed. | given the bandwidth of a 2-bit covert channel can be managed. | |||
Therefore an outer encapsulating header capable of carrying | Therefore an outer encapsulating header capable of carrying | |||
congestion markings SHOULD reflect accumulated congestion since the | congestion markings SHOULD reflect accumulated congestion since the | |||
last interface designed to regulate load (the Load Regulator). This | last interface designed to regulate load (the Load Regulator). This | |||
implies congestion notification SHOULD be copied into the outer | implies congestion notification SHOULD be copied into the outer | |||
header of each new encapsulating header that supports it--except at | header of each new encapsulating header that supports it. | |||
an in-path Load Regulator. An in-path Load Regulator knows its | ||||
function is to regulate load, so if it also acts as the ingress to a | We have said that a tunnel ingress SHOULD (as opposed to MUST) copy | |||
tunnel, in every new outer header it creates it MUST reset any | incoming congestion notification into an outer encapsulating header | |||
congestion marking. | that supports it. In the case of 2-bit ECN, the IETF security area | |||
has deemed the benefit always outweighs the risk. Therefore for | ||||
2-bit ECN we can and we will say 'MUST' (Section 5). But in this | ||||
section where we are setting down general design principles, we leave | ||||
it as a 'SHOULD'. This allows for future multi-bit congestion | ||||
notification fields where the risk from the covert channel created by | ||||
copying congestion notification might outweigh the congestion control | ||||
benefit of copying. | ||||
The Load Regulator is the node to which congestion feedback should be | The Load Regulator is the node to which congestion feedback should be | |||
returned by the next downstream node with a transport layer function | returned by the next downstream node with a transport layer feedback | |||
(typically but not always the data receiver). The Load Regulator is | function (typically but not always the data receiver). The Load | |||
not always (or even typically) the same thing as the node identified | Regulator is not always (or even typically) the same thing as the | |||
by the source address of the outermost exposed header. In general | node identified by the source address of the outermost exposed | |||
the addressing of the outermost encapsulation header says nothing | header. In general the addressing of the outermost encapsulation | |||
about the identifiers of either the upstream or the downstream | header says nothing about the identifiers of either the upstream or | |||
transport layer functions. As long as the transport functions know | the downstream transport layer functions. As long as the transport | |||
each other's addresses, they don't have to be identified in the | functions know each other's addresses, they don't have to be | |||
network layer or in any link layer. It was only a convenience that a | identified in the network layer or in any link layer. It was only a | |||
TCP receiver assumed that the address of the source transport is the | convenience that a TCP receiver assumed that the address of the | |||
same as the network layer source address of a packet it receives. | source transport is the same as the network layer source address of | |||
an IP packet it receives. | ||||
More generally, the return transport address could be identified | More generally, the return transport address for feedback could be | |||
solely in the transport layer protocol. For instance, a signalling | identified solely in the transport layer protocol. For instance, a | |||
protocol like RSVP [RFC2205] breaks up a path into transport layer | signalling protocol like RSVP [RFC2205] breaks up a path into | |||
hops and informs each hop of the address of its transport layer | transport layer hops and informs each hop of the address of its | |||
neighbour without any need to identify these hops in the network | transport layer neighbour without any need to identify these hops in | |||
layer. RSVP can be arranged so that these transport layer hops are | the network layer. RSVP can be arranged so that these transport | |||
bigger than the underlying network layer hops. The host identity | layer hops are bigger than the underlying network layer hops. The | |||
protocol (HIP) architecture [RFC4423] also supports the same | host identity protocol (HIP) architecture [RFC4423] also supports the | |||
principled separation (for mobility amongst other things), where the | same principled separation (for mobility amongst other things), where | |||
transport layer receiver identifies the transport layer sender using | the transport layer sender identifies its transport address for | |||
an identifier provided by the transport layer, which gets mapped to a | feedback to be sent to, using an identifier provided by a shim below | |||
network layer address below the transport layer. | the transport layer. | |||
Note that this principle deliberately doesn't require a packet header | Keeping to this layering principle deliberately doesn't require a | |||
to reveal the origin address of the baseline that congestion | network layer packet header to reveal the origin address from where | |||
notification has accumulated from. It is not necessary for the | congestion notification accumulates (its Congestion Baseline). It is | |||
network and lower layers to know the address of the Load Regulator. | not necessary for the network and lower layers to know the address of | |||
Only the destination transport needs to know that. With congestion | the Load Regulator. Only the destination transport needs to know | |||
notification, the network and link layers only notify congestion | that. With forward congestion notification, the network and link | |||
forwards, they aren't involved in feeding it backwards. If they are, | layers only notify congestion forwards; they aren't involved in | |||
e.g. backward congestion notification (BCN) in Ethernet [802.1au], | feeding it backwards. If they are (e.g. backward congestion | |||
that should be considered as a transport function added to the lower | notification (BCN) in Ethernet [IEEE802.1au] or EFCI in ATM | |||
layer, which must sort out its own addressing. Indeed, this is one | [ITU-T.I.371]), that should be considered as a transport function | |||
reason why ICMP source quench is now deprecated [RFC1254]; when | added to the lower layer, which must sort out its own addressing. | |||
congestion occurs within a tunnel it is complex (particularly in the | Indeed, this is one reason why ICMP source quench is now deprecated | |||
case of IPsec tunnels) to return the ICMP messages beyond the tunnel | [RFC1254]; when congestion occurs within a tunnel it is complex | |||
ingress back to the Load Regulator . | (particularly in the case of IPsec tunnels) to return the ICMP | |||
messages beyond the tunnel ingress back to the Load Regulator. | ||||
Similarly, if a management system is monitoring congestion and needs | Similarly, if a management system is monitoring congestion and needs | |||
to know the baseline of congestion notification, the management | to know the Congestion Baseline, the management system has to find | |||
system has to find this out from the transport; in general it cannot | this out from the transport; in general it cannot tell solely by | |||
tell solely by looking at the network or link layer headers. | looking at the network or link layer headers. | |||
We have said that a tunnel ingress that is not a Load Regulator | 4.1. Design Guidelines for New Encapsulations of Congestion | |||
SHOULD (as opposed to MUST) copy incoming congestion notification | Notification | |||
into an outer encapsulating header that supports it. In the case of | ||||
2-bit ECN, the IETF security area have deemed the benefit always | The following guidelines are for specifications of new schemes for | |||
outweighs the risk. Therefore for 2-bit ECN we can and we will say | encapsulating congestion notification (e.g. for specialised Diffserv | |||
'MUST' (Section 5). But in this section where we are setting down | PHBs in IP, or for lower layer technologies): | |||
general design principles, we leave it as a 'SHOULD'. This allows | ||||
for future multi-bit congestion notification fields where the risk | 1. Congestion notification in outer headers SHOULD be relative to a | |||
from the covert channel created by copying congestion notification | Congestion Baseline at the node expected to regulate the load on | |||
might outweigh the congestion control benefit of copying. | the link in question (the Load Regulator). This implies incoming | |||
congestion notifications from the higher layer SHOULD be copied | ||||
into encapsulating headers. This guideline is particularly | ||||
important where outer headers might cross trust boundaries, but | ||||
less important otherwise. | ||||
2. Congestion notification MUST NOT simply be copied from outer | ||||
headers to the forwarded header on decapsulation. The forwarded | ||||
congestion notification field SHOULD be calculated from the inner | ||||
and outer headers, taking account of the following, in the order | ||||
given: | ||||
1. If the inner header does not support congestion notification, | ||||
or indicates that the transport does not support congestion | ||||
notification, any explicit congestion notifications in the | ||||
outer header will not be understood if propagated further, so | ||||
if the only way to indicate congestion to onward nodes is to | ||||
drop the packet, it MUST be dropped. | ||||
2. If the outer header does not support explicit congestion | ||||
notification, but the inner header does, the inner header | ||||
SHOULD be forwarded unchanged. | ||||
3. Congestion indications may be ranked by strength. For | ||||
instance no congestion would be the weakest indication, with | ||||
possibly increasing levels of congestion given increasingly | ||||
stronger indications. | ||||
4. Where the inner and outer headers carry indications of | ||||
congestion of different strengths, the stronger indication | ||||
SHOULD be forwarded in preference to the weaker. Obviously, | ||||
if the strengths in both inner and outer are the same, the | ||||
same strength should be forwarded. | ||||
5. If the outer header carries a weaker indication of congestion | ||||
than the inner, it MAY be appropriate to raise a warning, as | ||||
this would be in illegal combination if Guideline Paragraph 1 | ||||
had been followed. | ||||
3. Where framing boundaries are different between the two layers, | ||||
congestion indications SHOULD be propagated on the basis that a | ||||
congestion indication in a packet or frame applies to all the | ||||
octets in the frame/packet. On average, a tunnel endpoint SHOULD | ||||
approximately preserve the number of marked octets arriving and | ||||
leaving. An algorithm for spreading congestion indications over | ||||
multiple smaller `fragments' SHOULD propagate congestion | ||||
indications as soon as they arrive, and SHOULD NOT hold them back | ||||
for later frames. | ||||
4. Assumptions on incremental deployment MUST be stated. | ||||
Regarding incremental deployment, the Per-Domain ECT Checking | ||||
of[RFC5129] is a good example to follow. In this example, header | ||||
space in the lower layer protocol (MPLS) was extremely limited, so no | ||||
ECN-capable transport codepoint was added to the MPLS header. | ||||
Interior nodes in a domain were allowed to set explicit congestion | ||||
indications without checking whether the frame was destined for a | ||||
transport that would understand them. This was made safe by | ||||
emphasising repeatedly that all the decapsulating edges of a whole | ||||
domain had to be upgraded at once, so there would always be a check | ||||
that the higher layer transport was ECN-capable on decapsulation. If | ||||
the decapsulator discovered that the higher layer showed the | ||||
transport would not understand ECN, it dropped the packet on behalf | ||||
of the earlier congestion node (see Guideline Paragraph 2.1). | ||||
Note that such a deployment strategy that assumes a savvy operator | ||||
was only appropriate because MPLS is targeted solely at professional | ||||
operators. This strategy would not be appropriate for other link | ||||
technologies (e.g. Ethernet) targeted at deployment by the general | ||||
public. | ||||
5. Default ECN Tunnelling Rules | 5. Default ECN Tunnelling Rules | |||
The following ECN tunnel processing rules are the default for a | The following ECN tunnel processing rules are the default for a | |||
packet with any DSCP. If required, different ECN processing rules | packet with any DSCP. If required, different ECN encapsulation rules | |||
MAY be defined for the appropriate Diffserv PHB using the guidelines | MAY be defined as part of the definition of an appropriate Diffserv | |||
in Section 4. | PHB using the guidelines in Section 4. However, the burden of | |||
handling exceptional PHBs in implementations of all affected tunnels | ||||
and lower layer link protocols should not be underestimated. | ||||
When a tunnel ingress creates an encapsulating IP header, the 2-bit | A tunnel ingress compliant with this specification MUST copy the | |||
ECN field of the inner IP header MUST be copied into the outer IP | 2-bit ECN field of the arriving IP header into the outer | |||
header, for all types of IP in IP tunnel (except if the tunnel | encapsulating IP header, for all types of IP in IP tunnel. This | |||
ingress is in compatibility mode--see Section 6). If the tunnel | encapsulation behaviour MUST only be used if the tunnel ingress is in | |||
ingress is also a Load Regulator, it MUST instead reset the outer | `normal state'. A `compatibility state' with a different | |||
header to ECT(0). | encapsulation behaviour is also specified in Section 6 for backward | |||
compatibility with legacy tunnel egresses that do not understand ECN. | ||||
To decapsulate the inner header at the tunnel egress, the outgoing | To decapsulate the inner header at the tunnel egress, a compliant | |||
inner header MUST be calculated from the combination of the incoming | tunnel egress MUST set the outgoing ECN field to the codepoint at the | |||
inner and outer headers setting the outgoing ECN field to the | intersection of the appropriate incoming inner header (row) and outer | |||
codepoints displayed in the body of Table 1. | header (column) in Table 1. | |||
+--Incoming Outer Header--- | +--Incoming Outer Header--- | |||
+--------------------+---------+------------+-----------+-----------+ | +---------------------+---------+-----------+-----------+-----------+ | |||
| Incoming Inner | Not-ECT | ECT(0) | ECT(1) | CE | | | Incoming Inner | Not-ECT | ECT(0) | ECT(1) | CE | | |||
| Header | | | | | | | Header | | | | | | |||
+--------------------+---------+------------+-----------+-----------+ | +---------------------+---------+-----------+-----------+-----------+ | |||
| Not-ECT | Not-ECT | drop (!!!) | drop(!!!) | drop(!!!) | | | Not-ECT | Not-ECT | drop (!!!) | drop(!!!) | drop(!!!) | | |||
| ECT(0) | ECT(0) | ECT(0) | ECT(0) | CE | | | ECT(0) | ECT(0) | ECT(0) | ECT(0) | CE | | |||
| ECT(1) | ECT(1) | ECT(1) | ECT(1) | CE | | | ECT(1) | ECT(1) | ECT(1) | ECT(1) | CE | | |||
| CE | CE | CE (!!!) | CE (!!!) | CE | | | CE | CE | CE | CE (!!!) | CE | | |||
+--------------------+---------+------------+-----------+-----------+ | +---------------------+---------+-----------+-----------+-----------+ | |||
+-----Outgoing Header------ | +-----Outgoing Header------ | |||
Table 1: IP in IP Decapsulation | Table 1: IP in IP Decapsulation | |||
The exclamation marks '(!!!)' in Table 1 indicate that this | The exclamation marks '(!!!)' in Table 1 indicate that this | |||
combination of inner and outer headers should not be possible if only | combination of inner and outer headers should not be possible if only | |||
legal transitions have taken place. So, the decapsulator should drop | legal transitions have taken place. So, the decapsulator should drop | |||
or mark the ECN field as the table specifies, but it MAY also raise | or mark the ECN field as the table specifies, but it MAY also raise | |||
an appropriate alarm. It MUST NOT raise an alarm so often that the | an appropriate alarm. It MUST NOT raise an alarm so often that the | |||
illegal combinations would amplify into a flood of alarm messages. | illegal combinations would amplify into a flood of alarm messages. | |||
6. Backward Compatibility | 6. Backward Compatibility | |||
A legacy tunnel egress may not know how to process an ECN field, so | Note: in RFC3168, a tunnel was in one of two modes: limited | |||
it will most likely simply disregard all outer headers. Therefore, | functionality or full functionality. Rather than working with modes | |||
unless a compliant tunnel ingress has established that the tunnel | of the tunnel as a whole, this specification uses the term `state' to | |||
egress understands ECN processing, it MUST only send packets with the | refer separately to the state of each tunnel end point, which is how | |||
ECN field set to Not-ECT in the outer header. Otherwise, if ECN | implementations have to work. | |||
capable outer headers were sent towards a legacy egress, it would | ||||
dangerously remove information about congestion experienced within | ||||
the tunnel. | ||||
A tunnel ingress may establish whether its tunnel egress will | If one end of an IPsec tunnel is compliant with [RFC4301], the other | |||
understand ECN processing by configuration or by negotiation. Note | end can be guaranteed to also be [RFC4301] compliant (there could be | |||
that a [RFC4301] tunnel ingress that has used IKEv2 key management | corner cases where manual keying is used, but they will be ignored | |||
[RFC4306] can guarantee that the tunnel egress is also RFC4301- | here). So there is no backward compatibility problem with IKEv2 | |||
compliant and therefore need not negotiate ECN capabilities. | RFC4301 IPsec tunnels. But once we extend our scope to any IP in IP | |||
tunnel, we have to cater for the possibility that a legacy tunnel | ||||
egress may not know how to process an ECN field, so if ECN capable | ||||
outer headers were sent towards a legacy (e.g. [RFC2003]) egress, it | ||||
would most likely simply disregard the outer headers, dangerously | ||||
discarding information about congestion experienced within the | ||||
tunnel. ECN-capable traffic sources would not see any congestion | ||||
feedback and instead continually ratchet up their share of the | ||||
bandwidth without realising that cross-flows from other ECN sources | ||||
were continually having to ratchet down. | ||||
To be compliant with this specification a tunnel ingress that does | To be compliant with this specification a tunnel ingress that does | |||
not know the egress ECN capability (e.g. by configuration) MUST | not always know the ECN capability of its tunnel egress MUST | |||
implement a 'normal' mode and a 'compatibility' mode, and it MUST | implement a 'normal' state and a 'compatibility' state, and it MUST | |||
initiate each negotiated tunnel in compatibility mode. On the other | initiate each negotiated tunnel in the compatibility state. | |||
hand, a compliant tunnel egress MUST merely implement the one | ||||
behaviour in Section 5, which we term 'full-functionality' mode. | ||||
Before switching to normal mode, a compliant tunnel ingress that does | However, a tunnel ingress can be compliant even if it only implements | |||
not know the egress ECN capability (e.g. by configuration) MUST | the 'normal state' of encapsulation behaviour, but only as long as it | |||
negotiate with the tunnel egress to establish whether the egress is | is designed or configured so that all possible tunnel egress nodes it | |||
in full functionality mode. If the egress is in full functionality | will ever talk to will have full ECN functionality (RFC3168 full | |||
mode, the ingress puts itself into normal mode. In normal mode the | functionality mode, RFC4301 and this present specification). The | |||
ingress follows the encapsulation rule in Section 5 (i.e. it copies | `normal state' is that defined in Section 5 (i.e. header copying). | |||
the inner ECN field into the outer header). If the egress is not in | Note that a [RFC4301] tunnel ingress that has used IKEv2 key | |||
full-functionality mode or doesn't understand the question, the | management [RFC4306] can guarantee that its tunnel egress is also | |||
tunnel ingress MUST remain in compatibility mode. | RFC4301-compliant and therefore need not further negotiate ECN | |||
capabilities. | ||||
A tunnel ingress in compatibility mode MUST set all outer headers to | Before switching to normal state, a compliant tunnel ingress that | |||
Not-ECT. | does not know the egress ECN capability MUST negotiate with the | |||
tunnel egress. If the egress says it is in full functionality state | ||||
(or mode), the ingress puts itself into normal state. In normal | ||||
state the ingress follows the encapsulation rule in Section 5 (i.e. | ||||
header copying). If the egress says it is not in full-functionality | ||||
state/mode or doesn't understand the question, the tunnel ingress | ||||
MUST remain in compatibility state. | ||||
A tunnel ingress in compatibility state MUST set all outer headers to | ||||
Not-ECT. This is the same per packet behaviour as the ingress end of | ||||
RFC3168's limited functionality mode. | ||||
A tunnel ingress that only implements compatibility state is at least | ||||
safe with the ECN behaviour of any egress it may encounter (any of | ||||
RFC2003, RFC2401, either mode of RFC2481 and RFC3168's limited | ||||
functionality mode). But an ingress cannot claim compliance with | ||||
this specification simply by disabling ECN processing across the | ||||
tunnel. A compliant tunnel ingress MUST at least implement `normal | ||||
state' and, if it might be used with arbitrary tunnel egress nodes, | ||||
it MUST also implement `compatibility state'. | ||||
A compliant tunnel egress on the other hand merely needs to implement | ||||
the one behaviour in Section 5, which we term 'full-functionality' | ||||
state, as it is the same as the egress end of the full-functionality | ||||
mode of [RFC3168]. It is also the same as the [RFC4301] egress | ||||
behaviour. | ||||
The decapsulation rules for the egress of the tunnel in Section 5 | The decapsulation rules for the egress of the tunnel in Section 5 | |||
have been defined in such a way that congestion control will still | have been defined in such a way that congestion control will still | |||
work safely if any of the earlier versions of ECN processing are used | work safely if any of the earlier versions of ECN processing are used | |||
unilaterally at the encapsulating ingress of the tunnel. If a tunnel | unilaterally at the encapsulating ingress of the tunnel (any of | |||
ingress tries to negotiate to use limited functionality mode or full | RFC2003, RFC2401, either mode of RFC2481, either mode of RFC3168, | |||
functionality mode, a decapsulating tunnel egress compliant with this | RFC4301 and this present specification). If a tunnel ingress tries | |||
specification MUST agree to the request, even though its behaviour | to negotiate to use limited functionality mode or full functionality | |||
will be the same in both cases. For 'forward compatibility', a | mode [RFC3168], a decapsulating tunnel egress compliant with this | |||
compliant tunnel egress MUST raise a warning about any requests to | specification MUST agree to either request, as its behaviour will be | |||
enter modes it doesn't recognise, but it can continue operating. If | the same in both cases. | |||
no ECN-related mode is requested, no error or warning need be raised | ||||
as the egress behaviour is compatible with all the legacy ingress | ||||
behaviours that don't negotiate capabilities. | ||||
Note that if a compliant node is the ingress for multiple tunnels, a | For 'forward compatibility', a compliant tunnel egress SHOULD raise a | |||
mode setting will need to be stored for each tunnel ingress. | warning about any requests to enter states or modes it doesn't | |||
However, if a node is the egress for multiple tunnels, none of the | recognise, but it can continue operating. If no ECN-related state or | |||
tunnels will need to store a mode setting, because a compliant egress | mode is requested, a compliant tunnel egress need not raise an error | |||
can only be in one mode. | or warning as its egress behaviour is compatible with all the legacy | |||
ingress behaviours that don't negotiate capabilities. | ||||
Implementation note: if a compliant node is the ingress for multiple | ||||
tunnels, a state setting will need to be stored for each tunnel | ||||
ingress. However, if a node is the egress for multiple tunnels, none | ||||
of the tunnels will need to store a state setting, because a | ||||
compliant egress can only be in one state. | ||||
7. Changes from Earlier RFCs | 7. Changes from Earlier RFCs | |||
The rule that a tunnel ingress MUST copy any ECN field into the outer | The rule that a normal state tunnel ingress MUST copy any ECN field | |||
header is a change to RFC3168 (unless it is a Load Regulator as well, | into the outer header is a change to the ingress behaviour of | |||
in which case there is no change). | RFC3168, but it is the same as the rules for IPsec tunnels in | |||
RFC4301. | ||||
The rules for calculating the outgoing ECN field on decapsulation at | The rules for calculating the outgoing ECN field on decapsulation at | |||
a tunnel egress are in line with the full functionality mode of ECN | a tunnel egress are in line with the full functionality mode of ECN | |||
in RFC3168 and with RFC4301, except that neither identified the need | in RFC3168 and with RFC4301, except that neither identified that an | |||
to raise an alarm if the inner header was CE but the outer header was | outer header of ECT(1) combined with an inner header of CE was an | |||
ECT. | illegal combination. | |||
The rules for how a tunnel establishes whether the egress has full | The rules for how a tunnel establishes whether the egress has full | |||
functionality ECN capabilities are an update to RFC3168. For all the | functionality ECN capabilities are an update to RFC3168. For all the | |||
typical cases, RFC4301 is not updated by the ECN capability check in | typical cases, RFC4301 is not updated by the ECN capability check in | |||
this specification, because a typical RFC4301 tunnel ingress will | this specification, because a typical RFC4301 tunnel ingress will | |||
have already established that it is talking to an RFC4301 tunnel | have already established that it is talking to an RFC4301 tunnel | |||
egress (e.g. if it uses IKEv2). However, there may be some corner | egress (e.g. if it uses IKEv2). However, there may be some corner | |||
cases (e.g. manual keying) where an RFC4301 tunnel ingress talks with | cases (e.g. manual keying) where an RFC4301 tunnel ingress talks with | |||
an egress with limited functionality ECN handling. For such corner | an egress with limited functionality ECN handling. Strictly, for | |||
cases, the requirement to use compatibility mode in this | such corner cases, the requirement to use compatibility mode in this | |||
specification updates RFC4301. | specification updates RFC4301. | |||
The optional ECN Tunnel field in the IPsec security association | The optional ECN Tunnel field in the IPsec security association | |||
database (SAD) and the optional ECN Tunnel Security Association | database (SAD) and the optional ECN Tunnel Security Association | |||
Attribute defined in RFC3168 are no longer needed. The security | Attribute defined in RFC3168 are no longer needed. The security | |||
association (SA) has no policy on ECN usage, because all RFC4301 | association (SA) has no policy on ECN usage, because all RFC4301 | |||
tunnels now support ECN without any policy choice. | tunnels now support ECN without any policy choice. | |||
RFC3168 defines a (required) limited functionality mode and an | RFC3168 defines a (required) limited functionality mode and an | |||
(optional) full functionality mode for a tunnel, but RFC4301 doesn't | (optional) full functionality mode for a tunnel, but RFC4301 doesn't | |||
need modes. In this specification only the ingress might need two | need modes. In this specification only the ingress might need two | |||
modes, unlike the modes of RFC3168 that were properties of the pair | states: a normal state (required) and a compatibility state (required | |||
of tunnel endpoints after negotiation. | in some scenarios, optional in others). The egress needs only full- | |||
functionality state which handles ECN the same as either mode of | ||||
RFC3168 or RFC4301. | ||||
All these ECN processing rules update RFC2003 on IP in IP tunnelling. | Additional changes to the RFC Index (to be removed by the RFC Editor): | |||
In the RFC index, RFC3168 should be identified as an update to | ||||
RFC2003 and RFC4301 should be identified as an update to RFC3168. | ||||
This specification updates RFC3168. It also suggests a minor | ||||
optional warning and a corner-case change to RFC4301, but these don't | ||||
really count as an update. | ||||
8. IANA Considerations | 8. IANA Considerations | |||
This memo includes no request to IANA. | This memo includes no request to IANA. | |||
9. Security Considerations | 9. Security Considerations | |||
Section 3.1 discusses the security constraints imposed on ECN tunnel | Section 3.1 discusses the security constraints imposed on ECN tunnel | |||
processing. The Design Principles of Section 4 trade-off between | processing. The Design Principles of Section 4 trade-off between | |||
security (covert channels) and congestion monitoring & control. In | security (covert channels) and congestion monitoring & control. In | |||
fact, ensuring congestion markings are not lost is itself another | fact, ensuring congestion markings are not lost is itself another | |||
aspect of security, because if we allowed congestion notification to | aspect of security, because if we allowed congestion notification to | |||
be lost, any attempt to enforce a response to congestion would be | be lost, any attempt to enforce a response to congestion would be | |||
much harder. | much harder. | |||
We keep the behaviour defined in both RFC3168 and RFC4301 where, if | If alternate congestion notification semantics are defined for a | |||
the inner and outer headers carry contradictory ECT values the inner | certain PHB (e.g. the pre-congestion notification architecture | |||
header is preserved for onward forwarding. However, in writing this | [I-D.ietf-pcn-architecture]), the scope of the alternate semantics | |||
document we noticed this behaviour would hide illegal suppression of | might typically be bounded by the limits of a Diffserv region or | |||
congestion notification from the detection mechanism designed for | regions, as envisaged in [RFC4774]. The inner headers in tunnels | |||
this attack. One reason two ECT codepoints were defined was to | crossing the boundary of such a Diffserv region but ending within the | |||
enable the source to detect if a CE marking had been applied then | region can potentially leak the external congestion notification | |||
subsequently removed. The source could detect this by weaving a | semantics into the region, or leak the internal semantics out of the | |||
pseudo-random sequence of ECT(0) and ECT(1) values into a stream of | region. [RFC2983] discusses the need for Diffserv traffic | |||
packets [RFC3540]. With the rules as they stand in RFC3168 and | conditioning to be applied at these tunnel endpoints as if they are | |||
RFC4301, within a tunnel a CE marking could be added and subsequently | at the edge of the Diffserv region. Similar concerns apply to any | |||
removed by a non-compliant node without detection, because the | processing or propagation of the ECN field at the edges of a Diffserv | |||
evidence of such misbehaviour is removed by the decapsulator. | region with alternate ECN semantics. Such edge processing must also | |||
be applied at the endpoints of tunnels with ends both inside and | ||||
outside the domain. [I-D.ietf-pcn-architecture] gives specific | ||||
advice on this for the PCN case, but other definitions of alternate | ||||
semantics will need to discuss the specific security implications in | ||||
their case. | ||||
We could have specified that an outer header value of ECT should | With the rules as they stand in RFC3168 and RFC4301, a small part of | |||
overwrite a contradictory ECT value in the inner header to close this | the protection of the ECN nonce [RFC3540] is compromised. One reason | |||
loophole. But we chose not to for two reasons: i) we wanted to avoid | two ECT codepoints were defined was to enable the data source to | |||
any changes to IPsec tunnelling behaviour; ii) allowing ECT values in | detect if a CE marking had been applied then subsequently removed. | |||
the outer header to override the inner header would have increased | The source could detect this by weaving a pseudo-random sequence of | |||
the bandwidth of the covert channel through the egress gateway from 1 | ECT(0) and ECT(1) values into a stream of packets, which is termed an | |||
to 1.5 bit per datagram, potentially threatening to upset the | ECN nonce. By the decapsulation rules in RFC3168 and RFC4301, if the | |||
consensus established in the security area that says that the | inner and outer headers carry contradictory ECT values only the inner | |||
bandwidth of this covert channel can now be safely managed. | header is preserved for onward forwarding. So if a CE marking added | |||
to the outer ECN field has been illegally (or accidentally) | ||||
suppressed by a subsequent node in the tunnel, the decapsulator will | ||||
revert the ECN field to its value before tampering, hiding all | ||||
evidence of the crime from the onward feedback loop. To close this | ||||
loophole, we could have specified that an outer header value of ECT | ||||
should overwrite a contradictory ECT value in the inner header (for | ||||
how, see the ideal decapsulation rules proposed in Appendix C). But | ||||
currently we choose to keep the 'broken' behaviour defined in RFC3168 | ||||
& RFC4301 for all the following reasons: | ||||
1. We wanted to avoid any changes to IPsec tunnelling behaviour; | ||||
2. Allowing ECT values in the outer header to override the inner | ||||
header would have increased the bandwidth of the covert channel | ||||
through the egress gateway from 1 to 1.5 bit per datagram, | ||||
potentially threatening to upset the consensus established in the | ||||
security area that says that the bandwidth of this covert channel | ||||
can now be safely managed; | ||||
3. This loophole is only applicable in the corner case where the | ||||
attacker is a network node downstream of a congested node in the | ||||
same tunnel; | ||||
4. In tunnelling scenarios, the ECN nonce is already vulnerable to | ||||
suppression by nodes downstream of a congested node in the same | ||||
tunnel, if they can copy the ECT value in the inner header to the | ||||
outer header (any node in the tunnel can do this if the inner | ||||
header is not encrypted, and an IPsec tunnel egress can do it | ||||
whether or not the tunnel is encrypted); | ||||
5. Although the 'broken' decapsulation behaviour removes evidence of | ||||
congestion suppression from the onward feedback loop, the | ||||
decapsulator itself can at least detect that congestion within | ||||
the tunnel has been suppressed; | ||||
6. The ECN nonce [RFC3540] currently has experimental status and | ||||
there has been no evidence that anyone has implemented it beyond | ||||
the author's prototype. | ||||
If a legacy security policy configures a legacy tunnel ingress to | ||||
negotiate to turn off ECN processing, a compliant tunnel egress will | ||||
agree to a request to turn off ECN processing but it will actually | ||||
still copy CE markings from the outer to the forwarded header. | ||||
Although the tunnel ingress 'I' in Figure 1 will set all ECN fields | ||||
in outer headers to Not-ECT, 'M' could still toggle CE on and off to | ||||
communicate covertly with 'B', because we have specified that 'E' | ||||
only has one mode regardless of what mode it says it has negotiated. | ||||
We could have specified that 'E' should have a limited functionality | ||||
mode and check for such behaviour. But we decided not to add the | ||||
extra complexity of two modes on a compliant tunnel egress merely to | ||||
cater for a legacy security concern that is now considered | ||||
manageable. | ||||
10. Conclusions | 10. Conclusions | |||
This document updates the tunnelling treatment of RFC3168 ECN for all | This document updates the ingress tunnelling encapsulation of RFC3168 | |||
IP in IP tunnels to bring it into line with the new behaviour in the | ECN for all IP in IP tunnels to bring it into line with the new | |||
IPsec architecture of RFC4301. | behaviour in the IPsec architecture of RFC4301. | |||
At the tunnel egress, header decapsulation for the default ECN | At a tunnel egress, header decapsulation for the default ECN marking | |||
marking behaviour is broadly unchanged except that one exceptional | behaviour is broadly unchanged except that one exceptional case has | |||
case has been catered for. At the ingress, for all forms of IP in IP | been catered for. At the ingress, for all forms of IP in IP tunnel, | |||
tunnel, encapsulation has been brought into line with the new IPsec | encapsulation has been brought into line with the new IPsec rules in | |||
rules in RFC4301 which copy rather than reset CE markings when | RFC4301 which copy rather than reset CE markings when creating outer | |||
creating outer headers. Previously, upstream congestion information | headers. | |||
was not revealed in the outer header, which limited the scope of some | ||||
management monitoring techniques and prevented certain active queue | This change to encapsulation has been motivated by analysis from the | |||
management algorithms from taking account of upstream congestion | three perspectives of security, control and management. They are | |||
markings. The change ensures all IP in IP tunnels reflect the more | somewhat in tension as to whether a tunnel ingress should copy | |||
relaxed attitude to revealing congestion information in the new IPsec | congestion markings into the outer header it creates or reset them. | |||
architecture, which now deems that the threat from 2-bit covert | From the control perspective either copying or resetting works for | |||
channels can be managed without disabling ECN. | existing arrangements, but copying has more potential for simplifying | |||
control and resetting breaks at least one proposal already on the | ||||
standards track. From the management and monitoring perspective | ||||
copying is preferable. From the network security perspective (theft | ||||
of service etc) copying is preferable. From the information security | ||||
perspective resetting is preferable, but the IETF Security Area now | ||||
considers copying acceptable given the bandwidth of a 2-bit covert | ||||
channel can be managed. Therefore there are no points against | ||||
copying and a number against resetting CE on ingress. | ||||
The change ensures ECN processing in all IP in IP tunnels reflects | ||||
this slightly more permissive attitude to revealing congestion | ||||
information in the new IPsec architecture. Once all tunnelling of | ||||
ECN works the same, ECN markings will have a defined meaning when | ||||
measured at any point in a network. This new certainty will enable | ||||
new uses of the ECN field that would otherwise be confounded by | ||||
ambiguity. | ||||
Also, this document defines more generic principles to guide the | Also, this document defines more generic principles to guide the | |||
design of alternate forms of tunnel processing of congestion | design of alternate forms of tunnel processing of congestion | |||
notification, if required for specific Diffserv PHBs (such as will be | notification, if required for specific Diffserv PHBs or for other | |||
required for the PCN working group) or for other lower layer | lower layer encapsulating protocols that might support congestion | |||
encapsulating protocols that might support congestion notification in | notification in the future. | |||
the future (e.g. MPLS). | ||||
11. Acknowledgements | 11. Acknowledgements | |||
Thanks to David Black, Bruce Davie, Toby Moncaster and Gabriele | Thanks to David Black for explaining a better way to think about | |||
Corliano for their careful review comments. | function placement and to Louise Burness for a better way to think | |||
about multilayer transports and networks, having read | ||||
[Patterns_Arch]. Also thanks to Arnaud Jacquet for ideas behind the | ||||
algorithms in Appendix B. Thanks to Bruce Davie, Toby Moncaster, | ||||
Gorry Fairhurst, Sally Floyd, Alfred Hoenes and Gabriele Corliano for | ||||
their thoughts and careful review comments. | ||||
12. Comments Solicited | 12. Comments Solicited | |||
Comments and questions are encouraged and very welcome. They can be | Comments and questions are encouraged and very welcome. They can be | |||
addressed to the IETF Transport Area working group mailing list | addressed to the IETF Transport Area working group mailing list | |||
<tsvwg@ietf.org>, and/or to the authors. | <tsvwg@ietf.org>, and/or to the authors. | |||
13. References | 13. References | |||
13.1. Normative References | 13.1. Normative References | |||
skipping to change at page 16, line 14 | skipping to change at page 23, line 14 | |||
[RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition | |||
of Explicit Congestion Notification (ECN) to IP", | of Explicit Congestion Notification (ECN) to IP", | |||
RFC 3168, September 2001. | RFC 3168, September 2001. | |||
[RFC4301] Kent, S. and K. Seo, "Security Architecture for the | [RFC4301] Kent, S. and K. Seo, "Security Architecture for the | |||
Internet Protocol", RFC 4301, December 2005. | Internet Protocol", RFC 4301, December 2005. | |||
13.2. Informative References | 13.2. Informative References | |||
[802.1au] "IEEE Standard for Local and Metropolitan Area Networks-- | [I-D.eardley-pcn-marking-behaviour] | |||
Virtual Bridged Local Area Networks - Amendment 10: | Eardley, P., "Marking behaviour of PCN-nodes", | |||
Congestion Notification", 2006, | draft-eardley-pcn-marking-behaviour-01 (work in progress), | |||
<http://www.ieee802.org/1/pages/802.1au.html>. | June 2008. | |||
(Work in Progress; Access Controlled link within page) | [I-D.ietf-pcn-architecture] | |||
Eardley, P., "Pre-Congestion Notification Architecture", | ||||
draft-ietf-pcn-architecture-03 (work in progress), | ||||
February 2008. | ||||
[BBnet] Sexton, M. and A. Reid, "Broadband Networking: {ATM}, | [I-D.ietf-pwe3-congestion-frmwk] | |||
{SDH} and {SONET}", Artech House telecommunications | Bryant, S., Davie, B., Martini, L., and E. Rosen, | |||
library ISBN: 0-89006-578-0, 1997. | "Pseudowire Congestion Control Framework", | |||
draft-ietf-pwe3-congestion-frmwk-01 (work in progress), | ||||
May 2008. | ||||
[I-D.ietf-tsvwg-ecn-mpls] | [I-D.moncaster-pcn-3-state-encoding] | |||
Davie, B., "Explicit Congestion Marking in MPLS", | Moncaster, T., Briscoe, B., and M. Menth, "A three state | |||
draft-ietf-tsvwg-ecn-mpls-00 (work in progress), | extended PCN encoding scheme", | |||
March 2007. | draft-moncaster-pcn-3-state-encoding-00 (work in | |||
progress), June 2008. | ||||
[I-D.rosen-pwe3-congestion] | [IEEE802.1au] | |||
Rosen, E., "Pseudowire Congestion Control Framework", | IEEE, "IEEE Standard for Local and Metropolitan Area | |||
draft-rosen-pwe3-congestion-04 (work in progress), | Networks--Virtual Bridged Local Area Networks - Amendment | |||
October 2006. | 10: Congestion Notification", 2008, | |||
<http://www.ieee802.org/1/pages/802.1au.html>. | ||||
[PCN-arch] | (Work in Progress; Access Controlled link within page) | |||
Eardley, P., Babiarz, J., Chan, K., Charny, A., Geib, R., | ||||
Karagiannis, G., Menth, M., and T. Tsou, "Pre-Congestion | [ITU-T.I.371] | |||
Notification Architecture", | ITU-T, "Traffic Control and Congestion Control in | |||
draft-eardley-pcn-architecture-00 (work in progress), | {B-ISDN}", ITU-T Rec. I.371 (03/04), March 2004. | |||
June 2007. | ||||
[PCNcharter] | [PCNcharter] | |||
IETF, "Congestion and Pre-Congestion Notification (pcn)", | IETF, "Congestion and Pre-Congestion Notification (pcn)", | |||
IETF w-g charter , Feb 2007, | IETF w-g charter , Feb 2007, | |||
<http://www.ietf.org/html.charters/pcn-charter.html>. | <http://www.ietf.org/html.charters/pcn-charter.html>. | |||
[Patterns_Arch] | ||||
Day, J., "Patterns in Network Architecture: A Return to | ||||
Fundamentals", Pub: Prentice Hall ISBN-13: 9780132252423, | ||||
Jan 2008. | ||||
[RFC1254] Mankin, A. and K. Ramakrishnan, "Gateway Congestion | [RFC1254] Mankin, A. and K. Ramakrishnan, "Gateway Congestion | |||
Control Survey", RFC 1254, August 1991. | Control Survey", RFC 1254, August 1991. | |||
[RFC1701] Hanks, S., Li, T., Farinacci, D., and P. Traina, "Generic | [RFC1701] Hanks, S., Li, T., Farinacci, D., and P. Traina, "Generic | |||
Routing Encapsulation (GRE)", RFC 1701, October 1994. | Routing Encapsulation (GRE)", RFC 1701, October 1994. | |||
[RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. | [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. | |||
Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 | Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 | |||
Functional Specification", RFC 2205, September 1997. | Functional Specification", RFC 2205, September 1997. | |||
[RFC2637] Hamzeh, K., Pall, G., Verthein, W., Taarud, J., Little, | [RFC2637] Hamzeh, K., Pall, G., Verthein, W., Taarud, J., Little, | |||
W., and G. Zorn, "Point-to-Point Tunneling Protocol", | W., and G. Zorn, "Point-to-Point Tunneling Protocol", | |||
RFC 2637, July 1999. | RFC 2637, July 1999. | |||
[RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, | [RFC2661] Townsley, W., Valencia, A., Rubens, A., Pall, G., Zorn, | |||
G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", | G., and B. Palter, "Layer Two Tunneling Protocol "L2TP"", | |||
RFC 2661, August 1999. | RFC 2661, August 1999. | |||
[RFC2983] Black, D., "Differentiated Services and Tunnels", | ||||
RFC 2983, October 2000. | ||||
[RFC3426] Floyd, S., "General Architectural and Policy | [RFC3426] Floyd, S., "General Architectural and Policy | |||
Considerations", RFC 3426, November 2002. | Considerations", RFC 3426, November 2002. | |||
[RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit | [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit | |||
Congestion Notification (ECN) Signaling with Nonces", | Congestion Notification (ECN) Signaling with Nonces", | |||
RFC 3540, June 2003. | RFC 3540, June 2003. | |||
[RFC4306] Kaufman, C., "Internet Key Exchange (IKEv2) Protocol", | [RFC4306] Kaufman, C., "Internet Key Exchange (IKEv2) Protocol", | |||
RFC 4306, December 2005. | RFC 4306, December 2005. | |||
[RFC4423] Moskowitz, R. and P. Nikander, "Host Identity Protocol | [RFC4423] Moskowitz, R. and P. Nikander, "Host Identity Protocol | |||
(HIP) Architecture", RFC 4423, May 2006. | (HIP) Architecture", RFC 4423, May 2006. | |||
[RFC4774] Floyd, S., "Specifying Alternate Semantics for the | ||||
Explicit Congestion Notification (ECN) Field", BCP 124, | ||||
RFC 4774, November 2006. | ||||
[RFC5129] Davie, B., Briscoe, B., and J. Tay, "Explicit Congestion | ||||
Marking in MPLS", RFC 5129, January 2008. | ||||
[Shayman] "Using ECN to Signal Congestion Within an MPLS Domain", | [Shayman] "Using ECN to Signal Congestion Within an MPLS Domain", | |||
2000, <http://www.ee.umd.edu/~shayman/papers.d/ | 2000, <http://www.ee.umd.edu/~shayman/papers.d/ | |||
draft-shayman-mpls-ecn-00.txt>. | draft-shayman-mpls-ecn-00.txt>. | |||
(Expired) | (Expired) | |||
Appendix A. In-path Load Regulation | Appendix A. Why resetting CE on encapsulation harms PCN | |||
Regarding encapsulation, the section of the PCN architecture | ||||
[I-D.ietf-pcn-architecture] on tunnelling says that header copying | ||||
(RFC4301) allows PCN to work correctly. However, resetting CE | ||||
markings confuses PCN marking. | ||||
The specific issue here concerns PCN excess rate marking | ||||
[I-D.eardley-pcn-marking-behaviour], i.e. the bulk marking of traffic | ||||
that exceeds a configured threshold rate. One of the goals of excess | ||||
rate marking is to enable the speedy removal of excess admission | ||||
controlled traffic following re-routes caused by link failures or | ||||
other disasters. This maintains a share of the capacity for | ||||
competing admission controlled traffic and for traffic in lower | ||||
priority classes. After failures, traffic re-routed onto remaining | ||||
links can often stress multiple links along a path. Therefore, | ||||
traffic can arrive at a link under stress with some proportion | ||||
already marked for removal by a previous link. By design, marked | ||||
traffic will be removed by the overall system in subsequent round | ||||
trips. So when the excess rate marking algorithm decides how much | ||||
traffic to mark for removal, it doesn't include traffic already | ||||
marked for removal by another node upstream (the `Excess traffic | ||||
meter function' of [I-D.eardley-pcn-marking-behaviour]). | ||||
However, if an RFC3168 tunnel ingress intervenes, it resets the ECN | ||||
field in all the outer headers, hiding all the evidence of problems | ||||
upstream. Thus, although excess rate marking works fine with RFC4301 | ||||
IPsec tunnels, with RFC3168 tunnels it typically removes large | ||||
volumes of traffic that it didn't need to remove at all. | ||||
Appendix B. Contribution to Congestion across a Tunnel | ||||
This specification mandates that a tunnel ingress determines the ECN | ||||
field of each new outer tunnel header by copying the arriving header. | ||||
If instead the outer ECN field were reset at a tunnel ingress (as it | ||||
was for the full functionality mode of RFC3168), it would be possible | ||||
for the tunnel egress to measure: | ||||
o congestion marking before the tunnel ingress (fraction of inner | ||||
header markings, p_i); | ||||
o congestion marking across the tunnel (fraction of outer header | ||||
markings, p_t); | ||||
o congestion marking after the tunnel egress (fraction of departing | ||||
header markings, p_o). | ||||
Although the newly mandated copying behaviour at ingress gains the | ||||
advantages described in the body of this specification, this one | ||||
advantage of the resetting behaviour of RFC3168 seems to have been | ||||
lost: on first impressions, it seems that the egress can no longer | ||||
accurately measure congestion contributed along the tunnel (p_t). | ||||
The egress could _estimate _the contribution along the tunnel by | ||||
measure which packets carry only a mark in the outer header (not the | ||||
inner). But this is not precisely the same as the congestion | ||||
contributed along the tunnel; tunnel nodes may have tried to mark | ||||
some packets that already had a marking in both the inner and outer | ||||
header. Measuring only additional outer markings will miss these. | ||||
Nonetheless, with the newly proposed scheme, a tunnel egress can | ||||
derive a precise estimate of marking introduced across a tunnel (p_t) | ||||
as follows. | ||||
The combined fraction of markings at the tunnel egress will be p_o = | ||||
1 - (1 - p_i)(1 - p_t). Explanation: this is (1 - the probability a | ||||
departing packet is not marked), which is (1 - (prob not marked | ||||
before tunnel)(prob not marked along tunnel)). Therefore, | ||||
rearranging, the egress can infer the fraction of marks introduced | ||||
across the tunnel as p_t = (p_o - p_i)/(1 - p_i). If arriving | ||||
congestion is low (p_i <<1), then the approximation p_t ~ (p_o - p_i) | ||||
should be good enough. This is the estimate we advised originally; | ||||
i.e. measuring only the extra markings in the outer header that are | ||||
not present in the inner header. If a better approximation is needed | ||||
p_t ~ (p_o - p_i)(1 + p_i), which removes the division, but still | ||||
assumes p_i<<1. | ||||
Using any of these formulae (including the precise one), it would be | ||||
possible for a tunnel egress to calculate a moving average of the | ||||
fraction of packets being marked by tunnel nodes, including those | ||||
already marked in the inner header. Alternatively, it should even be | ||||
possible for a tunnel egress to reverse engineer which packets would | ||||
have been marked across the tunnel if CE was reset on ingress even if | ||||
CE was actually copied on ingress.[[anchor3: Note from Bob: I've | ||||
worked out an algorithm so the tunnel egress can reverse engineer | ||||
marking as if CE was reset at the ingress even though CE was copied | ||||
at the ingress. It typically consumes 2 cycles / pkt, occasionally 4 | ||||
and very occasionally 8. {ToDo: On testing an implementation just now | ||||
it still has a wrinkle in it, but with a little more development I | ||||
believe it would work well. I'll write it into the next revision if | ||||
I get it working.}]] | ||||
Appendix C. Ideal Decapsulation Rules | ||||
Compliance with this appendix is NOT REQUIRED for compliance with the | ||||
present specification. | ||||
If the default ECN encapsulation behaviour does not offer suitable | ||||
trade offs, procedures exist for associating a new behaviour with a | ||||
new Diffserv PHB. However, it is unrealistic to expect vendors of | ||||
all IPSec and all IP in IP tunnel endpoints to cater for the | ||||
exceptional behaviour of PHB XXX. If all tunnels did require XXX- | ||||
specific behaviour, the resulting patchy and error-prone deployment | ||||
would probably cause XXX to suffer byzantine feature interactions | ||||
with poorly implemented tunnels. The default rules for tunnel | ||||
endpoints to handle both the Diffserv field and the ECN field should | ||||
'just work' when handling packets with an XXX Diffserv codepoint. | ||||
Given this specification requests a standards action to update the | ||||
RFC3168 encapsulation behaviour, this appendix explores a further | ||||
change to decapsulation that we ought to specify at the same time. | ||||
If instead this further change is added later, it will add another | ||||
set of backward compatibility combinations to the already complicated | ||||
change history of ECN tunnelling. | ||||
Multi-level congestion notification is currently on the IETF's | ||||
standards track agenda in the Congestion and Pre-Congestion | ||||
Notification (PCN) working group. The PCN working group requires | ||||
three congestion states (not marked and two levels of congestion | ||||
marking) [I-D.ietf-pcn-architecture]. The aim is for the first level | ||||
of marking to stop admitting new traffic and the second level to | ||||
terminate sufficient existing flows to bring a network back to its | ||||
operating point after a serious failure. | ||||
Although the ECN field gives sufficient codepoints for these three | ||||
states, the PCN working group cannot use them in case any tunnel | ||||
decapsulations occur within a PCN region. If a node in a tunnel sets | ||||
the ECN field to ECT(0) or ECT(1), this change will be discarded by a | ||||
tunnel egress compliant with RFC4301 and RFC3168. This can be seen | ||||
in Table 1, where the ECT values in the outer header are ignored | ||||
unless the inner header is the same. Effectively the ECT(0) and | ||||
ECT(1) codepoints have to be treated as just one codepoint when they | ||||
could otherwise have been used for their intended purpose of | ||||
congestion notification. Instead, the PCN w-g has had to propose | ||||
using extra Diffserv codepoint(s) to encode the extra states | ||||
[I-D.moncaster-pcn-3-state-encoding], using up the rapidly exhausting | ||||
DSCP space while leaving ECN codepoints unused. | ||||
Although this is currently most pressing for the PCN working group, | ||||
the issue is more general. Under Security Considerations (Section 9) | ||||
it has already been explained that a data sender cannot use the | ||||
experimental ECN nonce [RFC3540] to detect suppression of congestion | ||||
notification along a tunnel. | ||||
More generally, the currently standardised tunnel decapsulation | ||||
behaviour unnecessarily wastes a quarter of two bits (i.e. half a | ||||
bit) in the IP (v4 & v6) header. As explained in Section 3.1, the | ||||
original reason for not copying down outer ECT codepoints for onward | ||||
forwarding was to limit the covert channel across a decapsulator to 1 | ||||
bit per packet. However, now that the IETF Security Area has deemed | ||||
that a 2-bit covert channel through an encapsulator is a manageable | ||||
risk, the same should be true for a decapsulator. | ||||
Table 2 proposes a more ideal layered decapsulation behaviour. Note: | ||||
this table is only to support discussion. It is not currently | ||||
proposed for standards action. The only difference from Table 1 | ||||
(that is proposed for standards action), is the swapping of the cells | ||||
highlighted as *ECT(X)*. | ||||
+--Incoming Outer Header--- | ||||
+---------------------+---------+-----------+-----------+-----------+ | ||||
| Incoming Inner | Not-ECT | ECT(0) | ECT(1) | CE | | ||||
| Header | | | | | | ||||
+---------------------+---------+-----------+-----------+-----------+ | ||||
| Not-ECT | Not-ECT | drop(!!!) | drop(!!!) | drop(!!!) | | ||||
| ECT(0) | ECT(0) | ECT(0) | *ECT(1)* | CE | | ||||
| ECT(1) | ECT(1) | *ECT(0)* | ECT(1) | CE | | ||||
| CE | CE | CE | CE (!!!) | CE | | ||||
+---------------------+---------+-----------+-----------+-----------+ | ||||
+-----Outgoing Header------ | ||||
Table 2: Ideal IP in IP Decapsulation (currently NOT REQUIRED) | ||||
Note that, if this ideal proposal were taken up, extra backwards | ||||
compatibility issues would have to be resolved. | ||||
Appendix D. Non-Dependence of Tunnelling on In-path Load Regulation | ||||
We have said that at any point in a network, the Congestion Baseline | ||||
(where congestion notification starts from zero) should be the | ||||
previous upstream Load Regulator. We have also said that the ingress | ||||
of an IP in IP tunnel must copy congestion indications to the | ||||
encapsulating outer headers it creates. If the Load Regulator is in- | ||||
path rather than at the source, and also a tunnel ingress, these two | ||||
requirements seem to be contradictory. A tunnel ingress must not | ||||
reset incoming congestion, but a Load Regulator must be the | ||||
Congestion Baseline, implying it needs to reset incoming congestion. | ||||
In fact, the two requirements are not contradictory, because a Load | ||||
Regulator and a tunnel ingress are functions within a node that occur | ||||
in sequence on a stream of packets, not at the same point. Figure 3 | ||||
is borrowed from [RFC2983] (which was making a similar point about | ||||
the location of Diffserv traffic conditioning relative to the | ||||
encapsulation function of a tunnel). An in-path Load Regulator can | ||||
act on packets either at [1 - Before] encapsulation or at [2 - Outer] | ||||
after encapsulation. Load Regulation does not ever need to be | ||||
integrated with the [Encapsulate] function (but it can be for | ||||
efficiency). Therefore we can still maintain that the [Encapsulate] | ||||
function always copies CE into the outer header. | ||||
>>-----[1 - Before]--------[Encapsulate]----[3 - Inner]------------>> | ||||
\ | ||||
\ | ||||
+--------[2 - Outer]--------->> | ||||
Figure 3: Placement of In-Path Load Regulator Relative to Tunnel | ||||
Ingress | ||||
Then separately, if there is a Load Regulator at location [2 - | ||||
Outer], it might reset CE to ECT(0), say. Then the Congestion | ||||
Baseline for the lower layer (outer) will be [2 - Outer], while the | ||||
Congestion Baseline of the inner layer will be unchanged. But how | ||||
encapsulation works has nothing to do with whether a Load Regulator | ||||
is present or where it is. | ||||
If on the other hand a Load Regulator resets CE at [1 - Before], the | ||||
Congestion Baseline of both the inner and outer headers will be [1 - | ||||
Before]. But again, encapsulation is independent of load regulation. | ||||
D.1. Dependence of In-Path Load Regulation on Tunnelling | ||||
Although encapsulation doesn't need to depend on in-path load | ||||
regulation, the reverse is not true. The placement of an in-path | ||||
Load Regulator must be carefully considered relative to | ||||
encapsulation. Some examples are given in the following for | ||||
guidance. | ||||
In the traditional Internet architecture one tends to think of the | In the traditional Internet architecture one tends to think of the | |||
source host as the Load Regulator for a path. It is generally not | source host as the Load Regulator for a path. It is generally not | |||
desirable or practical for a node part way along the path to regulate | desirable or practical for a node part way along the path to regulate | |||
the load. However, various reasonable proposals for in-path load | the load. However, various reasonable proposals for in-path load | |||
regulation have been made from time to time (e.g. fair queuing, | regulation have been made from time to time (e.g. fair queuing, | |||
traffic engineering). Also the IETF has recently chartered a working | traffic engineering, flow admission control). The IETF has recently | |||
group to standardise admission control across a part of a path using | chartered a working group to standardise admission control across a | |||
pre-congestion notification (PCN) [PCNcharter], which involves in- | part of a path using pre-congestion notification (PCN) [PCNcharter]. | |||
path load regulation. This is of particular relevance here because | This is of particular relevance here because it involves congestion | |||
it involves congestion notification with an in-path Load Regulator | notification with an in-path Load Regulator, it can involve | |||
and it can involve tunnelling. | tunnelling and it certainly involves encapsulation more generally. | |||
We will use the more complex scenario in Figure 3 to tease out all | We will use the more complex scenario in Figure 4 to tease out all | |||
the issues that arise when combining congestion notification and | the issues that arise when combining congestion notification and | |||
tunnelling with various possible in-path load regulation schemes. In | tunnelling with various possible in-path load regulation schemes. In | |||
this case 'I1' and 'E2' break up the path into three separate | this case 'I1' and 'E2' break up the path into three separate | |||
congestion control loops. The feedback for these loops is shown | congestion control loops. The feedback for these loops is shown | |||
going right to left across the top of the figure. The 'V's are arrow | going right to left across the top of the figure. The 'V's are arrow | |||
heads representing the direction of feedback, not letters. But there | heads representing the direction of feedback, not letters. But there | |||
are also two tunnels within the middle control loop: 'I1' to 'E1' and | are also two tunnels within the middle control loop: 'I1' to 'E1' and | |||
'I2' to 'E2'. The two tunnels might be VPNs, perhaps over two MPLS | 'I2' to 'E2'. The two tunnels might be VPNs, perhaps over two MPLS | |||
core networks. M is a congestion monitoring point, perhaps between | core networks. M is a congestion monitoring point, perhaps between | |||
two border routers where the same tunnel continues unbroken across | two border routers where the same tunnel continues unbroken across | |||
the border. | the border. | |||
______ _______________________________________ _____ | ______ _______________________________________ _____ | |||
/ \ / \ / \ | / \ / \ / \ | |||
V \ V M \ V \ | V \ V M \ V \ | |||
A--->R--->I1===========>E1----->I2=========>==========>E2------->B | A--->R--->I1===========>E1----->I2=========>==========>E2------->B | |||
Figure 3: complex Tunnel Scenario | Figure 4: complex Tunnel Scenario | |||
The question is, should the congestion markings in the outer exposed | The question is, should the congestion markings in the outer exposed | |||
headers of a tunnel represent congestion only since the tunnel | headers of a tunnel represent congestion only since the tunnel | |||
ingress or over the whole upstream path from the source of the inner | ingress or over the whole upstream path from the source of the inner | |||
header (whatever that may mean)? Or put another way, should 'I1' and | header (whatever that may mean)? Or put another way, should 'I1' and | |||
'I2' copy or reset CE markings? | 'I2' copy or reset CE markings? | |||
The answer is that the baseline of congestion marking should be the | Based on the design principles in Section 4, the answer is that the | |||
nearest upstream interface designed to regulate traffic load--the | Congestion Baseline should be the nearest upstream interface designed | |||
Load Regulator. In Figure 3 'A', 'I1' or 'E2' are all Load | to regulate traffic load--the Load Regulator. In Figure 4 'A', 'I1' | |||
Regulators. We have shown the feedback loops returning to each of | or 'E2' are all Load Regulators. We have shown the feedback loops | |||
these nodes so that they can regulate the load causing the congestion | returning to each of these nodes so that they can regulate the load | |||
notification. So the baseline for congestion markings exposed to M | causing the congestion notification. So the Congestion Baseline | |||
should be 'I1' (the Load Regulator), not 'I2'. That is, 'I2' SHOULD | exposed to M should be 'I1' (the Load Regulator), not 'I2'. | |||
copy any CE marking into the outer header it creates, while 'I1' is | Therefore I1 should reset any arriving CE markings. In this case, | |||
an exception because it is an in-path load regulator, so it should | 'I1' knows the tunnel to 'E1' is unrelated to its load regulation | |||
reset the ECN field in the outer header it creates. | function. So the load regulation function within 'I1' should be | |||
placed at [1 - Before] tunnel encapsulation within 'I1' (using the | ||||
terminology of Figure 3). Then the Congestion Baseline all across | ||||
the networks from 'I1' to 'E2' in both inner and outer headers will | ||||
be 'I1'. | ||||
The following further examples illustrate how this answer might be | The following further examples illustrate how this answer might be | |||
applied: | applied: | |||
o Preemption marking is currently defined for PCN [PCN-arch] so that | o We argued in Appendix A that resetting CE on encapsulation could | |||
the rate of unmarked packets at the end of a path of multiple | harm PCN excess rate marking, which marks excess traffic for | |||
bottlenecks determines the maximum sustainable aggregate bit rate | removal in subsequent round trips. This marking relies on not | |||
over that path. To produce the correct marking by the end, each | marking packets if another node upstream has already marked them | |||
congested node must only consider packets to be eligible for | for removal. If there were a tunnel ingress between the two which | |||
marking if they have not already been marked by any previous | reset CE markings, it would confuse the downstream node into | |||
bottleneck along a path that may span multiple tunnels (including | marking far too much traffic for removal. So why do we say that | |||
MPLS encapsulations etc.). This scheme only results in the | 'I1' should reset CE, while a tunnel ingress shouldn't? The | |||
correct marking rate if the markings accumulated so far along the | answer is that it is the Load Regulator function at 'I1' that is | |||
path are copied into the outer exposed header of each tunnel or | resetting CE, not the tunnel encapsulator. The Load Regulator | |||
encapsulation. Consider that 'I1' and 'E2' in the complex | needs to set itself as the Congestion Baseline, so the feedback it | |||
scenario of Figure 3 are edge gateways of a PCN region. Admission | gets will only be about congestion on links it can relieve itself | |||
control based on PCN measurements is a form of load regulation, so | by regulating the load into them. When it resets CE markings, it | |||
'I1' regulates the load on the PCN region. Therefore 'I1' should | knows that something else upstream will have dealt with the | |||
be the baseline of congestion marking for _both_ tunnels within | congestion notifications it removes, given it is part of an end- | |||
the scope of its feedback loop. Therefore 'I2' should follow the | to-end admission control signalling loop. It therefore knows that | |||
normal rules and copy congestion marking into the outer tunnel | previous hops will be covered by other Load Regulators. | |||
header, while 'I1' is an exception because it is also a load | Meanwhile, the tunnel ingresses at both 'I1' and 'I2' should | |||
regulator, so it should reset CE markings in the outer header. | follow the new rule for any tunnel ingress and copy congestion | |||
marking into the outer tunnel header. The ingress at 'I1' will | ||||
happen to copy headers that have already been reset just | ||||
beforehand. But it doesn't need to know that. | ||||
o [Shayman] suggested feedback of ECN accumulated across an MPLS | o [Shayman] suggested feedback of ECN accumulated across an MPLS | |||
domain could cause the ingress to trigger re-routing to mitigate | domain could cause the ingress to trigger re-routing to mitigate | |||
congestion. This case is more like the simple scenario of | congestion. This case is more like the simple scenario of | |||
Figure 2, with a feedback loop across the MPLS domain ('E' back to | Figure 2, with a feedback loop across the MPLS domain ('E' back to | |||
'I'). The baseline for congestion exposed in outer headers in | 'I'). I is a Load Regulator because re-routing around congestion | |||
this case will be the tunnel ingress, which should therefore reset | is a load regulation function. But in this case 'I' should only | |||
the ECN field in the outer headers it creates. But the reason it | reset itself as the Congestion Baseline in outer headers, as it is | |||
should act as the baseline is because it is an in-path load | not handling congestion outside its domain, so it must preserve | |||
regulator (re-routing around congestion is a load regulation | the end-to-end congestion feedback loop for something else to | |||
function), not just because it is a tunnel ingress. | handle (probably the data source). Therefore the Load Regulator | |||
within 'I' should be placed at [2 - Outer] to reset CE markings | ||||
just after the tunnel ingress has copied them from arriving | ||||
headers. Again, the tunnel encapsulation function at 'I' simply | ||||
copies incoming headers, unaware that the load regulator will | ||||
subsequently reset its outer headers. | ||||
o The PWE3 working group of the IETF is considering the problem of | o The PWE3 working group of the IETF is considering the problem of | |||
how and whether an aggregate private wire emulation should respond | how and whether an aggregate edge-to-edge pseudo-wire emulation | |||
to congestion [I-D.rosen-pwe3-congestion]. Although the study is | should respond to congestion [I-D.ietf-pwe3-congestion-frmwk]. | |||
still at the requirements stage, some (controversial) solution | Although the study is still at the requirements stage, some | |||
proposals include in-path load regulation at the ingress to the | (controversial) solution proposals include in-path load regulation | |||
tunnel that could lead to tunnel arrangements with similar | at the ingress to the tunnel that could lead to tunnel | |||
complexity to that of Figure 3. | arrangements with similar complexity to that of Figure 4. | |||
These are not contrived scenarios--they could be a lot worse. For | These are not contrived scenarios--they could be a lot worse. For | |||
instance, a host may create a tunnel for IPsec which is placed inside | instance, a host may create a tunnel for IPsec which is placed inside | |||
a tunnel for Mobile IP over a remote part of its path. And around | a tunnel for Mobile IP over a remote part of its path. And around | |||
this all we may have MPLS labels being pushed and popped as packets | this all we may have MPLS labels being pushed and popped as packets | |||
pass across different core networks. Similarly, it is possible that | pass across different core networks. Similarly, it is possible that | |||
subnets could be built from link technology (e.g. ethernet switches) | subnets could be built from link technology (e.g. future Ethernet | |||
so that link headers being added and removed could involve congestion | switches) so that link headers being added and removed could involve | |||
notification in future link headers with all the same issues as with | congestion notification in future Ethernet link headers with all the | |||
IP in IP tunnels. | same issues as with IP in IP tunnels. | |||
The reason we introduced the concept of a Load Regulator was to allow | One reason we introduced the concept of a Load Regulator was to allow | |||
for in-path load regulation. In the traditional Internet | for in-path load regulation. In the traditional Internet | |||
architecture one tends to think of a host and a Load Regulator as | architecture one tends to think of a host and a Load Regulator as | |||
synonymous, but when considering tunnelling, even the definition of a | synonymous, but when considering tunnelling, even the definition of a | |||
host is too fuzzy, whereas a Load Regulator is a clearly defined | host is too fuzzy, whereas a Load Regulator is a clearly defined | |||
function. Similarly, the concept of innermost header is too fuzzy to | function. Similarly, the concept of innermost header is too fuzzy to | |||
be able to (wrongly) say that the source address of the innermost | be able to (wrongly) say that the source address of the innermost | |||
header should be the baseline. Which is the innermost header when | header should be the Congestion Baseline. Which is the innermost | |||
multiple encapsulations may be in use? Where do we stop? If we say | header when multiple encapsulations may be in use? Where do we stop? | |||
the original source in the above IPsec-Mobile IP case is the host, | If we say the original source in the above IPsec-Mobile IP case is | |||
how do we know it isn't tunnelling an encrypted packet stream on | the host, how do we know it isn't tunnelling an encrypted packet | |||
behalf of another host in a p2p network? | stream on behalf of another host in a p2p network? | |||
The reason there has been so much confusion over the question of | We have become used to thinking that only hosts regulate load. The | |||
whether a tunnel ingress should copy or reset CE markings is that we | end to end design principle advises that this is a good idea | |||
have become used to thinking that only hosts regulate load. The end | [RFC3426], but it also advises that it is solely a guiding principle | |||
to end design principle advises that this is a good idea [RFC3426], | intended to make the designer think very carefully before breaking | |||
but it also advises that it is only a guiding principle intended to | it. We do have proposals where load regulation functions sit within | |||
make the designer think very carefully before breaking it. We do | a network path for good, if sometimes controversial, reasons, e.g. | |||
have proposals where load regulation functions sit within a network | PCN edge admission control gateways [I-D.ietf-pcn-architecture] or | |||
path for good, if sometimes controversial, reasons, e.g. PCN edge | traffic engineering functions at domain borders to re-route around | |||
admission control gateways [PCN-arch] or traffic engineering | congestion [Shayman]. Whether or not we want in-path load | |||
functions at domain borders to re-route around congestion [Shayman]. | regulation, we have to work round the fact that it will not go away. | |||
Author's Address | Author's Address | |||
Bob Briscoe | Bob Briscoe | |||
BT | BT | |||
B54/77, Adastral Park | B54/77, Adastral Park | |||
Martlesham Heath | Martlesham Heath | |||
Ipswich IP5 3RE | Ipswich IP5 3RE | |||
UK | UK | |||
Phone: +44 1473 645196 | Phone: +44 1473 645196 | |||
Email: bob.briscoe@bt.com | Email: bob.briscoe@bt.com | |||
URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ | URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ | |||
Full Copyright Statement | Full Copyright Statement | |||
Copyright (C) The IETF Trust (2007). | Copyright (C) The IETF Trust (2008). | |||
This document is subject to the rights, licenses and restrictions | This document is subject to the rights, licenses and restrictions | |||
contained in BCP 78, and except as set forth therein, the authors | contained in BCP 78, and except as set forth therein, the authors | |||
retain all their rights. | retain all their rights. | |||
This document and the information contained herein are provided on an | This document and the information contained herein are provided on an | |||
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | |||
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND | OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND | |||
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS | THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS | |||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF | OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF | |||
skipping to change at page 21, line 45 | skipping to change at page 34, line 45 | |||
such proprietary rights by implementers or users of this | such proprietary rights by implementers or users of this | |||
specification can be obtained from the IETF on-line IPR repository at | specification can be obtained from the IETF on-line IPR repository at | |||
http://www.ietf.org/ipr. | http://www.ietf.org/ipr. | |||
The IETF invites any interested party to bring to its attention any | The IETF invites any interested party to bring to its attention any | |||
copyrights, patents or patent applications, or other proprietary | copyrights, patents or patent applications, or other proprietary | |||
rights that may cover technology that may be required to implement | rights that may cover technology that may be required to implement | |||
this standard. Please address the information to the IETF at | this standard. Please address the information to the IETF at | |||
ietf-ipr@ietf.org. | ietf-ipr@ietf.org. | |||
Acknowledgments | Acknowledgment | |||
Funding for the RFC Editor function is provided by the IETF | This document was produced using xml2rfc v1.33 (of | |||
Administrative Support Activity (IASA). This document was produced | http://xml.resource.org/) from a source in RFC-2629 XML format. | |||
using xml2rfc v1.32 (of http://xml.resource.org/) from a source in | ||||
RFC-2629 XML format. | ||||
End of changes. 90 change blocks. | ||||
471 lines changed or deleted | 1061 lines changed or added | |||
This html diff was produced by rfcdiff 1.35. The latest version is available from http://tools.ietf.org/tools/rfcdiff/ |