Encoding Explicit Congestion Notification (ECN) in IEEE802 Wire Protocols Abstract This document compares a number of ways to encode explicit congestion notification (ECN) in IEEE802 Wire Protocols. It mainly focuses on incremental deployment, ensuring that ECN-marked packets will not be propagated to higher layers that would not understand them, even in a network where some of the boxes that decapsulate higher layers payloads might not understand ECN in either layer. At present, it is a very rough draft offering a number of fairly immature ideas designed to provoke criticism. If any one of these ideas were worth pursuing, an IEEE802 work item would have to be found or created to pursue it. 1. Introduction -------------------- This document compares a number of ways to encode explicit congestion notification (ECN) in Ethernet Wire Protocols. The aim is for Ethernet switches to be able to mark packets as they approach congestion so that end systems can respond, often before any packets are dropped. A related aim is to make congestion visible to other network layer devices so that they can monitor and possibly limit the congestion caused by different users. At least three encoding states are needed (similar to ECN [RFC3168]), but they may not necessarily require three codepoints in each frame: * Not-ECT * ECT * CE There is possibly a need for a fourth state, to support two congestion severity levels like pre-congestion notification (PCN [RFC5559]). Some of the more recent outputs of the IEEE802 committees (usually those with 'provider' in their title) require configuration and management in order to work. Ideally, a solution is needed that will work correctly in an unmanaged (zero-config) deployment, even when newly upgraded equipment is mixed with legacy equipment in the same subnet. This document offers some zero-config ideas and some that would only work in managed environments. Support for the ECN nonce [RFC3540] or re-ECN [ID.briscoe-tsvwg-re-ecn-tcp] does not require support at each link, as end-to-end support in IP is sufficient. 2. Available Alternatives -------------------- 2.1 Modifications to the 802.1Q header -------------------- As [802.1Q] is only used in managed deployments, the solutions in this subsection are consequently also limited to managed deployments. The 802.1Q header looks like this: 3bits 1bit 12bits User CFI VLAN ID Priority I understand (but I DON'T KNOW) that there are few if any ethernet boxes that include any capability to process frames with a canonical format indicator (CFI) = 1. If CFI=1 I believe the frame is nearly always dropped. There's only two ways I know of to get 3 states into 1 bit: #1) Define: * CFI=1 as ECT * incoming CFI=0 as Not-ECT * outgoing CFI=0 as CE. In other words, a congested switch can mark a CFI=1 frame to CFI=0, but once one congested switch has marked a frame, if it arrives at another congested switch on the same link it will look as if it is Not-ECT and have to be dropped. This alternative is documented in [RFC3168] as a 1-bit idea for IP that was rejected in favour of 2 bits. #2) Define: * CFI=0 as unmarked * CFI=1 as marked And also hold the ECT state in the (IP) payload of the frame. At the next IP hop if the IP header = Not-ECT and CFI=1, drop the frame. This is like the approach for doing ECT in MPLS [RFC5129]. Two-bit schemes #3) Define * CFI=0 as Not-ECT * CFI=1 as ECT Redefine the last bit of the [802.1p] 'user priority' field as CE, either for all 802.1p (unlikely to be accepted) or locally for each operator (not zero config). The 802.1p header then becomes: 2bits 1bit 1bit 12bits User CE ECT VLAN ID Priority #4) Define * CFI=0 as Not-ECT * CFI=1 as ECT Redefine the 802.1p 'user priority' codepoints, either for all 802.1p (unlikely to be accepted) or locally for each operator (not zero config). Possible examples: 0 = Priority 0 not marked 1 = Priority 0 marked 2 = Priority 1 not marked 3 = Priority 1 marked 5 = Priority 2 not marked 6 = Priority 2 threshold marked 7 = Priority 2 excess rate marked 4 = Priority 3 not marked (and unmarkable) 2.2. New ECN Ethertype ---------------------- This idea has been designed to work in completely unmanaged (zero-config) environments, with a possible optimisation for a managed environment. The idea: Define a new Ethertype for all 802 protocols, let's say 802.1C where 'C' stands for congestion. Initially a prototype or vendor-specific Ethertype could be used [802a]. In the diagrams, Etype=C means the Ethertype is set to this new value. Define the 802.1C header in a very similar way to 802.1Q. That is, a shim that is inserted into the 802 header just before the Ethertype. If there is already an 802.1Q header, the 802.1C header would be inserted straight after it. If there are multiple 802.1Q headers (Q-in-Q [802.1ad]) or multiple MAC headers (MAC in MAC [802.1ah]), there only *needs* to be one 802.1C shim inside the first 802.1Q shim header. +--------+--------++----------+--------+--------+ |Dst MAC |Src MAC ||Ethertype |Payload |CRC/FCS | | | ||/Length | | | |6octets |6octets ||2octets |... |4octets | +--------+--------++----------+--------+--------+ | \ +---------------+ |802.1C | |3 octets | +---------------+ |Etype=C |ECN | |2octets |1octet| +---------------+ +--------+--------+----------++----------+--------+--------+ |Dst MAC |Src MAC |802.1Q ||Ethertype |Payload |CRC/FCS | | | | ||/Length | | | |6octets |6octets |4octets ||2octets |... |4octets | +--------+--------+----------++----------+--------+--------+ | \ 802.1C +--------+--------+----------++----------++----------+--------+--------+ |Dst MAC |Src MAC |802.1Q ||802.1Q ||Ethertype |Payload |CRC/FCS | | | | || ||/Length | | | |6octets |6octets |4octets ||4octets ||2octets |... |4octets | +--------+--------+----------++----------++----------+--------+--------+ | \ and/or | \ 802.1C 802.1C As an efficiency optimisation, encapsulators that understand 802.1C can remove any deeper 802.1C shims, including any ECN marking they contain into a single shim just inside the outermost 802.1Q shim. Encapsulators that don't understand 802.1C won't do this. But it doesn't matter if there are more shims than necessary, because when they are decapsulated, any ECN in them can be passed up the layers recusively (which is what the optimised encapsulator does to save sending too many shims). The proposed 802.1C shim starts with the newly defined Ethertype, then has n bits for ECN. Probably n=8, but we only need n=2. {ToDo: Rather than keep stuffing in little additions like this, there may be some way to add a general purpose shim.} There are three types of 802 node: encapsulating at a subnet ingress, forwarding in the subnet interior or decapsulating at a subnet egress. A physical node may take on more than one of these roles in sequence. An ingress node is always also a forwarding node. 2.2.1 Unmanaged subnet ---------------------- In an unmanaged subnet, the encapsulator does NOT insert an 802.1C shim into the frame, even if the ingress has been upgraded to understand 802.1C. We will describe a managed subnet later. Any forwarding nodes that have been upgraded to 802.1C run an AQM algorithm (on high speed switches this may just be a threshold at a small queue size, above which frames are marked). They need to be able to mark a frame, rather than drop it when the AQM algorithm decides to signal congestion. To mark a packet, the forwarding node inserts an 802.1C shim if one is not already present, then it sets the two ECN bits to 11 (say). A congested interior forwarding node that has not been upgraded to 802.1C will just drop frames as normal. Interior Ethernet switches that are just forwarding frames don't need to understand the Ethertype - usually only the egress of a subnet that decapsulates the frame checks the Ethertype. So, if an 802.1C shim has been added by an earlier switch, it should be forwarded at least as far as the egress of the subnet, even by switches that don't understand 802.1C. On decapsulation at the subnet egress, if the decapsulator doesn't understand the 802.1C Ethertype, it will drop the frame. Given the frame should only contain an 802.1C shim if it carries an ECN marking, this is the desired behaviour - the ECN marking cannot be propagated up the layers, so drop is the only alternative signal of congestion that is appropriate. If the decapsulator does understand 802.1C, it will remove the shim remembering any ECN-marking within, then continue decapsulating the next outer header. If the inner header supports ECN, it will mark the appropriate ECN field if there was an ECN marking in any of the shim(s) it removed. Examples of inner headers that support ECN would be o an IP header with the ECT setting in its ECN field; o a DECNET, ATM or frame relay header o an inner 802 frame, with or without an 802.1Q shim If the inner header does not support ECN, if there was any ECN marking in any of the stripped out shim(s) it must drop the frame - again, the only appropriate way to signal congestion to a higher layer that shows it would not understand ECN. If the higher layer inner payload is the logical link control (LLC) sub-layer [802.2], it is feasible that a bridge could preserve the 802.1C shim when it removes one outer 802 header and adds another. {ToDo: I don't know enough to be sure about this statement.} Disadvantage of the unmanaged scheme: frames that have been congestion marked become larger than frames that have not been marked. If this makes the frame larger than the max transmission unit (MTU) of the link, it will be dropped (which is safe congestion behaviour). However, this will complicate path MTU discovery by higher layers. Possible work-round: An encapsulator that has been upgraded to 802.1C could stuff all frames with a shim or padding of the same size as 802.1C but without using a new Ethertype. 2.2.2 Managed subnet (optional variant) -------------------- If the encapsulator is configured by a network manager to say that every egress in the subnet is fully 802.1C compatible, then the ingress can routinely insert an 802.1C shim whenever it encapsulates a higher layer (inner) header that supports ECN (see the above list of examples). Otherwise the rest of the managed scheme works exactly like the unmanaged variant above. Ingress nodes in an unmanaged subnet cannot routinely insert the 802.1C shim, because then *all* frames through an egress that has not been upgraded to 802.1C will be dropped. But if it is known that there are no egress nodes that do not understand 802.1C, then all frames can have the shim added. Then, all frames that are the same size when they arrive will remain the same size across the subnet. 2.2.3 Discussion of new Ethertype idea -------------------------------------- The new Ethertype idea is based on the insight that the ECN-capable transport (ECT) facility of the ECN protocol in IP is only necessary at the egress of an IEEE 802 subnet. Then, we don't need bits for the ECT function in every frame, we only check for ECN-capability on decapsulation. IEEE 802 frame headers only get stripped off by some function that must understand the header encapsulated inside. Therefore, if we define a new Ethertype to describe a new type of payload encapsulated within the 802 frame header, those boxes that have not been upgraded to understand this new Ethertype will have to drop the frame. 3. Pros :) & Cons :( -------------------- #1 :( In environments where ECN or PCN is deployed to get near-zero drop the extra drop one might get with multiple bottlenecks is possibly worrying, but not too bad. #2 :( Assumes a well-designed & configured network. Altho this is a reasonable assumption for MPLS and the provider-based forms of ethernet, it is not reasonable for ethernet in the home etc. We should define a frame format for all 802.1 frames without assuming their control configuration. #1 :( Can't be used for two congestion severity levels (e.g. PCN). CFI=1 sometimes dropped. {ToDo: Finish writing up the pros and cons of the rest of the schemes} 4. Conclusions, Acknowledgements etc {ToDo:} References {ToDo: add full citations} ---------- [802a] IEEE [802.1Q] IEEE [802.1ad] IEEE [802.1ah] IEEE [802.1p] IEEE [RFC3168] IETF [RFC3540] IETF [RFC5129] IETF [RFC5559] IETF [ID.briscoe-tsvwg-re-ecn-tcp] IETF Document Control ---------------- Version Date Author Comments ________________________________________________________________ A 10 Aug 2009 Bob Briscoe Initial Draft B 19 Feb 2010 Bob Briscoe Added new Ethertype idea. ________________________________________________________________