Encoding Explicit Congestion Notification (ECN) in IEEE802 Wire
Protocols

Abstract
This document compares a number of ways to encode explicit congestion
notification (ECN) in IEEE802 Wire Protocols.

It mainly focuses on incremental deployment, ensuring that ECN-marked
packets will not be propagated to higher layers that would not
understand them, even in a network where some of the boxes that
decapsulate higher layers payloads might not understand ECN in either
layer.

At present, it is a very rough draft offering a number of fairly
immature ideas designed to provoke criticism.

If any one of these ideas were worth pursuing, an IEEE802 work item
would have to be found or created to pursue it.

1. Introduction
--------------------

This document compares a number of ways to encode explicit congestion
notification (ECN) in Ethernet Wire Protocols. The aim is for Ethernet
switches to be able to mark packets as they approach congestion so that
end systems can respond, often before any packets are dropped. A related
aim is to make congestion visible to other network layer devices so that
they can monitor and possibly limit the congestion caused by different
users.

At least three encoding states are needed (similar to ECN [RFC3168]),
but they may not necessarily require three codepoints in each frame:
* Not-ECT
* ECT
* CE

There is possibly a need for a fourth state, to support two congestion
severity levels like pre-congestion notification (PCN [RFC5559]).

Some of the more recent outputs of the IEEE802 committees (usually those
with 'provider' in their title) require configuration and management in
order to work. Ideally, a solution is needed that will work correctly in
an unmanaged (zero-config) deployment, even when newly upgraded
equipment is mixed with legacy equipment in the same subnet. This
document offers some zero-config ideas and some that would only work in
managed environments.

Support for the ECN nonce [RFC3540] or re-ECN
[ID.briscoe-tsvwg-re-ecn-tcp] does not require support at each link, as
end-to-end support in IP is sufficient.

2. Available Alternatives
--------------------

2.1 Modifications to the 802.1Q header
--------------------

As [802.1Q] is only used in managed deployments, the solutions in this
subsection are consequently also limited to managed deployments.

The 802.1Q header looks like this:
3bits           1bit    12bits
User            CFI     VLAN ID
Priority

I understand (but I DON'T KNOW) that there are few if any ethernet boxes
that include any capability to process frames with a canonical format
indicator (CFI) = 1. If CFI=1 I believe the frame is nearly always
dropped.

There's only two ways I know of to get 3 states into 1 bit:
#1) Define:
  * CFI=1 as ECT
  * incoming CFI=0 as Not-ECT
  * outgoing CFI=0 as CE.
In other words, a congested switch can mark a CFI=1 frame to CFI=0, but
once one congested switch has marked a frame, if it arrives at another
congested switch on the same link it will look as if it is Not-ECT and
have to be dropped. This alternative is documented in [RFC3168] as a
1-bit idea for IP that was rejected in favour of 2 bits.

#2) Define:
  * CFI=0 as unmarked
  * CFI=1 as marked
And also hold the ECT state in the (IP) payload of the frame. At the
next IP hop if the IP header = Not-ECT and CFI=1, drop the frame. This
is like the approach for doing ECT in MPLS [RFC5129].

Two-bit schemes
#3) Define
  * CFI=0 as Not-ECT
  * CFI=1 as ECT
Redefine the last bit of the [802.1p] 'user priority' field as CE,
either for all 802.1p (unlikely to be accepted) or locally for each
operator (not zero config). The 802.1p header then becomes:
2bits           1bit	1bit    12bits
User            CE	ECT     VLAN ID
Priority

#4) Define
  * CFI=0 as Not-ECT
  * CFI=1 as ECT
Redefine the 802.1p 'user priority' codepoints, either for all 802.1p
(unlikely to be accepted) or locally for each operator (not zero
config). Possible examples:
0 = Priority 0 not marked
1 = Priority 0 marked
2 = Priority 1 not marked
3 = Priority 1 marked
5 = Priority 2 not marked

6 = Priority 2 threshold marked
7 = Priority 2 excess rate marked
4 = Priority 3 not marked (and unmarkable)

2.2. New ECN Ethertype
----------------------

This idea has been designed to work in completely unmanaged
(zero-config) environments, with a possible optimisation for a managed
environment.

The idea: Define a new Ethertype for all 802 protocols, let's say 802.1C
where 'C' stands for congestion. Initially a prototype or
vendor-specific Ethertype could be used [802a].  In the diagrams,
Etype=C means the Ethertype is set to this new value.

Define the 802.1C header in a very similar way to 802.1Q. That is, a
shim that is inserted into the 802 header just before the Ethertype. If
there is already an 802.1Q header, the 802.1C header would be inserted
straight after it. If there are multiple 802.1Q headers (Q-in-Q [802.1ad]) or multiple MAC headers (MAC in MAC
[802.1ah]), there only *needs* to be one 802.1C shim inside the first
802.1Q shim header.

+--------+--------++----------+--------+--------+
|Dst MAC |Src MAC ||Ethertype |Payload |CRC/FCS |
|        |        ||/Length   |        |        |
|6octets |6octets ||2octets   |...     |4octets |
+--------+--------++----------+--------+--------+
                  | \
                  +---------------+
                  |802.1C         |
                  |3 octets       |
                  +---------------+
                  |Etype=C |ECN   |
                  |2octets |1octet|
                  +---------------+

+--------+--------+----------++----------+--------+--------+
|Dst MAC |Src MAC |802.1Q    ||Ethertype |Payload |CRC/FCS |
|        |        |          ||/Length   |        |        |
|6octets |6octets |4octets   ||2octets   |...     |4octets |
+--------+--------+----------++----------+--------+--------+
                             | \
                             802.1C

+--------+--------+----------++----------++----------+--------+--------+
|Dst MAC |Src MAC |802.1Q    ||802.1Q    ||Ethertype |Payload |CRC/FCS |
|        |        |          ||          ||/Length   |        |        |
|6octets |6octets |4octets   ||4octets   ||2octets   |...     |4octets |
+--------+--------+----------++----------++----------+--------+--------+
                             | \  and/or | \
                             802.1C      802.1C

As an efficiency optimisation, encapsulators that understand 802.1C can 
remove any deeper 802.1C shims, including any ECN marking they contain 
into a single shim just inside the outermost 802.1Q shim. Encapsulators 
that don't understand 802.1C won't do this. But it doesn't matter if 
there are more shims than necessary, because when they are decapsulated, 
any ECN in them can be passed up the layers recusively (which is what 
the optimised encapsulator does to save sending too many shims).

The proposed 802.1C shim starts with the newly defined Ethertype, then
has n bits for ECN. Probably n=8, but we only need n=2. {ToDo: Rather
than keep stuffing in little additions like this, there may be some way
to add a general purpose shim.}

There are three types of 802 node: encapsulating at a subnet ingress,
forwarding in the subnet interior or decapsulating at a subnet egress. A
physical node may take on more than one of these roles in sequence. An
ingress node is always also a forwarding node.

2.2.1 Unmanaged subnet
----------------------
In an unmanaged subnet, the encapsulator does NOT insert an 802.1C shim
into the frame, even if the ingress has been upgraded to understand
802.1C. We will describe a managed subnet later.

Any forwarding nodes that have been upgraded to 802.1C run an AQM
algorithm (on high speed switches this may just be a threshold at a
small queue size, above which frames are marked). They need to be able
to mark a frame, rather than drop it when the AQM algorithm decides to
signal congestion. To mark a packet, the forwarding node inserts an
802.1C shim if one is not already present, then it sets the two ECN bits
to 11 (say).

A congested interior forwarding node that has not been upgraded to
802.1C will just drop frames as normal. Interior Ethernet switches that
are just forwarding frames don't need to understand the Ethertype -
usually only the egress of a subnet that decapsulates the frame checks
the Ethertype. So, if an 802.1C shim has been added by an earlier
switch, it should be forwarded at least as far as the egress of the
subnet, even by switches that don't understand 802.1C.

On decapsulation at the subnet egress, if the decapsulator doesn't
understand the 802.1C Ethertype, it will drop the frame. Given the frame
should only contain an 802.1C shim if it carries an ECN marking, this is
the desired behaviour - the ECN marking cannot be propagated up the
layers, so drop is the only alternative signal of congestion that is
appropriate.

If the decapsulator does understand 802.1C, it will remove the shim
remembering any ECN-marking within, then continue decapsulating the next
outer header. If the inner header supports ECN, it will mark the
appropriate ECN field if there was an ECN marking in any of the shim(s)
it removed. Examples of inner headers that support ECN would be
   o an IP header with the ECT setting in its ECN field;
   o a DECNET, ATM or frame relay header
   o an inner 802 frame, with or without an 802.1Q shim

If the inner header does not support ECN, if there was any ECN marking
in any of the stripped out shim(s) it must drop the frame - again, the
only appropriate way to signal congestion to a higher layer that shows
it would not understand ECN.

If the higher layer inner payload is the logical link control (LLC)
sub-layer [802.2], it is feasible that a bridge could preserve the
802.1C shim when it removes one outer 802 header and adds another.
{ToDo: I don't know enough to be sure about this statement.}

Disadvantage of the unmanaged scheme: frames that have been congestion
marked become larger than frames that have not been marked. If this
makes the frame larger than the max transmission unit (MTU) of the link,
it will be dropped (which is safe congestion behaviour). However, this
will complicate path MTU discovery by higher layers.

Possible work-round: An encapsulator that has been upgraded to 802.1C
could stuff all frames with a shim or padding of the same size as 802.1C
but without using a new Ethertype.

2.2.2 Managed subnet (optional variant)
--------------------
If the encapsulator is configured by a network manager to say that every
egress in the subnet is fully 802.1C compatible, then the ingress can
routinely insert an 802.1C shim whenever it encapsulates a higher layer
(inner) header that supports ECN (see the above list of examples).
Otherwise the rest of the managed scheme works exactly like the
unmanaged variant above.

Ingress nodes in an unmanaged subnet cannot routinely insert the 802.1C
shim, because then *all* frames through an egress that has not been
upgraded to 802.1C will be dropped. But if it is known that there are no
egress nodes that do not understand 802.1C, then all frames can have the
shim added. Then, all frames that are the same size when they arrive
will remain the same size across the subnet.


2.2.3 Discussion of new Ethertype idea
--------------------------------------
The new Ethertype idea is based on the insight that the ECN-capable
transport (ECT) facility of the ECN protocol in IP is only necessary at
the egress of an IEEE 802 subnet. Then, we don't need bits for the ECT
function in every frame, we only check for ECN-capability on
decapsulation.

IEEE 802 frame headers only get stripped off by some function that must
understand the header encapsulated inside. Therefore, if we define a new
Ethertype to describe a new type of payload encapsulated within the 802
frame header, those boxes that have not been upgraded to understand this
new Ethertype will have to drop the frame.


3. Pros :) & Cons :(
--------------------
#1 :( In environments where ECN or PCN is deployed to get near-zero drop
the extra drop one might get with multiple bottlenecks is possibly
worrying, but not too bad.

#2 :( Assumes a well-designed & configured network. Altho this is a
reasonable assumption for MPLS and the provider-based forms of ethernet,
it is not reasonable for ethernet in the home etc. We should define a
frame format for all 802.1 frames without assuming their control
configuration.

#1 :( Can't be used for two congestion severity levels (e.g. PCN).

CFI=1 sometimes dropped.

{ToDo: Finish writing up the pros and cons of the rest of the schemes}

4. Conclusions, Acknowledgements etc

{ToDo:}

References {ToDo: add full citations}
----------
[802a] IEEE
[802.1Q] IEEE
[802.1ad] IEEE
[802.1ah] IEEE
[802.1p] IEEE
[RFC3168] IETF
[RFC3540] IETF
[RFC5129] IETF
[RFC5559] IETF
[ID.briscoe-tsvwg-re-ecn-tcp] IETF


Document Control
----------------
Version Date         Author       Comments
________________________________________________________________
A       10 Aug 2009  Bob Briscoe  Initial Draft
B       19 Feb 2010  Bob Briscoe  Added new Ethertype idea.
________________________________________________________________