TSVWG B. Briscoe Internet Draft G. Corliano draft-briscoe-tsvwg-cl-architecture-00.txt P. Eardley Expires: January 2006 P. Hovell A. Jacquet D. Songhurst BT July 11, 2005 An architecture for edge-to-edge controlled load service using distributed measurement-based admission control draft-briscoe-tsvwg-cl-architecture-00.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on January 11, 2006. Copyright Notice Copyright (C) The Internet Society (2005). All Rights Reserved. Briscoe Expires January 11, 2006 [Page 1] Internet-Draft Controlled Load architecture July 2005 Abstract This document describes an architecture to achieve a Controlled Load (CL) service edge-to-edge, i.e. within a particular region of the Internet, by using distributed measurement-based admission control. The measurement made is of CL packets that have their Congestion Experienced (CE) codepoint set as they travel across the edge-to-edge region. Setting the CE codepoint, which is under the control of a new Per Hop Behaviour (CL-ramp-PHB, defined in draft-briscoe-tsvwg-cl-phb- 00.txt), provides an "early warning" of potential congestion. This information is used by the ingress node of the edge-to-edge region to decide whether to admit a new CL microflow. A use case is described which shows how the PHB is a fundamental building block in the edge-to-edge architecture, and in turn how this is a building block within a broader QoS architecture achieving an end- to-end CL service. Table of Contents 1. Introduction................................................3 1.1. Summary................................................3 1.2. Key features...........................................4 1.3. Benefits...............................................6 1.4. Standardisation requirements............................6 1.5. Terminology............................................7 1.6. Structure of rest of document...........................8 2. Use case....................................................8 2.1. Configured bandwidth allocation to the CL behaviour aggregate ...........................................................10 2.2. Flexible bandwidth allocation to CL behaviour aggregate.11 3. Details....................................................12 3.1. Packet processing......................................12 3.1.1. Ingress nodes.....................................12 3.1.2. Interior nodes....................................13 3.1.3. Egress nodes......................................15 3.2. Signalling............................................16 4. Extensions.................................................17 4.1. Multi-domain and multi-operator usage..................17 4.2. Variable bit rate sources..............................18 4.3. Starvation prevention..................................18 5. Relationship to other QoS mechanisms........................18 5.1. Standardisation requirements...........................18 5.2. Controlled Load........................................18 Briscoe Expires January 11, 2006 [Page 2] Internet-Draft Controlled Load architecture July 2005 5.3. Integrated services operation over Diffserv............19 5.4. Differentiated Services................................19 5.5. ECN...................................................19 5.6. RTECN.................................................20 5.7. RMD...................................................20 5.8. MPLS-TE...............................................20 6. Security Considerations.....................................21 7. Acknowledgements...........................................21 8. Comments solicited.........................................21 9. References.................................................21 Authors' Addresses............................................24 Intellectual Property Statement................................26 Disclaimer of Validity........................................26 Copyright Statement...........................................26 1. Introduction 1.1. Summary This document describes an architecture to achieve a controlled load service edge-to-edge, i.e. within a particular region of the Internet, using distributed measurement-based admission control. Controlled load service is a quality of service (QoS) closely approximating the QoS that the same flow would receive from a lightly loaded network element [RFC2211]. Controlled Load (CL) is useful for inelastic flows such as those for streaming real-time media. The architecture described in this document achieves edge-to-edge controlled load service using a new Per Hop Behaviour (PHB) as a fundamental building block. In turn, an end-to-end CL service would use this architecture as a building block within a broader QoS architecture. The PHB, edge-to-edge and end-to-end aspects are now briefly introduced in turn. The new PHB, called CL-ramp-PHB, is defined in [CL-PHB]. Network nodes that implement the differentiated services (DS) enhancements to IP use a codepoint in the IP header to select a PHB as the specific forwarding treatment for that packet [RFC2474, RFC2475]. The CL-ramp- PHB is different from PHBs defined so far, in that it defines Explicit Congestion Notification (ECN) marking semantics as part of the PHB. A node in the CL-region sets the Congestion Experienced (CE) codepoint in the IP header as an "early warning" of potential congestion, and aims to do so before there is any significant build- up of CL packets in the queue. Briscoe Expires January 11, 2006 [Page 3] Internet-Draft Controlled Load architecture July 2005 To achieve the CL service edge-to-edge, ie within a region of the Internet - which we call CL-region (defined below) - distributed measurement-based admission control is used. All nodes within the CL- region run the CL-ramp-PHB. The measurement is of the CL packets that have had their CE codepoint set as they travel across the CL-region. Since any node in the CL-region may set the CE codepoint, the measurement is distributed. The measurement is recorded by the egress node of the CL-region. The egress node calculates the bits in these CE packets as a fraction of the bits in all the CL packets, as an exponentially weighted moving average (which we term Congestion- Level-Estimate). Depending on the value of Congestion-Level-Estimate, the ingress node of the CL-region decides whether to admit a new CL microflow. Since setting the CE codepoint is an "early warning" of potential congestion (ie before there is any significant build-up of CL packets in the queue), the admission control procedure means that previously accepted CL microflows will suffer minimal queuing delay, jitter and loss - exactly the requirements of real time traffic. In turn, the edge-to-edge architecture is a building block in delivering an end-to-end CL service. The approach is similar to that described in [RFC2998] for Integrated services operation over Diffserv networks. Like [RFC2998], an IntServ class (CL in our case) is achieved end-to-end, with a CL-region viewed as a single reservation hop in the total end-to-end path. Interior routers of the CL-region do not process flow signalling nor do they hold state. Unlike [RFC2998] we do not require the end-to-end signalling mechanism to be RSVP, although it can be - as indeed we assume in Sections 2 and 3. [RFC2998] and our approach are compared further in Section 5. 1.2. Key features In this section we discuss some of the key aspects of the edge-to- edge architecture. One key feature of our approach revolves around the use of Explicit Congestion Notification (ECN) [RFC3168] to indicate that the amount of packets flowing is getting close to the engineered capacity. Note that ECN operates across the CL-region, ie edge-to-edge, and not host-to-host as in [RFC3168]. The new PHB, CL-ramp-PHB, is designed to provide an "early warning" of potential congestion. It assumes that a new microflow won't move the CL-region directly from no congestion to overload; there will always be an intermediate stage where a new CL microflow causes CL Briscoe Expires January 11, 2006 [Page 4] Internet-Draft Controlled Load architecture July 2005 packets to have their CE codepoint set but still be delivered without significant delay. This assumption is valid for core and backbone networks but is unlikely to be valid in access networks where the granularity of an individual call becomes significant. Note that the CL-region can potentially span multiple domains. Indeed, over time CL-regions may incrementally grow and merge, and could eventually become a single CL-region encompassing all core and backbone networks, providing Internet-wide controlled load service in concert with stateful admission control mechanisms at the very edges of the Internet. It is also possible for a CL-region to include domains run by different operators. The border routers between operators within the CL-region only have to do bulk accounting - per microflow metering and policing is not needed. Section 4.1 discusses further. CL-packets are marked with a Differentiated Services Codepoint (DSCP), so that nodes in the CL-region can distinguish the CL packets from non-CL ones [RFC2474] and know that the CL-ramp-PHB is required. However, note that we do not use the traffic conditioning agreements (TCAs) of the (informational) Diffserv architecture [RFC2475], in which operators in practice rely on subscription-time Service Level Agreements (SLAs) that statically define the parameters of the traffic that will be accepted from a customer. Operators deploying our mechanism do not need to make a fixed assignment of capacity because the division of bandwidth between CL and non-CL traffic can be flexible. Our edge-to-edge architecture uses dynamic admission control: the closed feedback loop between the ingress and egress nodes of the CL- region. The key advantage of controlling the load dynamically rather than with TCAs is that the latter can fail catastrophically. The problem arises because the TCA at the ingress must allow any destination address, if it is to remain scalable. But for longer topologies, the chances increase that traffic will focus on a resource near the egress, even though it is within contract at the ingress [Reid]. Even though networks can be engineered to make such failures rare, when they occur all inelastic flows through the congested resource fail catastrophically. This is also why in our approach the egress node of the CL-region calculates the Congestion- Level-Estimate separately for CL packets from each ingress node. Finally, it is assumed that the end systems react properly to non-CL packets that are dropped or have their CE codepoint set, otherwise Briscoe Expires January 11, 2006 [Page 5] Internet-Draft Controlled Load architecture July 2005 new CL microflows calls may get unfairly blocked. How to police this is out of scope of this document. 1.3. Benefits We believe that the mechanism described in this document has several advantages, which we briefly explain with reference to the key features described above: o It achieves statistical guarantees of quality of service for microflows, delivering a very low delay, jitter and packet loss service suitable for applications like voice and video calls that generate real time inelastic traffic. This is because of its per microflow admission control scheme, combined with its "early warning" of potential congestion. The guarantee is at least as strong as with Intserv Controlled Load (Section 5 mentions why the guarantee may be somewhat better), but without its scalability problems [RFC2208]. o It scales well, because there is no signal processing or path state held by the interior nodes of the CL-region. o It is resilient, again because no state is held by the interior nodes of the CL-region. o It requires minimal new standardisation, because it reuses existing QoS protocols. o It can be deployed incrementally, network by network. Not all the networks on the end-to-end path need to have it deployed. Two CL- regions can be separated by a network that uses another QoS mechanism (eg MPLS), or where they are adjacent can merge to become a single CL-region. o It can work between operators, ie the CL-region can include domains run by different operators. This is scalable because there is only bulk metering at the inter-operator interface; there is no need for per microflow accounting or policing. 1.4. Standardisation requirements The architecture described in this document has two new standardisation requirements: for a new PHB, as described in [CL- Briscoe Expires January 11, 2006 [Page 6] Internet-Draft Controlled Load architecture July 2005 PHB], and for the end-to-end signalling protocol to carry the Congestion-Level-Estimate report (eg with RSVP, the RESV message must carry a new opaque object across the CL-region). Other than these two things, the arrangement uses existing standards throughout although, as mentioned above, not in their usual architecture. Section 5 discusses standardisation issues further. This document is INFORMATIONAL. 1.5. Terminology o Ingress node: a node which is an ingress gateway to the CL-region. A CL-region may have several ingress nodes. o Egress node: a node which is an egress gateway from the CL-region. A CL-region may have several egress nodes. o Interior node: a node which is part of the CL-region, but isn't an ingress or egress node. o CL-region: A region of the Internet in which all nodes run the CL- ramp-PHB and all traffic enters/leaves through an ingress/egress node. A CL-region is a DS region (a DS region is either a single DS domain or set of contiguous DS domains), but note that the CL- region does not use the traffic conditioning agreements (TCAs) of the (informational) Diffserv architecture. o CL-ramp-PHB: A new Per Hop Behaviour, described in [CL-PHB]. o Congestion-Level-Estimate: the bits in CL packets that have the CE codepoint set, divided by the bits in all CL packets. It is calculated as an exponentially weighted moving average. It is calculated by an egress node for CL packets from a particular ingress node. Briscoe Expires January 11, 2006 [Page 7] Internet-Draft Controlled Load architecture July 2005 ______________________________ / \ / \ |-------| |--------| |-------| |Ingress|----|Interior|----|Egress | | node | | node | | node | |-------| |--------| |-------| \ / \______________________________/ < ---------- CL-region ----------- > Figure 1: Sample edge-to-edge configuration and terminology 1.6. Structure of rest of document Section 2 describes a use case, with further details in Section 3 and extensions in Section 4. Section 5 discusses standardisation aspects. 2. Use case In this section we outline a usage scenario to illustrate how our mechanism works. It is intended to show how the main features fit together to deliver QoS, with further details in Section 3. Our QoS mechanism operates over a CL-region. For now we assume that it consists of one domain whilst in Section 4.1 we extend it to the multi-domain case, including where different operators run the domains. So our scenario consists of two end hosts, each connected to their own access networks, which are linked by the CL-region. We require some other method, for instance IntServ, to be used outside the CL-region to provide QoS. For now we assume that the end-to-end signalling protocol is RSVP; other protocols are considered in Section 3.2. From the perspective of RSVP the CL-region is a single hop, so the RSVP PATH and RESV messages are processed by the ingress and egress nodes but are carried transparently across all the interior nodes. Hence, the ingress and egress nodes hold per microflow state, whilst no state is kept by the interior nodes. Section 2.1 describes a restricted scenario where the CL behaviour aggregate is assigned a fixed amount of bandwidth. This is equivalent Briscoe Expires January 11, 2006 [Page 8] Internet-Draft Controlled Load architecture July 2005 to the case today with the DS architecture: a subscription-time Service Level Agreement (SLA) statically defines the amount of bandwidth reserved for a particular behaviour aggregate. Section 2.2 describes the more general case where there is no fixed allocation to CL traffic. Each node in the CL-region runs an algorithm to determine whether to set the CE codepoint of a particular CL packet. In our description we assume that a bulk token bucket is used (other implementations are possible), and that tokens are added when packets are queued and are consumed at a fixed rate. The idea is that an excess of tokens is seen before the queue of CL packets has got long enough to cause the CL packets to suffer a significant delay - the algorithms are explained more fully below and are slightly different in Sections 2.1 and 2.2. Note that the same token bucket is used for all the CL packets, ie it operates in bulk on the CL behaviour aggregate and not per microflow. ___ ____ _______________________________________ ____ ___ | | | | | | | | | | | | | | |Ingress Interior Interior Egress| | | | | | | | | | node node node node | | | | | | | | | |------| |------| |------| |------| | | | | | | | | | CL- | | CL- | | CL- | | | | | | | | |..| |..| PHB |...| PHB |...| PHB |...| Meter|..| |..| | | | | | |------| |------| |------| |------| | | | | | | | | | \ / | | | | | | | | | | \ / | | | | | | | | | | --<------------<-----------<-- | | | | | | | | | | | | | | | |___| |____| |_______________________________________| |____| |___| Sx Access CL-region Access Rx End Network Network End Host Host <------ edge-to-edge signalling ------> (admission control) <-------------------end-to-end QoS signalling protocol----------------> Figure 2: Overall QoS architecture Briscoe Expires January 11, 2006 [Page 9] Internet-Draft Controlled Load architecture July 2005 2.1. Configured bandwidth allocation to the CL behaviour aggregate Each node in the CL-region has a fixed rate (bandwidth) allocated to CL traffic, under the control of management configuration. Tokens are consumed at a fixed rate that is slightly slower than the configured rate, and added when packets are queued. This means that the amount of tokens starts to increase before the actual queue builds up but when it is in danger of doing so soon; hence it can be used as an "early warning" of potential congestion. The probability that a node sets the CE codepoint of a CL packet depends on the number of tokens in the bucket. Below one threshold value of the number of tokens no packets have their CE codepoint set and above the second they all do; in between, the probability increases linearly. We now describe how setting the CE codepoint influences admission control by the ingress node. For ease of description we imagine that packets are already flowing. Each egress meters whether a CL packet has its CE codepoint set. We assume that initially the traffic load is such that there are no CE packets. Next a source tries to set up a new CL microflow. The RSVP PATH message is processed by the ingress and egress nodes and PATH state is installed in these two routers. When the RSVP RESV message travels back from the receiving end host, the egress node adds on an RSVP object which states that currently no CL packets have their CE codepoint set. Hence the ingress node admits the new CL microflow, and the RESV message continues on to the source. We imagine that this new microflow results in one (or more) of the interior nodes starting to set the CE codepoint of CL packets because their arrival rate is nearing the configured rate. The egress calculates - as an exponentially weighted moving average - the fraction of CL packets from a particular ingress node that have their CE codepoint set (or rather the calculation is done according to the bits in those packets). This Congestion-Level-Estimate provides an estimate of how near the CL-region is getting to a load where the CL traffic will start suffering significant delays. Note that the metering is done separately per ingress node, because (as discussed in Section 1.2) there may be sufficient capacity on all the nodes on the path between one ingress node and a particular egress, but not from a second ingress. The next time a source tries to set up a CL microflow, the egress informs the ingress node about the relevant Congestion-Level- Estimate; this is included as an opaque object within the RSVP RESV Briscoe Expires January 11, 2006 [Page 10] Internet-Draft Controlled Load architecture July 2005 message. If it is greater than some threshold value then the ingress refuses the request, otherwise it is accepted and the RSVP RESV continues to the source end host. It is also possible for an egress node to get a RSVP RESV message and not know what Congestion-Level-Estimate is. For example, if there are no CL microflows at present between the relevant ingress and egress nodes. In this case the egress requests the ingress to send probe packets, from which it can initialise its meter. Having explained how the admission control decision is reached we now look at an on-going data microflow. The source sends CL packets, which arrive at the ingress node. The ingress uses a normal five- tuple filter to identify that the packets are part of a previously admitted CL microflow, and it also polices the microflow to ensure it remains within its traffic profile. (The ingress has learnt the required information from the RSVP PATH message.) The ingress sets the DSCP appropriately and the ECN field to ECT (ECN-Capable Transport). The CL packets now travel across the CL-region, with the CE codepoint getting set if necessary. Also, appropriate queue scheduling is needed in each node to ensure that CL traffic gets its configured bandwidth. For instance, a Weighted Round Robin scheduler could be used. 2.2. Flexible bandwidth allocation to CL behaviour aggregate The set-up is similar to the previous sub-section, except that nodes in the CL-region do not allocate a fixed bandwidth to CL flows. As a consequence, the algorithm for setting the CE codepoint is slightly altered. Tokens are consumed at a fixed rate that is slightly slower than the (total) outgoing service rate, and added when packets are queued. The probability that a node sets the CE codepoint of a CL packet depends on the number of tokens in the bucket *plus* the number of queued non-CL packets. Below one threshold value of this sum no packets have their CE codepoint set and above the second they all do; in between, the probability increases linearly. Note that the probability reflects the load of both CL and non-CL traffic. The reason is to ensure a 'fair balance' between the two classes, by rejecting CL session requests if non-CL demand is very high. Alternatively, if the number of queued non-CL packets is not Briscoe Expires January 11, 2006 [Page 11] Internet-Draft Controlled Load architecture July 2005 included, then the admission of a CL microflow is independent of the amount of non-CL traffic. The admission control procedure is as in the previous sub-section. As regards queue scheduling, CL packets are always scheduled ahead of non-CL ones, in order to minimise their delay and jitter, and FIFO (First In First Out) queuing is used to prevent reordering within a CL microflow. This is more restrictive than in the previous sub- section, which we believe is necessary now the arrival rate of CL packets is unknown. 3. Details In this section we first concentrate on the details about packet processing in nodes in the CL-region, before looking more briefly at issues associated with the signalling for admission control. 3.1. Packet processing A network operator upgrades normal IP routers by: o Adding functionality related to admission control to all its ingress and egress nodes o Adding appropriate queuing and scheduling behaviour to its nodes, including the ability to set the CE codepoint "early". We consider the detailed actions required for each of the types of node in turn. 3.1.1. Ingress nodes Ingress nodes perform the following tasks: o Classify incoming packets - decide whether they are CL or non-CL packets. This is done using a normal filter spec (source and destination addresses and port numbers), whose details have been gathered from the RSVP PATH message o Police - check that the microflow is conformant with what has been agreed (ie the flow keeps to its agreed data rate). If necessary, the suggested action is that packets are marked to Best Effort. o Packet colouring - for CL microflows, set the DSCP appropriately and set the ECN field to ECT(0) or ECT(1) Briscoe Expires January 11, 2006 [Page 12] Internet-Draft Controlled Load architecture July 2005 o Perform standard 'interior node' functions (see next sub-section) 3.1.2. Interior nodes Interior nodes do the following tasks: o Examine the DSCP - to see if it's a CL packet o Enqueue - CL and non-CL packets are put into logically separate queues; if required, a CL packet can pre-empt non-CL packet(s) in the total buffer (see below). o Non-CL packets are handled as usual. A RED algorithm [RFC2309] is used to decide whether to drop packets or, if they are ECN- capable, set their CE codepoint. o CL packets have their CE codepoint set according to what is essentially a token bucket algorithm (see below). o Dequeue - any CL packet is always dequeued before a non-CL packet. Within the CL class scheduling is FIFO. There may be a hierarchy of non-CL classes, this is out of scope. Queuing: Although CL and non-CL packets are put into logically separate queues, implementations in practice share the same buffer space. If the buffer is full then an incoming non-CL packet is dropped, whilst an incoming CL packets is queued and sufficient of the newest non-CL packet(s) are dropped. In the unlikely event that the buffer is full of CL packets, then the newest CL packet is discarded (ie tail drop). Because of the admission procedure this should be rare, but it is needed to protect the network in case of misconfiguration for instance. Setting the CE codepoint: Tokens are added when CL packets are queued and are consumed at a fixed rate related to the outgoing service rate. When a CL packet arrives the token bucket is updated as follows: Briscoe Expires January 11, 2006 [Page 13] Internet-Draft Controlled Load architecture July 2005 [CL-bucket-level]n+1 = [CL-bucket-level]n + CL-packet-size - (service-bit-rate * time * safety-factor) Where CL-bucket-level is the amount of tokens in the token bucket. It is constrained to lie between 0 and a fixed upper limit time is the time elapsed since CL-bucket-level was last updated safety-factor is > 1 and gives the "early warning" of potential congestion service-bit-rate is either the configured bit rate for CL traffic - for the fixed bandwidth case (ie Section 2.1), or the outgoing service rate for all traffic - for the flexible bandwidth case (ie Section 2.2). CL packets have their CE codepoint set with a probability that depends on the number of non-CL packets in the queue, as well as the number of tokens in a token bucket. When a CL packet arrives, the probability that the node sets its CE codepoint is determined as follows: if [CL-bucket-level]n+1 + (A * smoothed-non-CL-queue-length) < min- threshold Probability-CE-codepoint-set = 0 if [CL-bucket-level]n+1 + (A * smoothed-non-CL-queue-length) > max-threshold Probability-CE-codepoint-set = 1 otherwise Probability-CE-codepoint-set = (CL-bucket-level - min-threshold) / (max-threshold - min-threshold) Briscoe Expires January 11, 2006 [Page 14] Internet-Draft Controlled Load architecture July 2005 Where max-threshold > min-threshold max-threshold <= the fixed upper limit of CL-bucket-level smoothed-non-CL-queue-length is the number of bits in packets in the non-CL queue, smoothed as an exponentially weighted moving average (EWMA) A is either 0 or 1: A = 0 for the fixed bandwidth case (ie Section 2.1), A = 1 for the flexible bandwidth case (ie Section 2.2). 3.1.3. Egress nodes Egress nodes do the following tasks: o Metering - for CL packets, calculating the fraction of the total bits which are in CE packets. The calculation is done as an exponentially weighted moving average. A separate calculation is made for CL packets from each ingress router. o Packet colouring - for CL packets, set the DSCP and the ECN field to whatever has been agreed as appropriate for the next domain. An egress node getting a CL packet first determines which ingress node that packet has come from. The necessary details are gathered from the RSVP PATH message (previous RSVP hop, ie ingress node, vs. filter spec). It then updates the two meters associated with that ingress node. The meters work on an aggregate basis, and not per microflow. For every CL packet arrival: [EWMA-total-bits]n+1 = (w * bits-in-packet) + ((1-w) * [EWMA- total-bits]n ) [EWMA-CE-bits]n+1 = (B * w * bits-in-packet) + ((1-w) * [EWMA-CE- bits]n ) Briscoe Expires January 11, 2006 [Page 15] Internet-Draft Controlled Load architecture July 2005 [Congestion-Level-Estimate]n+1 = [EWMA-CE-bits]n+1 / [EWMA-total- bits]n+1 where EWMA-total-bits is the total number of bits in CL packets, calculated as an exponentially weighted moving average (EWMA) EWMA-CE-bits is the total number of bits in CL packets where the packet has its CE codepoint set, again calculated as an EWMA. B is either 0 or 1: B = 0 if the CL packet does not have its CE codepoint set B = 1 if the CL packet has its CE codepoint set w is the exponential weighting factor. Varying the value of the weight trades off between the smoothness and responsiveness of the estimate of the percentage of CE packets. There will be a threshold inter-arrival time between packets of the same aggregate below which the egress will consider the estimate of the Congestion-Level-Estimate as too stale, and it will then trigger probing by the ingress. For packet colouring, by default the ECN field is set to the Not-ECT codepoint. Note that this results in the loss of the end-to-end meaning of the ECN field. It can usually be assumed that end-to-end congestion control is unnecessary within an end-to-end reservation. But if a genuine need is identified for end-to-end ECN semantics within a reservation, then an alternative is to tunnel CL packets across the CL-region, or to agree an extension to end-to-end signalling to indicate that the microflow uses an ECN-capable transport. We do not recommend such apparently unnecessary complexity. 3.2. Signalling The admission control procedure involves signalling between the ingress and egress nodes. The following new messages are needed:- Briscoe Expires January 11, 2006 [Page 16] Internet-Draft Controlled Load architecture July 2005 o Egress to ingress: piggy-backed on reservation reply: this is the current value of Congestion-Level-Estimate. An egress node is configured to know it is an egress node, so it always appends this to the reservation response. A flag in this message can indicate the value is unknown, in order to trigger probing by the ingress. o Ingress to egress: probe: this is a probe packet The description in the earlier sections has assumed that RSVP signalling is used. In this case, the first bullet requires standardisation so that the RSVP RESV message can carry a new opaque object with the load report. However, there are several other possible signalling protocols, for instance using NSIS. It would therefore be sensible to ensure that the new signalling messages do not constrain the choice of end-to-end QoS mechanism nor how the end-to-end and edge-to-edge (ie ingress-to- egress) mechanisms interact. As an example on the latter point, with RSVP the PATH message is forwarded immediately to the next domain, with the Congestion-Level-Estimate report only being calculated when the RESV returns, at which point it can be piggy-backed on to the RESV and sent to the ingress. In other cases, it may be that admission control is performed before the signalling message is forwarded to the next domain. 4. Extensions 4.1. Multi-domain and multi-operator usage The CL-region can consist of multiple domains. Then only the ingress and egress nodes of the CL-region take part in the admission control procedure, ie at the ingress to the first domain and the egress from the final domain. Note that domain border nodes within the CL-region do not take part in signal processing or hold path state. The multiple domains can even be run by different operators. The border routers between operators within the CL-region only have to do bulk accounting - per microflow metering and policing is not needed [Briscoe]. This is possible even when the operators do not trust each other. In a later version of the draft we will explain how a downstream domain can police that its upstream domain does not 'cheat' by admitting traffic when the downstream path is over- congested [Re-feedback]. Briscoe Expires January 11, 2006 [Page 17] Internet-Draft Controlled Load architecture July 2005 4.2. Variable bit rate sources So far we have assumed that the real time inelastic sources operate at a constant bit rate. We have determined under what conditions it is possible to handle variable bit rate (VBR) sources. The simplest approach is an algorithm that decides whether to set the CE codepoint using a service rate much less than the real service rate (ie allowing an extra safety margin); the network can still operate efficiently when resources are shared between CL and non-CL flows. This approach assumes that the sources are statistically independent. 4.3. Starvation prevention According to the particular traffic levels it may sometimes be possible for either the non-CL or CL traffic to be starved. An algorithm to prevent starvation will be documented in a future draft. 5. Relationship to other QoS mechanisms 5.1. Standardisation requirements Standardisation of two functions is needed: o First, a new per hop behaviour is required (CL-ramp-PHB), which is described in [CL-PHB]. The corresponding DSCP needs to be RECOMMENDED rather than EXP/LU (experimental / local use), to enable multi-domain operation and vendor interoperability. This document is a use case of CL-ramp-PHB. o Signalling between the ingress and egress nodes and its interaction with the end-to-end QoS mechanism, for instance RSVP or NSIS. For instance, given RSVP's capabilities to carry opaque objects, define an object to carry the Congestion-Level-Estimate report. Probe packets are simply data addressed to the egress gateway and require no protocol standardisation, although best practice is required for their number, size and rate. 5.2. Controlled Load The CL mechanism delivers QoS similar to Integrated Services controlled load, but rather better as queues are kept empty by driving admission control from bulk token buckets on each interface that can detect a rise in load before queues build, sometimes termed a virtual queue [AVQ, vq]. It is also more robust to route changes. Briscoe Expires January 11, 2006 [Page 18] Internet-Draft Controlled Load architecture July 2005 5.3. Integrated services operation over Diffserv Our approach to end-to-end QoS is similar to that described in [RFC2998] for Integrated services operation over Diffserv networks. Like [RFC2998], an IntServ class (CL in our case) is achieved end-to- end, with a CL-region viewed as a single reservation hop in the total end-to-end path. Interior routers of the CL-region do not process flow signalling nor do they hold state. Unlike [RFC2998] we do not require the end-to-end signalling mechanism to be RSVP, although it can be. Also, we do not use the DS architecture (see Section 5.4). Bearing in mind these differences, we can describe our architecture in the terms of the options in [RFC2998]. The Diffserv network region is RSVP-aware, but awareness is confined to (what [RFC2998] calls) the "border routers" of the Diffserv region. We use explicit admission control into this region, with either static provisioning or explicit signalling (corresponding to the configured and flexible bandwidth cases of Sections 2.1 and 2.2 respectively). The ingress "border router" does per microflow policing and sets the correct DSCP (ie we use router marking rather than host marking). 5.4. Differentiated Services The DS architecture does not specify any way for devices outside the domain to dynamically reserve resources or receive indications of network resource availability. In practice, service providers rely on subscription-time Service Level Agreements (SLAs) that statically define the parameters of the traffic that will be accepted from a customer. The CL mechanism allows dynamic reservation of resources and unlike Diffserv it can span multiple domains without active mechanisms at the borders. Therefore we do not use the traffic conditioning agreements (TCAs) of the (informational) Diffserv architecture [RFC2475]. [Johnson] compares admission control with a 'generously dimensioned' Diffserv network as ways to achieve QoS. The former is recommended. 5.5. ECN CL complies with the ECN aspects of the IP wire protocol [RFC3168], but provides its own edge-to-edge feedback instead of the TCP aspects of ECN. All nodes within a particular CL-region are upgraded with the CL mechanism, so the requirements of [Floyd] are met. The operator prevents traffic arriving at a node that doesn't understand CL by administrative configuration of the ring of gateways around the region. Where a region of nodes that understand CL spans multiple domains, the operators contract with each other to surround the Briscoe Expires January 11, 2006 [Page 19] Internet-Draft Controlled Load architecture July 2005 region by gateways to prevent CL traffic being handled by nodes that do not understand it. 5.6. RTECN Real-time ECN (RTECN) [RTECN, RTECN-usage] has a similar aim to this document (to achieve a low delay, jitter and loss service suitable for RT traffic) and a similar approach (per microflow admission control combined with an "early warning" of potential congestion through setting the CE codepoint). But it has a different architecture: host-to-host (rather than edge-to-edge). [CL-PHB] defines a new PHB, CL-step-PHB, that should be suitable; its algorithm is similar to CL-ramp-PHB, but setting the CE codepoint is either 'on' or 'off'. Only probe packets use the CL-step-PHB, whilst data uses the Expedited Forwarding PHB [RFC3246]. 5.7. RMD Resource Management in Diffserv (RMD) [RMD] is similar to this work, in that it pushes complex classification, traffic conditioning and admission control functions to the edge of a DS domain and simplifies the operation of the interior nodes. One of the RMD modes uses measurement-based admission control, however it works differently: each interior node measures the user traffic load in the PHB traffic aggregate, and each interior node processes a local RESERVE message and compares the requested resources with the available resources (maximum allowed load minus current load). Hence a difference is that the CL architecture described in this document has been designed not to require interaction between interior nodes and signalling, whereas in RMD all interior nodes are QoS-NSLP aware. So our architecture is more agnostic to signalling, requires fewer changes to existing standards and therefore works with existing RSVP as well as having the potential to work with future signalling protocols like NSIS. 5.8. MPLS-TE Multi-protocol label switching traffic engineering (MPLS-TE) allows reservation of resources for an aggregate of many flows. However, it still requires admission control and policing (using a bandwidth manager) of microflows into the aggregate. This must be repeated at each trust boundary. The present technique could be used for admission control of microflows into a set of MPLS-TE aggregates. They may span multiple domains without requiring per-microflow processing at the trust boundaries. However it would require that the MPLS header could include the ECN field. Briscoe Expires January 11, 2006 [Page 20] Internet-Draft Controlled Load architecture July 2005 6. Security Considerations To protect against denial of service attacks, the ingress node of the CL-region needs to police all CL packets and drop packets in excess of the reservation. Further security aspects to be considered later. 7. Acknowledgements We thank Joe Babiarz for very helpful discussion about this document and [RTECN]. This work evolved from the Guaranteed Stream Provider developed in the M3I project [GSPa, GSP-TR], which in turn was based on the theoretical work of Gibbens and Kelly [DCAC]. 8. Comments solicited Comments and questions are encouraged and very welcome. They can be sent to the Transport Area Working Group's mailing list, tsvwg@ietf.org, and/or to the authors (either individually or collectively at gqs@jungle.bt.co.uk). 9. References A later version will distinguish normative and informative references. [AVQ] S. Kunniyur and R. Srikant "Analysis and Design of an Adaptive Virtual Queue (AVQ) Algorithm for Active Queue Management", In: Proc. ACM SIGCOMM'01, Computer Communication Review 31 (4) (October, 2001). [Briscoe] Bob Briscoe and Steve Rudkin, "Commercial Models for IP Quality of Service Interconnect", BT Technology Journal, Vol 23 No 2, April 2005. Briscoe Expires January 11, 2006 [Page 21] Internet-Draft Controlled Load architecture July 2005 [CL-PHB] B. Briscoe, G. Corliano, P. Eardley, P. Hovell, A. Jacquet, D. Songhurst, "The Controlled Load per hop behaviour", draft-briscoe-tsvwg-cl-phb-00.txt (work in progress), July 2005 [DCAC] Richard J. Gibbens and Frank P. Kelly "Distributed connection acceptance control for a connectionless network", In: Proc. International Teletraffic Congress (ITC16), Edinburgh, pp. 941—952 (1999). [Floyd] S. Floyd, 'Specifying Alternate Semantics for the Explicit Congestion Notification (ECN) Field', draft- floyd-ecn-alternates-00.txt (work in progress), April 2005 [GSPa] Karsten (Ed.), Martin "GSP/ECN Technology \& Experiments", Deliverable: 15.3 PtIII, M3I Eu Vth Framework Project IST-1999-11429, URL: http://www.m3i.org/ (February, 2002) (superseded by [GSP- TR]) [GSP-TR] Martin Karsten and Jens Schmitt, "Admission Control Based on Packet Marking and Feedback Signalling ­-- Mechanisms, Implementation and Experiments", TU- Darmstadt Technical Report TR-KOM-2002-03, URL: http://www.kom.e-technik.tu- darmstadt.de/publications/abstracts/KS02-5.html (May, 2002) [Johnson] DM Johnson, 'QoS control versus generous dimensioning', BT Technology Journal, Vol 23 No 2, April 2005 [Re-feedback] Bob Briscoe, Arnaud Jacquet, Carla Di Cairano- Gilfedder, Andrea Soppera, Re-feedback for Policing Congestion Response in an Inter-network, ACM SIGCOMM 2005, August 2005. [Reid] ABD Reid, 'Economics and scalability of QoS solutions', BT Technology Journal, Vol 23 No 2, April 2005 [RFC2208] F. Baker et al, "Resource ReSerVation Protocol (RSVP) --- Version 1 Applicability Statement; Some Guidelines on Deployment" RFC2208 (January, 1997) Briscoe Expires January 11, 2006 [Page 22] Internet-Draft Controlled Load architecture July 2005 [RFC2211] J. Wroclawski, Specification of the Controlled-Load Network Element Service, September 1997 [RFC2309] Braden, B., et al., "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, April 1998. [RFC2474] Nichols, K., Blake, S., Baker, F. and D. Black, "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, December 1998 [RFC2475] Blake, S., Black, D., Carlson, M., Davies, E., Wang, Z. and W. Weiss, "An Architecture for Differentiated Services", RFC 2475, December 1998. [RFC2597] Heinanen, J., Baker, F., Weiss, W. and J. Wrocklawski, "Assured Forwarding PHB Group", RFC 2597, June 1999. [RFC2998] Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang, L., Speer, M., Braden, R., Davie, B., Wroclawski, J. and E. Felstaine, "A Framework for Integrated Services Operation Over DiffServ Networks", RFC 2998, November 2000. [RFC3168] Ramakrishnan, K., Floyd, S. and D. Black "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC3246] B. Davie, A. Charny, J.C.R. Bennet, K. Benson, J.Y. Le Boudec, W. Courtney, S. Davari, V. Firoiu, D. Stiliadis, 'An Expedited Forwarding PHB (Per-Hop Behavior)', RFC 3246, March 2002. [RMD] Attila Bader, Lars Westberg, Georgios Karagiannis, Cornelia Kappler, Tom Phelan, 'RMD-QOSM - The Resource Management in Diffserv QoS model', draft-ietf-nsis- rmd-03 Work in Progress, June 2005. [RTECN] Babiarz, J., Chan, K. and V. Firoiu, 'Congestion Notification Process for Real-Time Traffic', draft- babiarz-tsvwg-rtecn-03" Work in Progress, February 2005. Briscoe Expires January 11, 2006 [Page 23] Internet-Draft Controlled Load architecture July 2005 [RTECN-usage] Alexander, C., Ed., Babiarz, J. and J. Matthews, 'Admission Control Use Case for Real-time ECN, draft- alexander-rtecn-admission-control-use-case-00', Work in Progress, February 2005. [vq] Costas Courcoubetis and Richard Weber "Buffer Overflow Asymptotics for a Switch Handling Many Traffic Sources" In: Journal Applied Probability 33 pp. 886-- 903 (1996). Authors' Addresses Bob Briscoe BT Research B54/77, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: bob.briscoe@bt.com Dave Songhurst BT Research B54/69, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: dsonghurst@jungle.bt.co.uk Briscoe Expires January 11, 2006 [Page 24] Internet-Draft Controlled Load architecture July 2005 Philip Eardley BT Research B54/77, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: philip.eardley@bt.com Peter Hovell BT Research B54/69, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: peter.hovell@bt.com Gabriele Corliano BT Research B54/70, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: gabriele.2.corliano@bt.com Arnaud Jacquet BT Research B54/70, Sirius House Adastral Park Martlesham Heath Ipswich, Suffolk IP5 3RE United Kingdom Email: arnaud.jacquet@bt.com Briscoe Expires January 11, 2006 [Page 25] Internet-Draft Controlled Load architecture July 2005 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2005). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Briscoe Expires January 11, 2006 [Page 26]