Transport Area Working Group B. Briscoe Internet-Draft BT & UCL Expires:August 31,December 28, 2006February 27,June 26, 2006 Emulating Border Flow Policing using Re-ECN on Bulk Datadraft-briscoe-tsvwg-re-ecn-border-cheat-00draft-briscoe-tsvwg-re-ecn-border-cheat-01 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire onAugust 31,December 28, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract Scaling per flow admission control to the Internet is a hard problem. A recently proposed approach combines Diffserv and pre-congestion notification (PCN) to provide a service slightly better than Intserv controlled load. It scales to networks of any size, but only if domains trust each other to comply with admission control and rate policing. This memo claims to solve this trust problem without losing scalability. It describes bulk border policing thatemulatesprovides a sufficient emulation of per-flow policing with the help of another recently proposed extension to ECN, involving re-echoing ECN feedback (re-ECN). With onlypassive,passive bulk measurements at borders, sanctions can be applied against cheating networks. Status (to be removed by the RFC Editor) This memo is posted as an Internet-Draft with the intent to eventually progress to informational status. It is envisaged that the necessary standards actions to realise the system described would sit in three other documents currently being discussed (but not on the standards track) in the IETF Transport Area [Re-TCP], [RSVP-ECN] & [PCN]. The authors seek comments from the Internet community on whether combining PCN and re-ECN is a sufficient solution to the admission control problem. Changes from previous drafts (to be removed by the RFC Editor) From -00 to -01: Added subsection on Border Accounting Mechanisms (Section 5.6.1) Section 4.2 on the re-ECN wire protocol clarified and re-organised to separately discuss re-ECN for default ECN marking and for pre- congestion marking (PCN). Router Forwarding Behaviour subsection added to re-organised section on Protocol Operation (Section 4.3). Extensions section moved within Protocol Operations. Emulating Border Policing (Section 5) reorganised, starting with a new Terminology subsection heading, and a simplified overview section. Added a large new subsection on Border Accounting Mechanisms within a new section bringing together other subsections on Border Mechanisms generally (Section 5.6). Some text moved from old subsections into these new ones. Added section on Incremental Deployment (Section 7), drawing together relevant points about deployment made throughout. Sections on Design Rationale (Section 8) and Security Considerations (Section 9) expanded with some new material, including new attacks and their defences. Suggested Border Metering Algorithms improved (Appendix A.2) for resilience to newly identified attacks. Table of Contents 1. Introduction 2. Requirements Notation 3. The Problem 3.1. The Traditional Per-flow Policing Problem 3.2. Generic Scenario 4. Re-ECN Protocol for an RSVP (or similar) Transport 4.1. Protocol Overview 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) 4.2.1. Re-ECN Recap 4.2.2. Re-ECN Combined with Pre-Congestion Notification (re-PCN) 4.3. Protocol Operation4.4.4.3.1. Protocol Operation for an Established Flow 4.3.2. Aggregate Bootstrap4.5.4.3.3. Flow Bootstrap 4.3.4. Router Forwarding Behaviour 4.3.5. Extensions 5. Emulating Border Policing with Re-ECN 5.1. Informal Terminology 5.2. Policing Overview5.2.5.3. Pre-requisite Contractual Arrangements5.3.5.4. Emulation of Per-Flow Rate Policing: Rationale and Limits5.4. Policing5.5. Sanctioning Dishonest Marking5.5.5.6. Border Mechanisms 5.6.1. Border Accounting Mechanisms 5.6.2. Competitive Routing5.6.5.6.3. Fail-safes 6. Analysis 7.ExtensionsIncremental Deployment 8. Design Choices and Rationale 9.IANASecurity Considerations 10.SecurityIANA Considerations 11. Conclusions 12. Acknowledgements 13. Comments Solicited 14. References 14.1. Normative References 14.2. Informative References Appendix A. Implementation A.1. Ingress Gateway Algorithm for Blanking the REbitflag A.2. Downstream Congestion Metering Algorithms A.2.1. Bulk Downstream Congestion Metering Algorithm A.2.2. Inflation Factor for Persistently Negative Flows A.3. Algorithm for Sanctioning Negative Traffic Author's Address Intellectual Property and Copyright Statements 1. Introduction The Internet community largely lost interest in the Intserv architecture after it was clarified that it would be unlikely to scale to the whole Internet [RFC2208]. Although Intserv mechanisms proved impractical, theservicesbandwidth reservation service it aimed to offerareis still very much required. A recently proposed approach[CL-arch][CL-deploy] combines Diffserv and pre- congestion notification (PCN) to provide a service slightly better than Intserv controlled load [RFC2211]. It scales to any size network, but only if domains trusteach othertheir neighbours tocomply with admission control and rate policing.have checked that upstream customers aren't taking more bandwidth than they reserved, either accidentally or deliberately. This memo describes border policing measuresto sanction networksso thatcheat each other.one network can protect its interests, even if networks around it are deliberately trying to cheat. The approach provides a sufficient emulation of flow rate policing at trust boundaries but without per-flow processing. The emulation is not perfect, but it is sufficient to ensure that the punishment is at least proportionate to the severity of the cheat. The aim is to be able toclaim thatscale controlled load servicecan scaleto any number of endpoints, even though such scaling must take account of the increasing numbers of networks and users who may all have conflicting interests. To achieve such scaling, this memo combines two recent proposals, both of which it briefly recaps: o Aframeworkdeployment model for admission control over Diffserv using pre- congestion notification[CL-arch][CL-deploy] describes how bulk pre- congestion notification on routers within an edge-to-edge Diffserv region can emulate the precision of per-flow admission control to provide controlled load service without unscalable per-flow processing; o Re-ECN: Adding Accountability to TCP/IP [Re-TCP]. The trick that addresses cheating at borders is to recognise that border policing is mainly necessary because cheating upstream networks will admit traffic when they shouldn't only as long as they don't directly experience the downstream congestion their misbehaviour can cause. The re-ECN protocolensuresrequires upstream nodeshonestlyto declare expected downstream congestion in all forwardedpackets, which wepackets and it makes it in their interests to declare it honestly. Operators can thenusemonitor downstream congestion in bulk at borders to emulateborderpolicing. Rather than the end-to-end arrangement used when re-ECN was specified for the TCP transport [Re-TCP], this memo specifies re-ECN in an edge-to-edge arrangement, making it applicable to theDiffservabove deployment model for admission controlscenario in the framework.over Diffserv. Also, rather than using a TCP transport for regular congestion feedback, this memo specifies re-ECN using RSVP as thetransport. We use the proposed minor extension of RSVP that allows it to carrytransport for feedback [RSVP-ECN]. A similar deployment model, but with a different transport for signalling congestion feedback[RSVP- ECN], whichcould be used (e.g. RMD [NSIS-RMD] uses NSIS). This memo aims to do two things: i) define how to apply the re-ECN protocol to the admission control over Diffserv scenario; and ii) explain why re-ECN sufficiently emulates border policing in that scenario. Most of the memo ismuch less frequent but more precise than TCP. Of course, networktaken up with the second aim; explaining why it works. Applying re-ECN to the scenario actually involves quite a trivial modification to the ingress gateway. Our immediate goal is to convince everyone to build that modification in to ingress gateways from the start, whether first deployments require policing or not. Otherwise, when we want to add policing, we will have built ourselves a legacy problem. In other words, we aim to convince people to "Build in security from the start." The body of this memo is structured as follows: Section 3 describes the border policing problem. We recap the traditional, unscalable view of how to solve the problem, and we recap the admission control solution which has the scalability we do not want to lose when we add border policing; Section 4 specifies the re-ECN protocol solution in detail; Section 5 explains how to use the protocol to emulate border policing, and why it works; Section 6 analyses the security of the proposed solution; Section 8 explains the sometimes subtle rationale behind our design decisions; Section 9 comments on the overall robustness of the security assumptions and lists specific security issues. It must be emphasised that we are not evangelical about removing per- flow processing from borders. Network operators may choose toprocessdo per-flowsignallingprocessing at their borders for their own reasons, such as to support business models that require per-flow accounting.But the goal of this documentOur aim is to show that per-flow processing at borders is no longernecessary/necessary/ in order to provideend- to-endend-to-end QoS using flow admission control.To be clear,Indeed, we are absolutely opposed to standardisation of technology that embeds particular business models into the Internet. Our aimhereis merely to provide a new useful metric (downstream congestion) at trust boundaries. Given the well-known significance of congestion in economics, operators can then use this new metric in their interconnection contracts if they choose. This will enable competitive evolution of new business models (for examplessee [IXQoS]),see [IXQoS]), alongside more traditional models that depend on more costly per-flow processing at borders.We specify this protocol solution in detail in Section 4, after specifying the inter-domain policing problem more precisely and briefly recapping the framework for providing admission control using pre-congestion notification in Section 3. Having described the solution, this memo continues as follows: {ToDo: }2. Requirements Notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3. The Problem 3.1. The Traditional Per-flow Policing Problem If we claim to be able to emulate per-flow policing with bulk policing at trust boundaries, we need to know exactly what we are emulating. So, even though we expect it to become a historic practice, we will start from the traditional scenario with per-flow policing at trust boundaries to explain why it has always been considered necessary. To be able to take advantage of a reservation-based service such as controlled load, a source must reserve resources using a signalling protocol such as RSVP [RFC2205].But, evenAn RSVP signalling request refers to a flow of packets by its flow ID tuple (filter spec [RFC2205]) (or its security parameter index (SPI) [RFC2207] ifthe sourceport numbers are hidden by IPSec encryption). Other signalling protocols use similar flow identifiers. But, it isauthorisedinsufficient to merely authorise and admit a flow based on its identifiers, for instance merely opening a pin-hole for packets with identifiers that match an admittedat theflowlevel,ID. Once a flow is admitted, it cannot necessarily be trusted to send packets within the rate profile it requested. The packet rate must also be policed to keep the flow within the requested flow spec [RFC2205]. For instance, without data rate policing, a source could reserve resources for an 8kbps audio flow but transmit a 6Mbps video (theft of service). More subtly, the sender could generate bursts that were outside the profile it had requested. In traditional architectures, per-flow packet rate-policing is expensive and unscalable but, without it, a network is vulnerable to such theft of service (whether malicious or accidental). Perhaps more importantly, if flows are allowed to send more data than they were permitted, the ability of admission control to give assurances to other flows will break.A signalled request refers to a flow of packets by its flow ID tuple (filter spec [RFC2205]) (or its security parameter index (SPI)& nbsp[RFC2207] if port numbers are hidden by IPsec encryption). But merely opening a pin-hole for packets that match an admitted flow ID is an insufficient policing mechanism. The packet rate must also be policed to keep the flow within the requested flow spec [RFC2205].Just as sources need not be trusted to keep within their requested flow spec, whole networks might also try to cheat. We will now set up a concrete scenario to illustrate such cheats. Imagine reservations for unidirectional flows from senders, through at least two networks, an edge network and its downstream transit provider. Imagine the edge network charges its retail customers per reservation but also has to pay its transit provider a charge per reservation. Typically, both its selling and buying charges might depend on the duration and rate of each reservation. The level of the actual selling and buying prices are irrelevant to our discussion (most likely the network will sell at a higher price than it buys, of course). A cheating ingress network could systematically reduce the size of its retail customers' reservation signalling requests before forwarding them to its transit provider (and systematically reinstate the responses on the way back). It would then receive an honest income from its upstream retail customer but only pay for fraudulently smaller reservations downstream. Equivalently, a cheating ingress network may feed the traffic from a number of flows into an aggregate reservation over the transit that is smaller than the total of all the flows. Because of these fraud possibilities, in traditional QoS reservation architectures the downstream network polices at each border. The policer checks that the actual sent data rate of each flow is within the signalled reservation. Reservation signalling could be authenticated end to end, but this wouldn't prevent the aggregation cheat just described. For this reason, and to avoid the need for a global PKI, signalling integrity is typically only protected on a hop-by-hop basis [RFC2747].[RFC2747]. A variant of the above cheat is where a router in an honest downstream network denies admission to a new reservation, but a cheating upstream network still admits the flow. For instance, the networks may be using Diffserv internally, but Intserv admission control at their borders [RFC2998]. The cheat would only work if they were using bulk Diffserv traffic policing at their borders, perhaps to avoid the cost/complexity of Intserv border policing. As far as the cheating upstream network is concerned, it gets the revenue from the reservation, but it doesn't have to pay any downstream wholesale charges and the congestion is in someone else's network. The cheating network may calculate that most of the flows affected by congestion in the downstream network aren't likely to be its own. It may also calculate that the downstream routeris probably not actually congested, but rather it is denyinghas been configured to deny admission to new flows in order to protect bandwidth assigned to otherlower priority services.network services (e.g. enterprise VPNs). So the cheating network can steal capacity from the downstream operator's VPNs that are probably not actually congested. To summarise, in traditional reservation signalling architectures, if a network cannot trust a neighbouring upstream network to rate-police each reservation, it has to check for itself that the data rate fits within each of the reservations it has admitted. 3.2. Generic Scenario We will now describe a generic internetworking scenario that we will use to describe and to test our bulk policing proposal. It consists of a number of networks and endpoints that do not fully trust each other to behave. In Section 6 we will tie down exactly what we mean by partial trust, and we will consider the various combinations where some networks do not trust each other and others are colluding together. _ ___ _____________________________________ ___ _ | | | | _|__ ______ ______ ______ _|__ | | | | | | | | | | | | | | | | | | | | | | | | | | | | |Inter-| |Inter-| |Inter-| | | | | | | | | | | | | | ior | | ior | | ior | | | | | | | | | | | | | |Domain| |Domain| |Domain| | | | | | | | | | | | | | A | | B | | C | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | | | | | | | |B| |B| |B| |B| |B| |B| | | |||\ | | | |==| |==|Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr |==||==||=>| | | | | | |G/W | | | | | | | | | | | | | |G/W | |||/ | | | | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | | | | | | | | | | | | | | | | | | | | | | | | | |____| |______| |______| |______| |____| | | | | |_| |___| |_____________________________________| |___| |_| Sx Ingress Diffserv region Egress Rx End Access Access End Host Network Network Host <-------- edge-to-edge signalling -------> (for admission control) <-------------------end-to-end QoS signalling protocol-------------> Figure 1: Generic Scenario (see text for explanation of terms) An ingress and egress gateway (Ingr G/W and Egr G/W in Figure 1) connect the interior Diffserv region to the edge access networks where routers (not shown) use per-flow reservation processing. Within the Diffserv region are three interior domains, A, B and C, as well as the inward facing interfaces of the ingress and egress gateways. An ingress and egress border router (BR) is shown interconnecting each interior domain with the next. There may be other interior routers (not shown) within each interior domain. In two paragraphs we now briefly recap how pre-congestion notification is intended to be used to control flow admission to a large Diffserv region. The first paragraph describes data plane functions and the second describes signalling in the control plane. We omit many details from[CL-arch][CL-deploy] including behaviour during routing changes. For brevity here we assume other flows are already in progress across a path through the Diffserv region before a new one arrives, but how bootstrap works is described in Section4.4.4.3.2. Figure 1 shows a single simplex reserved flow from the sending (Sx) end host to the receiving (Rx) end host. The ingress gateway polices incoming traffic within its admitted reservation and remarks it to turn on an ECN-capablecodepoint [RFC3168]codepoint [RFC3168] and the controlled load (CL) Diffserv codepoint. Together, these codepoints define which traffic is entitled to the enhanced scheduling of the CL behaviour aggregate on routers within the Diffserv region. The CL PHB of interior routers consists of a scheduling behaviour and a new ECN marking behaviour that we call'pre-congestion`pre-congestion notification' [PCN]. The CL PHB simply re-uses the definition of expedited forwarding(EF) [RFC3246](EF) [RFC3246] for its scheduling behaviour. But it incorporates a new ECN marking behaviour, which sets the ECN field of an increasing number of CL packets to the admission marked (AM) codepoint as they approach a threshold rate that is lower than the line rate. The use of virtual queues ensures real queues have hardly built up any congestion delay. The level of marking detected at the egress of the Diffservregion,region is then used by the signalling system in order to determine admissioncontrol.control as follows. The end-to-end QoS signalling (e.g. RSVP) for a new reservation takes one giant hop from ingress to egress gateway, because interior routers within the Diffserv region are configured to ignore RSVP. The egress gateway holds flow state because it takes part in theend-to-endend- to-end reservation. So it can classify all packets by flow and it can identify all flows that have the same previous RSVP hop (aCL-region-aggregate).CL- region-aggregate). For each CL-region-aggregate of flows in progress, the egress gateway maintains a per-packet moving average of the fraction of pre-congestion-marked traffic. Once an RSVP PATH message for a new reservation has hopped across the Diffserv region and reached the destination, an RSVP RESV message is returned. As the RESV message passes, the egress gateway piggy-backs the relevant pre-congestion level onto it [RSVP-ECN]. Again, interior routers ignore the RSVP message, but the ingress gateway strips off thepre-congestionpre- congestion level. If the pre-congestion level is above a threshold, the ingress gateway denies admission to the new reservation, otherwise it returns the original RESV signal back towards the data sender. Once a reservation is admitted, its traffic will always receive low delay service for the duration of the reservation. This is because ingress gateways ensure that traffic not under a reservation cannot pass into the Diffserv region with the CL DSCP set. So non-reserved traffic will always be treated with a lower priority PHB at each interior router.4. Re-ECN Protocol for an RSVP Transport 4.1.And even if some disaster re-routes traffic after it has been admitted, if the traffic through any resource tips over a fail-safe threshold, pre-congestion notification will trigger flow- pre-emption to very quickly bring every router within the whole Diffserv region back below its operating point. The whole admission control system just described deliberately confines per-flow processing to the access edges of the network, where it will not limit the system's scalability. But ideally we want to extend this approach to multiple networks, to take even more advantage of its scaling potential. We would still need per-flow processing at the access edges of each network, but not at the high speed interfaces where they interconnect. Even though such an admission control system would work technically, it would gain us no scaling advantage if each network also wanted to police the rate of each admitted flow for itself---border routers would still have to do complex packet operations per-flow anyway, given they don't trust upstream networks to do their policing for them. This memo describes how to emulate per-flow rate policing using bulk mechanisms at border routers, so the full scalability potential of pre-congestion notification is not limited by the need for per-flow policing mechanisms at borders, which would make borders the most cost-critical pinch-points. Then we can achieve the long sought-for vision of secure Internet-wide bandwidth reservations without needing per-flow processing at all in core and border routers---where scalability is most critical. 4. Re-ECN Protocol for an RSVP (or similar) Transport 4.1. Protocol Overview First we need to recap the way routers accumulate congestion marking along a path. Each ECN-capable router marks some packets with CE, the marking probability increasing with the length of thevirtualqueue at its egresslink [PCN]. With multiple ECN-capable routerslink. The only difference with pre-congestion marking [PCN] is that marking is based ona path, the ECN field accumulatesthefractionlength ofCE markinga virtual queue, so thateach router adds. The combined effect ofthepacketreal queue occupancy can remain very low. We will use the terms congestion and pre-congestion interchangeably in the following unless it is important to distinguish between them. With multiple ECN-capable routers on a path, the ECN field accumulates the fraction of CE marking that each router adds. The combined effect of the packet marking of all the routers along the path signals congestion of the whole path to the receiver. So, for example, if one router early in a path is marking 1% of packets and another later in a path is marking 2%, flows that pass through both routers will experience approximately 3% marking. The packets crossing an inter-domain trust boundary within the Diffserv region will all have come from different ingress gateways and will all be destined for different egress gateways. We will show that the key to policing against theft of service is for a border router to be able to directly measureexpected downstream pre-congestionthe congestion that is about to be caused by the traffic it forwards. That is, it can measure locally the congestion on each of the downstream paths betweena border routeritself and the egress gateways thatpackets are headedits traffic is destined for. With the original ECN protocol, if CE markings crossing the border had been counted over a period, they would have represented the accumulated upstreampre-congestioncongestion that had already been experienced by those packets. The general idea of re-ECN is for the ingress gateway to continuously encode path congestion into the IPheader, where pathheader where, in this case, `path' means from ingress to egress gateway. Then at any point on that path (e.g. between domains A & B in Figure 2 below), IP headers can be monitored to subtract upstream congestion from expected path congestion in order to give the expected downstream congestion still to be experienced until the egress gateway. Importantly, it turns out that there is no need to monitor downstream congestion on a per-flow basis. We will show that accounting for it in bulk across all flows will be sufficient. _____________________________________ _|__ ______ ______ ______ _|__ | | | A | | B | | C | | | +----+ +-+ +-+ +-+ +-+ +-+ +-+ +----+ | | |B| |B| |B| |B| |B| |B| | | |Ingr|==|R| |R|==|R| |R|==|R| |R|==|Egr | |G/W | | | | |: | | | | | | | | |G/W | +----+ +-+ +-+: +-+ +-+ +-+ +-+ +----+ | | | |: | | | | | | |____| |______|: |______| |______| |____| |_____________:_______________________| : | : | |<-upstream-->:<-expected downstream->| | congestion : congestion | | u v ~= p - u | | | |<--- expected path congestion, p --->| Figure 2: Re-ECN concept 4.2. Re-ECN Abstracted Network Layer Wire Protocol (IPv4 or v6) In this section we define the names of the various codepoints of the re-ECNprotocol,protocol when used with pre-congestion notification, deferring description of their semantics to the following sections.FirstBut first we recap the re-ECN wire protocol proposed in [Re-TCP].It4.2.1. Re-ECN Recap Re-ECN uses the two bit ECN field broadly as in RFC3168 [RFC3168]. It also uses a new re-ECN extension (RE)bit.flag. The actual position of the REbitflag is different between IPv4 & v6 headers so we will use an abstraction of the IPv4 and v6 wire protocols by just calling it the REbit.flag. [Re-TCP] proposes using bit 48 (currently unused) in the IPv4 header for the REbit,flag, while for IPv6 it proposes an ECN extensionheader for IPv6.header. Unlike the ECN field, the REbitflag is intended to be set by the sender and remain unchanged along the path, although it can be read by network elements that understand the re-ECN protocol. In the scenario used in this memo,anthe ingress gatewaychanges the setting of the RE bit, actingacts as a proxy for the sender, setting the RE flag as permitted in the specification of re-ECN.AlthoughNote that general-purpose routers do not have to read the REbit is a separate, single bit field, it can be read as an extensionflag, only special policing elements at borders do. And no general-purpose routers have to change thetwo-bit ECN field;RE flag, although thethree concatenated bitsingress and egress gateways do because inwhatthe edge-to-edge deployment model wewill callare using, they act as proxies for theextended ECN field (EECN) make eight codepoints available. Whenendpoints. Therefore the REbit setting is "don't care", we use the RFC3168 names offlag does not even have to be visible to interior routers. So theECN codepoints,RE flag has no implications on protocols like MPLS. Congested label switching routers (LSRs) would have to be able to notify their congestion with an ECN/PCN codepoint in the MPLS shim [ECN-MPLS], but like any interior IP router, they can be oblivious to the RE flag, which need only be read by border policing functions. Although the RE flag is a separate, single bit field, it can be read as an extension to the two-bit ECN field; the three concatenated bits in what we will call the extended ECN field (EECN) make eight codepoints available. When the RE flag setting is "don't care", we use the RFC3168 names of the ECN codepoints, but [Re-TCP] proposes the following six codepoint names for when there is a need to be more specific.+-------+------------+------+-------------+-------------------------++-------+------------+------+---------------+-----------------------+ | ECN | RFC3168 | RE |re-ECNExtended ECN |re-ECNRe-ECN meaning | | field | codepoint |bitflag | codepoint | |+-------+------------+------+-------------+-------------------------++-------+------------+------+---------------+-----------------------+ | 00 | Not-ECT | 0 |NRECTNot-RECT | Not re-ECN-capable | | | | | | transport | | 00 | Not-ECT | 1 |NFFNE |No feedbackFeedback not | | | | | | established | | 01 | ECT(1) | 0 | Re-Echo | Re-echoed congestion | | | | | | and RECT | | 01 | ECT(1) | 1 | RECT |re-ECNRe-ECN capable | | | | | | transport | | 10 | ECT(0) | 0 |--CU----- |Currently unusedLegacy ECN use | | | | | | only | | 10 | ECT(0) | 1 | --CU-- | Currently unused | | | | | | | | 11 | CE | 0 | CE(0) | Congestionexperienced| | | | | | experienced with | | | | | | Re-Echo | | 11 | CE | 1 | CE(-1) | Congestion | | | | | | experienced |+-------+------------+------+-------------+-------------------------++-------+------------+------+---------------+-----------------------+ Table 1: Re-cap of Default Extended ECN Codepoints Proposed for Re- ECN 4.2.2. Re-ECN Combined with Pre-Congestion Notification (re-PCN) As permitted byRFC3168, [PCN] proposes new semantics forthe ECNcodepoints when combined withspecification [RFC3168], aDiffserv codepoint (DSCP) that uses pre-congestion notification. It also proposes various alternative encodings for these semantics, attemptingproposal is currently being advanced in the IETF tofit five states intodefine different semantics for how routers might mark thefour availableECNcodepoints by making various compromises.field of certain packets. Thefive states are Not-ECT, ECT (ECN-capable transport),idea is to be able to notify congestion when theECN Nonce, Admission Marking (AM) and Pre-emption Marking (PM). Onerouter's load approaches a logical limit, rather than the physical limit of thefive states wasline. This new marking is called pre-congestion notification [PCN] and we will use the term PCN-enabled router for a router that can apply pre-congestion notification marking to the ECNNonce [RFC3540], but the capability we describe in this memo supercedesfields of packets. [RFC3168] recommends that a packet's Diffserv codepoint should determine which type of ECN marking it receives. A Diffserv per-hop behaviour (PHB) can specify that routers should apply pre-congestion notification marking to PCN-capable packets. We will call this a PCN-enhanced PHB. A PCN-capable packet must meet two conditions, it must carry a DSCP that maps to a PCN-enhanced PHB and it must carry an ECN field that turns on PCN marking. As an example, the controlled load (CL) PHB might specify expedited forwarding as its scheduling behaviour and PCN marking as its congestion marking behaviour. Then we would say the CL PHB is a PCN- enhanced PHB, and that packets with a DSCP that maps to the CL PHB and with ECN turned on are PCN-capable packets. [PCN] actually proposes that two logical limits should be used for pre-congestion notification, with the higher limit as a back-stop for dealing with anomalous events. It envisages PCN will be used to admission control inelastic real-time traffic, so marking at the lower limit will trigger admission control, while at the higher limit it will trigger flow pre-emption. Because it needs two types of congestion marking, PCN seems to need five states: Not-ECT, ECT (ECN-capable transport), the ECN Nonce, Admission Marking (AM) and Flow Pre-emption Marking (PM). [PCN] proposes various alternative encodings of the ECN field, attempting various compromises to fit these five states into the four available ECN codepoints. One of the five states to make room for is the ECN Nonce [RFC3540], but the capability we describe in this memo supersedes any need for the Nonce. The ECN Nonce is an elegant scheme, but it only allows a sending node (or its proxy) to detect suppression of congestion markingby a cheating receiver.in the feedback loop. Thus the Nonce requires the sender or its proxy to be trusted to respond correctly to congestion. But this is precisely the main cheat we want to protect against (as well as many others). One of thecompromisescompromise protocol encodings that [PCN] explores ("Alternative 5") leaves out support for the ECN Nonce. Therefore we use that one.Then, with the additionThis encoding of PCN markings is shown on theRE bit, the 8 encodingsleft of Table 2. Note that these codepoints of theextendedECN(EECN)fieldbecome those defined in the table below. Note that these codepointsonly take on the semanticsin the table below whenof pre-congestion noticiation if they are combined with a Diffserv codepoint that the operator has configured to cause PCN marking, by mapping it to a PCN-enhanced PHB. For the rest of this memo, we will not distinguish between Admission Marking and Pre-emption Marking unless we need to be specific. We will call both "congestion marking". With the above encoding, congestion marking can be read to mean any packet with the left-most bit of the ECN field set. The re-ECN protocol can be used to control misbehaving sources whether congestion is with respect to a logical threshold (PCN) or the physical line rate (ECN). In either case the RE flag can be used to create an extended ECN field. For PCN-capable packets, the 8 possible encodings of this 3-bit extended ECN (EECN) field are definedas supporting pre-congestion notification. +--------+-----------+------+-------------+-------------------------+on the right of Table 2 below. The purposes of these different codepoints will be introduced in subsequent sections. +-------+-----------------+------+-------------+--------------------+ | ECN | PCN codepoint | RE |re-ECNExtended |re-ECNRe-ECN meaning | | field |codepoint(Alternative 5) | flag | ECN | | | | |bit| codepoint | |+--------+-----------+------+-------------+-------------------------++-------+-----------------+------+-------------+--------------------+ | 00 | Not-ECT | 0 |NRECTNot-RECT | Not re-ECN-capable | | | | | | transport | | 00 | Not-ECT | 1 |NFFNE |No feedbackFeedback not | | | | | | established | | 01 | ECT(1) | 0 | Re-Echo | Re-echoedcongestion| | | | | | congestion and | | | | | | RECT | | 01 | ECT(1) | 1 | RECT |re-ECNRe-ECN capable | | | | | | transport | | 10 | AM | 0 | AM(0) | Admission Markingwith| | | | | | with Re-Echo | | 10 | AM | 1 | AM(-1) | Admission Marking | | | | | | | | 11 | PM | 0 | PM(0) | Pre-emptionMarking| | | | | | Marking with | | | | | | Re-Echo | | 11 | PM | 1 | PM(-1) | Pre-emption | | | | | | Marking |+--------+-----------+------+-------------+-------------------------++-------+-----------------+------+-------------+--------------------+ Table 2: Extended ECN Codepoints if the Diffserv codepoint uses Pre- congestion Notification (PCN)For4.3. Protocol Operation 4.3.1. Protocol Operation for an Established Flow The re-ECN protocol involves a simple tweak to therest of this memo, we will not distinguish between Admission Marking and Pre-emption Marking (unless stated otherwise). We will call both "congestion marking". With the above encoding, congestion marking can be read to mean any packet with the left-most bit of the ECN field set. All but the "not re-ECN-capable transport" (NRECT) field imply the presence of an ECN-capable transport. Congested PCN-capable routers must drop rather than mark packets carrying the NRECT codepoint. Note that adding PCN-capability to a router will involve checking the RE bit as well as the ECN field and DSCP before deciding whether to drop or to mark a packet during congestion. Router implementations might well append the RE bit to their internal representation of the ECN field, treating them internally as one 3-bit extended ECN value. 4.3. Protocol Operation In this section we will give an overview of the operation of the re- ECN protocol for an RSVP transport, deferring a detailed specification to the following sections. The re-ECN protocol involves a simple tweak to the actionaction of the gateway at the ingress edge of the CL region. In theframeworkdeployment model just described[CL-arch],[CL-deploy], for each active traffic aggregate across the CL region (CL-region-aggregate) the ingress gateway will hold a fairly recent Congestion-Level-Estimate that the egress gateway will have fed back to it, piggybacked on the signalling that sets up each flow. For instance, one aggregate might have been experiencing 3%pre- congestionpre-congestion (that is, congestion marked octets whether Admission Marked or Pre-emption Marked). In this case, the ingress gateway MUST clear the REbitflag to "0" for the same percentage of octets ofCL- packetsCL-packets (3%) and set it to "1" in the rest (97%). Appendix A.1 gives a simple pseudo-code algorithm that the ingress gateway may use to do this. The REbitflag is set and cleared this way round for incrementaldeployentdeployment reasons (see [Re-TCP]). To avoid confusion we will use the term `blanking' (rather than marking) when the REbitflag is cleared to "0", so we will talk of the `RE blanking fraction' as the fraction of octets with the REbitflag cleared to "0". ^ | | RE blanking fraction 3% | +----------------------------+====+ | | | | 2% | | | | | | congestion marking fraction| | 1% | | +----------------------+ | | | | | 0% +----+=====+---------------------------+------> ^ <--A---> <---B---> <---C---> ^ domain | ^ ^ | ingress | | egress 1.00% 2.00% marking fraction Figure 3: ExampleRe-ECN CodepointExtended ECN codepoint Marking fractions (Imprecise) Figure 3 illustrates our example. The horizontal axis represents the index of each congestible resource (typically queues) along a path through the Internet. The two superimposed plots show the fraction of each ECN codepoint observed along this path, assuming there are two congested routers somewhere withindomansdomains A and C. Andthe tableTable 3 below shows the downstream pre-congestion measured at various border observation points along the path.TheseFigure 4 (later) shows the same results of these subtractions, but in graphical form like the above figure. The tabulated figures are actually reasonable approximations derived from more precise formulae given in Appendix A of [Re-TCP]. The REbitflag is not changed by interior routers, so it can be seen that it acts as a reference against which the congestion marking fraction can be compared along the path. +--------------------------+---------------------------------------+ | Border observation point | Approximate Downstream pre-congestion | +--------------------------+---------------------------------------+ | ingress -- A | 3% - 0% = 3% | | A -- B | 3% - 1% = 2% | | B -- C | 3% - 1% = 2% | | C -- egress | 3% - 3% = 0% | +--------------------------+---------------------------------------+ Table 3: Downstream Congestion Measured at Example Observation Points Note that the ingress determines the RE blanking fraction for each aggregate using the most recent feedback from the relevant egress, arriving with each new reservation, or each refresh. These updates arrive relatively infrequently compared to the speed with which congestion changes. Although this feedback will always be out of date, on average positive errorswillshould cancel out negative over a sufficiently long duration. In summary, the network adds pre-congestion marking in the forward data path, the egress feeds its level back to the ingress inRSVP,RSVP (or similar signalling), then the ingress gateway re-echoes it into the forward data path by blanking the REbit.flag. Hence the name re-ECN. Then at any border within the Diffserv region, the pre-congestion marking that every passing packet will be expected to experience downstream can be measured to be the RE blanking fraction minus the congestion marking fraction.4.4.4.3.2. Aggregate Bootstrap When a new reservation PATH message arrives at the egress, if there are currently no flows in progress from the same ingress, there will be no state maintaining the current level of pre-congestion marking for the aggregate. While the reservation signalling continues onward towards the receiving host, the egress gateway returns an RSVP message to the ingress with a flag [RSVP-ECN] asking the ingress to send a specified number of data probes between them. This bootstrap behaviour is all described in theframework [CL-arch].deployment model [CL-deploy]. However, with our new re-ECN scheme, the ingress does not know what proportion of the data probes should have the REbitflag blanked, because it has no estimate yet of pre-congestion for the path across the Diffserv region. To be conservative, following the guidance for specifying other re- ECN transports in [Re-TCP], the ingress SHOULD set theNFFNE codepoint of the extended ECN header in all probe packets (Table 2). As per theframework,deployment model, the egress gateway measures the fraction of congestion-marked probe octets and feeds back the resulting pre- congestion level to the ingress, piggy-backed on the returning reservation response (RESV) for the new flow. Probe packets are identifiable by the egress because they have the ingress as the source and the egress as the destination in the IP header. It may seem inadvisable to expect theNFFNE codepoint to be set on probes, given legacy firewalls etc. might discard such packets (because this flag had noprevousprevious legitimate use). However, in the deployment scenariosenvisaged for this admission control framework,envisaged, each domain in the Diffserv region has to be explicitly configured to support the controlled load service. So, before deploying the service, the operator MUST reconfigure such a misbehaving middlebox to allow through packets with the REbitflag set. Note that we have said SHOULD rather than MUST for theNFFNE setting behaviour of the ingress for probe packets. This entertains the possibility of an ingress implementation having the benefit of other knowledge of the path, which it re-uses for a newly starting aggregate. For instance, it may hold cached information from a recent use of the aggregate that is still sufficiently current to be useful. It might seem pedantic worrying about these few probe packets, but this behaviour ensures the system is safe, even if the proportion of probe packets becomes large.4.5.4.3.3. Flow Bootstrap It might be expected that a new flow within an active aggregate would need no special bootstrap behaviour. If there was an aggregate already in progress between the gateways the new flow was about to use, it would inherit the prevailing RE blanking fraction. And if there were no active aggregate, theaggregatebootstrap behaviour for an aggregate would be appropriate and sufficient for the new flow. However, for a number of reasons, at least the first packet of each new flow SHOULD be set to theNFFNE codepoint, irrespective of whether it is joining an active aggregate or not. If the first packet is unlikely to be reliably delivered, a number ofNFFNE packets MAY be sent to increase the probability that at least one is delivered to the egress gateway. If each flow does not start with anNFFNE packet, it will be seen later that sanctions may beincorrectly appliedtoo strict at the interface before the egress gateway. It will often be possible to apply sanctions at the granularity of aggregates rather than flows, but in an internetworked environment it cannot be guaranteed that aggregates will be identifiable in remote networks. So settingNFFNE at the start of each flow is a safe strategy. For instance, a remote network may have equal cost multi-path (ECMP) routing enabled, causing different flows between the same gateways to traverse different paths. After an idle period of more than 1 second, the ingress gateway SHOULD set the EECN field of the next packet it sends toNF.FNE. ThisREQUIREMENTallows the design of network policers to bedeterministic. Ifdeterministic (see [Re- TCP]). However, if the ingress gateway can guarantee that the network(s) that will carry the flow to its egress gateway all use a common identifier for the aggregate (e.g. a single MPLS network without ECMP routing), it MAY NOT setNFFNE when it adds a new flow to an activeaggregate andaggregate. And anNFFNE packet need only be sent if a whole aggregate has been idle for more than 1 second.5. Emulating Border Policing with Re-ECN Note: In the rest of this memo, where the context makes it clear, we will loosely use the term 'congestion' rather than using4.3.4. Router Forwarding Behaviour Adding re-ECN works well without modifying thestricter 'downstream pre-congestion'. Also we will loosely talkforwarding behaviour ofpositive or negative traffic, meaning traffic where the moving average of the downstream pre-congestion metric is persistently positive or negative respectively. The notion of positive and negative downstream pre-congestion is because downstream pre-congestion is calculated by subtracting the congestion marking fraction from the RE blanking fraction. Thereforeany routers. However, below, two changes are proposed when forwarding packetscan be considered to havewith a'value multiplier' of +1, 0 or -1. Blanking the RE bit increments the 'value multiplier' ofper-hop-behaviour that requires pre- congestion notification: Preferential drop: When apacket. Congestion markingrouter cannot avoid dropping ECN-capable packets, preferential dropping of packets with different extended ECN codepoints SHOULD be implemented between packets within apacket decrements 'the value multiplier' (whether admission marking or pre-emption marking). Both together cancel each other out (a neutral or zero 'value- multiplier').PHB that uses PCN marking. TheNF codepointdrop preference order to use isan exception. It hasdefined in Table 4. Note that to reduce configuration complexity, Re-Echo and FNE MAY be given the samepositive 'value multiplier' as a re-echoed packet. The table below specifies unambiguously the value multipliers of each extended ECN codepoint. +-------+------+-------------+--------------+-----------------------+drop preference, but if feasible, FNE should be dropped in preference to Re-Echo. +--------+------+----------------+---------+------------------------+ | ECN | RE |re-ECNExtended ECN |'ValueDrop |re-ECNRe-ECN meaning | | field |bitflag | codepoint |multiplier'Pref | |+-------+------+-------------+--------------+-----------------------++--------+------+----------------+---------+------------------------+ |0001 | 0 |NRECTRe-Echo |n/a5/4 |Not re-ECN-capableRe-echoed congestion | | | | | |transportand RECT | | 00 | 1 |NF | +1 | No feedback | | 01 | 0 | Re-EchoFNE |+14 |Re-echoed congestionFeedback not | | | | | |and RECTestablished | | 01 | 1 | RECT |03 |re-ECNRe-ECN capable | | | | | | transport | | 10 | 0 | AM(0) |03 | Admission Marking with | | | | | |withRe-Echo | | 10 | 1 | AM(-1) |-13 | Admission Marking | | | | | | | | 11 | 0 | PM(0) |02 | Pre-emption Marking | | | | | | with Re-Echo | | 11 | 1 | PM(-1) |-12 | Pre-emption Marking |+-------+------+-------------+--------------+-----------------------+| | | | | | | 00 | 0 | Not-RECT | 1 | Not re-ECN-capable | | | | | | transport | +--------+------+----------------+---------+------------------------+ Table 4:'Sign'Drop Preference of Extended ECN CodepointsJust(1 = drop 1st) Given this proposal is being advanced at the same time as PCN itself, wewill loosely talk of positive and negative traffic whenstrongly RECOMMEND that preferential drop based on extended ECN codepoint is added to router forwarding at the same time as PCN marking. Preferential dropping can be difficult to implement, but wemeanstrongly RECOMMEND this security-related re-ECN improvement where feasible as it is an effective defence against flooding attacks. Marking vs. Drop: We propose that PCN-routers SHOULD inspect thelevel of downstream pre-congestion inRE flag as well as the ECN field to decide whether to drop or mark PCN DSCPs. They MUST choose drop if thestreamcodepoint oftraffic, we will also talkthis extended ECN field is Not-RECT. Otherwise they SHOULD mark (unless, ofpositive or negative packets, meaning whethercourse, buffer space is exhausted). A PCN-capable router MUST NOT ever congestion mark a packetcontributes positively or negatively to downstream pre- congestion. 5.1. Policing Overview To emulate border policing,carrying thegeneral ideaNot-RECT codepoint because the transport will only understand drop, not congestion marking. But a PCN-capable router can mark rather than drop an FNE packet, even though its ECN field when looked at in isolation isfor each domain to apply financial penalties'00' which appears to be a legacy Not-ECT packet. Therefore, if a packet's RE flag is '1', even if itsupstream neighbourECN field is '00', a PCN-enabled router SHOULD use congestion marking. This allows the `feedback not established' (FNE) codepoint to be used for probe packets, inproportionorder tothe amountpick up PCN marking when bootstrapping an aggregate. ECN marking rather than dropping ofdownstream pre-congestionFNE packets MUST only be deployed in controlled environments, such as thatthe upstream network sends across the border. This seems to encourage everyone to understate downstream pre-congestion to reduce the penalties they incur. But it isinthe last domain's interest to create a balancing upward pressure by applying sanctions to flows[CL-deploy], where the presence of an egress node that understands ECN markingfraction goes negative beforeis assured. Congestion events might otherwise be ignored if theegress gateway. Of course, some domains may trust other domains to comply without applying sanctions or penalties. In these cases,receiver only understands drop, rather than ECN marking. This is because there is nopenalties need be applied. The re-ECN protocol ensures downstream pre-congestion markingguarantee that ECN capability has been negotiated if feedback ispassed on correctly whether ornotpenalties are applied to it, soestablished (FNE). Also, [Re-TCP] places thesystem works just as well withstrong condition that amixture of some domains trusting each other and others not. Figure 4 uses the same examplerouter MUST apply drop rather than marking to FNE packets unless it can guarantee that FNE packets are rate limited either locally or upstream. 4.3.5. Extensions If a different signalling system, such as NSIS, were used, but it provided admission control inprevious sections to show the downstream pre-congestion marking fraction, v, acrossapath through the Internet. Downward arrows show the pressure for each domain to underdeclare downstreamsimilar way, using pre-congestionin traffic they passnotification (e.g. with RMD [NSIS-RMD]) we believe re-ECN could be used to protect against misbehaving networks in thenext domain, becausesame way as proposed above. 5. Emulating Border Policing with Re-ECN 5.1. Informal Terminology In the rest of this memo, where thepenalties. Note that atcontext makes it clear, we will sometimes loosely use thelast egressterm `congestion' rather than using the stricter `downstream pre-congestion'. Also we will loosely talk of positive or negative flows, meaning flows where theDiffserv region, domain C should not agree to pay any penalties tomoving average of theegress gateway fordownstream pre-congestionpassed tometric is persistently positive or negative. The notion of a negative metric arises because it is derived by subtracting one metric from another. Of course actual downstream congestion cannot be negative, only theegress gateway. Downstream pre-congestionmetric can (whether due to time lags or deliberate malice). Just as we will loosely talk of positive and negative flows, we will also talk of positive or negative packets, meaning packets that contribute positively or negatively to downstream pre-congestion. Therefore packets can be considered tothe egress gateway shouldhavereached zero here, so if domain C agreeda `worth' of +1, 0 or -1, which, when multiplied by their size, indicates their contribution topay for anydownstreampre-congestion, it would givecongestion. Packets will usually be sent with a worth of 0. Blanking theegress gateway an incentive to overdeclare pre-congestion feedback and takeRE flag increments theresulting profit from domain C. Providers should be freeworth of a packet toagree+1. Congestion marking a packet decrements its worth (whether admission marking or pre-emption marking). Congestion marking a previously blanked packet cancel out thecontractual terms they wish between themselves, so this memo does not propose to standardise how these penalties would be applied. Itpositive and negative worth of each marking (a worth of 0). The FNE codepoint issufficient to standardisean exception. It has there-ECN protocol sosame positive worth as a packet with thedownstream pre-congestion metricRe-Echo codepoint. The table below specifies unambiguously the worth of each extended ECN codepoint. Note the order isavailable if providers choosedifferent from the previous table touse it. However, Section 5.2 gives some examples ofemphasise howthese penalties might be implemented. p e n a l t i e s /congestion marking processes decrement the worth. +--------+------+------------------+-------+------------------------+ |\ A : : :ECN | RE |<--A---> <---B---> <---C---> domainExtended ECN |V : : : 3%Worth |+-----+Re-ECN meaning | |:field | flag | codepoint |V V : 2%| |+----------------------+ :+--------+------+------------------+-------+------------------------+ | 00 |downstream pre-congestion0 |: 1%Not-RECT | n/a |:Not re-ECN-capable | | | | | | transport | | 01 | 0 | Re-Echo | +1 | Re-echoed congestion |:| |:|: 0% +----+----------------------------+====+------> : : : A : : : :|: ingress : : : egress 1.00% 2.00%: pre-congestion|sanctions Figure 4: Policing Framework, showing creation of opposing pressures to underdeclare and overdeclare downstream pre-congestion, using penaltiesandsanctions Any traffic that persistently goes negative by the time it leaves a domain mustRECT | | 10 | 0 | AM(0) | 0 | Admission Marking with | | | | | | Re-Echo | | 11 | 0 | PM(0) | 0 | Pre-emption Marking | | | | | | with Re-Echo | | 00 | 1 | FNE | +1 | Feedback nothave been marked correctly in the first place. A domain| | | | | | established | | 01 | 1 | RECT | 0 | Re-ECN capable | | | | | | transport | | 10 | 1 | AM(-1) | -1 | Admission Marking | | | | | | | | 11 | 1 | PM(-1) | -1 | Pre-emption Marking | +--------+------+------------------+-------+------------------------+ Table 5: 'Worth' of Extended ECN Codepoints 5.2. Policing Overview It will be recalled thatdiscovers such trafficdownstream congestion canadopt a range of strategies to protect itself. Which strategy it uses will depend on policy, because it cannot immediately assume malice---there maybean innocent configuration error somewhere infound by subtracting upstream congestion from path congestion. Figure 4 displays thesystem. So this memo also does not proposedifference between the two plots in Figure 3 tostandardise any particular mechanism, but Section 5.4 does give examples of howshow downstream pre-congestion across theunderlying re-ECN protocol could be usedsame path through the Internet. To emulate border policing, the general idea is for each domain to applysanctions to persistently negative traffic. The ultimate sanction would be to drop such negative traffic indiscriminately, without regard to flows. A less drastic sanction might bepenalties tofocus drop on specific packetsits upstream neighbour inspecific flowsproportion toremovethenegative bias while doing minimal harm. In all cases a management alarm SHOULD be raised on detecting persistently negative traffic and any automatic sanctions taken SHOULD be logged. Even ifamount of downstream pre-congestion that thechosen policy is to take no automatic action,upstream network sends across thecause can then be investigated manually. The incentive for domains not to tolerate negatively marked traffic depends on financial penalties never being negative.border. That is,any level of negative marking only equates to zero penalty. In other words,the penaltiesare always paidshould be in proportion to thesame direction asheight of thedata, and never againstplot. Downward arrows in thedata flow. This is consistent withfigure show thedefinitionresulting pressure for each domain to under-declare downstream pre-congestion in traffic they pass to the next domain, because ofphysical congestion; whenthe penalties. p e n aresource is underutilised, it is not negatively congested, its congestion is just zero. So, although short periodsl t i e s / | \ A : : : | | <--A---> <---B---> <---C---> domain | V : : : 3% | +-----+ | | : | | | V V : 2% | | +----------------------+ : | | downstream pre-congestion | : 1% | | : | : | | : | : 0% +----+----------------------------+====+------> : : : A : : : : | : ingress : : : egress 1.00% 2.00%: pre-congestion | sanctions Figure 4: Policing Framework, showing creation ofnegative marking can be toleratedopposing pressures tocorrect temporary overdeclarations dueunder-declare and over-declare downstream pre-congestion, using penalties and sanctions These penalties seem tolagsencourage everyone to understate downstream congestion in order to reduce thefeedback system, persistentpenalties they incur. But a balancing pressure is introduced by the last domain, which applies sanctions to flows if downstreamnegativecongestioncan have no physical meaning and therefore must signify a problem.goes negative before the egress gateway. The upward arrow atthe egress of domain C at itsDomain C's border with the egress gatewayin Figure 4representsthisthe incentivenotthe sanctions would create toallowprevent negative traffic.But theThe same upward pressureappliescan be applied ateveryany domain border (arrows not shown).WithAny flow that persistently goes negative by theabove penalty system, eachtime it leaves a domainseems tomust not have been marked correctly in the first place. A domain that discovers such aperverse incentiveflow can adopt a range of strategies tofake pre-congestion. For instance domain B's profit depends on the difference between pre-congestion at its ingress (its revenue) and at its egress (its cost). So if B overstates internal pre-congestionprotect itself. Which strategy itseems to increase its profit. However, we canuses will depend on policy, because it cannot immediately assumethat domain A could bypass B, routing through other domains to reach the egress. So the competitive discipline of least-cost routing can ensure that any domain tempted to fake pre-congestion for profit risks losing all its usage revenue. The least congested route would eventuallymalice---there may beable to win this competitive game, only as long as it didn't declare more fake pre-congestion thanan innocent configuration error somewhere in thenext most competitive route. Again, thissystem. This memo doesneednot propose to standardise any particular mechanismfor routing based on re-ECN.to detect persistently negative flows, but Section 5.5explains why no new standards woulddoes give examples. Note that we have used the term flow, but there will beneededno need to bury into the transport layer forcongestion routing as long as re-ECN marking had been standardised. That sectionport numbers; identifiers visible in the network layer will be sufficient (IP address pair, DSCP, protocol ID). The appendix alsopointsgives a mechanism topapers concerning optimising routing inbound thepresence of usage charging. 5.2. Pre-requisite Contractual Arrangementsrequired flow state, preventing state exhaustion attacks. Of course, some domains may trust other domains to comply with admission control without applying sanctions or penalties. In these cases, the protocol should still be used but no penalties need be applied. The re-ECN protocolhas been chosen to solve the policing problem because it embeds aensures downstream pre-congestionmetric in passing CL traffic thatmarking isdifficult to lie about and canpassed on correctly whether or not penalties are applied to it, so the system works just as well with a mixture of some domains trusting each other and others not. Providers should be free to agree the contractual terms they wish between themselves, so this memo does not propose to standardise how these penalties would be applied. It is sufficient to standardise the re-ECN protocol so the downstream pre-congestion metric is available if providers choose to use it. However, the next section (Section 5.3) gives some examples of how these penalties might be implemented. 5.3. Pre-requisite Contractual Arrangements The re-ECN protocol has been chosen to solve the policing problem because it embeds a downstream pre-congestion metric in passing CL traffic that is difficult to lie about and can be measured in bulk. The ability to emulate border policing depends on network operators choosing to use this metric as one of the elements in their contracts with each other. Already many inter-domain agreements involve a capacity and a usage element. The usage element may be based on volume or various measures of peak demand. We expect that those network operatorsthatwho choose to use pre-congestion notification for admission control would also be willing to consider using this downstream pre-congestion metric as a usage element in their interconnection contracts for admission controlled (CL) traffic.Appendix A.2Congestion (or pre-congestion) has the dimension of [octet], being the product of volume transferred [octet] and the congestion fraction [dimensionless], which is the fraction of the offered load that the network isn't able to serve (or would rather not serve in the case of pre-congestion). Measuring downstream congestion gives asuggested algorithmmeasure of the volume transferred but modulated by congestion expected downstream. So volume transferred during off-peak periods counts as nearly nothing, while volume transferred at peak times counts very highly. The re-ECN protocol allows one network to measure how much pre-congestion has been `dumped' into it by another network. And then in turn how much of that pre-congestion it dumped into the next downstream network. Section 5.6 describes mechanisms for calculating border penalties referring to Appendix A.2 for suggested metering algorithms for downstream congestion at a border router.ItConceptually, it could hardly be simpler. It broadly involves accumulating the volume of packets with the REbitflag blanked and the volume of those with congestion markingandthen subtracting the two.In order to discard a persistent negative balance (see above), time is slotted into periods of say 10secs (or a time sufficient for a few rounds of feedback depending on the level of aggregation). Every timeslot, a positive balance between the two counters is accumulated into a long-term counter and reset. Whereas, if the balance during any timeslot is negative, it is discarded and a management alarm SHOULD also be raised. Over an accounting period (say a month) the single metric in the long term counter represents all the downstream congestion caused by traffic passing the border meter. Congestion has the dimension of [byte], being the product of volume transferred [byte] and percentage pre-congestion [dimensionless]. The above algorithm effectively gives a measure of the volume transferred, but modulated by pre-congestion expected downstream. So volume transferred during off-peak periods counts as nearly nothing, while volume transferred at peak times counts very highly. The re- ECN protocol allows one network to measure how much pre-congestion has been 'dumped' into it by another network. And then in turn how much of that pre-congestion it dumped into the next downstream network.Once this downstream pre-congestion metric is available, operators are free to choose how they incorporate it into their interconnectioncontracts [IXQoS].contracts [IXQoS]. Some may include a threshold volume of pre- congestion as a quality measure in their service level agreement, perhaps with a penalty clause if the upstream network exceeds this threshold over, say, a month. Others may agree a set of tiered monthly thresholds, with increasing penalties as each threshold is exceeded. But, it would be just aseasyeasy, and morepreciseresistant to gaming, to do away with discrete thresholds, and instead make the penalty rise smoothly with the volume of pre-congestion by applying a price topre- congestionpre-congestion itself. Then the usage element of the interconnection contract would directly relate to the volume ofpre-congestionpre- congestion caused by the upstream network. The direction of penalties and charges relative to the direction of traffic flow is a constant source of confusion. Typically, where capacity charges are concerned, lower tier customer networks pay higher tier provider networks. So money flows from the edges to the middle of theinternetwork where there isinternetwork, towards greaterconnectivity.connectivity, irrespective of the flow of data. But we advise that penalties or charges for usagenormallyshould follow the same direction as the data flow---the direction of control at the network layer. Otherwise a network lays itself open to `denial of funds' attacks. So, where a tier 2 provider sends data into a tier 3 customer network, we would expect the penalty clauses for sending too much pre-congestion to be against the tier32 network, even though it is the provider.The relative direction of penalties and charges is a constant source of confusion.It may help to remember that data will be flowing in the other direction too. So the provider network has as much opportunity to levy usage penalties as its customer, and it can set the price or strength of its own penalties higher if it chooses. Usage charges in both directions tend to cancel each other out, which confirms that usage-charging is less to do with revenue raising and more to do with encouraging load control discipline in order to smooth peaks and troughs, improving utilisation and quality.To focus the discussion, from now on, unless otherwise stated, we will assume a downstream network charges its upstream neighbour in proportion toFurther, when operators agree penalties in their interconnection contracts for sending downstream congestion, they should make sure that any level of negative marking only equates to zero penalty. In other words, penalties are always paid in thepre-congestion it sends, B_v, usingsame direction as thenotation of Appendix A.2. If they previously agreeddata, and never against the(fixed) price per byte of pre-congestion woulddata flow, even if downstream congestion seems to beL, then the bill atnegative. This is consistent with theenddefinition ofthe month will simplyphysical congestion; when a resource is underutilised, it is not negatively congested. Its congestion is just zero. So, although short periods of negative marking can be tolerated to correct temporary over-declarations due to lags in theproduct L.B_v, plus any fixed charges they may alsofeedback system, persistent downstream negative congestion can haveagreed. We are well aware that the IETF triesno physical meaning and therefore must signify a problem. The incentive for domains not toavoid standardising technology thattolerate persistently negative traffic depends ona particular business model. But our aim is merely to show that border policing can at least work withthisone model, then we can assumeprinciple thatoperators might experiment withpenalties must never be paid against themetric in other models. Effectively tiered thresholds are just more coarse-grained approximationsdata flow. Also note that at the last egress of thefine-grained case we chooseDiffserv region, domain C should not agree toexamine. Of course, operators are freepay any penalties tocomplement thisthe egress gateway for pre-congestion-based usage element of their charges with traditional capacity charging, and we expect they will. 5.3. Emulation of Per-Flow Rate Policing: Rationale and Limits The important feature of charging in proportion tocongestionvolume is thatpassed to thepenalty aggregates and deaggregates correctly along with packet flows. This is becauseegress gateway. Downstream pre-congestion to thepenalty rises linearly with bit rate and linearly with congestion, becauseegress gateway should have reached zero here. If domain C were to agree to pay for any remaining downstream pre-congestion, itis the product of them both. So if the packets crossing a border consist of a thousand flows, and one of those flows doubles its rate,would give theingressegress gatewayforwarding that flow will havean incentive toput twice as muchover-declare pre- congestionmarking intofeedback and take thepackets of that flow. And this extra congestion marking will add proportionately toresulting profit from domain C. To focus the discussion, from now on, unless otherwise stated, we will assume a downstream network chargeslevied at every border the flow crossesits upstream neighbour in proportion to theamount ofpre-congestionremaining onit sends (V_b in thepath. As importantly, pre-congestion itself rises super-linearly with utilisationnotation ofa particular resource. So if someone tries to push another flow into a path that is already signalling enough pre- congestion to warrant admission control, the penalty will be a lot greater than itAppendix A.2). Effectively tiered thresholds wouldhave been to add the same flow to a less congested path. So,be just more coarse-grained approximations of thesystem as a whole is fairly insensitivefine-grained case we choose to examine. If these neighbours had previously agreed that theactual level(fixed) price per octet of pre-congestionthat each ingress chooses for triggering admission control. The deterrent against exceeding whatever threshold is chosen rises very quickly with a small amount of cheating. These arewould be L, then theproperties that allow re-ECN to emulate per-flow border policing of both rate and admission control. When a whole inter- network is operatingbill atnormal (typically very low) congestion,thepre-congestion marking from virtual queues willend of the month would simply bea little higher--- still low, but more noticeable. But this does not imply that usage /charges/ mustthe product L*V_b, plus any fixed charges they may alsobe low. Thathave agreed. We are well aware that the IETF tries to avoid standardising technology that depends on a particular business model. Indeed, this principle is at the/price/ L. For instance, combining capacity and volume chargesheart of all our own work. Our aim here isquiteto make acommon feature of interconnection agreements in today's Internet, particularly since p2p file-sharing became popular. Imaginenew metric available thatthe monthly payment between two networkswe believe ismade up of a volume charge and a capacity charge, and they usually turn outsuperior tobe in a ratio of about 1:2 (not atypical). If charging for volume were replaced with charging for congested volume, one would expect the price of congestionall existing metrics. Then, our aim is tobe arranged soshow that border policing can at least work with thetotal charge for usage remained about the same---still aboutonethird of the total settlement. Because that is obviously the chargemodel we have just outlined. We assume that operators might then experiment with themarket has found is necessarymetric in other models. Of course, operators are free topush back against usage. So, if an average pre- congestion fraction turned out to be 0.1%, one wouldcomplement this pre-congestion-based usage element of their charges with traditional capacity charging, and we expect they will. Also note well that everything we discuss in this memo only concerns interconnection within theprice L per byteDiffserv region. ISPs are free to sell or give away reservations however they want on the retail market. But ofpre-congestion would be about 1000 timescourse, interconnection charges will have a bearing on that. Indeed, in thepreviously used per byte price for volume (beforepresent scenario, the ingress gateway effectively sells reservations on one side and buys congestionmetrics were available). Frompenalties on theabove example itother. As congestion rises, one canbe seen why operators will become acutely sensitive toimagine the gateway discovering that congestionthey cause in other networks, which is of course the desired effect to encourage networks to /control/penalties have risen higher than thecongestion they allow their users to cause to others. Effectively, usage charges(probably fixed) revenue it willcontinuously flowearn fromingress gateways toselling theplaces where there is mild pre-congestion, in proportion tonext flow reservation. This encourages thedata rates from those gateways andgateway tothe path pre- congestion. If anyone sends even onecut its losses by blocking new calls, which is why we believe downstream congestion penalties can emulate per- flow rate policing athigher rate, they will immediately haveborders, as the next section explains. 5.4. Emulation of Per-Flow Rate Policing: Rationale and Limits The important feature of charging in proportion topay proportionately more usage charges. Because therecongestion volume isno knowledge of reservations withinthat theDiffserv region, no interior router can police whetherpenalty aggregates and disaggregates correctly along with packet flows. This is because the penalty rises linearly with bit rateof each flow(unless congestion is absolutely zero) and linearly with congestion, because it isgreater than each reservation. Sothesystem doesn't truly emulate rate-policingproduct ofeach flow. But there is no incentive to pack a higher rate into a reservation, becausethem both. So if thecharges are directly proportionalpackets crossing a border belong torate, irrespectivea thousand flows, and one of those flows doubles its rate, thereservation. However, if virtual queues start to fill on any path, even though real queuesingress gateway forwarding that flow willstill be ablehave toprovide low latency service, pre-put twice as much congestion marking into the packets of that flow. And this extra congestion marking willrise fairly quickly. It may eventually reachadd proportionately to thethreshold wherepenalties levied at every border theingress gateway would deny admissionflow crosses in proportion tonew flows. Ifthe amount of pre-congestion remaining on the path. Effectively, usage charges will continuously flow from ingressgateway cheatsgateways to the places generating pre-congestion marking, in proportion to the pre-congestion marking introduced andcontinuestoadmit new flows,theaffected virtual queues will rapidly fill, even thoughdata rates from those gateways. As importantly, pre-congestion itself rises super-linearly with utilisation of a particular resource. So if someone tries to push another flow into a path that is already signalling enough pre- congestion to warrant admission control, thereal queuespenalty willstillbelittle worsea lot greater thanthey were when admission control shouldit would have beeninvoked. The ingress gateway will havetopay the penalty for such an extremely high pre-congestion level, soadd thepressuresame flow toinvoke admissiona less congested path. This makes the incentive system fairly insensitive to the actual level of pre-congestion for triggering admission controlshould become unbearable.that each ingress chooses. Theabove mechanisms protectdeterrent againstrational operators. In Section 5.6 we discuss how networks can protect themselves from accidental or deliberate misconfiguration in neighbouring networks. 5.4. Policing Dishonest Marking As CL traffic leaves the last network before the egress gateway (domain C) the RE blanking fraction should match the congestion marking fraction, when averaged over a sufficiently long duration (perhaps ~10s to allowexceeding whatever threshold is chosen rises very quickly with afew roundssmall amount offeedback through regular signallingcheating. These are the properties that allow re-ECN to emulate per-flow border policing ofnewboth rate andrefreshed reservations). If domain C doesn't trust the networks aroundadmission control. It is not a perfect emulation of per-flow border policing, but we claim it is sufficient tobehave honestly, it should install a monitoratits egress. This monitor aimsleast ensure the cost todetect flowsothers ofCL packets thata cheat is borne by the cheater, because the penalties arepersistently negative.at least proportionate to the level of the cheat. Ifflows are positive, domain C need take no action---this simply meansanupstreamedge networkmust be paying more penalties than it needs to. Appendix A.3 givesoperator is selling reservations at asuggested algorithm for the monitor. Note thatlarge profit over themonitor operates on flows but we would like itcongestion cost, these pre- congestion penalties will notto require per-flow state. This is why we have been carefulbe sufficient to ensurethat all flows MUST start with a packet marked withnetworks in theNF codepoint. Ifmiddle get aflow does not startshare of those profits, but at least they can cover their costs. We will now explain withthe NF codepoint,an example. When amonitorwhole inter-network islikely to treat it unfavourably. This incentivises setting ofoperating at normal (typically very low) congestion, theNF codepoint. This also means that a monitorpre- congestion marking from virtual queues will beresistant to state exhaustion attacks from other networks,a little higher than if the real queues had been used---still low, but more noticeable. But low congestion levels do not imply that usage /charges/ must also be low. Usage charges will depend on the /price/ L as well. If themonitor never creates state unless an NF packet arrives. Andmetric of the usage element of anNF packet counts positive, so it will cost a lot for a networkinterconnection agreement was changed from pure volume tosend manypre-congested volume, one would expect the price ofthem. Monitor algorithms will often maintainpre-congestion to be arranged so that the total usage charge remained about the same. So, if an average pre- congestion fractionof RE blanked packets across flows. When maintaining an average across flows, a monitor MUST ignore packets with the NF codepoint set. An ingress gateway sets the NF codepoint when it does not haveturned out to be 1/1000, one would expect that thebenefitprice L (per octet) offeedback from the ingress. So counting packets with FE clearedpre-congestion would belikely to makeabout 1000 times theaverage unnecessarily positive, providing headroom (or should we say footroom?)previously used (per octet) price fordishonest (negative) traffic. If the monitor detectsvolume. We should add that apersistently negative flow, it could drop sufficient negative and neutral packets to force the flowswitch tonot be negative. Thispre-congestion is unlikely to exactly maintain theapproach taken for the 'egress dropper' in [Re-TCP],same overall level of usage charges, butfor the scenario inthismemo, where everyone would expect everyone elseargument will be approximately true, because usage charge will rise tokeepat least the level the market finds necessary to push back against usage. From theprotocolabove example itis probably more advisable to raisecan be seen why amanagement alarm. So all ingresses cannot understate downstream pre-congestion without getting logged. Then1000x higher price will make operators become acutely sensitive to thenetwork operator can deal withcongestion they cause in other networks, which is of course theoffendingdesired effect; to encourage networks to /control/ the congestion they allow their users to cause to others. If any network sends even one flow atthe human level, out of band. 5.5. Competitive Routing Goldenberg et al [Smart_rtg] refershigher rate, they will immediately have tovarious commercial product and presents its own algorithms for moving traffic between multihomed routes based onpay proportionately more usage charges.None of these systems require any changes to standards protocols because the choice between the available border gateway protocol (BGP) routesBecause there isbased on a combination of localno knowledge of reservations within thecharging regime and local measurementDiffserv region, no interior router can police whether the rate oftraffic levels. If, as we propose, charges or penalties were based oneach flow is greater than each reservation. So thelevelsystem doesn't truly emulate rate- policing ofre-ECN measured in passing traffic, a similar optimisation could be achieved without requiring any changeseach flow. But there is no incentive tostandard routing protocols. We must be clear that applying pre-congestion-based routingpack a higher rate into a reservation, because the charges are directly proportional tothis admission control system remains an open research issue. Traffic engineering based on congestion requires careful dampingrate, irrespective of the reservations. However, if virtual queues start toavoid oscillations, and should not be attempted without adult supervision :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering basedfill oncongestion. Without the benefit of re-ECN, they they hadany path, even though real queues will still be able toadd a path attributeprovide low latency service, pre- congestion marking will rise fairly quickly. It may eventually reach the threshold where the ingress gateway would deny admission toBGPnew flows. If the ingress gateway cheats and continues toadvertise a route's downstream congestion (actuallyadmit new flows, the affected virtual queues will rapidly fill, even though the real queues will still be little worse than theyproposed that BGPwere when admission control shouldadvertisehave been invoked. The ingress gateway will have to pay thechargepenalty forcongestion, which we believe wrongly embedssuch anassumption into BGP that congestion will be charged for). 5.6. Fail-safes The mechanisms describedextremely high pre-congestion level, sofar create incentives for rational operators to behave. That is, one operator aimsthe pressure tomake another behave responsibly by applying penalties and expecting a rational response that trades off costsinvoke admission control should become unbearable. The above mechanisms protect againstbenefits. It is usually reasonable to assume that other network operators behave rationally (policy routingrational operators. In Section 5.6.3 we discuss how networks canavoid those that might not). But this approach does notprotectagainstthemselves from accidental or deliberate misconfiguration in neighbouring networks. 5.5. Sanctioning Dishonest Marking As CL traffic leaves themisconfigurations and accidents of other operators. Therefore, we proposelast network before the egress gateway (domain C) thefollowing two similar mechanisms at a network's borders to provide "defence in depth": Highly positive flowsREblanked packetsblanking fraction shouldbe sampled andmatch the congestion marking fraction, when averaged over asmall regular sample picked randomly as they crosssufficiently long duration (perhaps ~10s to allow aborder interface. Then subsequent packets matching the same source and destination addressfew rounds of feedback through regular signalling of new andDSCPrefreshed reservations). To protect itself, domain C shouldbe monitored. If the RE blanking rate is well aboveinstall athreshold (tomonitor at its egress. It aims to detect flows of CL packets that are persistently negative. If flows are positive, domain C need take no action---this simply means an upstream network must bedetermined by operational practice),paying more penalties than it needs to. Appendix A.3 gives amanagement alarmsuggested algorithm for the monitor, meeting the criteria below. o It SHOULDbe raised,introduce minimal false positives for honest flows; o It SHOULD quickly detect andthe flow MAYsanction dishonest flows (minimal false negatives); o It MUST beautomatically subjectinvulnerable tofocused drop. Persistently negative flows congestion marked packetsstate exhaustion attacks from malicious sources. For instance, if the dropper uses flow-state, it should not besampled andpossible for asmall regular sample picked randomly as they cross a border interface. Then subsequent packets matching the samesourceand destination address and DSCP should be monitored. If the RE blanking rate minus the congestion marking rate is persistently negative,to send numerous packets, each with amanagement alarm SHOULD be raised, and thedifferent flowMAY be automatically subjectID, tofocused drop. Both these mechanisms rely onforce thefactdropper to exhaust its memory capacity; o It MUST introduce sufficient loss in goodput so thathighly postive (or negative) flows will appear more quicklymalicious sources cannot play off losses in thesample by selecting randomly solely from positive (or negative) packets.egress dropper against higher allowed throughput. Salvatori [CLoop_pol] describes this attack, which involves the source understating path congestion then inserting forward error correction (FEC) packets to compensate expected losses. Note thatthere is no assumption that users behave rationally. The system is protected fromthevagiaries of irrational user behaviour by the ingress gateways, which transform internal penalties into a deterministic, admission control mechanism that prevents users from misbehaving, by directly engineered means. 6. Analysis The domains in Figure 1 are not expected to be completely malicious towards each other. After all,monitor operates on flows but with careful design we canassumeavoid per-flow state. This is why we have been careful to ensure thatthey areallco- operating to provide an internetworking service toflows MUST start with a packet marked with thebenefit of each of them and their customers. Otherwise their routing polices wouldFNE codepoint. If a flow does notinterconnect them instart with thefirst place. However, we assume that they are also competitors of each other. SoFNE codepoint, anetwork may trymonitor is likely tocontravene our proposed protocol iftreat itwould gain or make a competitor lose, or both, but only ifunfavourably. This risk makes itcan do so without being caught. Therefore we do not have to consider every possible random attack one network could launch onworth setting thetrafficFNE codepoint at the start ofanother, given anyway one network can always drop or corrupt packetsa flow, even though there is a cost to setting FNE (positive `worth'). Starting flows with an FNE packet also means that a monitor will be resistant to state exhaustion attacks from other networks, as the monitor can then be designed to never create state unless an FNE packet arrives. And an FNE packet counts positive, so itforwards on behalf of another. Therefore, we only consider new opportunitieswill cost a lot for/gainful/ attack that our proposal introduces. Buta network to send many of them. Monitor algorithms will often maintain acertain extent we can also rely onmoving average across flows of thein depth defences we have described (Section 5.6 ) intended to mitigatefraction of RE blanked packets. When maintaining an average across flows, a monitor MUST ignore packets with thepotential impact if one network accidentally misconfiguringFNE codepoint set. An ingress gateway sets theworkingsFNE codepoint when it does not have the benefit ofthis protocol. Infeedback from thegeneric scenarioegress. So counting packets with FNE cleared would be likely to make the average unnecessarily positive, providing headroom (or should weintroduced in Figure 1say footroom?) for dishonest (negative) traffic. If theingressmonitor detects a persistently negative flow, it could drop sufficient negative andegress gateways are shown inneutral packets to force themost generic arrangement, without any surrounding network. This allows usflow toconsider more specific cases where these gateways and a neighbouring network are operated by the same player. As well as cases where the same player operates neighbouring networks, we will also consider cases wherenot be negative. This is thetwo gateways collude as one player and whereapproach taken for thesender and receiver collude as one. Collusion of other sets of domains are less likely,`egress dropper' in [Re-TCP], butwe will consider such cases. In the general case, we will assume none of the nine trust domains across the figure fully trust any of the others. Takingfor thegenericscenario inFigure 1, as we only proposethis memo, where everyone would expect everyone else to keep tochange routers within the Diffserv region, we assume the operators of networks outsidetheregion willprotocol, a management alarm SHOULD bedoing per-flow policing. That is, we assume the networks outside the Diffserv regionraised on detecting persistently negative traffic and any automatic sanctions taken SHOULD be logged. Even if thegateways around its edges can protect themselves. So our primary concernchosen policy is tobe able to protect networks that don't do per-flow policing from those that do. The ingress and egress gateways are the only waytake no automatic action, theouter 'enemy'cause canget at the middle victim, so wethen be investigated manually. Then all ingresses cannot understate downstream pre-congestion without their action being logged. So network operators canconsider the gateways asdeal with offending networks at therepresentativeshuman level, out ofthe 'enemy' as far as domains A, B and C are concerned. We will call this trust scenario 'edges against middles'. Earlierband. As a last resort, perhaps where the ingress gateway address seems to have been spoofed inthis memo, we outlinedtheclassic border rate policing problem (Section 3). It will nowsignalling, packets can beusefuldropped. Drops could be focused on just sufficient packets in misbehaving flows tospell out the motivations that would create the lack of trust asremove theroot causenegative bias while doing minimal harm. A future version ofthe problem. The more reservationsthis memo may define a control message that could be used to notify an offending ingress gatewaycan allow,(possibly via themore revenueegress gateway) that itreceives. The middle networks want the edgesis sending persistently negative flows. However, we are aware that such messages could be used tocomply withtest theadmission control protocol when they becomesensitivity of the detection system, socongested thatcurrently we prefer silent sanctions. An extreme scenario would be where an ingress gateway (or set of gateways) mounted a DoS attack against another network. If theirservicetraffic caused sufficient congestion toothers might suffer. The middle networks also wantlead toensure the edges cannot steal more service from them thandrop but theypay for. Inunderstated path congestion to avoid penalties for causing high congestion, thecontextpreferential drop recommendations in Section 4.3.4 would at least ensure that these flows would always be dropped before honest flows.. 5.6. Border Mechanisms 5.6.1. Border Accounting Mechanisms One ofthis 'edges aginst middles' scenario,there-ECN protocol has twomaineffects: o The more pre-congestion there is on a path across the Diffserv region, the higher the ingress gateway hasdesign goals of re-ECN was for border security mechanisms todeclare downstream pre-congestion v_0. o because downstream pre-congestion should on averagebezero atas simple as possible, otherwise they would become theegress An executive summarypinch-points that limit scalability ofour security analysis can be stated in two parts, distinguished bythetype of collusion considered. Inwhole internetwork. As thefirst case collusion is limitedtitle of this memo suggests, we want toneighboursavoid per-flow processing at borders. We also want to keep to passive mechanisms that can monitor traffic inthe feedback loop. In other words, two neighbouringparallel to forwarding, rather than having to filter traffic inline---in series with forwarding. As data rates continue to rise, we suspect that all-optical interconnection between networkscanwill soon beassumeda requirement. So we want toact as one. Oravoid any new need for buffering (even though border filtering is current practice for other reasons, we don't want to make it even less likely that we will ever get rid of it). So far, we have been able to keep theegress gateway might collude with domain C. Orborder mechanisms simple, despite having had to harden them against some subtle attacks on theingress gateway might collude with domain A. Or ingressre-ECN design. The mechanisms are still passive andegress gateways might collude with each other. In these cases where only neighboursavoid per-flow processing, although we do use filtering as a fail-safe to temporarily shield against extreme events in other networks, such as accidental misconfigurations (Section 5.6.3). The basic accounting mechanism at each border interface simply involves accumulating thefeedback loop collude, all parties have avolume of packets with positiveincentive to declare downstream pre- congestion truthfully,worth (Re- Echo and FNE), and subtracting theingress gateway hasvolume of those with negative worth: AM(-1) and PM(-1). Even though this mechanism takes no regard of flows, over an accounting period (say apositive incentive to invoke admission control when congestion rises above the admission threshold in any network in the region (including its own). No party has an incentive to send more traffic than declared in reservation signalling (even though only the gateways readmonth) thissignalling). In short, no party can gain at the expense of another. In the case of other forms of collusion (e.g. between domain A and C) it would be possiblesubtraction will account forsay A & B to create a tunnel between theselves so that A would gain attheexpense of B. But C would then losedownstream congestion caused by all thegain that A had made. Thereforeflows traversing thevalue to A & C of colludinginterface, wherever they come from, and wherever they go to. The two networks can agree tomountuse thisattack seems questionable. It is made more questionable, becausemetric however they wish to determine some congestion-related penalty against theattack canupstream network (see Section 5.3 for examples). Although the algorithm could hardly bestatistically detected by Bsimpler, it is spelled out usingthe second defencepseudo- code indepth mechanism mentioned already. Note that C can effectively prevent A attacking it through a tunnel, by treating the tunnel end point as a direct link to a neighbouring network, which falls backAppendix A.2.1. Various attempts to subvert theregular scenario without collusion. {ToDo: Duere-ECN design have been made. In all cases their root cause is persistently negative flows. But, after describing these attacks we will show that we don't actually have tolack of time, the full write upget rid ofthe security analysis is deferredall persistently negative flows in order to thwart thenext version of this memo.} Finally, it is well known that the best personattacks. In honest flows, downstream congestion is measured as positive minus negative volume. So if all flows are honest (i.e. not persistently negative), adding all positive volume and all negative volume without regard toanalyse the securityflows will give an aggregate measure ofa systemdownstream congestion. But such simple aggregation isnotonly possible if no flows are persistently negative. Unless persistently negative flows are completely removed, they will reduce thedesigner. Therefore, our confident claims mustaggregate measure of congestion. The aggregate may still behedged with doubt until others with an incentive to breakpositive overall, but not as positive as it would havemounted a full analysis. 7. Extensions If a different signalling system, such as NSIS, were used, but providing admission control in a similar way using pre-congestion notification (e.g. with RMD [NSIS-RMD]) a similar approachbeen had the negative flows been removed. In Section 5.5 we discussed how tore-ECN could be used. 8. Design Choices and Rationale The casesanction traffic to remove, or at least to identify, persistently negative flows. But, even if the sanction forusing re-feedback (a generalisation of re-ECN)negative traffic is topolice congestion response and provide QoSdiscard it, unless it ismade in [Re-fb]. Essentially,discarded at theinsight is that congestion crosses layersexact point it goes negative, it will wrongly subtract fromthe physical upwards. Therefore re-feedback polices congestion response basedaggregate downstream congestion, at least at any borders it crosses after it has gone negative but before it is discarded. We rely onphysical interfaces not addresses. That is,sanctions to deter dishonest understatement of congestion. But even thecongestion leaving a physical interfaceultimate sanction of discard can only bepoliced ateffective if theinterface, rather thansender is bothered about thecongestion on packets that claimdata getting through tocomeits destination. A number of attacks have been identified where a sender gains froman address, which may be spoofed. Also, re-feedback does not actually require feedback.sending dummy traffic or it can attack someone or something using dummy traffic even though it isn't communicating any information to anyone: o Asource must act conservatively beforenetwork can simply create its own dummy traffic to congest another network, perhaps causing itgets feedback. On the subject of lack of feedback, theto lose business at nofeedback (NF) codepoint is motivated by arguments for a state set-up bit in IPcost toprevent state exhaustion attacks.the attacking network. Thisidea was first put forwardis a form of denial of service perpetrated byDavid Clark and documented in [Handley_Steps_DoS].one network on another. Theideapreferential drop measures in Section 4.3.4 provide crude protection against such attacks, but we are not overly worried about more accurate prevention measures, because it isthat network layer datagrams should signal explicitly when they require statealready possible for networks tobe created inDoS other networks on thelayer above (e.g. at flow start). Thengeneral Internet, but they generally don't because of thehigher layer can refuse to create any state unless a datagram declares this intent.grave consequences of being found out. Webelieve the NF codepoint can be used to serve the same purpose as the proposed more generic state-set-up bit. The re-feedback paper [Re-fb] also makesare only concerned if re-ECN increases thecasemotivation forusingsuch aneconomic interpretation of congestion, which is the basis of the incentives-based approach usedattack, as inthis memo. That paper also makes the case againsttheuse of classic feedback if the economic interpretation of congestion is to be realised. The problemnext example. o A network can just generate negative traffic and send it over its border withusing classic feedback for policing congestion isa neighbour to reduce the overall penalties that itopens up receiving networksshould pay to`denialthat neighbour. It could even initialise the TTL so it expired shortly after entering the neighbouring network, reducing the chance offunds' attacks. {ToDo: Further Design Rationale willdetection further downstream. This attack need not beincluded in future versionsmotivated by a desire to deny service and indeed need not cause denial ofthis memo} 9. IANA Considerations {ToDo:}This memo includes no requestservice. A network's main motivator would most likely be toIANA (yet). 10. Security Considerations This whole memo concernsreduce thesecurity ofpenalties it pays to ascalable admission control system. In particular the analysis section. Below some specific security issues are mentioned that did not fit elsewhere in the memo or which comment onneighbour. But, therobustnessprospect of financial gain might tempt thesecurity provided bynetwork into mounting a DoS attack on thedesign. Firstly, we must repeatother network as well, given thestatementgain would offset some ofapplicability intheanalysis:risk of being detected. Note that weonly consider new opportunities for /gainful/ attack that our proposal introduces. Despite only involving a few bits, there is sufficient complexityhave not included DoS by Internet hosts in thewhole system that there are numerous possibilities for attacks not catered for. But as far asabove list of attacks, because weare aware, none reap any benefithave restricted ourselves to a scenario with edge-to-edge admission control across a Diffserv region. In this case, theattacker. It will always be possible for one network to cause damage to another neighbouring network's traffic by dropping or corrupting it as it forwards it. Therefore we do not believe networks would set their routing policies to interconnect inedge ingress gateways insulate the Diffserv region from DoS by Internet hosts. Re-ECN resists more general DoS attacks, but this is discussed in [Re-TCP]. The firstplace if they didn't trust the other networks notstep towards a solution todamage their traffic without any /direct/ gainall these problems with negative flows is tothemselves. Having said this, we do wantbe able tohighlight some ofestimate theweaker parts of our argument. We have argued that networks will be dissuaded from fakingcontribution they make to downstream congestionmarking byat a border and to correct thepossibility that upstream networks will route round them. Asmeasure accordingly. Although ideally wehave said, these arguments are intuitive and will remain fairly tenuous until proved in practice, particularly closewant to remove negative flows themselves, perhaps surprisingly, theegress where less competitive routingmost effective first step islikely. We should also pointto cancel outthattheapproach in this memo was only designed to be robust for admission control. We do not claimpolluting effect negative flows have on theincentives will always be strong enough to force correct flow pre- emption behaviour. Thismeasure of downstream congestion at a border. It isbecause pre-emptionmore important to get an unbiased estimate offlows tendstheir effect, than tobe associated with much higher damagetry to remove them all. A suggested algorithm to give anoperator's reputation for robust quality than denying admission. However, in generalunbiased estimate of theincentives for correct flow pre-emption are similarcontribution from negative flows tothose for admission control. Finally, it may seem thatthe8 codepoints that have been made available by extendingdownstream congestion measure is given in Appendix A.2.2. Although making an accurate assessment of theECN field withcontribution from negative flows may not be easy, just theRE bit have been used rather wastefully. Insingle step of neutralising their polluting effect on congestion metrics removes all theRE bit has been used as an orthogonal single bit in nearlygains networks could otherwise make from mounting dummy traffic attacks on each other. This puts allcases. The only exception being whennetworks on theECN field is clearedsame side (only with respect to"00". The mapping of the codepoints in an earlier versionnegative flows of course), rather than being pitched against each other. The network where thisproposal used the codepoint space more efficiently, butflow goes negative as well as all thescheme became vulnerable to a network operator focusing itsnetworks downstream lose out from not being reimbursed for any congestionmarking to mark more positive than neutral packets in order to reduce its penalties. {ToDo: More security considerations will undoubtedly be addedthis flow causes. So they all have an interest infuture versionsgetting rid ofthis memo.} 11. Conclusions Using pre-congestion isthese negative flows. Networks forwarding apromising technique to controlflowadmissions that will scale to any size network. However,before itrequires a mechanism to ensure that networks can interconnect even if they do not trust each to keep togoes negative aren't strictly on theadmission control protocols. We claimsame side, but they are disinterested bystanders---they don't care that there-ECN protocol provides such a mechanism,flow goes negative downstream, but at least they can't actively gain from making it go negative. The problem becomes localised so thatone network can detect and prevent another network inonce a flow goes negative, all thesystem fro cheating for its own gain. 12. Acknowledgements All the followingnetworks from where it happens and beyond downstream each havegiven helpful commentsa small problem, each can detect it has a problem andsome may become co- authorseach can get rid oflater drafts: Arnaud Jacquet, Alessandro Salvatori, Steve Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard (BT), Stephen Hailes (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill Lehr, Sharon Gillett (MIT) and comments from participants intheCFP/CRN inter-provider QoS and broadband working groups. 13. Comments Solicited Comments and questions are encouraged and very welcome. Theyproblem if it chooses to. But negative flows can no longer beaddressed to the IETF Transport Area working group's mailing list <tsvwg@ietf.org>, and/or to the authors. 14. References 14.1. Normative References [PCN] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., Charny, A., Liatsos, V., Babiarz, J., Chan, K., and S. Dudley, "Pre-Congestion Notification", draft-briscoe-tsvwg-cl-phb-01 (work in progress), March 2006. [RFC2119] Bradner, S., "Key wordsused foruse in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2211] Wroclawski, J., "Specificationany new attacks. Once an unbiased estimate of theControlled-Load Network Element Service", RFC 2211, September 1997. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Additioneffect ofExplicit Congestion Notification (ECN)negative flows can be made, the problem reduces toIP", RFC 3168, September 2001. [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, J., Courtney, W., Davari, S., Firoiu, V.,detecting andD. Stiliadis, "An Expedited Forwarding PHB (Per-Hop Behavior)", RFC 3246, March 2002. [RSVP-ECN] Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., Babiarz,preferably removing flows that have gone negative as soon as possible. But importantly, complete eradication of negative flows is no longer critical---best endeavours will be sufficient. Note that the guiding principle behind all the above discussion is that any gain from subverting the protocol should be precisely neutralised, rather than punished. If a gain is punished to a greater extent than is sufficient to neutralise it, it will most likely open up a new vulnerability, where the amplifying effect of the punishment mechanism can be turned on others. For instance, if possible, flows should be removed as soon as they go negative, but we do NOT RECOMMEND any attempts to discard such flows further upstream while they are still positive. Such over-zealous push-back is unnecessary and potentially dangerous. These flows have paid their `fare' up to the point they go negative, so there is no harm in delivering them that far. If someone downstream asks for a flow to be dropped as near to the source as possible, because they say it is going to become negative later, an upstream node cannot test the truth of this assertion. Rather than have to authenticate such messages, re-ECN has been designed so that flows can be dropped solely based on locally measurable evidence. A message hinting that a flow should be watched closely to test for negativity is fine. But not a message that claims that a positive flow will go negative later, so it should be dropped. . 5.6.2. Competitive Routing With the above penalty system, each domain seems to have a perverse incentive to fake pre-congestion. For instance domain B profits from the difference between penalties it receives at its ingress (its revenue) and those it pays at its egress (its cost). So if B overstates internal pre-congestion it seems to increase its profit. However, we can assume that domain A could bypass B, routing through other domains to reach the egress. So the competitive discipline of least-cost routing can ensure that any domain tempted to fake pre- congestion for profit risks losing /all/ its incoming traffic. The least congested route would eventually be able to win this competitive game, only as long as it didn't declare more fake pre- congestion than the next most competitive route. This memo does not need to standardise any particular mechanism for routing based on re-ECN. Goldenberg et al [Smart_rtg] refers to various commercial products and presents its own algorithms for moving traffic between multi-homed routes based on usage charges. None of these systems require any changes to standards protocols because the choice between the available border gateway protocol (BGP) routes is based on a combination of local knowledge of the charging regime and local measurement of traffic levels. If, as we propose, charges or penalties were based on the level of re-ECN measured in passing traffic, a similar optimisation could be achieved without requiring any changes to standard routing protocols. We must be clear that applying pre-congestion-based routing to this admission control system remains an open research issue. Traffic engineering based on congestion requires careful damping to avoid oscillations, and should not be attempted without adult supervision :) Mortier & Pratt [ECN-BGP] have analysed traffic engineering based on congestion. But without the benefit of re-ECN, they had to add a path attribute to BGP to advertise a route's downstream congestion (actually they proposed that BGP should advertise the charge for congestion, which we believe wrongly embeds an assumption into BGP that the only thing to do with congestion is charge for it). 5.6.3. Fail-safes The mechanisms described so far create incentives for rational operators to behave. That is, one operator aims to make another behave responsibly by applying penalties and expects a rational response (i.e. one that trades off costs against benefits). It is usually reasonable to assume that other network operators will behave rationally (policy routing can avoid those that might not). But this approach does not protect against the misconfigurations and accidents of other operators. Therefore, we propose the following two mechanisms at a network's borders to provide "defence in depth". Both are similar: Highly positive flows: A small sample of positive packets should be picked randomly as they cross a border interface. Then subsequent packets matching the same source and destination address and DSCP should be monitored. If the fraction of positive marking is well above a threshold (to be determined by operational practice), a management alarm SHOULD be raised, and the flow MAY be automatically subject to focused drop. Persistently negative flows: A small sample of congestion marked packets should be picked randomly as they cross a border interface. Then subsequent packets matching the same source and destination address and DSCP should be monitored. If the RE blanking fraction minus the congestion marking fraction is persistently negative, a management alarm SHOULD be raised, and the flow MAY be automatically subject to focused drop. Both these mechanisms rely on the fact that highly positive (or negative) flows will appear more quickly in the sample by selecting randomly solely from positive (or negative) packets. Note that there is no assumption that /users/ behave rationally. The system is protected from the vagaries of irrational user behaviour by the ingress gateways, which transform internal penalties into a deterministic, admission control mechanism that prevents users from misbehaving, by directly engineered means. 6. Analysis The domains in Figure 1 are not expected to be completely malicious towards each other. After all, we can assume that they are all co- operating to provide an internetworking service to the benefit of each of them and their customers. Otherwise their routing polices would not interconnect them in the first place. However, we assume that they are also competitors of each other. So a network may try to contravene our proposed protocol if it would gain or make a competitor lose, or both, but only if it can do so without being caught. Therefore we do not have to consider every possible random attack one network could launch on the traffic of another, given anyway one network can always drop or corrupt packets that it forwards on behalf of another. Therefore, we only consider new opportunities for /gainful/ attack that our proposal introduces. But to a certain extent we can also rely on the in depth defences we have described (Section 5.6.3 ) intended to mitigate the potential impact if one network accidentally misconfiguring the workings of this protocol. The ingress and egress gateways are shown in the most generic arrangement possible in Figure 1, without any surrounding network. This allows us to consider more specific cases where these gateways and a neighbouring network are operated by the same player. As well as cases where the same player operates neighbouring networks, we will also consider cases where the two gateways collude as one player and where the sender and receiver collude as one. Collusion of other sets of domains is less likely, but we will consider such cases. In the general case, we will assume none of the nine trust domains across the figure fully trust any of the others. As we only propose to change routers within the Diffserv region, we assume the operators of networks outside the region will be doing per-flow policing. That is, we assume the networks outside the Diffserv region and the gateways around its edges can protect themselves. So given we are proposing to remove flow policing from some networks, our primary concern must be to protect networks that don't do per-flow policing (the potential `victims') from those that do (the `enemy'). The ingress and egress gateways are the only way the outer enemy can get at the middle victim, so we can consider the gateways as the representatives of the enemy as far as domains A, B and C are concerned. We will call this trust scenario `edges against middles'. Earlier in this memo, we outlined the classic border rate policing problem (Section 3). It will now be useful to reiterate the motivations that are the root cause of the problem. The more reservations a gateway can allow, the more revenue it receives. The middle networks want the edges to comply with the admission control protocol when they become so congested that their service to others might suffer. The middle networks also want to ensure the edges cannot steal more service from them than they are entitled to. In the context of this `edges against middles' scenario, the re-ECN protocol has two main effects: o The more pre-congestion there is on a path across the Diffserv region, the higher the ingress gateway must declare downstream pre-congestion. o If the ingress gateway does not declare downstream pre-congestion high enough on average, it will `hit the ground before the runway', going negative and triggering sanctions, either directly against the traffic or against the ingress gateway at a management level An executive summary of our security analysis can be stated in three parts, distinguished by the type of collusion considered. Neighbour-only Middle-Middle Collusion: Here there is no collusion or collusion is limited to neighbours in the feedback loop. In other words, two neighbouring networks can be assumed to act as one. Or the egress gateway might collude with domain C. Or the ingress gateway might collude with domain A. Or ingress and egress gateways might collude with each other. In these cases where only neighbours in the feedback loop collude, we concludes that all parties have a positive incentive to declare downstream pre-congestion truthfully, and the ingress gateway has a positive incentive to invoke admission control when congestion rises above the admission threshold in any network in the region (including its own). No party has an incentive to send more traffic than declared in reservation signalling (even though only the gateways read this signalling). In short, no party can gain at the expense of another. Non-neighbour Middle-Middle Collusion: In the case of other forms of collusion between middle networks (e.g. between domain A and C) it would be possible for say A & C to create a tunnel between themselves so that A would gain at the expense of B. But C would then lose the gain that A had made. Therefore the value to A & C of colluding to mount this attack seems questionable. It is made more questionable, because the attack can be statistically detected by B using the second `defence in depth' mechanism mentioned already. Note that C can defend itself from being attacked through a tunnel by treating the tunnel end point as a direct link to a neighbouring network (e.g. as if A were a neighbour of C, via the tunnel), which falls back to the safety of the neighbour-only scenario. Middle-Edge Collusion: Collusion between networks or gateways within the Diffserv region and networks or users outside the region has not yet been fully analysed. The presence of full per-flow policing at the ingress gateway seems to make this a less likely source of a successful attack. {ToDo: Due to lack of time, the full write up of the security analysis is deferred to the next version of this memo.} Finally, it is well known that the best person to analyse the security of a system is not the designer. Therefore, our confident claims must be hedged with doubt until others with perhaps a greater incentive to break it have mounted a full analysis. 7. Incremental Deployment We believe ECN has so far not been widely deployed because it requires widespread end system and network deployment just to achieve a marginal improvement in performance. The ability to offer a new service (admission control) would be a much stronger driver for ECN deployment. As stated in the introduction, the aim of this memo is to "build in security from the start" when admission control is based on pre- congestion notification. However, the proposal has been designed so that security can be added some time after first deployment. Given admission control based on pre-congestion notification requires few changes to standards, it should be deployable fairly soon. However, re-ECN requires a change to IP, which may take a little longer. We expect that initial deployments of PCN-based admission control will be confined to single networks, or to clubs of networks that trust each other. The proposal in this memo will only become relevant once networks with conflicting interests wish to interconnect their admission controlled services, but without the scalability constraints of per-flow border policing. It will not be possible to use re-ECN, even in a controlled environment between consenting operators, unless it is standardised into IP. Given the IPv4 header has limited space for further changes, current IESG policy [{ToDo: ref?}] is not to allow experimental use of codepoints in the IPv4 header, as whenever an experiment isn't taken up, the space it used tends to be impossible to reclaim. If PCN-based admission control is deployed before re-ECN is standardised into IP, wherever a networks (or club of networks) connects to another network (or club of networks) with conflicting interests, they will place a gateway between the two regions that does per-flow rate policing and admission control. If re-ECN is eventually standardised into IP, it will be possible for these separate regions to upgrade all their gateways to use re-ECN before removing the per-flow policing gateways between them. Given the edge-to-edge deployment model of PCN-based admission control, it is reasonable to imagine this incremental deployment model without needing to cater for partial deployment of re-ECN in just some of the gateways around one Diffserv region. Only the edge gateways around a Diffserv region have to be upgraded to add re-ECN support, not interior routers. It is also necessary to add the mechanisms that use re-ECN to secure a network against misbehaving gateways and networks. Specifically, these are the border mechanisms (Section 5.6) and the mechanisms to sanction dishonest marking (Section 5.5). We also RECOMMEND adding improvements to forwarding on interior routers (Section 4.3.4). But the system works whether all, some or none are upgraded, so interior routers may be upgraded in a piecemeal fashion at any time. 8. Design Choices and Rationale The primary insight of this work is that downstream congestion is the metric that would be most useful to control an internetwork, and particularly to police how one network responds to the congestion it causes in a remote network. This is the problem that has previously made it so hard to provide scalable admission control. The case for using re-feedback (a generalisation of re-ECN) to police congestion response and provide QoS is made in [Re-fb]. Essentially, the insight is that congestion is a factor that crosses layers from the physical upwards. Therefore re-feedback polices congestion where it emerges from a physical interface between networks. This is achieved by bringing the congestion information to the interface, rather than examining packet addressing where there is congestion. Then congestion crossing the physical interface at a border can be policed at the interface, rather than policing the congestion on packets that claim to come from an address (which may be spoofed). Also, re-feedback works in the network layer independently of other layers---despite its name re-feedback does not actually require feedback. It requires a source to act conservatively before it gets feedback. On the subject of lack of feedback, the feedback not established (FNE) codepoint is motivated by arguments for a state set-up bit in IP to prevent state exhaustion attacks. This idea was first put forward informally by David Clark and documented by Handley and Greenhalgh in [Steps_DoS]. The idea is that network layer datagrams should signal explicitly when they require state to be created in the network layer or the layer above (e.g. at flow start). Then a node can refuse to create any state unless a datagram declares this intent. We believe the proposed FNE codepoint serves the same purpose as the proposed state-set-up bit, but it has been overloaded with a more specific purpose, using it on more packets than just the first in a flow, but never less (i.e. it is idempotent). In effect the FNE codepoint serves the purpose of a `soft-state set-up codepoint'. The re-feedback paper [Re-fb] also makes the case for converting the economic interpretation of congestion into hard engineering mechanism, which is the basis of the approach used in this memo. The admission control gateways around the Diffserv region use hard engineering, not incentives, to prevent end users from sending more traffic than they have reserved. Incentive-based mechanisms are only used between networks, because they are expected to respond to incentives more rationally than end-users can be expected to. However, even then, a network can use fail-safes to protect itself from excessively unusual behaviour by neighbouring networks, whether due to an accidental misconfiguration or malicious intent. The guiding principle behind the incentive-based approach used between networks is that any gain from subverting the protocol should be precisely neutralised, rather than punished. If a gain is punished to a greater extent than is sufficient to neutralise it, it will most likely open up a new vulnerability, where the amplifying effect of the punishment mechanism can be turned on others. The re-feedback paper also makes the case against the use of congestion charging to police congestion if it is based on classic feedback (where only upstream congestion is visible to network elements). It argues this would open up receiving networks to `denial of funds' attacks and would require end users to accept dynamic pricing (which few would). Re-ECN has been deliberately designed to simplify policing at the borders between networks. These trust boundaries are the critical pinch-points that will limit the scalability of the whole internetwork unless the overall design minimises the complexity of security functions at these borders. The border mechanisms described in this memo run passively in parallel to data forwarding and they do not require per-flow processing. 9. Security Considerations This whole memo concerns the security of a scalable admission control system. In particular the analysis section. Below some specific security issues are mentioned that did not belong elsewhere or which comment on the overall robustness of the security provided by the design. Firstly, we must repeat the statement of applicability in the analysis: that we only consider new opportunities for /gainful/ attack that our proposal introduces, particularly if the attacker can avoid being identified. Despite only involving a few bits, there is sufficient complexity in the whole system that there are probably numerous possibilities for other attacks. However, as far as we are aware, none reap any benefit to the attacker. For instance, it would be possible for a downstream network to remove the congestion markings introduced by an upstream network, but it would only lose out on the penalties it could apply to a downstream network. When one network forwards a neighbouring network's traffic it will always be possible to cause damage by dropping or corrupting it. Therefore we do not believe networks would set their routing policies to interconnect in the first place if they didn't trust the other networks not to arbitrarily damage their traffic. Having said this, we do want to highlight some of the weaker parts of our argument. We have argued that networks will be dissuaded from faking congestion marking by the possibility that upstream networks will route round them. As we have said, these arguments are based on fairly delicate assumptions and will remain fairly tenuous until proved in practice, particularly close to the egress where less competitive routing is likely. We should also point out that the approach in this memo was only designed to be robust for admission control. We do not claim the incentives will always be strong enough to force correct flow pre- emption behaviour. This is because a user will tend to perceive much greater loss in value if a flow is pre-empted than if admission is denied at the start. However, in general the incentives for correct flow pre-emption are similar to those for admission control. Finally, it may seem that the 8 codepoints that have been made available by extending the ECN field with the RE flag have been used rather wastefully. In effect the RE flag has been used as an orthogonal single bit in nearly all cases. The only exception being when the ECN field is cleared to "00". The mapping of the codepoints in an earlier version of this proposal used the codepoint space more efficiently, but the scheme became vulnerable to a network operator focusing its congestion marking to mark more positive than neutral packets in order to reduce its penalties. With the scheme as now proposed, once the RE flag is set or cleared by the sender or its proxy, it should not be written by the network, only read. So the gateways can detect if any network maliciously alters the RE flag. IPSec AH integrity checking does not cover the IPv4 option flags (they were considered mutable---even the one we propose using for the RE flag that was `currently unused' when IPSec was defined). But it would be sufficient for a pair of gateways to make random checks on whether the RE flag was the same when it reached the egress gateway as when it left the ingress. Indeed, if IPSec AH had covered the RE flag, any network intending to alter sufficient RE flags to make a gain would have focused its alterations on packets without authenticating headers (AHs). No cryptographic algorithms have been harmed in the making of this proposal. 10. IANA Considerations This memo includes no request to IANA. 11. Conclusions This memo builds on a promising technique to solve the classic problem of making flow admission control scale to any size network. It involves the use of Diffserv in a deployment model that uses pre- congestion notification feedback to control admission into a network path [CL-deploy]. However as it stands, that deployment model depends on all network domains trusting each other to comply with the protocols, invoking admission control and flow pre-emption when requested. We propose that the congestion feedback used in that deployment model should be re-echoed into the forward data path, by making a trivial modification to the ingress gateway. We then explain how the resulting downstream pre-congestion metric in packets can be monitored in bulk at borders to sufficiently emulate flow rate policing. We claim the result of combining these two approaches is an admission control system that scales to any size network /and/ any number of interconnected networks, even if they all act in their own interests. This proposal aims to convince its readers to "Design in Security from the start," by building modified ingress gateways from day one, even if border policing is not needed at first. This way, we will not build ourselves tomorrow's legacy problem. Re-echoing congestion feedback is based on a principled technique called Re-ECN [Re-TCP], designed to add accountability for causing congestion to the general-purpose IP datagram service. Re-ECN proposes to consume the last completely unused bit in the basic IPv4 header. 12. Acknowledgements All the following have given helpful comments and some may become co- authors of later drafts: Arnaud Jacquet, Alessandro Salvatori, Steve Rudkin, David Songhurst, John Davey, Ian Self, Anthony Sheppard, Carla Di Cairano-Gilfedder (BT), Mark Handley (who identified the excess canceled packets attack), Stephen Hailes, Adam Greenhalgh (UCL), Francois Le Faucheur, Anna Charny (Cisco), Jozef Babiarz, Kwok-Ho Chan, Corey Alexander (Nortel), David Clark, Bill Lehr, Sharon Gillett, Steve Bauer (MIT) (who publicised various dummy traffic attacks), Sally Floyd (ICIR) and comments from participants in the CFP/CRN inter-provider QoS and broadband working groups. 13. Comments Solicited Comments and questions are encouraged and very welcome. They can be addressed to the IETF Transport Area working group's mailing list <tsvwg@ietf.org>, and/or to the authors. 14. References 14.1. Normative References [PCN] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., Charny, A., Liatsos, V., Babiarz, J., Chan, K., Dudley, S., Westberg, L., Bader, A., and G. Karagiannis, "Pre- Congestion Notification Marking", draft-briscoe-tsvwg-cl-phb-02 (work in progress), June 2006. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2211] Wroclawski, J., "Specification of the Controlled-Load Network Element Service", RFC 2211, September 1997. [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, September 2001. [RFC3246] Davie, B., Charny, A., Bennet, J., Benson, K., Le Boudec, J., Courtney, W., Davari, S., Firoiu, V., and D. Stiliadis, "An Expedited Forwarding PHB (Per-Hop Behavior)", RFC 3246, March 2002. [RSVP-ECN] Le Faucheur, F., Charny, A., Briscoe, B., Eardley, P., Babiarz, J., and K. Chan, "RSVP Extensions for Admission Control over Diffserv using Pre-congestion Notification",draft-lefaucheur-rsvp-ecn-00draft-lefaucheur-rsvp-ecn-01 (work in progress),October 2005.June 2006. [Re-TCP] Briscoe, B., Jacquet, A., and A. Salvatori, "Re-ECN: Adding Accountability for Causing Congestion to TCP/IP",draft-briscoe-tsvwg-re-ecn-tcp-01draft-briscoe-tsvwg-re-ecn-tcp-02 (work in progress),MarchJune 2006. 14.2. Informative References[CL-arch][CL-deploy] Briscoe, B., Eardley, P., Songhurst, D., Le Faucheur, F., Charny, A.,Babiarz, J.,Babiarz, J., Chan, K., Westberg, L., Bader, A., andK. Chan,G. Karagiannis, "AFrameworkDeployment Model for Admission Control over DiffServ using Pre-Congestion Notification",draft-briscoe-tsvwg-cl-architecture-02draft-briscoe-tsvwg-cl-architecture-03 (work in progress),MarchJune 2006. [CLoop_pol] Salvatori, A., "Closed Loop Traffic Policing", Politecnico Torino and Institut Eurecom Masters Thesis , September 2005. [ECN-BGP] Mortier, R. and I. Pratt, "Incentive Based Inter-Domain Routeing", Proc Internet Charging and QoS Technology Workshop (ICQT'03) pp308--317, September 2003, <http:// research.microsoft.com/users/mort/publications.aspx>. [ECN-MPLS] Bruce, B., Briscoe, B., and J. Tay, "Explicit Congestion Marking in MPLS", draft-davie-ecn-mpls-00 (work in progress), June 2006. [IXQoS] Briscoe, B. and S. Rudkin, "Commercial Models for IP Quality of Service Interconnect", BT Technology Journal (BTTJ) 23(2)171--195, April 2005, <http://www.cs.ucl.ac.uk/staff/B.Briscoe/pubs.html#ixqos>. [NSIS-RMD] Bader, A., Westberg, L., Karagiannis, G., Kappler, C., and T. Phelan, "RMD-QOSM - The Resource Management in Diffserv QOS Model", draft-ietf-nsis-rmd-06 (work in progress), February 2006. [RFC2205] Braden, B., Zhang, L., Berson, S., Herzog, S., and S. Jamin, "Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification", RFC 2205, September 1997. [RFC2207] Berger, L. and T. O'Malley, "RSVP Extensions for IPSEC Data Flows", RFC 2207, September 1997. [RFC2208] Mankin, A., Baker, F., Braden, B., Bradner, S., O'Dell, M., Romanow, A., Weinrib, A., and L. Zhang, "Resource ReSerVation Protocol (RSVP) Version 1 Applicability Statement Some Guidelines on Deployment", RFC 2208, September 1997. [RFC2747] Baker, F., Lindell, B., and M. Talwar, "RSVP Cryptographic Authentication", RFC 2747, January 2000. [RFC2998] Bernet, Y., Ford, P., Yavatkar, R., Baker, F., Zhang, L., Speer, M., Braden, R., Davie, B., Wroclawski, J., and E. Felstaine, "A Framework for Integrated Services Operation over Diffserv Networks", RFC 2998, November 2000. [RFC3540] Spring, N., Wetherall, D., and D. Ely, "Robust Explicit Congestion Notification (ECN) Signaling with Nonces", RFC 3540, June 2003. [Re-fb] Briscoe, B., Jacquet, A., Di Cairano-Gilfedder, C., Salvatori, A., Soppera, A., and M. Koyabe, "Policing Congestion Response in an Internetwork Using Re-Feedback", ACM SIGCOMM CCR 35(4)277--288, August 2005, <http:// www.acm.org/sigs/sigcomm/sigcomm2005/ techprog.html#session8>. [Smart_rtg] Goldenberg, D., Qiu, L., Xie, H., Yang, Y., and Y. Zhang, "Optimizing Cost and Performance for Multihoming", ACM SIGCOMM CCR 34(4)79--92, October 2004, <http://citeseer.ist.psu.edu/698472.html>. [Steps_DoS] Handley, M. and A. Greenhalgh, "Steps towards a DoS- resistant Internet Architecture", Proc. ACM SIGCOMM workshop on Future directions in network architecture (FDNA'04) pp 49--56, August 2004. Appendix A. Implementation A.1. Ingress Gateway Algorithm for Blanking theRE bit The ingress gateway receives regular feedback reportingRE flag The ingress gateway receives regular feedback reporting the fraction of congestion marked octets for each aggregate arriving at the egress. So for each aggregate it should blank the RE flag on the same fraction of octets. It is more efficient to calculate the reciprocal of this fraction when the signalling arrives, Z_0 = (1 / Congestion-Level-Estimate). Z_0 will be the number of octets of packets the ingress should send with the RE flag set between those it sends with the RE flag blanked. Z_0 will also take account of the sustainable rate reported during the flow pre-emption process, if necessary. A suitable pseudo-code algorithm for the ingress gateway is as follows: ==================================================================== B_i = 0 /* interblank volume */ for each PCN-capable packet { b = readLength() /* set b to packet size */ B_i += b /* accumulate interblank volume */ if B_i < b * Z_0 { /* test whether interblank volume... */ writeRE(1) } else { /* ...exceeds blank RE spacing * pkt size*/ writeRE(0) /* ...and if so, clear RE */ B_i = 0 /* ...and re-set interblank volume */ } } ==================================================================== A.2. Downstream Congestion Metering Algorithms A.2.1. Bulk Downstream Congestion Metering Algorithm To meter the bulk amount of downstream pre-congestion in traffic crossing an inter-domain border, an algorithm is needed that accumulates the size of positive packets and subtracts thefractionsize ofcongestion marked octetsnegative packets. We maintain two counters: V_b: accumulated pre-congestion volume B: total data volume (in case it is needed) A suitable pseudo-code algorithm for a border router is as follows: ==================================================================== V_b = 0 B = 0 for eachaggregate arriving atPCN-capable packet { b = readLength(packet) /* set b to packet size */ B += b /* accumulate total volume */ if readEECN(packet) == (Re-Echo || FNE) { V_b += b /* increment... */ } elseif readEECN(packet) == ( AM(-1) || PM(-1) ) { V_b -= b /* ...or decrement V_b... */ } /*...depending on EECN field */ } ==================================================================== At theegress. Soend of an accounting period this counter V_b represents the pre-congestion volume that penalties could be applied to, as described in Section 5.3. For instance, accumulated volume of pre-congestion through a border interface over a month might be V_b = 5PB (petabyte = 10^15 byte). This might have resulted from an average downstream pre-congestion level of 1% on an accumulated total data volume of B = 500PB. A.2.2. Inflation Factor foreach aggregatePersistently Negative Flows The following process is suggested to complement the simple algorithm above in order to protect against the various attacks from persistently negative flows described in Section 5.6.1. As explained in that section, the most important and first step is to estimate the contribution of persistently negative flows to the bulk volume of downstream pre-congestion and to inflate this bulk volume as if these flows weren't there. The process below has been designed to give an unboased estimate, but itshould blankmay be possible to define other processes that achieve similar ends. While theRE bit onabove simple metering algorithm is counting thesame fractionbulk ofoctets. It is more efficient to calculatetraffic over an accounting period, thereciprocalmeter should also select a subset ofthis fraction whenthesignalling arrives, Z_0 = 1 / Congestion- Level-Estimate, which willwhole flow ID space that is small enough to bethe numberable to realistically measure but large enough to give a realistic sample. Many different samples ofbytesdifferent subsets ofpacketstheingressID space shouldsend with the RE bit set between those it sends with the RE bit blanked. Z_0 will also take account of the sustainable rate reportedbe taken at different times during theflow pre-emption process, if necessary. A suitable pseudo-code algorithm foraccounting period, preferably covering theingress gateway is as follows: ==================================================================== B_i = 0 /* interblank volume */ forwhole ID space. During eachpacket { b = readLength() /* set b to packet size */ B_i += b /* accumulate interblank volume */ if B_i < b * Z_0 { /* test whether interblank volume... */ writeRE(1) } else { /* ...exceeds blank RE spacing * pkt size*/ writeRE(0) /* ...and if so, clear RE */ B_i = 0 /* ...and re-set interblank volume */ } } ==================================================================== A.2. Bulk Downstream Congestion Metering Algorithm To metersample, thebulk amount of downstream pre-congestion in passing traffic an algorithm is needed that accumulatesmeter should count thesizevolume of positive packetswith RE blanked (or NF set)andsubtractssubtract thesizevolume ofcongestion marked packets, but ignoresnegative, maintaining apersistently negative balance overseparate account for each flow in the sample. It should run adurationlot longer than the large majority of flows, to avoid a bias from missing the starts and ends ofT ~ 10secs, say. Three counters needflows, which tend to bemaintained: B_v: accumulated pre-congestion volume B_s: pre-congestion volume in timeslot B_t:positive and negative respectively. Once the accounting period finishes, the meter should calculate the totaldata volume A suitable pseudo-code algorithm for a border router is as follows: ==================================================================== B_v = 0 B_s = 0 B_t = 0 t = timeNow() + T /* divide into timeslotsoffew secs */the accounts V_{bI} foreach packet { b = readLength() /* set b to packet size */ B_t += b /* accumulatethe subset of flows I in the sample, and the totalvolume */ if readRE() == 0 || readEECN() == NF { B_s += b /* increment... */ } elseif readECN() == 1X { B_s -= b /* ...or decrement B_s... */ } /*...depending on EECN field */ if timeNow() > t { /* every timeslot... */ if B_v > 0 { /* countof the accounts V_{fI} excluding flows with a negativebalance as zero */ B_v += B_s /* otherwise accumulateaccount from thebalance */ } B_ssubset I. Then the weighted mean of all these samples should be taken a_S =0 /* re-setsum_{forall I} V_{fI} / sum_{forall I} V_{bI}. If V_b is thetemp counter... */ t += T /* ...forresult of thenext timeslot */ } } ==================================================================== Atbulk accounting algorithm over theend of anaccounting period (Appendix A.2.1) it can be inflated by thiscounter B_v representsfactor a_S to get a good unbiased estimate of thepre-congestion volume that penalties could be applied to, as described in Section 5.2. For instance, accumulatedvolume ofpre-congestion through a border interface over a month might be B_v = 5PB (petabyte = 10^15 byte). This might have resulted from an averagedownstreampre-congestion level of 1% on an accumulated total data volumecongestion over the accounting period a_S.V_b, without being polluted by the effect ofB_t = 500PB.persistently negative flows. A.3. Algorithm for Sanctioning Negative Traffic {ToDo: Write updropperalgorithms similar to Appendix D of [Re-TCP] for the negative flow monitor with flow management algorithm and the variant with bounded flow state.} Author's Address Bob Briscoe BT & UCL B54/77, Adastral Park Martlesham Heath Ipswich IP5 3RE UK Phone: +44 1473 645196 Email: bob.briscoe@bt.com URI: http://www.cs.ucl.ac.uk/staff/B.Briscoe/ Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society.