ConEx | B. Briscoe |
Internet-Draft | BT |
Intended status: Informational | M. Sridharan |
Expires: August 18, 2014 | Microsoft |
February 14, 2014 |
Network Performance Isolation in Data Centres using Congestion Policing
draft-briscoe-conex-data-centre-02
This document describes how a multi-tenant (or multi-department) data centre operator can isolate tenants from network performance degradation due to each other's usage, but without losing the multiplexing benefits of a LAN-style network where anyone can use any amount of any resource. Zero per-tenant configuration and no implementation change is required on network equipment. Instead the solution is implemented with a simple change to the hypervisor (or container) beneath the tenant's virtual machines on every physical server connected to the network. These collectively enforce a very simple distributed contract - a single network allowance that each tenant can allocate among their virtual machines, even if distributed around the network. The solution uses layer-3 switches that support explicit congestion notification (ECN). It is best if the sending operating system supports congestion exposure (ConEx). Nonetheless, the operator can unilaterally deploy a complete solution while operating systems are being incrementally upgraded to support ConEx and ECN.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on August 18, 2014.
Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
A number of companies offer hosting of virtual machines on their data centre infrastructure—so-called infrastructure as a service (IaaS) or 'cloud computing'. A set amount of processing power, memory, storage and network are offered. Although processing power, memory and storage are relatively simple to allocate on the 'pay as you go' basis that has become common, the network is less easy to allocate, given it is a naturally distributed system.
This document describes how a data centre infrastructure provider can offer isolated network performance to each tenant by deploying congestion policing at every ingress to the data centre network, e.g. in all the hypervisors (or containers). The data packets pick up congestion information as they traverse the network, which is brought to the ingress using one of two approaches: feedback tunnels or ConEx (or a mix of the two). Then, these ingress congestion policers have sufficient information to limit the amount of congestion any tenant can cause anywhere in the whole meshed pool of data centre network resources. This isolates the network performance experienced by each tenant from the behaviour of all the others, without any tenant-related configuration on any of the switches.
How it works is very simple and quick to describe. Why this approach provides performance isolation may be more difficult to grasp. In particular, why it provides performance isolation across a network of links, even though there is no isolation mechanism in each link. Essentially, rather than limiting how much traffic can go where, traffic is allowed anywhere and the policer finds out whenever and wherever any traffic causes a small amount of congestion so that it can prevent heavier congestion.
This document explains how it works, while a companion document [conex-policing] builds up an intuition for why it works. Nonetheless to make this document self-contained, brief summaries of both the 'how' and the 'why' are given in sections 3 & 4. Then Section 5 gives details of the design and Section 6 explains the aspects of the design that enable incremental deployment. Finally Section 7 introduces other attempts to solve the network performance isolation problem and why they fall down in various ways.
The solution would also be just as applicable to isolate the network performance of different departments within the private data centre of an enterprise, which could be implemented without virtualisation. However, it will be described as a multi-tenant scenario, which is the more difficult case from a security point of view.
The following goals are met by the design, each of which is explained subsequently:
virtual hypervisors hosts switches machines V11 V12 V1m +--------+ __/ * * ... * | ____ | H1 ,-.__________+--+__/ \___\__ __\__|__\T1/__|____/`-' __-|S1|____,-- | /__\ | `. _ ,' ,'| |_______ . +--------| H2 ,-._,`. ,' +--+ . . `-'._ `. . +--------+ . `,' `. | ____ | . ,' `-. `.+--+_______ Vn1 Vn2 Vnm | _\T1/_ | / `-_|S2|____ * * ... * |/ /__\ \| Hn ,-.__________| |__ `-- \___\__ __\__/policers\____/`-' +--+ \__ \ ____ / \ |\_\T2/_/| | /__\ | +--------+
The two (or more) policers associated with tenant T1 act as one logical policer.
Figure 1: Edge Policing and the Hose Traffic Model
______________________ | | | | Legend | |w1 |w2 |wi | | | | | | [_] [_]packet stream | V V V | | congestion . . . | [*] marked packet | token bucket| . | | . | __|___| | ___ | __|___| | . | | |:::| | \ / policer | | |:::| __|___| | |:::| | /_\ | | +---+ | +---+ | +---+ | | bucket depth | : | : | : | /\ marking meter | controls the | . | : | . | \/ | policer _V_ . | : | . |______________________| ____\ /__/\___________________________ downstream /[*] /_\ \/ [_] | : [_] | : [_] \ /->network class-/ | . | . \ / /---> ifier/T1 _V_ . | . \ / / __,--.__________________\ /__/\___________________\______/____/ loss `--' T2 [*] [_] [_]/_\ \/ [_] | . [*] / \ \-X---> \ | . / \--> \Ti _V_ : / \ loss \__________________________\ /__/\_____/ \-X---> [_] [*] [_] [*] [_] /_\ \/ [_]
Figure 2: Bulk Congestion Policer Schematic
[conex-policing] explains how the operator or tenant would determine an appropriate allowance.
In the proposed approach, the network operator deploys capacity as usual—using previous experience to determine a reasonable contention ratio at every tier of the network. Then, the tenant contracts with the operator for the rate at which their congestion policer will allow them to contribute to congestion.
Network performance isolation traditionally meant that each user could be sure of a minimum guaranteed bit-rate. Such assurances are useful if traffic from each tenant follows relatively predictable paths and is fairly constant. If traffic demand is more dynamic and unpredictable (both over time and across paths), minimum bit-rate assurances can still be given, but they have to be very small relative to the available capacity, because a large number of users might all want to simulataneously share any one link, even though they rarely all use it at the same time.
This either means the shared capacity has to be greatly overprovided so that the assured level is large enough, or the assured level has to be small. The former is unnecessarily expensive; the latter doesn't really give a sufficiently useful assurance.
Round robin or fair queuing are other forms of isolation that guarantee that each user will get 1/N of the capacity of each link, where N is the number of active users at each link. This is fine if the number of active users (N) sharing a link is fairly predictable. However, if large numbers of tenants do not typically share any one link but at any time they all could (as in a data centre), a 1/N assurance is fairly worthless. Again, given N is typically small but could be very large, either the shared capacity has to be expensively overprovided, or the assured bit-rate has to be worthlessly small. The argument is no different for the weighted forms of these algorithms: WRR & WFQ).
Both these traditional forms of isolation try to give one tenant assured instantaneous bit-rate by constraining the instantaneous bit-rate of everyone else. This approach is flawed except in the special case when the load from every tenant on every link is continuous and fairly constant. The reality is usually very different: sources are on-off and the route taken varies, so that on any one link a source is more often off than on.
For instance, if 100 tenants are using a 1Gb/s link for 1% of the time, there is a good chance each will get the full 1Gb/s link capacity. But if just six of those tenants suddenly start using the link 50% of the time, whenever the other 94 tenants need the link, they will typically find 3 of these heavier tenants using it already. If a 1/N approach like round-robin were used, then the light tenants would suddently get 1/4 * 1Gb/s = 250Mb/s on average. Round-robin cannot claim to isolate tenants from each other if they usually get 1Gb/s but sometimes they get 250Mb/s (and only 10Mb/s guaranteed in the worst case when all 100 tenants are active).
In contrast, congestion policing is the key to network performance isolation because it focuses policing only on those tenants that go fast over congested path(s) excessively and persistently over time. This keeps congestion below a design threshold everywhere so that everyone else can go fast. In this way, congestion policing takes account of highly variable loads (varying in time and varying across routes). And, if everyone's load happens to be constant, congestion policing converges on the same outcome as the traditional forms of isolation.
The other flaw in the traditional approaches to isolation, like WRR & WFQ, is that they actually prevent long-running flows from yielding to brief bursts from lighter tenants. A long-running flow can yield to brief flows and still complete nearly as soon as it would have otherwise (the brief flows complete sooner, freeing up the capacity for the longer flow sooner). However, WRR & WFQ prevent flows from even seeing the congestion signals that would allow them to co-ordinate between themselves, because they isolate each tenant completely into separate queues.
In summary, superficially, traditional approaches with separate queues sound good for isolation, but:
[conex-policing] explains why congestion policing works using numerical examples from a data centre and schematic traffic plots (in ASCII art). The bullets below provide a summary of that explanation, which builds from the simple case of long-running flows through a single link up to a full meshed network with on-off flows of different sizes and different behaviours:
[conex-policing] also includes a section that gives guidance on how to estimate appropriate fill rates and sizes for congestion token buckets.
The design involves the following elements, each detailed in the following subsections:
,---------. ,---------. |Transport| |Transport| | Sender | . |Receiver | | | /|___________________________________________| | | ,-<---------------Congestion-Feedback-Signals--<--------. | | | |/ | | | | | |\ Transport Layer Feedback Flow | | | | | | \ ___________________________________________| | | | | | \| | | | | | | ' ,-----------. . | | | | | |_____________| |_______________|\ | | | | | | IP Layer | | Data Flow \ | | | | | | |(Congested)| ,-----.\ | | | | | | | Network |--Congestion-Signals--->-' | | | | | Device | | | \| | | | | | | |Audit| /| | | `----------->--(new)-IP-Layer-ConEx-Signals-------->| | | | | | `-----'/ | | | |_____________| |_______________ / | | | | | | |/ | | `---------' `-----------' ' `---------'
Figure 3: The ConEx Protocol in the Internet Architecture
The operator of the data centre infrastructure needs to trust this information, therefore it cannot just use the feedback in the end-to-end transport (e.g. TCP SACK or ECN echo congestion experienced flags) that might anyway be encrypted. Trusted congestion feedback may be implemented in either of the following two ways:
The feedback tunnel approach (a) is inefficient because it duplicates end-to-end feedback and it introduces at least a round trip's delay, whereas the ConEx approach (b) is more efficient and not delayed, because ConEx packets signal a conservative estimate of congestion in the upcoming round trip. Avoiding feedback delay is important for controlling congestion from aggregated short flows. However, ConEx signals will not necessarily be supported by the sending operating system.
Therefore, given ConEx IP packets are self-identifying, the best approach is to rely on ConEx signals when present and fill in with tunnelled feedback when not, on a packet-by-packet basis.
Both approaches are much easier if explicit congestion notification (ECN [RFC3168]) is enabled on network switches and if all packets are ECN-capable. For non-ECN-capable packets, ECN support can be turned on in the outer of an edge-to-edge tunnel. The reasons that ECN helps in each case are:
The above cases can be arranged in a 2x2 matrix, to show when edge-to-edge tunnelling is needed and what function the tunnel would need to serve:
ConEx-capable? | ECN-capable: Y | ECN-capable: N |
---|---|---|
Y | No tunnel needed | ECN-enabled tunnel |
N | Tunnel Feedback | ECN-enabled tunnel + Tunnel feedback |
We can now summarise the steps necessary to ensure an ingress congestion policer obtains trustworthy congestion signals:
Network switches/routers do not need any modification. However, both congestion detection by the tunnel (approach a) and ConEx audit (approach b) are significantly easier if switches support ECN.
Once switches support ECN, Data centre TCP [DCTCP] could optionally be used (DCTCP requires ECN). It also requires modified sender and receiver TCP algorithms as well as a more aggressive configuration of the active queue management (AQM) in the L3 switches or routers.
Innovation in the design of congestion policers is expected and encouraged, but here we wlil describe one specific design to be concrete.
A bulk congestion policing function would most likely be implemented as a shim in the hypervisor. The hypervisor would create one instance of a bulk congestion policer per tenant on the physical machine, and it would ensure that all traffic sent by that tenant's VMs into the network would pass through the relevant congestion policer by associating every new virtual machine with the relevant policer.
A bulk congestion policing function has already been outlined in Section 3. To recap, it consists of a token bucket that is filled with congestion tokens at a constant rate. The bucket is drained by the size of every packet that carries a congestion marking. If the tunnel-feedback approach (a) were used, the bucket would be drained by congestion feedback from the tunnel egress, rather than markings on packets. If the ConEx approach (b) were used, the bucket would be drained by ConEx markings on the actual data packets being forwarded. A congestion policer will need to drain in response to either form of signal, because it is recommended that both approaches are used in combination.
Various more sophisticated congestion policer designs have been evaluated [CPolTrilogyExp]. In these experiments, it was found that it is better if the policer gradually increases discards as the bucket becomes empty. Also isolation between tenants is better if each tenant is policed based on the combination of two buckets, not one (Figure 4):
In this arrangement each marked packet drains tokens from both buckets, and the probability of policer discard is taken as the worse of the two buckets.
| | Legend: |c*wi |wi See previous figure V V . . . | . | deep bucket _ _ _ _ _ _ _ _ _ _ _ _ |___| | . |:::| |_ _ _ _ _ _ _ |___| |:::| | shallow +---+ +---+ worse of the| bucket two buckets| \____ ____/ triggers| \ / both buckets policing V : drained by ___ . marked packets ___________\ /___________________/ \__________________ [_] [_] /_\ [_] [*] [_] \ / [_] [_] [_]
Figure 4: Dual Congestion Token Bucket (in place of each single bucket in the previous figure)
While the data centre network operator only needs to police congestion in bulk, tenants may wish to enforce their own limits on individual users or applications, as sub-limits of their overall allowance. Given all the information used for policing is readily available within the transport layer of their own operating system. Tenants can readily apply any such per-flow, per-user or per-application limitations. The tenant may operate their own fine-grained policing software, or such detailed control capabilities may be offered as part of the platform (platform as a service or PaaS).
A customer may run virtual machines on multiple physical nodes, in which case at the time each VM is instantiated the data centre operator will deploy a congestion policer in the hypervisor on each node where the customer is running a VM.The DC operator can arrange for these congestion policers to collectively enforce the per-customer congestion allowance, as a distributed policer.
A function to distribute a customer's tokens to the policer associated with each of the customer's VMs would be needed. This could be similar to the distributed rate limiting of [DRL], which uses a gossip-like protocol to fill the sub-buckets. Alternatively, a logically centralised bucket of congestion tokens could be used. it could be replicated for reliability then there could be simple 1-1 communication between the central bucket and each local token bucket.
Importantly, congestion tokens can be freely reassigned between different VMs, because a congestion token is equivalent at any place or time in a network. In contrast, traditional bit-rate tokens cannot simply be reassigned from one VM to another without implications on the balance of network loading. This is because the parameters used for bit-rate policing depend on the topology and its capacity planning (open loop), whereas congestion policing complements the closed loop congestion avoidance system that adapts to the prevailing traffic and topology.
As well as distribution of tokens between the VMs of a tenant, it would similarly be feasible to allow transfer of tokens between tenants, also without breaking the performance isolation properties of the system. Secure token transfer mechanisms could be built above the underlying policing design described here, but that is beyond the current scope and therefore deferred to future work.
A mechanism to bring trustworthy congestion signals to the ingress (Section 5.1) is critical to this performance isolation solution. Section 5.1.1 compares the two solutions: b) ConEx, which is efficient and it's timely enough to police short flows; and a) tunnel-feedback, which is neither. However, ConEx requires deployment in host operating systems first, while tunnel feedback can be deployed unilaterally by the data centre operator in all hypervisors (or containers), without requiring support in guest operating systems.
The section describes the steps necessary to support both approaches. This would provide an incremental deployment route with the best of both worlds: tunnel feedback could be deployed initially for unmodified guest OSs despite its weaknesses, and ConEx could gradually take over as it was deployed more widely in guest OSs. It is important not to deploy the tunnel feedback approach without checking for ConEx-capable packets, otherwise it will never be possible to migrate to ConEx. The advantages of being able to migrate to ConEx are:
Initially, the approach would be confined to intra-data centre traffic. With the addition of ECN support on network equipment (at least bottleneck access routers) in the WAN between data centres, it could straightforwardly be extended to inter-data centre scenarios, including across interconnected backbone networks.
Once this approach becomes deployed within data centres and possibly across interconnects between data centres and enterprise LANs, the necessary support will be implemented in a wide range of equipment used in these scenarios. Similar equipment is also used in other networks (e.g. broadband access and backhaul), so that it would start to be possible for these other networks to deploy a similar approach.
The Related Work section of [CongPol] provides a useful comparison of the approach proposed here against other attempts to solve similar problems.
When the hose model is used with Diffserv, capacity has to be considerably over-provisioned for all the unfortunate cases when multiple sources of traffic happen to coincide even though they are all in-contract at their respective ingress policers. Even so, every node within a Diffserv network also has to be configured to limit higher traffic classes to a maximum rate in case of really unusual traffic distributions that would starve lower priority classes. Therefore, for really important performance assurances, Diffserv is used in the 'pipe' model where the policer constrains traffic separately for each destination, and sufficient capacity is provided at each network node for the sum of all the peak contracted rates for paths crossing that node.
In contrast, the congestion policing approach is designed to give full performance assurances across a meshed network (the hose model), without having to divide a network up into pipes. If an unexpected distribution of traffic from all sources focuses on a congestion hotspot, it will increase the congestion-bit-rate seen by the policers of all sources contributing to the hot-spot. The congestion policers then focus on these sources, which in turn limits the severity of the hot-spot.
The critical improvement over Diffserv is that the ingress edges receive information about any congestion occuring in the middle, so they can limit how much congestion occurs, wherever it happens to occur. Previously Diffserv edge policers had to limit traffic generally in case it caused congestion, because they never knew whether it would (open loop control).
Congestion policing mechanisms could be used to assure the performance of one data flow (the 'pipe' model), but this would involve unnecessary complexity, given the approach works well for the 'hose' model.
Therefore, congestion policing allows capacity to be provisioned for the average case, not for the near-worst case when many unlikely cases coincide. It assures performance for all traffic using just one traffic class, whereas Diffserv only assures performance for a small proportion of traffic by partitioning it off into higher priority classes and over-provisioning relative to the traffic contracts sold for for this class.
{ToDo: Refer to [conex-policing] for comparison with WRR & WFQ}
Seawall {ToDo} [Seawall]
{ToDo}
This document does not require actions by IANA.
{ToDo}
Thanks to Yu-Shun Wang for comments on some of the practicalities.
Bob Briscoe is part-funded by the European Community under its Seventh Framework Programme through the Trilogy 2 project (ICT-317756). The views expressed here are solely those of the author.
Detailed changes are available from http://tools.ietf.org/html/draft-briscoe-conex-data-centre