TCP Maintenance and Minor Extensions (tcpm) | B. Briscoe |
Internet-Draft | BT |
Updates: 793 (if approved) | March 09, 2015 |
Intended status: Experimental | |
Expires: September 10, 2015 |
Inner Space for all TCP Options (Kitchen Sink Draft - to be Split Up)
draft-briscoe-tcpm-inspace-mode-tcpbis-00
This document describes an experimental redesign of TCP's extensibility mechanism. It aims to traverse most known middleboxes including connection splitters, by making it possible to tunnel all TCP options within the TCP Data. It provides a choice between in-order and out-of-order delivery for TCP options. In-order delivery is a useful new facility for options that control datastream processing. Out-of-order delivery has been the norm for TCP options until now, and is necessary for options involved with acknowledging data, otherwise flow control can deadlock. TCP's original design limits TCP option space to 40B. In the new design there is no such arbitrary limit, other than the maximum size of a segment. The TCP client can immediately start to use the extra option space optimistically from the very first SYN segment, by using a dual handshake. The dual handshake is designed to prevent a legacy server from getting confused and sending the control options to the application as user-data. The dual handshake is only one strategy - a single handshake will usually suffice once deployment is underway. In summary, the protocol should allow new TCP options to be introduced i) with minimal middlebox traversal problems; ii) with incremental deployment from legacy servers; iii) with zero handshaking delay iv) with a choice of in-order and out-of-order delivery v) without arbitrary limits on available space.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on September 10, 2015.
Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
TCP has become hard to extend, partly because the option space was limited to 40B when TCP was first defined [RFC0793] and partly because many middleboxes only forward TCP headers that conform to the stereotype they expect.
In 2011, [Honda11] tested a broad but small set of paths and found that there were few if any middlebox traversal problems over residential access networks, but the chance of a new option traversing other types of access was terrible. Cellular was especially bad (stripping options on 40% of paths for port 80 and 20% for other ports), but WiFi hotspots, enterprise, and university networks were close behind (typically, about 18% of paths blocked new extensions). This specification ensures new TCP capabilities can traverse most middleboxes by tunnelling TCP options within the TCP Data as 'Inner Options' (Figure 1). Then the TCP receiver can reconstruct the Inner Options sent by the sender, even if a middlebox resegments the datastream and even if it strips 'Outer' options from the TCP header that it does not recognise.
The two words 'Inner Space' are appropriate as a name for the scheme; 'Inner' because it encapsulates options within the TCP Data and 'Space' because the space for TCP options within the TCP Data is virtually unlimited—constrained only by the maximum segment size.
,-----. TCP Payload ,-----. | App |<----------------------------------------->| App | |-----| |-----| | | Inner Options within TCP Data | | | |<----------------------------------------->| | | | | | | TCP | TCP Header and TCP header and | TCP | | | Outer Options ,---------. Outer Options | | | |<-------------->|Middlebox|<-------------->| | |-----| |---------| |-----| | IP | | IP | | IP | : : : : : :
Figure 1: Encapsulation Approach
Tunnelling options within TCP Data raises two difficult questions: i) immediate (out-of-order) delivery of certain options and ii) bootstrapping the inner control channel.
Traditional TCP options [RFC0793] are delivered unreliably and out of order, because they are within the main header, outside the TCP sequence space. This document calls these 'Outer Options'. When TCP options are placed within the TCP Data (Inner Options), it is easiest to include them within TCP's sequence space. Then TCP naturally delivers them reliably and in order without any extra machinery. However, in-order delivery is unacceptable for some options.
TCP options fall into three categories:
The simplest ('default') variant of the Inner Space protocol [I-D.briscoe-tcpm-inner-space] delivers all Inner Options reliably and in order within the datastream.Therefore the default-mode Inner Space protocol can only support segment-related options as Outer Options. This is irritating because even though only a few options are segment-related, if just one kind of option cannot traverse a middlebox, it often prevents a whole set of other extensions from being used even though they would have no problem traversing the middlebox as Inner Options. For instance, one MPTCP option (the Data ACK) and one tcpcrypt option (the MAC) have to be delivered immediately (out of order), even though all the other MPTCP and tcpcrypt options can be delivered in order.
The present specification extends the default-mode Inner Space protocol to add out-of-order delivery of Inner Options. It can then support all TCP options as Inner Options. This offers the prospect of completely circumventing middlebox problems and space problems for all TCP extensions.
The second difficult question addressed by the present specification is how to bootstrap the inner control channel—without any visible difference to the TCP wire protocol that would otherwise be unlikely to traverse many middleboxes. Given the Inner Space protocol places control options within TCP Data, it is critical that a legacy TCP receiver is never confused into passing this mix to an application as if it were pure data. Naïvely, both ends could handshake to check they understand the protocol, but this would introduce a round of delay.
The Inner Space protocol will have to use whichever bootstrap approach is least bad, because they all involve compromises. For the present specification, the dual handshake has been chosen over the only other candidate currently in the running [I-D.touch-tcpm-tcp-syn-ext-opt], in which the client complements the SYN with an out-of-band (OOB) segment. In both approaches the client starts the connection with two segments. However, with the OOB approach the two segments will always be necessary, whereas the dual handshake is only a transition strategy that becomes unnecessary for each server as it is upgraded. Both approaches will need to be tested for middlebox traversal. It seems likely that many firewalls will block the OOB segment and it is also expected that some middleboxes will block the data in the SYN used for one of the dual handshakes.
In the dual handshake approach the client sends two SYNs; one for an upgraded server, and the other for an ordinary server. Then, if the client discovers that the server does not understand the new protocol, it can abort the upgraded handshake before the server corrupts the application by passing it Inner Options. Otherwise, if the server does understand the new protocol, the client can abort the ordinary handshake, given it offers no extra option space. Either way, zero extra delay is added. Interworking of the dual handshake with TCP Fast Open [I-D.ietf-tcpm-fastopen] is carefully defined so that either server can pass data to the application as soon as the initial SYN arrives.
Solving the five problems of i) option-space exhaustion; ii) middlebox traversal; iii) legacy server confusion; iv) a choice of in-order and out-of-order frame delivery; and v) handshake latency; does not come without cost:
A number of extensions to TCP are in the process of definition and experimentation (TCPINC, MPTCP, etc). If a general-purpose middlebox traversal solution were available now, each new protocol design would not need complex machinery to detect and work round the byzantine range of middlebox behaviours. It would also make these extensions available to many more users.
It seems inevitable that ultimately more option space will be needed, particularly given that many of the TCP options introduced recently consume large numbers of bits in order to provide sufficient information entropy, which is not amenable to compression.
Extension of TCP option space requires support from both ends. This means it will take many years before the facility is functional for most pairs of end-points. Therefore, given the problem is already becoming pressing, a solution needs to start being deployed now.
This experimental specification extends the TCP wire protocol. It is independent of the dynamic rate control behaviour of TCP and it is independent of (and thus compatible with) any protocol that encapsulates TCP, including IPv4 and IPv6.
TCP is critical to the robust functioning of the Internet, therefore any proposed modifications to TCP need to be thoroughly tested.
The implications of this work are more than 'just' a low latency incrementally deployable way to extend TCP option space:
The body of the document starts with a full specification of the Inner Space extension to TCP (Section 2). It is rather terse, answering 'What?' and 'How?' questions, but deferring 'Why?' to Section 3. The careful design choices made are not necessarily apparent from a superficial read of the specification, so the Design Rationale section is fairly extensive. The body of the document ends with Section 5 that checks possible interactions between the new scheme and pre-existing variants of TCP, including interaction with partial implementations of TCP in known middleboxes.
Appendix A defines the encoding that the Inner Space protocol uses for TCP Data. Eventually, this appendix is likely to be published separately because the encoding is more generally applicable. Appendix B defines an Inner TCP Option that provides a capability to switch the mode of a TCP connection, where the term 'mode' is a very general concept that might be used to change the ordering semantics of a connection, or switch off the Inner Space capability part way through a connection. Eventually this appendix is likely to be published separately due to its general applicability. Appendix C specifies optional extensions to the protocol that will need to be implemented experimentally to determine whether they are useful. And Appendix D discusses the merits of the chosen design against some of the optional extensions.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. In this document, these words will appear with that interpretation only when in ALL CAPS. Lower case uses of these words are not to be interpreted as carrying RFC-2119 significance.
Note that the term 'Ordinary' is used for segments and connections, but the term 'Legacy' is used for hosts. This is because, if the Inner Space protocol were widely used in future, a host that could not open an Upgraded Connection would be considered deficient and therefore 'Legacy', whereas an Ordinary Connection would not be considered deficient; because it will always be legitimate to open an Ordinary Connection if extra option space or middlebox traversal is not needed.
During initial deployment, an Upgraded TCP Client sends two alternative SYNs: an Ordinary SYN in case the server is legacy and a SYN-U in case the server is upgraded. The two SYNs MUST have the same network addresses and the same destination port, but different source ports. Once the client establishes which type of server has responded, it continues the connection appropriate to that server type and aborts the other without completing the 3-way handshake.
The format of the SYN-U will be described later (Section 2.2.2). At this stage it is only necessary to know that the client can put either TCP options or payload (or both) in a SYN-U, in the space traditionally intended only for payload. So if the server's response shows that it does not recognise the Upgraded SYN-U, the client is responsible for aborting the Upgraded Connection. This ensures that a Legacy TCP Server will never erroneously confuse the application by passing it TCP options as if they were user-data.
Section 3.1 explains various strategies the client can use to send the SYN-U first and defer or avoid sending the Ordinary SYN. However, such strategies are local optimizations that do not need to be standardized. The rules below cover the most aggressive case, in which the client sends the SYN-U then the Ordinary SYN back-to-back to avoid any extra delay. Nonetheless, the rules are just as applicable if the client defers or avoids sending the Ordinary SYN.
Table 1 summarises the TCP 3-way handshake exchange for each of the two SYNs in the two right-hand columns, between an Upgraded TCP Client (the active opener) and either:
Because the two SYNs come from different source ports, the server will treat them as separate connections, probably using separate threads (assuming a threaded server). A load balancer might forward each SYN to separate replicas of the same logical server. Each replica will deal with each incoming SYN independently - it does not need to co-ordinate with the other replica.
Ordinary Connection | Upgraded Connection | ||
---|---|---|---|
1 | Upgraded Client | >SYN | >SYN-U |
/\/\ | /\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ |
2 | Legacy Server | <SYN/ACK | <SYN/ACK |
3a | Upgraded Client | Waits for response to both SYNs | |
3b | " | >ACK | >RST |
4 | Cont... | ||
/\/\ | /\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ |
2 | Upgraded Server | <SYN/ACK | <SYN/ACK-U |
3a | Upgraded Client | Waits for response to SYN-U | |
3b | " | >RST | >ACK |
4 | Cont... |
Each column of the table shows the required 3-way handshake exchange within each connection, using the following symbols:
The connection that starts with an Ordinary SYN is called the 'Ordinary Connection' and the one that starts with a SYN-U is called the 'Upgraded Connection'. An Upgraded Server MUST respond to a SYN-U with an Upgraded SYN/ACK (termed a SYN/ACK-U and defined in Section 2.2.2). Then the client recognises that it is talking to an Upgraded Server. The client's behaviour depends on which response it receives first, as follows:
If the client receives a response to the SYN, but a short while after that {ToDo: duration TBA} the response to the SYN-U has not arrived, it SHOULD retransmit the SYN-U. If latency is more important than the extra TCP option space, in parallel to any retransmission, or instead of any retransmission, the client MAY give up on the Upgraded (SYN-U) Connection by sending a reset (RST) and completing the 3-way handshake of the Ordinary Connection.
If the client receives no response at all to either the SYN or the SYN-U, it SHOULD solely retransmit one or the other, not both. If latency is more important than the extra TCP option space, it will retransmit the SYN. Otherwise it will retransmit the SYN-U. It MUST NOT retransmit both segments, because the lack of response could be due to severe congestion.
Once an Upgraded Connection has been successfully negotiated in the SYN, SYN/ACK exchange, either host can allocate any amount of the TCP Data space in any subsequent segment for extra TCP options. In fact, the sender has to use the upgraded segment structure in every subsequent segment of the connection that contains non-zero TCP Payload. The sender can use the upgraded structure in a segment carrying no TCP Payload, but it does not have to (see Section 2.3.1.5).
As well as extra option space, the facility offers other advantages, such as reliable ordered delivery of Inner TCP Options on empty segments and more robust middlebox traversal. If none of these features is needed, at any point the facility can be disabled for the rest of the connection, using the ModeSwitch TCP option in Appendix B. Interestingly, the ModeSwitch options itself can be very simple because it uses the reliable ordered delivery property of Inner Options, rather than having to cater for the possibility that a message to switch modes might be lost or reordered.
An Upgraded Segment is structured as shown in Figure 2. Up to the TCP Data Offset, the structure is identical to an Ordinary TCP Segment, with a base TCP Header (BaseHdr) and the usual facility to set the Data Offset (DO) to allow space for TCP options. These regular TCP options are renamed by this specification to Outer TCP Options or just Outer Options, and labelled as OuterOpts in the figure.
| SDS | |--------------------------------------------->| |P| | SOO | | |a| ,--------->| | | DO |d| Len+1 | InOO | | ,------------------>| ,------->,-------------------->| | +--------+----------+-+--------+----------+----------+-------------+ | BaseHdr| OuterOpts| | InSpace|PrefixOpts|SuffixOpts| Payload | +--------+----------+-+--------+----------+----------+-------------+ | '----------.----------' | | Inner Options | `-----------------------.----------------------' TCP Data
All offsets are specified in 4-octet (32-bit) words, except SDS and Pad, which are in octets.
Figure 2: The Structure of an Upgraded Segment (not to scale)
Unlike an Ordinary TCP Segment, the Payload of an Upgraded Segment does not start straight after the TCP Data Offset. Instead, Figure 2 shows that space is provided for additional Inner TCP Options before the TCP Payload. The size of this space is termed the Inner Options Offset (InOO). The TCP receiver reads the InOO field from the Inner Option Space (InSpace) option defined in Section 2.2.2.
Padding might have to be included at the start of the TCP Data to align the InSpace option on a 4-octet boundary from the start of the datastream (see Section 2.3.1.2).
Because the InSpace Option is only ever located in a standardized location it does not need to follow the RFC 793 format of a TCP option. Therefore, although we call InSpace an 'option', we do not describe it as a 'TCP option'. The Length (Len) of the InSpace option itself is read from a fixed location within the InSpace option.
The Sent Data Size (SDS) is also read from within the InSpace Option. If the datastream has been resegmented, it allows the receiver to know the size of the segment as it was when it was sent, even if the InSpace Options are no longer at the start of each segment (see Section 2.3).
The Suffix Options Offset (SOO) is also read from within the InSpace Option. It delineates the end of the Prefix TCP Options (PrefixOpts in the figure) and the start of the Suffix TCP Options (SuffixOpts). The receiver processes PrefixOpts before OuterOpts, then SuffixOpts afterwards in order with the datastream. Full details of option processing are given in Section 2.3.
The first segment in each direction (i.e. the SYN or the SYN/ACK) is identifiable as upgraded by the presence of 6-octets of magic number at the start of the TCP Data. The probability that an Upgraded Server will mistake arbitrary data at the beginning of the payload of an Ordinary Segment for the Magic Number has to be allowed for, but it is vanishingly small (see Section 3.2.2). Once an Upgraded Connection has been negotiated during the SYN - SYN/ACK exchange, a magic number is not needed to identify Upgraded Segments, because both ends then know the protocol that determines where subsequent InSpace options will be located.
The internal structure of the InSpace Option for an Upgraded SYN or SYN/ACK segment (SYN=1) is defined in Figure 3a) and for a segment with SYN=0 in Figure 3b) or an abbreviated form in Figure 3c).
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 a) +---------------------------------------------------------------+ | Magic Number A | +-------------------------------+---------------------------+---+ | Sent TCP Data Size (SDS) |Inner Options Offset (InOO)|Len| +-------------------------------+---------------------------+---+ | Magic Number B |Suffix Options Offset (SOO)|CU | +-------------------------------+---------------------------+---+ b) +-------------------------------+-----------------------------+-+ | Marker | ZOMBI |CU +-------------------------------+---------------------------+-+-+ | Sent TCP Data Size (SDS) |Inner Options Offset (InOO)|Len| +-------------------------------+---------------------------+---+ | Currently Unused (CU) |Suffix Options Offset (SOO)|CU | +-------------------------------+---------------------------+---+ c) +-------------------------------+-----------------------------+-+ | Marker | ZOMBI |P| +-------------------------------+---------------------------+-+-+ | Sent TCP Data Size (SDS) |Inner Options Offset (InOO)|Len| +-------------------------------+---------------------------+---+
Figure 3: InSpace Option Format a) SYN=1; b) SYN=0, Len=2; c) SYN=0, Len=1
The fields are defined as follows (see Section 3.4 for the rationale behind these format choices):
The following field is only defined within a segment with SYN=1 (i.e. a SYN or SYN/ACK):
The following fields are only defined within a segment with SYN=0:
The objects that Inner Space places within the TCP Data can be divided into two types:
The rationale for these choices is given in Section 3.2.6. The following two subsections lay out the order in which these options are processed respectively when the sender writes them and when the receiver reads them.
If an Upgraded TCP Client uses a TCP Fast Open (TFO) cookie [I-D.ietf-tcpm-fastopen] in an Upgraded SYN-U, it MUST place the TFO option within the Inner TCP Options, beyond the Data Offset.
This rule is specific to TFO, but it can be generalised to any capability similar to TFO as follows: An Upgraded TCP Client MUST NOT place any TCP option in the Outer TCP Options of a SYN if it might cause a TCP server to pass user-data directly to the application before its own 3-way handshake completes.
If a client uses TCP Fast Open cookies on both the parallel connection attempts of a dual handshake, an Upgraded Server will deliver the TCP Payload to the application twice before the client aborts the Ordinary Connection. This is not a problem, because [I-D.ietf-tcpm-fastopen] requires that TFO is only used for applications that are robust to duplicate requests.
The sender MUST add (3 - ((seqno - isn - 1) % 4)) octets of non-zero padding (Pad in Figure 2) to align the start of the InSpace option on a 4-octet word boundary from the start of the datastream, where seqno is the TCP sequence number of the segment, isn is the initial sequence number and '%' is the modulo operation.
If the end of the last Inner TCP Option does not align on a 4-octet boundary, the sender MUST append sufficient no-op TCP options. The end of the Prefix TCP Options MUST be similarly aligned.
If the sending TCP is applying a block-mode transformation to the TCP Data (e.g. compression or encryption), the sender might have to add some padding options to align the end of the Inner Options with the end of a block. Any yet-to-be-written encryption specification will need to carefully define this padding in order not to weaken the cipher.
The sender MUST include all the TCP Data in TCP's sequence number and acknowledgement number space, i.e. any padding, the InSpace Option and any Inner Options as well as the TCP Payload.
Whenever the sender includes non-zero TCP Payload in a segment, it MUST also include an InSpace Option, whether or not there are any Inner Options (to enable reconstruction in case of resegmentation).
On the other hand, if the sender includes no TCP Payload in a segment (e.g. ACKs, RSTs), it SHOULD NOT include an InSpace Option unless it is necessary to send an Inner Option. {ToDo: Consider whether there is any reason to preclude Inner Options on a RST, FIN or FIN-ACK.}
A sender MUST consider the sequence space consumed by InSpace options, any padding and any Prefix Options as implicitly acknowledged. Therefore, the sender has no need to hold these items in its retransmit buffer. A sender MUST hold Suffix Options (and TCP Payload, of course) in its retransmit buffer until they are acknowledged.
These rules and those below concerning flow control and pure ACKs have significant implications, which are discussed alongside their rationale in Section 3.2.6.
The sender MUST count Suffix Options and the TCP Payload towards consumption of the receive window advertised by the remote host. Nonetheless, the sender MUST NOT count any padding, the InSpace Option and any Prefix Options towards consumption of the advertised receive window.
There might be a legacy middlebox on the path that discards segments containing out-of-window data but does not understand the way the Inner Space protocol modifies flow control. To traverse such a middlebox, a sending implementation SHOULD use a modified flow control algorithm that avoids the send window dropping below a minimum threshold Snd.Wind.Min (instead of zero). Each sender unilaterally chooses Snd.Wind.Min to allow for Fire-and-Forget Objects it might need in flight on its half-connection. The receiving sides of both half-connections play no part in this allowance. Section Section 3.2.6.2 discusses the rationale for this approach.
A reasonable value for the sender to choose for Snd.Wind.Min would be twice the size of the fire-and-forget objects currently in flight. This would ensure that a middlebox still considers all the fire-and-forget objects are in-window, even if a whole window were lost and retransmitted.
There are three types of acknowledgement segment:
It is expected that impure ACKs will rarely be necessary. An example of an Impure ACK is a segment containing no TCP Payload, but still carrying a message authentication code (MAC) in a Prefix Option in order to authenticate and protect the integrity of the TCP header of the ACK.
If an Inner Space TCP implementation currently has no further TCP Payload or Suffix Options to send, and it receives Impure ACKs, it MUST NOT itself respond with further impure ACKs, i.e. it MUST NOT consume further sequence space solely to acknowledge impure ACKs.
Nonetheless, while it has no further TCP Payload or Suffix Options to send, it MAY cumulatively acknowledge the TCP Data in the impure ACKs it has received by emitting a pure ACK, but no more often than once per round trip time (see Section 3.2.6.2 for rationale). If it later starts sending further Payload Data and/or Suffix Options, it will cumulatively acknowledge the sequence space of all the TCP Data in the intervening impure ACKs it has received, as would be expected.
If a sequence of one or more Impure ACKs is dropped, the receiver will not know whether they were impure. The receiver's normal ACK feedback will request a retransmission of the missing sequence space. By definition, the sender does not hold fire-and-forget options in its retransmit buffer. Therefore, the sender MUST reconstruct a new impure ACK of at least the same size as the gap in fire-and-forget options (if SACK has not been negotiated the sender will only know the size of the gap up to any subsequent in-order objects). The sender will include whatever Prefix options are relevant at the time of retransmission (which might be none). If the size of the new Prefix Options is less than the gap to be filled, the sender MUST make up the shortfall with noop Prefix Options. If the size of the new Prefix Options is greater than the gap to be filled, no harm will be done. This is because the receiver discards fire-and-forget options after processing them, so any overflow will not overwrite flow-controlled in-order data already in the receive buffer.
The sender constructs the TCP Data in the following order:
The rules for reading Inner TCP Options are divided between the following two subsections, depending on whether SYN=1 or SYN=0.
This subsection applies when TCP receives a segment with SYN=1, e.g. when the server receives a SYN or the client receives a SYN/ACK.
Before processing any TCP options, unless the size of the TCP Data is less than 12 octets, an Upgraded Receiver MUST determine whether the segment is an Upgraded Segment by checking that all the following conditions apply:
If all these conditions pass, the receiver MAY walk the sequence of Inner TCP Options, using the length of each to check that the sum of their lengths equals InOO. The receiver then concludes that the received segment is an Upgraded Segment.
The receiver then processes the TCP Options in the following order:
The receiver removes the magic number, the InSpace Option and each TCP Option from the TCP Data as it processes each.
The receiver MUST NOT count the size of Prefix Options against the receive window. Strictly it ought to subtract the size of Suffix Options from the receive window on arrival, then add the size back again as it removes them. However, when SYN=1, the Suffix Options will never have to be buffered, so these redundant steps can be skipped.
Once only the TCP Payload (if any) remains, the receiver holds it ready to pass to the application. It then emits the appropriate Upgraded Acknowledgement to progress the handshake (see Section 2.1.1).
If any of the above tests to find the InSpace Option fails:
For the avoidance of doubt the above rules imply that, as long as an InSpace Option has not been found in the segment, the receiver might rerun the tests for it multiple times if multiple Outer TCP Options alter the TCP Data. However, once the receiver has found an InSpace Option, it MUST NOT rerun the tests for an Upgraded Segment in the same segment.
If the receiver has not found an InSpace Option after processing all the Outer Options, it emits the appropriate Ordinary Acknowledgement to progress the handshake (see Section 2.1.1). As normal, it holds any TCP Payload ready to pass to the application.
This subsection applies once the TCP connection has successfully negotiated to use the upgraded InSpace structure.
The receiver processes Prefix Options and Outer Options in the order they are received. But it processes Suffix Options in the order they were sent, which is not necessarily the order in which they are received. The receiver achieves this by processing an arriving segment with SYN=0 in the following order. (Steps 3 & 6 are included for completeness even though no current TCP options apply data transformations):
The receiver uses each InSpace Option to calculate the extent of the associated Inner Options (using SOO and InOO).
Once only the TCP Payload remains, the TCP receiver passes it to the application as normal.
Middleboxes exist that process some aspects of the TCP Header. The present specification defines a new location for Inner TCP Options beyond the Data Offset, this is intended for the exclusive use of the destination TCP implementation. Therefore:
A TCP implementation is not necessarily aware whether it is deployed in a middlebox or in a destination, e.g. a split TCP connection might use a regular off-the-shelf TCP implementation. Therefore, a general-purpose TCP that implements the present specification will need a configuration switch to disable any search for options beyond the Data Offset and to enable immediate forwarding of data in a SYN.
{ToDo: Define behaviour of forwarding or receiving nodes if the structure or format of an Upgraded Segment is not as specified.}
If an Upgraded TCP Receiver receives an InSpace Option with a Length it does not recognise as valid, it MUST drop the packet and acknowledge the octets up to the start of the unrecognised option.
Values of Sent Data Size greater than 2^16 - 21 (=65,515 = 0xFFEB) octets in a regular (non-jumbo) InSpace Option MUST be treated as the distance to the next InSpace option, but they MUST NOT be taken as indicative of the size of the TCP Data when it was sent. This is because the TCP Data in a regular IPv6 packet cannot be greater than (2^16 -1 - 20) octets (given the minimum TCP header is 20 octets). If the size of the TCP Data is greater than 0xFFEB octets, the sender MUST use a Jumbo InSpace Option (Appendix C.2).
A Sent Data Size of 0xFFFF octets MAY be used to minimise the occurrence of empty InSpace options without permanently disabling the Inner Space protocol for the rest of the connection.
An implementation of the Inner Space protocol MUST support the EchoCookie TCP option [I-D.briscoe-tcpm-echo-cookie]. To indicate its support for EchoCookie, an Ordinary Client would send an empty EchoCookie TCP option on the SYN. Support for the Inner Space protocol makes this redundant. Therefore an Inner Space client MUST NOT send an empty EchoCookie TCP option on a SYN-U.
The EchoCookie TCP option replaces the SYN Cookie mechanism [RFC4987], which only has sufficient space to hold the result of one TCP option negotiation (the MSS), and then only a subset of the possible values (see the discussion under Security Considerations Section 7).
This section is informative, not normative.
In traditional [RFC0793] TCP, the space for options is limited to 40B by the maximum possible Data Offset. Before a TCP sender places options beyond that, it has to be sure that the receiver will understand the upgraded protocol, otherwise it will confuse and potentially crash the application by passing it TCP options as if they were payload data.
The Dual Handshake (Section 2.1.1) ensures that a Legacy TCP Server will never pass on TCP options as if they were user-data. If a SYN carries TCP Data, a TCP server typically holds it back from the application until the 3-way handshake completes. This gives the client the opportunity to abort the Upgraded Connection if the response from the server shows it does not recognise an Upgraded SYN.
The strategy of sending two SYNs in parallel is not essential to the Alternative SYN approach. It is merely an initial strategy that minimises latency when the client does not know whether the server has been upgraded. Evolution to a single SYN with greater option space could proceed as follows:
There is concern that, although dual handshake approaches might well eventually migrate to a single handshake, they do not scale when there are numerous choices to be made simultaneously. For instance:
Nonetheless, it is not necessary to try every possible combination of N choices, which would otherwise require 2^N handshakes (assuming each choice is between two options). Instead, a selection of the choices could be attempted together. At the extreme, two handshakes could be attempted, one with all the new features, and one without all the new features.
It has been proposed [Briscoe14] that extension of a header (as opposed to options) at layer X ought not to be located within the header at layer X, but instead within the layer encapsulated by that header (layer X+1), for a selection of principled and pragmatic reasons:
This section justifies the magic number approach by contrasting it with a more 'conventional' approach. A conventional approach would use a regular (Outer) TCP option to point to the dividing line within the TCP Data between the extra Inner Options and the TCP Payload.
This 'conventional' approach cannot provide extra option space over a path on which a middlebox strips TCP options that it does not recognise. [Honda11] quantifies the prevalence of such paths. It reports on experiments conducted in 2010-2011 that found unknown options were stripped from the SYN-SYN/ACK exchange on 14% of paths to port 80 (HTTP), 6% of paths to port 443 (HTTPS) and 4% of paths to port 34343 (unassigned). Further analysis found that the option-stripping middleboxes fell into two main categories:
The magic number approach ensures that all the TCP Headers and options up to the Data Offset are completely indistinguishable from an Ordinary Segment. Therefore, it will be highly likely (but not certain—see Appendix C.1.4) that the extra Inner Options will always be forwarded, while the conventional approach would fall far short of ths ideal.
The magic number approach also ensures that the Inner Options and the option that points to them are both tucked away beyond the Data Offset (see Section 2.2.1). This makes it highly likely that the two will share the same fate—it would be extremely unusual for a middlebox to treat different parts of the TCP Data selectively.
Typically, if a TCP option were stripped, the concern would only be lack of function, not safety. But with option space extension, the concern is serious application corruption. If control options are placed beyond the Data Offset, and the option that says they are there gets stripped, it risks control options being passed to the application as (corrupt) data. Although option stripping can be detected during the handshake, this consumes round trips and it is does not guarantee that option stripping will not start part-way through a connection (e.g. due to a path change). In contrast the magic number approach is inherently safe.
The downside of the magic number approach is that it is slightly non-deterministic, quantified as follows:
The above probability is based on the assumptions that:
Therefore even though 2^{-66} is a vanishingly small probability, the actual probability of a collision will be much lower.
If a perfect collision does occur, it will result in TCP removing a number of 32-bit words of data from the start of a byte-stream before passing it to the application.
The purpose of locating control options within the TCP Data is not to evade security. Security middleboxes can be expected to evolve to examine control options in the new inner location. Instead, the purpose is to traverse middleboxes that block new TCP options unintentionally—as a side effect of their main purpose—merely because their designers were too careless to consider that TCP might evolve. This category of middleboxes tends to forward the TCP Payload unaltered.
By sitting within the TCP Data, the Inner Space protocol should traverse enough existing middleboxes to reach critical mass and prove itself useful. In turn, this will open an opportunity to introduce integrity protection for the TCP Data (which includes Inner Options). Whereas today, no operating system would introduce integrity protection of Outer TCP options, because in too many cases it would fail and abort the connection.
Once the integrity of Inner Options is protected, it will raise the stakes. Any attempt to meddle with control options within the TCP Data will not just close off the theoretical potential benefit of a protocol advance that no-one knows they want yet; it will fail integrity checks and therefore completely break any communication. It is unlikely that a network operator will buy a middlebox that does that.
Then middlebox designers will be on the back foot. To completely block communications they will need a sound justification. If they block an attack, that will be fine. But if they want to block everything abnormal, they will have to block the whole communication, or nothing. So the operator will want to choose middlebox vendors who take much more care to ensure their policies track the latest protocol advances—to avoid costly support calls.
Some middleboxes discard a segment sent to a well-known port (particularly port 80) if the TCP Data does not conform to the expected app-layer protocol (particularly HTTP). Often such middleboxes only parse the start of the app-layer header (e.g. Web filters only continue until they find the URL being accessed, or DPI boxes only continue until they have identified the application-layer protocol).
The segment structure defined in Section 2.2.1 would not traverse such middleboxes. An alternative segment structure that avoids the start of the first two segments in each direction is defined in Appendix C.3. It is not mandatory to implement in the present specification. However, it is hoped that it will be included in some experimental implementations so that it can be decided whether it is worth making mandatory.
A middlebox that splits a TCP connection can coalesce and/or divide the original segments. Segmentation offload hardware is another common cause of resegmentation. Inclusion of the marker in the InSpace Option allows the receiver to reconstruct the original segment boundaries. The ZOMBI encoding Appendix A removes any occurrences of the marker other than those at the start of each segment.
Superficially, the receiver does not need the sent data size (SDS) field to find the end of each sent segment; it could scan for the marker at the start of the next segment instead. However, in the common case when a stream has not been resegmented, the receiver will find the marker at the start of the segment, but the next marker will not have been received yet. The SDS field allows the receiver to know immediately whether a whole segment has been received as sent. For the same reason, Minion [I-D.iyengar-minion-protocol] uses a (different) marker to tag the end of each message. In contrast, the Inner Space approach uses 2B to declare the original segment size, which saves having to scan the stream for an end marker.
Equally, one could argue that markers are unnecessary, because the sequence of sent data size fields from the start of the stream seem sufficient to find all the segment boundaries. Using markers ensures that the receiver can pick out segment boundaries immediately on arrival, which is important for deadlock avoidance (see Section 3.2.6).
The Sent Data Size is not strictly necessary on a SYN (SYN=1, ACK=0) because a SYN is never resegmented. However, for simplicity, the layout for a SYN is made the same as for a SYN/ACK. This future-proofs the protocol against the possibility that SYNs might be resegmented in future. And it makes it easy to introduce the alternative segment structure of Appendix C.3 if it is needed.
Section 2.3 introduced the two types of objects that Inner Space places within the TCP Data:
The following two sections address each in turn: i) explaining why it is useful to introduce in-order flow-controlled TCP options and ii) explaining why it is feasible to encapsulate fire-and-forget options within the TCP datastream, despite its reliable ordered semantics.
Including Suffix Options within TCP's sequence space gives the sender a simple way to ensure that control options will be delivered reliably and in order to the remote TCP, even if the control options are on segments without user-data. By using TCP's existing stream delivery mechanisms, it adds no extra protocol processing, no extra packets and no extra bits.
The sender can even choose to place control options on a segment without user-data, e.g. to reliably re-key TCP-level encryption on a connection currently sending no data in one direction. The sender can even add an InSpace Option without further Inner Options except a no-op Suffix option. Then it can ensure that the segment will automatically be delivered reliably and in order to the remote TCP, even though it carries no user-data or other TCP control options, e.g. for a test probe, a tail-loss probe or a keep-alive.
Figure 4a) illustrates control options arriving reliably and in order at the receiving TCP stack in comparison with the traditional approach shown in Figure 4b), in which control options are outside the sequence space. In the traditional approach, during a period when the remote TCP is sending no user-data, the local TCP may receive control options E, B and D without ever knowing that they are out of order, and without ever knowing that C is missing.
a) __ ____ _______ _ __ |__|____|_______|_| |__| control :E : D : C :B: :A : ________________: : : : :__________________: : |________________| |__________________| data b) __ |__| E |_|__ B __ |____|D |__|A control \ / \ / ________________\/__________________\/ |________________||__________________| data ! !drop ____!__ |_______|C
Figure 4: Control options a) inside vs. b) outside TCP sequence space`
By including Inner Options within the sequence space, each control option is automatically bound to the start of a particular byte in the data stream, which makes it easy to switch behaviour at a specific point mid-stream (e.g. re-keying or switching to a different control mode). With traditional TCP options, a bespoke reliable and ordered binding to the data stream would have to be developed for each TCP option that needs this capability (e.g. co-ordinating use of new keys in TCP-AO [RFC5925] or tcpcrypt [I-D.bittau-tcpinc-tcpcrypt]).
Including Inner Options in sequence also allows the receiver to tell the sender the exact point at which it encountered an unrecognised TCP option using only TCP's pre-existing byte-granularity acknowledgement scheme.
Middleboxes exist that rewrite TCP sequence and acknowledgement numbers, and they also rewrite options that refer to sequence numbers (at least those known when the middlebox was produced, such as SACK, but not any introduced afterwards). If Inner Options were not included in sequence, the number of bytes beyond the TCP Data Offset in each segment would not match the sequence number increment between segments. Then, such middleboxes could unintentionally corrupt the user-data and options by 'normalising' sequence or acknowledgement numbering. Fortunately, including Inner Options in sequence improves robustness against such middleboxes.
The Inner Space protocol allows Fire-and-Forget Options to be tunnelled within the TCP Data so that they can traverse middleboxes that would otherwise strip them or somehow normalise their contents. Two question then arise: i) should Fire-and-Forget Objects (padding, the InSpace Option and Prefix Options) consume sequence space and ii) should they be covered by flow control? The answers to these questions will also be re-usable to multiplex streams within one TCP connection:
The rule above concerning sequence space is a compromise needed to traverse middleboxes. So, perhaps predictably, this begets further compromises. The rule concerning flow-control is principled. So perhaps predictably, it has to be compromised to traverse certain middleboxes. The rationale for these compromises is explained below, referring to the normative rules in the protocol specification where appropriate:
Inner Space uses TCP as a substrate protocol, i.e. on the wire, the headers look like an RFC793-compliant TCP, and there is only a difference if one looks inside the TCP Data. Other transport extensibility approaches have used UDP as a substrate protocol, for instance, to carry SCTP through middleboxes.
In design and implementation terms, it is much easier to turn UDP into a reliable protocol, than it is to selectively turn TCP into an unreliable protocol. However, UDP is already blocked on about 15% of Internet paths {ToDo: ref}, whereas vanilla TCP is still universally permitted. Therefore, because the goal is middlebox traversal, not just ease of implementation, Inner Space uses TCP as a substrate.
It may well turn out that Inner Space cannot reach some places that UDP can. It is expected that applications (or even the TCP stack) might sometimes have to resort to tryinging UDP as a substrate in such cases.
At an earlier stage in the specification of the Inner Space protocol [I-D.briscoe-tcpm-inner-space] before unordered delivery of Inner Options was introduced, Inner Options could all be processed in either user-space or kernel-space. The only exception was the interactions controlling the handshake on the first segment in each direction. However, with the addition of unordered delivery of Prefix Options, the protocol has to be implemented in the kernel, because the protocol modifies the behaviour of TCP, not just its payload.
The format of the InSpace Option (Figure 3) does not necessarily have to comply with the RFC 793 format for TCP options, because it is not intended to ever appear in a sequence of TCP options. In particular, it does not need an Option Kind, because the option is always in a known location. In effect the magic number serves as a multi-octet Option Kind for the first InSpace Option, and the location of each subsequent option is always known by the marker in the InSpace option as well as by the offset from the previous one, using the Sent Data Size field.
Other aspects of the layout are justified as follows:
When SYN=1 the layout of the InSpace Option includes:
When SYN=0, the following further considerations determined the layout of the InSpace Option:
The overhead of the Inner Space protocol is quantified as follows:
For example, if P=80% and D=10%, the connection rate will inflate by 8%. P is difficult to predict. D is likely to be small, and in the longer term it should reduce to the proportion of connections to remaining legacy servers, which are likely to be the less frequently accessed ones. In the worst case if both P & D are 100%, the maximum that the connection rate can inflate by is 100% (i.e. to twice present levels).
This is because a server or middlebox only holds dual connection state for one round trip, until the RST on one of the two connections. For example, keeping P & D as they were in the above example, if R = 3 round trips {ToDo: TBA}, connection state would inflate by 2.7%. In the longer term, any extra connection state would be focused on legacy servers, with none on upgraded servers. Therefore, if memory for dual handshake flow state was a problem, upgrading the server to support the Inner Space protocol would solve the problem.
For example, keeping and P & D as they were in the above example, if J = 50KiB for IPv4 and K = 70 packets (ToDo: TBA), traffic overhead would be 0.03% counting in bytes or 0.2% counting in packets.
This assumes an InSpace option adds 8B per segment (i.e. both Prefix and Suffix Options together on every segment will be rare). For example, keeping P as it was in the above example and taking Q=10% and F=750B, the traffic overhead is 0.09%. It is as difficult to predict Q as it is to predict P.
It is believed that all TCP options that were designed as Outer Options can be relocated without alteration as Prefix Options, because the unreliable unordered semantics are the same as TCP Outer Options. However, some yet-to-be-defined TCP options might be better suited to the reliable ordered semantics of Suffix Options. Specifically, existing or proposed TCP options fall into the following categories:
{ToDo: The above list is not authoritative. Some TCP options include suboptions, some of which are discussed below, but others remain to be fully assessed.}
The specification of any future TCP option MUST state whether it is designed as a Suffix Option (reliable ordered) or as a Prefix / Outer Option (unreliable unordered) or "Don't Care". A TCP option MUST by default only be used as an Outer or Prefix Option, unless it is explicitly specified that it can (or must) be used as a Suffix Option.
The Inner Space protocol supports TCP Fast Open, by constraining the client to obey the rules in Section 2.3.1.1).
All the sub-types of the MPTCP option [RFC6824] except one could be located as Suffix or Prefix Options. That is, MP_CAPABLE, MP_JOIN, ADD_ADDR(2), REMOVE_ADDR, MP_PRIO, MP_FAIL, MP_FASTCLOSE. The Data Sequence Signal (DSS) of MPTCP consists of four separable parts: i) the Data ACK; ii) the mapping between the Data Sequence Number and the Subflow Sequence Number over a Data-Level Length; iii) the Checksum; and iv) the DATA_FIN flag. If MPTCP were re-factored to take advantage of the Inner Space protocol, all these parts except the Data ACK could be located as Suffix Options (the Checksum would not be necessary).
The MPTCP Data ACK has to remain as a Prefix or Outer Option otherwise there would be a risk of flow control deadlock, as pointed out in [Raiciu12]. For instance, a Web client might pipeline multiple requests that fill a Web server's receive buffer, while the Web server might be busy sending a large response to the first request before it reads the second request. If the Data ACK were a Suffix Option, the Web client would have to stop acknowledging the first response from the server (due to lack of receive window). Then the server would not be able to move on to the next request—a classic deadlock.
The TCP authentication option can be configured either to cover TCP Options or not (when it was defined only Outer Options existed). If it covers any TCP Options it has to be located as an Outer or Prefix Option to prevent the possibility of flow-control deadlock (because it would consume receive window on pure ACKs if it were located as a Suffix Option).
All sub-options of the tcpcrypt CRYPT option could be located as Suffix Options. However, as long as the tcpcrypt MAC option covers the TCP header and Outer Options, it has to be located as an Outer Option for the same deadlock reason as TCP-AO.
An Upgraded Server can support SYN Cookies [RFC4987] for Ordinary Connections. For Upgraded Connections Section 2.5 defines a new EchoCookie TCP option that is a prerequisite for InSpace implementations, and provides sufficient space for the more extensive connection state requirements of an InSpace server.
{ToDo: TCP States and Transitions, Connectionless Resets, ICMP Handling, Forward-Compatibility.}
The interaction with the assumptions about TCP made by middleboxes is covered extensively elsewhere:
An aim of the Inner Space protocol is for legacy applications to continue to just work without modification. Therefore it is expected that the dual handshaking logic and placement of options within the TCP Data will be implemented beneath the well-known socket interface.
Inner Space implementations will need to comply with the following behaviours to ensure that legacy applications continue to receive predictable behaviour from the socket interface:getpeername() while the TCP server behind the socket is (unwittingly) engaged in a dual handshake, it will return the port of the remote client, even though this connection might subsequently be aborted. This is because a TCP server is not aware of whether it is part of a dual handshake.
Note that Inner Space has no impact on queries for the remote port from a TCP server. If an application calls
Some applications interrogate the TCP stack to determine the path max transmission unit (PMTU), e.g. in order to optimize application message boundaries within the datastream. From the viewpoint of such applications, TCP options subtract the same amount from the PMTU whether they are Outer or Inner Options. However, the 8 (or 12) octet InSpace header and the alignment padding represent extra overhead. Therefore, for such applications, the TCP stack as seem through the socket API will seem similar to a tunnel that reduces the useful size of the PMTU. This could lead to fragmentation until such applications are updated. Nonetheless, most such applications already include code to adapt to PMTU reduction by tunnels.
It would be appropriate to enable the Inner Space protocol on a per-host or per-user basis. The necessary configuration switch does not need to be standardised, but it might allow the following three states:
The socket API might also need to be extended for future applications that want to control the Inner Space protocol explicitly. Experience will determine the best API, so these ideas are merely informational suggestions at this stage:
This specification requires IANA to allocate values from the TCP Option Kind name-space against the following names:
Early implementation before the IANA allocation MUST follow [RFC6994] and use experimental option 254 and respective Experiment IDs:
{ToDo: Values TBA and register them with IANA} then migrate to the assigned option after allocation.
Certain cryptographic functions have different coverage rules for the TCP Header and TCP Payload. Placing some TCP options beyond the Data Offset could mean that they are treated differently from regular TCP options. This is a deliberate feature of the protocol, but application developers will need to be aware that this is the case.
A malicious host can send bogus SYN segments with a spoofed source IP address (a SYN flood attack). The Inner Space protocol does not alter the feasibility of this attack. However, the extra space for TCP options on a SYN allows the attacker to include more TCP options on a SYN than before, so it can make a server do more option processing before replying with a SYN/ACK. To mitigate this problme, a server under stress could deprioritise SYNs with longer option fields to focus its resources on SYNs that require less processing.
Each SYN in a SYN flood attack causes a TCP server to consume memory. The Inner Space protocol allows a potentially large amount of TCP option state to be negotiated during the SYN exchange, which could allow attackers to exhaust the TCP server's memory more easily. The EchoCookie TCP option (see Section 2.5) allows the server to place this state in a cookie and send it on the SYN/ACK to the purported address of the client—rather than hold it in memory. Then, as long as the client returns the cookie on the acknowledgement and the server verifies it, the server can recover its full record of all the TCP options it negotiated and continue the connection without delay. On the other hand, the server's responses to SYNs from spoofed addresses will scatter to those spoofed addresses and the server will not have consumed any memory while waiting in vain for them to reply. See the Security Considerations in [I-D.briscoe-tcpm-echo-cookie] for how the EchoCookie facility protects against reflection and amplification attacks.
Some security devices block data in an initial SYN segment, classifying it as the signature of an attack. Attackers might indeed use data-in-SYN to strengthen the force of a SYN flood attack, but it has also always been valid for clients to use data-in-SYN for low latency service as well (today data-in-SYN is used by TCP Fast Open, but data-in-SYN has been permitted for similar reasons right back to the days of RFC 793). On its own, data-in-SYN MUST NOT be considered a sufficient signature of an attack. It can only be considered an attack signature if seen in combination with other symptoms of a SYN flood attack. The logic that led to data-in-SYN alone being considered an attack was probably well-intentioned, but it actually turns a security device into an attack on innocent low latency services.
The optional extension for DPI traversal specified in Appendix C.3 might create a new attack vector. The attack was originally proposed (by David Mazieres) when an earlier draft required the optional extension to be applied at the start of both half-connections. As long as the DPI traversal extension no longer applies in the server-client direction the attack seems less feasible. Nonetheless, the attack in the server-client direction is described here anyway (in case it prompts someone to think of a similar feasible attack in the client-server direction):
If the DPI traversal solution is to be used, and a feasible attack is developed in the client-server direction, a couple of directions to prevent such an attack could be explored:
The idea of this approach grew out of discussions with Joe Touch while developing draft-touch-tcpm-syn-ext-opt, and with Jana Iyengar and Olivier Bonaventure. Jana Iyengar also suggested the sender-only flow-control offset. The idea that it is architecturally preferable to place a protocol extension within a higher layer, and code its location into upgraded implementations of the lower layer, was originally articulated by Rob Hancock. {ToDo: Ref?} The following people provided useful comments: Joe Touch, Yuchung Cheng, John Leslie, Mirja Kuehlewind, Andrew Yourtchenko, Costin Raiciu, Marcelo Bagnulo Braun, Julian Chesterfield, Jaime Garcia, Ted Hardie and David Mazieres, Tim Shepard, Mark Handley.
Bob Briscoe's contribution is part-funded by the European Community under its Seventh Framework Programme through the Trilogy 2 project (ICT-317756) and the Reducing Internet Transport Latency (RITE) project (ICT-317700). The views expressed here are solely those of the author.
[I-D.ietf-tcpm-fastopen] | Cheng, Y., Chu, J., Radhakrishnan, S. and A. Jain, "TCP Fast Open", Internet-Draft draft-ietf-tcpm-fastopen-10, September 2014. |
[RFC0793] | Postel, J., "Transmission Control Protocol", STD 7, RFC 793, September 1981. |
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. |
[RFC6994] | Touch, J., "Shared Use of Experimental TCP Options", RFC 6994, August 2013. |
[Briscoe14] | Briscoe, B., "Tunnelling through Inner Space", IAB Workshop on Stack Evolution in a Middlebox Internet , January 2015. |
[Cheshire97] | Cheshire, S. and M. Baker, "Consistent Overhead Byte Stuffing", Proc. ACM SIGCOMM'97, Computer Communication Review 27(4):209--220, October 1997. |
[Honda11] | Honda, M., Nishida, Y., Raiciu, C., Greenhalgh, A., Handley, M. and H. Tokuda, "Is it Still Possible to Extend TCP?", Proc. ACM Internet Measurement Conference (IMC'11) 181--192, November 2011. |
[I-D.bittau-tcpinc-tcpcrypt] | Bittau, A., Boneh, D., Hamburg, M., Handley, M., Mazieres, D. and Q. Slack, "Cryptographic protection of TCP Streams (tcpcrypt)", Internet-Draft draft-bittau-tcpinc-tcpcrypt-00, October 2014. |
[I-D.briscoe-tcpm-echo-cookie] | Briscoe, B., "The Echo Cookie TCP Option", Internet-Draft draft-briscoe-tcpm-echo-cookie-00, October 2014. |
[I-D.briscoe-tcpm-inner-space] | Briscoe, B., "Inner Space for TCP Options", Internet-Draft draft-briscoe-tcpm-inner-space-01, October 2014. |
[I-D.ietf-httpbis-http2] | Belshe, M., Peon, R. and M. Thomson, "Hypertext Transfer Protocol version 2", Internet-Draft draft-ietf-httpbis-http2-17, February 2015. |
[I-D.iyengar-minion-protocol] | Jana, J., Cheshire, S. and J. Graessley, "Minion - Wire Protocol", Internet-Draft draft-iyengar-minion-protocol-02, October 2013. |
[I-D.touch-tcpm-tcp-syn-ext-opt] | Touch, J. and T. Faber, "TCP SYN Extended Option Space Using an Out-of-Band Segment", Internet-Draft draft-touch-tcpm-tcp-syn-ext-opt-01, September 2014. |
[I-D.wing-tsvwg-happy-eyeballs-sctp] | Wing, D. and P. Natarajan, "Happy Eyeballs: Trending Towards Success with SCTP", Internet-Draft draft-wing-tsvwg-happy-eyeballs-sctp-02, October 2010. |
[RFC2018] | Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP Selective Acknowledgment Options", RFC 2018, October 1996. |
[RFC2675] | Borman, D., Deering, S. and R. Hinden, "IPv6 Jumbograms", RFC 2675, August 1999. |
[RFC4987] | Eddy, W., "TCP SYN Flooding Attacks and Common Mitigations", RFC 4987, August 2007. |
[RFC5925] | Touch, J., Mankin, A. and R. Bonica, "The TCP Authentication Option", RFC 5925, June 2010. |
[RFC6555] | Wing, D. and A. Yourtchenko, "Happy Eyeballs: Success with Dual-Stack Hosts", RFC 6555, April 2012. |
[RFC6824] | Ford, A., Raiciu, C., Handley, M. and O. Bonaventure, "TCP Extensions for Multipath Operation with Multiple Addresses", RFC 6824, January 2013. |
[RFC7323] | Borman, D., Braden, B., Jacobson, V. and R. Scheffenegger, "TCP Extensions for High Performance", RFC 7323, September 2014. |
[Raiciu12] | Raiciu, C., Paasch, C., Barre, S., Ford, A., Honda, M., Duchene, F., Bonaventure, O. and M. Handley, "How Hard Can It Be? Designing and Implementing a Deployable Multipath TCP", Proc. USENIX Symposium on Networked Systems Design and Implementation , April 2012. |
This appendix is normative and mandatory to implement for the Inner Space protocol. This encoding is relegated to an appendix merely because it is applicable more generally than for just Inner Space. Therefore, in a future revision, this appendix might be removed and replaced by a reference to a stand-alone document.
The Inner Space protocol requires the sender to add a marker in every segment at the first 4-octet aligned word from the start of the datastream. Then, even if the stream is subsequently resegmented, the receiver can recover segments and their associated TCP options as they were sent. The sender uses the value 0x0000 as the 2-octet marker at the start of the InSpace option header. It uses the ZOMBI encoding to remove all other occurrences of 0x0000, treating the segment as a sequence of 2-octet shorts. Then, a marker will unambiguously locate the InSpace option at the start of each segment. From this InSpace option, the receiver can find the length of the segment. Then it can decode the ZOMBI encoding to return the segment to its original form.
The sender applies the ZOMBI encoding as follows:
Because an offset can never be zero, this process naturally removes all occurrences of 0x0000 from the segment.
The receiver reverses the above encoding, assuming the worst case of a resegmented stream unless it finds otherwise:
Below an implementation of the ZOMBI encode and decode algorithms is given in C. The decode algorithm would be preceded by marker-scanning code to find the location of the ZOMBI and SDS fields within the InSpace option. The SDS field will always be non-zero, therefore it will never be changed by the encoding, so the receiver can read it before starting to decode. In case length is odd, a non-zero pseudo-padding octet is considered to be appended to the segment while encoding or decoding (but it is not actually transmitted).
/* {ToDo: Test} * ZombiEncode encodes "length" bytes of data * starting directly after the marker pointed to by "ptr", where: * length = sds - pad. */ void ZombiEncode(unsigned short *ptr, unsigned short length) { const unsigned short *end = ptr + ++length>>1; % /2 rounded up unsigned short *code_ptr = ++ptr; % point to ZOMBI unsigned short code = 0x0001; while (++ptr < end) { % initialise after ZOMBI if (*ptr == 0) { *code_ptr = code; code_ptr = ptr; code = 0x0001; } else code++; } } /* {ToDo: Test} * ZombiDecode decodes "length" bytes of data * starting after the marker pointed to by "ptr", where * length = sds - pad. * Returns number of shorts still to decode. */ short ZombiDecode(unsigned short *ptr, unsigned short length) { const unsigned short *end = ptr++ + ++length>>1; % /2 rounded up while (ptr < end) { % initialise to ZOMBI code = *ptr; *ptr = 0; ptr += code; } return (ptr - end); }
The ZOMBI encoding always uses a marker that is larger than the maximum possible segment size. Therefore, for a jumbo segment Appendix C.2, the sender uses 0x00000000 (4 octets of zeros) as the marker; it pads the segment to a multiple of 4 octets; and it scans the stream in 4-octet words, replacing any occurrences of the marker with the offset in 4-octet words to the next marker.
The ZOMBI encoding is similar to consistent overhead byte stuffing (COBS [Cheshire97]). The main difference is that COBS markers are only one octet. Therefore, in COBS, whenever the distance between zero-bytes is greater than 0xFE, it has to insert an extra byte into the stream with the special value of 0xFF. When decoding, 0xFF is removed rather than replaced by 0x00. Therefore, as well as 2 extra delimiting octets, COBS introduces a variable number of extra octets, but no more than 1 in 254 (a more accurate name would have been capped overhead byte stuffing, because the overhead is variable, not consistent).
In contrast, ZOMBI introduces a predictable overhead of 4 delimiting octets per segment (or 5 for odd length segments), with no unpredictable variation. Therefore, space for the known overhead can be set aside in the InSpace option, and the ZOMBI encode and decode operation can be zero-copy, which is not possible with COBS. A more accurate name for ZOMBI would have been constant overhead message boundary insertion. Nonetheless, the encoding to replace markers once the message boundaries have been inserted actually is zero overhead, so the cool acronym is not totally contrived.
This appendix is normative and mandatory to implement for the Inner Space protocol. This encoding is relegated to an appendix merely because, in a future revision, this appendix might be removed and replaced by a reference to a stand-alone document. It defines the new ModeSwitch TCP option illustrated in Figure 5. This option provides a facility to disable the Inner Space protocol for the remainder of a connection. It also provides a general-purpose facility for a TCP connection to co-ordinate between the endpoints before switching into a yet-to-be-defined mode.
0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 +---------------+---------------+-----------+-+-+ | ModeSwitch | Length=3 |Flags (CU) |I|R| +---------------+---------------+-----------+-+-+
Figure 5: The ModeSwitch TCP Option
The Option Kind is ModeSwitch, the value of which is to be allocated by IANA {ToDo: Value TBA}. ModeSwitch MUST be used only as an Inner Option, because it uses the reliable ordered delivery property of Inner Options. Therefore implementation of the Inner Space protocol is REQUIRED for an implementation of ModeSwitch. Nonetheless, ModeSwitch is a generic facility for switching a connection between yet-to-be-defined modes that do not have to relate to extra option space.
The sender MUST set the option Length to 3 (octets). The Length field MUST be forwarded unchanged by other nodes, even if its value is different.
The Flags field is available for defining modes of the connection. Only two connection modes are currently defined. The first 6 bits of the Flags field are Currently Unused (CU) and the sender MUST set them to zero. The CU flags MUST be ignored and forwarded unchanged by other nodes, even if their value is non-zero.
The two 1-bit connection mode flags that are currently defined have the following meanings:
The default Inner Space mode at the start of a connection is I=1, meaning Inner Space is in enabled mode.
The procedure for changing a mode or modes is as follows:
The regular TCP sequence numbers and acknowledgement numbers of requests or confirmations can be used to disambiguate overlapping requests or responses.
Once a host switches to Disabled mode, it MUST NOT send any further InSpace Options. Therefore it can send no further Inner Options and it cannot switch back to Enabled mode for the rest of the connection.
To temporarily reduce InSpace overhead without permanently disabling the protocol, the sender can use a value of 0xFFFF in the Sent Data Size (see Section 2.4).
This appendix specifies protocol extensions that are OPTIONAL while the specification is experimental. If an implementation includes an extension, this section gives normative specification requirements. However, if the extension is not implemented, the normative requirements can be ignored.
{Temporary note: The IETF may wish to consider making some of these extensions mandatory to implement if early testing shows they are useful or even necessary. Or it may wish to make at least the receiving side mandatory to implement to ensure that two-ended experiments are more feasible.}
This appendix is normative. It is separated from the body of the specification because it is OPTIONAL to implement while the Inner Space protocol is experimental. It is not mandatory to implement because it will be more useful once the Inner Space protocol has become accepted widely enough that fewer middleboxes will discard SYN segments carrying this option (see Appendix D for when best to deploy it). It only works if both ends support it, but it can be deployed one end at a time, so there is no need for support in early experimental implementations.
{Temporary note: The choice between the explicit handshake in the present section or the handshake in Section 2.1.1 is a tradeoff between robustness against middlebox interference and minimal server state. During the IETF review process, one might be chosen as the only variant to go forward, at which point the other will be deleted. Alternatively, the IETF could require a server to understand both variants and a client could be implemented with either, or both. If both, the application could choose which to use at run-time. Then we will need a section describing the necessary API.}
This explicit dual handshake is similar to that in Section 2.1.1, except the SYN that the Upgraded Client sends on the Ordinary Connection is explicitly distinguishable from the SYN that would be sent by a Legacy Client. Then, if the server actually is an Upgraded Server, it can reset the Ordinary Connection itself, rather than creating connection state for at least a round trip until the client resets the connection.
For an explicit dual handshake, the TCP client still sends two alternative SYNs: a SYN-O intended for Legacy Servers and a SYN-U intended for Upgraded Servers. The two SYNs MUST have the same network addresses and the same destination port, but different source ports. Once the client establishes which type of server has responded, it continues the connection appropriate to that server type and aborts the other. The SYN intended for Upgraded Servers includes additional options within the TCP Data (the SYN-U defined as before in Section 2.2.1).
Table 2 summarises the TCP 3-way handshake exchange for each of the two SYNs in the two right-hand columns, between an Upgraded TCP Client (the active opener) and either:Table 1, which has already been explained in Section 2.1.1.
The table uses the same layout and symbols as
Ordinary Connection | Upgraded Connection | ||
---|---|---|---|
1 | Upgraded Client | >SYN-O | >SYN-U |
/\/\ | /\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ |
2 | Legacy Server | <SYN/ACK | <SYN/ACK |
3a | Upgraded Client | Waits for response to both SYNs | |
3b | " | >ACK | >RST |
4 | Cont... | ||
/\/\ | /\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ | /\/\/\/\/\/\/\/\/\ |
2 | Upgraded Server | <RST | <SYN/ACK-U |
3 | Upgraded Client | >ACK | |
4 | Cont... |
As before, an Upgraded Server MUST respond to a SYN-U with a SYN/ACK-U. Then, the client recognises that it is talking to an Upgraded Server.
Unlike before, an Upgraded Server MUST respond to a SYN-O with a RST. However, the client cannot rely on this behaviour, because a middlebox might be stripping Outer TCP Options which would turn the SYN-O into a regular SYN before it reached the server. Then the handshake would effectively revert to the implicit variant. Therefore the client's behaviour still depends on which SYN-ACK arrives first, so its response to SYN-ACKs has to follow the rules specified for the implicit handshake variant in Section 2.1.1.
The rules for processing TCP options are also unchanged from those in Section 2.3.
The SYN-O is merely a SYN with an extra InSpaceO Outer TCP Option as shown in Figure 6. It merely identifies that the SYN is opening an Ordinary Connection, but explicitly identifies that the client supports the Inner Space protocol.
0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +---------------+---------------+ | Kind=InSpaceO | Length=2 | +---------------+---------------+
Figure 6: An InSpaceO TCP Option Flag
An InSpaceO TCP Option has Option Kind InSpaceO with value {ToDo: Value TBA} and MUST have Length = 2 octets.
To use this option, the client MUST place it with the Outer TCP Options. A Legacy Server will just ignore this TCP option, which is the normal behaviour for an option that TCP does not recognise [RFC0793].
If the client receives a RST on one connection, but a short while after that {ToDo: duration TBA} the response to the SYN-U has not arrived, it SHOULD retransmit the SYN-U. If latency is more important than the extra TCP option space, in parallel to any retransmission, or instead of any retransmission, the client MAY send a SYN without any InSpace TCP Option, in case this is the cause of the black-hole. However, the presence of the RST implies that the SYN with the InSpaceO TCP Option (the SYN-O) probably reached the server, therefore it is more likely (but not certain) that the lack of response on the other connection is due to transmission loss or congestion loss.
If the client receives no response at all to either the SYN-O or the SYN-U, it SHOULD solely retransmit one or the other, not both. If latency is more important than the extra TCP option space, it SHOULD send a SYN without an InSpaceO TCP Option. Otherwise it SHOULD retransmit the SYN-U. It MUST NOT retransmit both segments, because the lack of response could be due to severe congestion.
There is a small but finite possibility that the Explicit Dual Handshake might encounter the cases below. The Implicit Handshake (Section 2.1.1) is robust to these possibilities, but the Explicit Handshake is not, unless the following additional rules are followed:
If a path either holds back or discards data in a SYN-U, but there is evidence that the server is upgraded from a RST response to the SYN-O, the strategy below might at least allow a connection to use extra option space on all the segments except the SYN.
It is assumed that the symptoms described in the 'both aborted' case (Appendix C.1.3) have occurred, i.e. the server has responded to the SYN-O with a RST, but it has responded to the SYN-U with an Ordinary SYN/ACK not a SYN/ACK-U, so the client has had to RST the Upgraded Connection as well. In this case, the client SHOULD attempt the following (alternatively it MAY give up and fall back to opening an Ordinary TCP connection).
The client sends an 'Alternative SYN-U' by including an InSpaceU Outer TCP Option (Figure 7). This Alternative SYN-U merely flags that the client is attempting to open an Upgraded Connection. The client MUST NOT include any Inner Options or InSpace Option or Magic Number. If the previous aborted SYN/ACK-U acknowledged the data that the client sent within the original SYN-U, the client SHOULD resend the TCP Payload data in the Alternative SYN-U, otherwise it might as well defer it to the first data segment.
0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +---------------+---------------+ | Kind=InSpaceU | Length=2 | +---------------+---------------+
Figure 7: An InSpaceU Flag TCP option
An InSpaceU Flag TCP Option has Option Kind InSpaceU with value {ToDo: Value TBA} and MUST have Length = 2 octets.
To use this option, the client MUST place it with the Outer TCP Options. A Legacy Server will just ignore this TCP option, which is the normal behaviour for an option that TCP does not recognise [RFC0793]. Because the client has received a RST from the server in response to the SYN-O it can assume that the server is upgraded. So the client probably only needs to send a single Alternative SYN-U in this repeat attempt. Nonetheless, the RST might have been spurious. Therefore the client MAY also send an Ordinary SYN in parallel, i.e. using the Implicit Dual Handshake (Section 2.1.1).
If an Upgraded Server receives a SYN carrying the InSpaceU option, it MUST continue the rest of the connection as if it had received a full SYN-U (Section 2.2), i.e. by processing any Outer Options in the SYN-U and responding with a SYN/ACK-U.
This appendix is normative. It defines the format of the InSpace Option necessary to support jumbograms. It is separated from the body of the specification because it is OPTIONAL to implement while the Inner Space protocol is experimental. In experimental implementations, it will be sufficient to implement the required behaviour for when the Length of a received InSpace Option is not recognised (Section 2.4).
If the IPv6 Jumbo extension header is used, a sender MUST use the InSpace Option format defined in Figure 8.
All the fields have the same meanings as defined in Section 2.2.2, except Sent Data Size (SDS), the Inner Options Offset (InOO) and the Suffix Options Offset (SOO) use more bits, respectively 32, 30 and 30. The Length (Len) field can be either 2, 3 or 4, where binary 00 represents 4.
When reading a segment, the Jumbo InSpace Option could be present in a packet that is not a jumbogram (e.g. due to resegmentation). Therefore a receiver MUST use the Jumbo InSpace Option to work along the stream irrespective of whether arriving packets are jumbo sized or not.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------------------------------------------------------+ | Marker | +-----------------------------------------------------------+---+ | ZOMBI |Len| +-----------------------------------------------------------+---+ | Sent Data Size (SDS) | +-----------------------------------------------------------+-+-+ | Inner Options Offset (InOO) CU|P| +-----------------------------------------------------------+-+-+ | Suffix Options Offset (SOO) |CU | +-----------------------------------------------------------+---+
Figure 8: InSpace Option for a Jumbo Datagram
This appendix is normative. It is separated from the body of the specification because it is OPTIONAL to implement while the Inner Space protocol is experimental.
In experiments conducted between 2010 and 2011, [Honda11] reported that 7 of 142 paths (about 5%) blocked access to port 80 if the payload was not parsable as valid HTTP. This extension to the specification has been defined in case experiments prove that it significantly improves traversal of such deep packet inspection (DPI) boxes.
This extension places the expected app-layer headers at the start of the TCP Data in the SYN and in the first data segment in the client-to-server direction:
From the second InSpace Option onwards, the structure of the stream reverts to that already defined in Section 2.2.1. So the value of Sent Data Size (SDS#2) in the second InSpace Option (InSpace #2) defines the length of the remaining TCP Data before the end of the first data segment, as shown.
TCP Data .---------------------------'------------------. | Inner Options | a) SYN=1 | .---------'---------. | +--------+----------+-------------+-+---------+---------+----------+ | BaseHdr| OuterOpts| Payload | | PrefOpts| SuffOpts|InSpace#1 | +--------+----------+-------------+-+---------+---------+----------+ | DO | | | SOO | | | `------------------>| |P`-------->| | Len = 3 | | | |a| InOO |<---------' |d|<------------------' | b) First SYN=0 segment +--------+----------+--------+-+---------+--------+--------+-------+ | BaseHdr| OuterOpts|Payload | |InSpace#2|PrefOpts|SuffOpts|Payload| +--------+----------+--------+-+---------+--------+--------+-------+ | DO | | | Len | SOO | | `------------------>| |P`-------->`------->| | | | |a| | InOO | | | |d| `---------------->| | | SDS#1 | SDS#2 | `------->`------------------------------------>| | | |
All offsets are specified in 4-octet (32-bit) words, except SDS and Pad, which are in octets.
Figure 9: Segment Structures to Traverse DPI boxes (not to scale)
It is recognised that having to work from the end of the first segment makes segment processing more involved. Experimental implementation of this approach will determine whether the extra complexity improves DPI box traversal sufficiently to make it worthwhile.
If it does work, it is believed that this extension will only be necessary on the initial SYN and the first data segment sent in the direction from TCP client to server. Therefore, the SYN/ACK and data segments sent by the TCP server will continue to use the regular Inner Space segment structure illustrated in Figure 2.
If a TCP client that implements this extension opens a connection with a server that does not, the client will fall back to ordinary TCP even though the server would have supported the Inner Space protocol without the DPI traversal extension. This is because the server does not look for the magic number at the end of the SYN, so it behaves like a legacy TCP server responding with an ordinary SYN/ACK, which in turn makes the client fall back to ordinary TCP. Such limited fall-back is considered sufficient to support experiments to see whether the DPI traversal extension is useful. If it is useful, a future standards track specification could make support for this DPI traversal extension mandatory for an Inner Space TCP server, but still optional for an Inner Space TCP client.
In the body of this specification, two variants of the dual handshake are defined:
Both schemes double up connection state (for a round trip) on the Legacy Server. But only the implicit scheme doubles up connection state (for a round trip) on the Upgraded Server as well. On the other hand, the explicit scheme risks delay accessing a Legacy Server if a middlebox discards the SYN-O (some firewalls and middleboxes discard packets with unrecognised TCP options [Honda11]). Table 3 summarises these points.
SYN (Implicit) | SYN-L (Explicit) | |
---|---|---|
Minimum state on Upgraded Server | - | + |
Minimum risk of delay to Legacy Server | + | - |
There is no need for the IETF to choose between these. If the specification allows either or both, the tradeoff can be left to implementers at build-time, or to the application at run-time.
Initially clients might choose the Implicit Dual Handshake to minimise delays due to middlebox interference. But later, perhaps once more middleboxes support the scheme, clients might choose the Explicit scheme, to minimise state on Upgraded Servers.
This appendix is informative, not normative. It records outstanding issues with the protocol design that will need to be resolved before publication.
A detailed version history can be accessed at <http://datatracker.ietf.org/doc/draft-briscoe-tcpm-inner-space/history/>
Editorial changes:
Editorial changes: