The Transmission Control Protocol |
|
WEB Services | 2007/09/21 (금) 23:06 |
| |||
The Transmission Control Protocol
On an IP network, applications use two standard transport protocols to communicate with each other. These are the User Datagram Protocol (UDP), which provides a lightweight and unreliable transport service, and the Transmission Control Protocol (TCP), which provides a reliable and controlled transport service. The majority of Internet applications use TCP, since its built-in reliability and flow control services ensure that data does not get lost or corrupted.
TCP is probably the most important protocol in use on the Internet today. Although IP does the majority of the legwork, moving datagrams and packets around the Internet as needed, TCP makes sure that the data inside of the IP datagrams is correct. Without this reliability service, the Internet would not work nearly as well as it does, if it worked at all.
It is also interesting to note that the first versions of TCP were designed before IP, with IP being extracted from TCP later. In fact, TCP is now designed to work with any packet-switched network, whether this be raw Ethernet or a distributed IP-based network like the Internet. This flexible design has resulted in TCP being adopted by other network architectures, including OSI's Transport Protocol 4 (TP4) and Apple Computer Corporation's AppleTalk Data Stream Protocol (ADSP).
The TCP StandardTCP is defined in RFC 793, which has been republished as STD 7 (TCP is an Internet Standard protocol). However, RFC 793 contained some vagaries which were clarified in RFC 1122 (Host Network Requirements). In addition, RFC 2001 introduced a variety of congestion-related elements to TCP, which have been included into the standard specification, although this RFC was superseded by RFC 2581 (a.k.a., RFC 2001 bis). As such, TCP implementations need to incorporate RFC 793, RFC 1122, and RFC 2581 in order to work reliably and consistently with other implementations.
RFC 793 states that the Protocol ID for TCP is 6. When a system receives an IP datagram that is marked as containing Protocol 6, it should pass the contents of the datagram to TCP for further processing.
TCP Is a Reliable, Connection-Centric Transport ProtocolRemember that all of the transport-layer protocols (including TCP and UDP) use IP for their basic delivery services, and that IP is an unreliable protocol, providing no guarantees that datagrams or packets will reach their destination intact. It is quite possible for IP packets to get lost entirely (due to an untimely link failure on the network somewhere), or for packets to become corrupted (due to an overworked or buggy router), or for packets to get reordered as they cross different networks en route to the destination system, or for a myriad of other problems to crop up while packets are being bounced around the Internet.
For applications that need some sort of guarantee that data will arrive at its destination intact, this uncertainty is simply unacceptable. Electronic mail, TELNET, and other network applications are the basis of many mission-critical efforts, and as such they need some sort of guarantee that the data they transmit will arrive in its original form.
This reliability is achieved through the use of a virtual circuit that TCP builds whenever two applications need to communicate. As we discussed in Chapter 1, An Introduction to TCP/IP, a TCP session is somewhat analogous to a telephone conversation in that it provides a managed, full-duplex, point-to-point communications circuit for application protocols to use. Whenever data needs to be sent between two TCP-based applications, a virtual circuit is established between the two TCP providers, and a highly monitored exchange of application data occurs. Once all of the data has been successfully sent and received, the connection gets torn down.
Building and monitoring these virtual circuits incurs a fair amount of overhead, making TCP somewhat slower than UDP. However, UDP does not provide any reliability services whatsoever, which is an unacceptable trade-off for many applications.
Services Provided by TCPAlthough it is possible for applications to provide their own reliability and flow control services, it is impractical for them to do so. Rather than developing (and debugging) these kinds of services, it is much more efficient for applications to leverage them as part of a transport-layer protocol, where every application has access to them. This arrangement allows shorter development cycles, better interoperability, and less headaches for everybody.
TCP provides five key services to higher-layer applications:
Virtual circuits
Whenever two applications need to communicate with each other using TCP, a virtual circuit is established between the two TCP endpoints. The virtual circuit is at the heart of TCP's design, providing the reliability, flow control, and I/O management features that distinguish it from UDP. Application I/O management
Applications communicate with each other by sending data to the local TCP provider, which then transmits the data across a virtual circuit to the other side, where it is eventually delivered to the destination application. TCP provides an I/O buffer for applications to use, allowing them to send and receive data as contiguous streams, with TCP converting the data into individually monitored segments that are sent over IP. Network I/O management
When TCP needs to send data to another system, it uses IP for the actual delivery service. Thus, TCP also has to provide network I/O management services to IP, building segments that can travel efficiently over the IP network, and turning individual segments back into a data-stream appropriate for the applications. Flow control
Different hosts on a network will have different characteristics, including processing capabilities, memory, network bandwidth, and other resources. For this reason, not all hosts are able to send and receive data at the same rate, and TCP must be able to deal with these variations. Furthermore, TCP has to do all of this seamlessly, without any action being required from the applications in use. Reliability
TCP provides a reliable transport service by monitoring the data that it sends. TCP uses sequence numbers to monitor individual bytes of data, acknowledgment flags to tell if some of those bytes have been lost somewhere, and checksums to validate the data itself. Taken together, these mechanisms make TCP extremely reliable. All told, these services make TCP an extremely robust transport protocol.
Virtual CircuitsIn order for TCP to provide a reliable transport service, it has to overcome IP's own inherent weaknesses, possibly the greatest of which is the inability to track data as it gets sent across the network. IP only moves packets around the network, and makes no pretense towards offering any sort of reliability whatsoever. Although this lack of reliability is actually a designed-in feature of IP that allows it to move data across multiple paths quickly, it is also an inherent weakness that must be overcome in order for applications to communicate with each other reliably and efficiently.
TCP does this by building a virtual circuit on top of IP's packet-centric network layer, and then tracking data as it is sent through the virtual circuit. This concept is illustrated in Figure 7-1. Whenever a connection is made between two TCP endpoints, all of the data gets passed through the virtual circuit.
By using this virtual circuit layer, TCP accomplishes several things. It allows IP to do what it does best (which is moving individual packets around the network), while also allowing applications to send and receive data without them having to worry about the condition of the underlying network. And since each byte of data is monitored individually by TCP, it's easy to take corrective actions whenever required, providing reliability and flow control services on top of the chaotic Internet.
These virtual circuits are somewhat analogous to the way that telephone calls work. It is easy to see this corollary if you think of the two TCP endpoints as being the 뱓elephones,?and the applications being the users of those telephones.
When an application wants to exchange data with another system, it first requests that TCP establish a workable session between the local and remote applications. This process is similar to you calling another person on the phone. When the other party answers (밐ello??, they are acknowledging that the call went through. You then acknowledge the other party's acknowledgment (밐i Joe, this is Eric?, and begin exchanging information (밫he reason I'm calling is뀛).
Likewise, data travelling over a TCP virtual circuit is monitored throughout the session, just as a telephone call is. If at any time parts of the data are lost (밯hat did you say??, the sending system will retransmit the lost data (밒 said뀛). If the connection degrades to a point where communications are no longer possible, then sooner or later both parties will drop the call. Assuming that things don't deteriorate to that point, then the parties will agree to disconnect (밪ee ya? once all of the data has been exchanged successfully, and the call will be gracefully terminated.
This concept is illustrated in Figure 7-2. When a TCP connection needs to be established, one of the two endpoint systems will try to connect with the other endpoint. If the 밹all?goes through successfully, then the TCP stack on the remote system will acknowledge the connection request, which will then be followed by an acknowledgment from the sender. This three-way handshake ensures that the connection is sufficiently reliable for data to be exchanged.
Likewise, each clump of data that is sent is explicitly acknowledged, providing constant feedback that everything is going okay. Once all of the data has been sent, either endpoint can close the virtual circuit. However, the disconnect process also uses acknowledgments in order to ensure that both parties are ready to terminate the call. If one of the systems still had data to send, then they might not agree to drop the circuit.
The virtual circuit metaphor has other similarities with traditional telephone calls. For example, TCP is a full-duplex transport that allows each party to send and receive data over the same virtual circuit simultaneously, just like a telephone call does. This allows for a web browser to request an object and for the web server to send the requested data back to the client using a single virtual circuit, rather than requiring that each end establish its own communication channel.
Every TCP virtual circuit is dedicated to one pair of endpoints, also like a telephone call. If an application needs to communicate with multiple endpoints simultaneously, then it must establish unique circuits for each endpoint pair, just as telephone calls do. This is true even if the same applications are in use at both ends of the connection. For example, if a web browser were to simultaneously request four GIF images from the same server using four simultaneous HTTP 밎ET?commands, then four separate TCP circuits would be needed in order for the operations to complete, even though the same applications and hosts were being used with all of the requests.
For all of these reasons, it is easy to think of TCP's virtual circuits as being very similar to the familiar concept of telephone calls.
Application I/O Management.The primary benefit of the virtual circuit metaphor is the reliability that it allows. However, another set of key benefits is the I/O management services that this design provides.
One of the main features that comes from this design is that applications can send and receive information as streams of data, rather than having to deal with packetsizing and management issues directly. This allows a web server to send a very large graphic image as a single stream of data, rather than as a bunch of individual packets, leaving the task of packaging and tracking the data to TCP.
This design helps to keep application code simple and straightforward, resulting in lower complexity, higher reliability, and better interoperability. Application developers don't have to build flow control, circuit-management, and packaging services into their applications, but can instead use the services provided by TCP, without having to do anything special. All an application has to do is read and write data; TCP does everything else.
TCP provides four distinct application I/O management services to applications:
?Internal Addressing. TCP assigns unique port numbers to every instance of every application that is using a TCP virtual circuit. Essentially, these port numbers act as extension numbers, allowing TCP to route incoming data directly to the appropriate destination application.
?Opening Circuits. Applications inform TCP when they need to open a connection to a remote application, and leave it to TCP to get the job done.
?Data Transfer. Whenever an application needs to send data, it just hands it off to TCP, and assumes that TCP will do everything it can to make sure that the data is delivered intact to the destination system.
?Destroying Circuits. Once applications have finished exchanging data, they inform TCP that they are finished, and TCP closes the virtual circuit.
Application addressing with TCP portsApplications communicate with TCP through the use of ports, which are practically identical to the ports found in UDP. Application are assigned 16-bit port numbers when they register with TCP, and TCP uses these port numbers for all incoming and outgoing traffic.
Conceptually, port numbers provide 밻xtensions?for the individual applications in use on a system, with the IP address of the local system acting as the main phone number. Remote applications 밹all?the host system (using the IP address), and also provide the extension number (port number) of the destination application that they want to communicate with. TCP uses this information to identify the sending and receiving applications, and to deliver data to the correct application.
Technically, this procedure is a bit more complex than it is being described here. When an application wishes to communicate with another application, it will give the data to TCP through its assigned port number, telling TCP the port number and IP address of the destination application. TCP will then create the necessary TCP message (called a 뱒egment?, marking the source and destination port numbers in the message headers, and storing whatever data is being sent in the payload portion of the message. A complete TCP segment will then get passed off to the local IP software for delivery to the remote system (which will create the necessary IP datagram and shoot it off).
Once the IP datagram is received by the destination system, the remote IP software will see that the data portion of the datagram contains a TCP segment (as can be seen by the Protocol Identifier field in the IP header), and will hand the contents of the segment to TCP for further processing. TCP will then look at the TCP header, see the destination port number, and hand the payload portion of the segment off to whatever application is using the specified destination port number.
This concept is illustrated in Figure 7-3. In that example, an HTTP client is sending data to the HTTP server running on port 80 of the destination system. When the data arrives at the destination system, TCP will examine the destination port number for that segment, and then deliver the contents of the segment to the application it finds there (which should be the HTTP server).
Technically, a port identifies only a single instance of an application on a single system. The term 뱒ocket?is used to identify the port number and IP address concantenated together (i.e., port 80 on host 192.168.10.10 could also be referred to as socket 192.168.10.10:80). A 뱒ocket pair?consists of both endpoints on a virtual circuit, including the IP addresses and port numbers of both applications on both systems.
All TCP virtual circuits work on the concept of socket pairs. Multiple connections between two systems must have unique socket pairs, with at least one of the two endpoints having a different port number.
TCP port numbers are not necessarily linked with applications on a one-to-one basis. It is quite common for some applications to open multiple connections simultaneously, and these connections would all require unique socket pairs, even if there was only one application in use. For example, if an HTTP 1.0 client were to simultaneously download multiple graphic objects from an HTTP server, then each instance of the HTTP client would require a unique and separate port number in order for TCP to route the data correctly. In this case, there would be only one application, but there would be multiple bindings to the network, with each binding having a unique port number.
It is important to realize that circuits and ports are entirely separate entities, although they are tightly interwoven. The virtual circuit provides a managed transport between two endpoint TCP providers, while port numbers provide only an address for the applications to use when talking to their local TCP provider. For this reason, it is entirely possible for a server application to support several different client connections through a single port number (although each unique virtual circuit will have a unique socket pair, with the client-side address and/or socket being the unique element).
For example, Figure 7-4 shows a single HTTP server running on Arachnid, with two active virtual circuits (one for Ferret, and another for Greywolf). Although both connections use the same IP address and port number on Arachnid, the socket pairs themselves are unique, due to the different IP addresses and port numbers in use by the two client systems. In this regard, virtual circuits are different from the port number in use by the HTTP server, although these elements are also tightly related.
Most of the server-based IP applications that are used on the Internet today use what are referred to as 뱖ell-known?port numbers, as we discussed in the previous chapter. For example, an HTTP server will listen on TCP port 80 by default, which is the well-known port number associated with HTTP servers. This way, any HTTP client that needs to connect to any HTTP server can use the default destination of TCP port 80. Otherwise, the client would have to specify the port number of the server that it wanted to connect with (you've seen this in some URLs that use http://www.somehost.com:8080/ or the like; 8080 is the port number of the HTTP server on www.somehost.com).
Most servers let you use any port number and are not restricted to the well-known port number. However, if you run your servers on non-standard ports, then you would have to tell every user that the server was not accessible on the default port. This would be hard to manage at best. By sticking with the defaults, all users can connect to your server using the default port number, which is likely to cause the least amount of trouble.
Historically, only server-based applications have been allowed to run on ports below 1024, as these ports could be used only by privileged accounts. By limiting access to these port numbers, it was more difficult for a hacker to install a rogue application server. However, this restriction is based on Unix-specific architectures and is not easily enforced on all of the systems that run IP today. Many application servers now run on operating systems that have little or no concept of privileged users, making this historical restriction somewhat irrelevant.
There are a number of predefined port numbers that are registered with the Internet Assigned Numbers Authority (IANA). All of the port numbers below 1024 are reserved for use with well-known applications, although there are also many applications that use port numbers outside of this range. Some of the more common port numbers are shown in Table 7-1. For a detailed listing of all of the port numbers that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/port-numbers).
Besides the reserved addresses that are managed by the IANA, there are also 뱔nreserved?port numbers that can be used by any application for any purpose, although conflicts may occur with other users who are also using those port numbers. Any port number that is frequently used should be registered with the IANA.
To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.
Opening a circuitApplications communicate with each other using the virtual circuits provided by TCP. These circuits are established on an as-needed basis, getting created and destroyed as requested by the applications in use. Whenever an application needs to communicate with another application somewhere on the network, it will ask the local TCP provider to establish a virtual circuit on its behalf.
There are two methods for requesting that a virtual circuit be opened: either a client will request an open so that data can be sent immediately, or a server will open a port in 뱇isten?mode, waiting for a connection request to arrive from a client.
The simplest of the two methods is the 뱎assive open,?which is the form used by servers that want to listen for incoming connections. A passive open indicates that the server is willing to accept incoming connection requests from other systems, and that it does not want to initiate an outbound connection. Typically, a passive open is 뱔nqualified,?meaning the server can accept an incoming connection from anybody. However, some security-sensitive applications will accept connections only from predefined entities, a condition known as a 뱏ualified passive open.?This type is most often seen with corporate web servers, ISP news servers, and other restricted-access systems.
When a publicly accessible server first gets started, it will request that TCP open a well-known port in passive mode, offering connectivity to any node that sends in a connection request. Any TCP connection requests that come into the system destined for that port number will result in a new virtual circuit being established.
Client applications (such as a web browser) use 밶ctive opens?when making these connection request. An active open is the opposite of a passive open, in that it is a specific request to establish a virtual circuit with a specific destination socket (typically this will be the well-known port number of the server that is associated with the specific client).
This process is illustrated in Figure 7-5. When an HTTP client needs to get a document from a remote HTTP server, it issues an 밶ctive open?to the local TCP software, providing it with the IP address and TCP port number of the destination HTTP server. The client's TCP provider then allocates a random port number for the application and attempts to establish a virtual circuit with the destination system's TCP software. The server's TCP software verifies that the connection can be opened (Is the port available? Are there security filters in place that would prevent the connection?), and then respond with an acknowledgment.
If the destination port is unavailable (perhaps the web server is down), then the TCP provider on the server system rejects the connection request. This is in contrast to UDP, which has to rely on ICMP Destination Unreachable: Port Unreachable Error Messages for this service. TCP is able to reject connections explicitly and can therefore abort connection requests without having to involve ICMP.
If the connection request is accepted, then the TCP provider on the server system acknowledges the request, and the client would then acknowledge the server's acknowledgment. At this point, the virtual circuit would be established and operational, and the two applications could begin exchanging data, as illustrated in Figure 7-5.
The segments used for the handshake process do not normally contain data, but instead are zero-length 밹ommand segments?that have special connection-management flags in their headers, signifying that a new virtual circuit is being established. In this context, the most important of these flags is the Synchronize flag, used by two endpoints to signify that a virtual circuit is being established.
For example, the first command segment sent by the client in Figure 7-5 would have the Synchronize flag enabled. This flag tells the server's TCP software that this is a new connection request. In addition, this command segment will also
provide the starting byte number (called the 뱒equence number? that the client will use when sending data to the server, with this data being provided in the Sequence Identifier field of the TCP header.
If the server is willing to establish a virtual circuit with the client, then it will respond with its own command segment that also contains the Synchronize flag and that also gives the starting sequence number that the server will use when sending data back to the client. This command segment will also have the Acknowledgment flag enabled, with the Acknowledgment Identifier field pointing to the client's next-expected sequence number.
The client will then return a command segment with the Acknowledge flag enabled to the server, and with its Acknowledgment Identifier field pointing to the server's next-expected sequence number. Note that this segment does not have the Synchronize flag enabled, since the virtual circuit is now considered up and operational, with both systems now being able to exchange data as needed.
It is entirely possible for two systems to issue active opens to each other simultaneously, although this scenario is extremely rare (I know of no applications that do this purposefully). In theory, such an event is possible, although it probably happens only on very slow networks where the circuit-setup messages pass each other on the wire.
For more information on the Synchronize and Acknowledgment flags, refer to 밅ontrol Flags?later in this chapter. For more information on the sequence and acknowledgment numbers, refer to 밨eliability?also later in this chapter.
Exchanging dataOnce a virtual circuit has been established, the applications in use can begin exchanging data with each other. However, it is important to note that applications do not exchange data directly. Rather, each application hands data to its local TCP provider, identifying the specific destination socket that the data is for, and TCP does the rest.
Applications can pass data to TCP in chunks or as a contiguous byte-stream. Most TCP implementations provide a 뱖rite?service that is restricted in size, forcing applications to write data in blocks, just as if they were writing data to a file on the local hard drive. However, TCP's buffering design also supports application writes that are contiguous, and this design is used in a handful of implementations.
TCP stores the data that it receives into a local send buffer. Periodically, a chunk of data will get sent to the destination system. The recipient TCP software will then store this data into a receive buffer, where it will be eventually passed to the destination application.
For example, whenever a web browser issues an HTTP 밎ET?request, the request is passed to TCP as application data. TCP stores the data into a send buffer, packaging it up with any other data that is bound for the destination socket. The data then gets bundled into an IP datagram and sent to the destination system. The recipient's TCP provider then takes the data and passes it up to the web server, which fetches the requested document and hands it off to TCP. TCP sends chunks of the document data back to the client in multiple IP packets, where it is queued up and then handed to the application.
This concept is outlined in Figure 7-6, which shows an HTTP client asking for a document from a remote HTTP server. Once the TCP virtual circuit is established, the HTTP client writes 밎ET document?into the local send buffer associated with the virtual circuit in use by the client. TCP then puts this data into a TCP segment (creating the appropriate TCP headers), and sends it on to the specified destination system via IP. The HTTP server at the other end of the connection would then take the same series of steps when returning the requested document back to the client.
The important thing to remember here is that application data is transmitted as independent TCP segments, each of which requires acknowledgments. It is at this layer that TCP's reliability and flow control services are most visible.
For more information on how TCP converts application data into IP datagrams, refer ahead to 밡etwork I/O Management.?/P>
Closing a circuitOnce the applications have exchanged all of their data, the circuit can be closed. Closing a circuit is similar to opening one, in that an application must request the action (except in those cases where the connection has collapsed, and TCP is forced to terminate it).
Either end of the connection may close the circuit at any time, using a variety of different means. The two common ways to close are 밶ctive closes?that initiate a shutdown sequence and 뱎assive closes?that respond to an active close request.
Just as building a circuit requires a bidirectional exchange of special command segments, so does closing it. One end of the connection requests that the circuit be closed (the active close at work). The remote system then acknowledges the termination request and responds with its own termination request (the passive close). The terminating system then acknowledges the acknowledgment, and both endpoints drop the circuit. At this point, neither system is able to send any more data over the virtual circuit.
Figure 7-7 shows this process in detail. Once the HTTP client has received all of the data, it requests that the virtual circuit be closed. The HTTP server then returns an acknowledgment for the shutdown request, and also sends its own termination request. When the server's shutdown request is received by the client, the client issues a final acknowledgment, and begins closing its end of the circuit. Once the final acknowledgment is received by the server, the server shuts down whatever is left of the circuit. By this point, the connection is completely closed.
Just as TCP uses special Synchronize flags in the circuit-setup command segments, TCP also has special Finish flags that it uses when terminating a virtual circuit. The side issuing the active close sends a command segment with the Finish flag enabled and with a sequence number that is one byte higher than the last sequence number used by that side during the connection. The destination system responds with a command segment that also has the Finish flag enabled and with its sequence number also incremented by one. In addition, the Acknowledgment Identifier field for this response will still point to the next-expected sequence number, even though no other data should be forthcoming. In this regard, the Finish flag is considered to be one byte of data (just like the Synchronize flag), and as such must be acknowledged explicitly.
Once the Finish segments have been exchanged, the terminating system must respond with a final acknowledgment for the last Finish segment, with the
Acknowledgment Identifier also pointing to the next-expected sequence number (even though there should not be another segment coming). However, this last segment will not have the Finish flag enabled, since the circuit is considered to be 밺own?and out of action by this point.
It's important to note that either endpoint can initiate the circuit-termination process, and there are no hard and fast rules for which end should do it, although typically it is left to the client to perform this service since it may make multiple requests over a single connection. POP3 is a good example of this process, as POP3 allows a client to submit multiple commands during a single session. The client would need to dictate when the circuit should be closed with this type of application. However, sometimes a server issues the active close. For example, Gopher servers close the virtual circuit after sending whatever data has been requested, as do HTTP 1.0 servers.
It's also important to note that 뱒erver?applications keeps the port open until the application itself is terminated, allowing other clients to continue connecting to that server. However, the individual circuits will be torn down on a per-connection basis, according to the process described above.
Sometimes, the two systems do not close their ends of the circuit simultaneously. This results in a staggered close—also known as a 밾alf-close?#151;with each end issuing passive close requests at different times. One example of this type can be found in the rsh utility, which is used to submit shell commands to rsh servers. On some systems, once an rsh command has been sent the client will close its end of the connection, effectively switching the virtual circuit into half-duplex mode. The server will then process the shell command, send the results back to the client (for display or further processing), and then close its end of the connection. Once both ends have been closed, the circuit is dropped.
Another option for closing a circuit is to simply drop it without going through an orderly shutdown. Although this method will likely cause unnecessary traffic, it is not uncommon. Typically, this method should only happen if an application is abruptly terminated. If an application needs to immediately close a circuit without going through the normal shutdown sequence, then it will request an immediate termination, and TCP will issue a segment with the Reset flag set, informing the other end that the connection is being killed immediately.
For more information on the Finish, Reset, and Acknowledgment flags, refer ahead to 밅ontrol Flags.?For more information on the sequence and acknowledgment numbers, refer ahead to 밨eliability.?/P>
Application design issuesSome applications open a connection and keep it open for long periods of time, while others open and close connections rapidly, using many circuits for a single operation.
For example, if you instruct your web browser to open a document from an HTTP 1.0 Web server, the HTTP client issues an active open to the destination HTTP 1.0 server, which then sends the document to the client and close the TCP connection. If there are any graphic objects on that document, the HTTP client has to open multiple unique connections for each of those objects. Thus, opening a single web page could easily result in twenty or more circuits being established and destroyed, depending on the number of objects embedded in the requested web page.
Since this model generates a lot of traffic (and uses a lot of network resources on the server), this process was changed with HTTP 1.1, which now allows a single circuit to be used for multiple operations. With HTTP 1.1, a client may request a page and then reuse the existing circuit to download objects embedded within that page. This model results in significantly fewer virtual circuits being used, although it also makes the download process synchronous rather than asynchronous.
Most applications use a single circuit for everything, keeping that circuit open even when there may not be any noticeable activity. TELNET is one example of this, where the TELNET client will issue an active open during the initial connection, and then use that virtual circuit for everything until the connection is terminated. After logging in, the user may get up and walk away from the client system, and thus no activity may occur for an extended period of time, although the TCP connection between the two systems would remain active.
Whether the circuits are torn down immediately or kept open for extended periods of time is really a function of the application's design goal, rather than anything mandated by TCP. It is entirely possible for clients to open and close connections rapidly (as seen with web browsers that use individual circuits for every element in a downloaded document), or to open a single connection and maintain it in perpetuity (as seen with TELNET).
Keep-alivesAlthough RFC 793 does not make any provision for a keep-alive mechanism, some TCP implementations provide one anyway. There are good reasons for doing this, and bad ones as well.
By design, TCP keep-alives are supposed to be used to detect when one of the TCP endpoints has disappeared without closing the connection. This feature is particularly useful for applications where the client may be inactive for long periods of time (such as TELNET), and there's no way to tell whether the connection is still valid.
For example, if a PC running a TELNET client were powered off, the client would not close the virtual circuit gracefully. Unfortunately, when that happened the TELNET server would never know that the other end had disappeared. Long periods of inactivity are common with TELNET, so not getting any data from the client for an extended period would not cause any alarms on the TELNET server itself. Furthermore, since the TELNET server wouldn't normally send unsolicited data to the client, it would never detect a failure from a lack of acknowledgments either. Thus, the connection might stay open for infinity, consuming system resources for no good purpose.
TCP keep-alives allow servers to check up on clients periodically. If no response is received from the remote endpoint, then the circuit is considered invalid and will be released.
RFC 1122 states that keep-alives are entirely optional, should be user-configurable, and should be implemented only within server-side applications that will suffer real harm if the client were to disappear. Although implementations vary, RFC 1122 also states that keep-alive segments should not contain any data, but may be configured to send one byte of data if required for compatibility with noncompliant implementations.
Most systems use an unsolicited command segment for this task, with the sequence number of the command segment set to one byte less than the sequence number of the next byte of data to be sent, effectively reusing the last sequence number of the last byte of data sent over the virtual circuit. This design effectively forces the remote endpoint to issue a duplicate acknowledgment for the last byte of data that was sent over that connection. When the acknowledgment arrives, the server knows that the client is still there and operational. If no response comes back after a few such tests, then the server can drop the circuit.
Network I/O ManagementWhen an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP packages portions of the data into bundles (called 뱒egments?, and passes them off to IP for delivery to the destination system, as illustrated in Figure 7-8.
Although this process sounds simple, it involves a lot of work, primarily due to segment-sizing issues that TCP has to deal with. For every segment that gets created, TCP has to determine the most efficient segment size to use at that particular moment, which is an extremely complex affair involving may different factors.
However, this is also an extremely important service, since accurately determining the size of a segment dictates many of the performance characteristics of a virtual circuit.
For example, making the segment too small wastes network bandwidth. Every TCP segment contains at least 40 bytes of overhead for the IP and TCP headers, so if a segment only contained one byte of data, then the byte ratio of headers-to-data for that segment would be 40:1, a miserable level of throughput by anybody's standard. Conversely, sending 400 bytes of data would change this ratio to 1:10, which is better, although still not very good. Sending four kilobytes of data would change this ratio to 1:100, which would provide excellent utilization of the network's capacity.
On the other hand, sending too much data in a segment can cripple performance as well. If a segment were too big for the IP datagram to travel across the network (due to topology-specific restrictions), then the IP datagram itself would have to be fragmented in order for it to get across that network. This situation would not only require additional processing time on the router that had to fragment the packet, but it would also introduce delays to the destination system, as the receiving IP stack would have to wait for all of the IP fragments to arrive and be reassembled before the TCP segment could be passed to TCP for processing.
In addition, fragmentation also introduces reliability concerns. If a fragment is lost, then the recipient's fragmentation timers have to expire, and an ICMP Time Exceeded error message has to be issued. Then the sender has to resend the entire datagram, which is likely to result in even more fragmentation occurring. Furthermore, on networks that experience known levels of packet loss, fragmentation increases the network's exposure to damage, since a single lost fragment will destroy a large block of data. But if the data were sent as discrete packets to begin with, the same lost packet would result in only that one small segment being lost, which would take less time to recover from. For all of these reasons, avoiding fragmentation is also a critical function of accurately determining the most effective segment size for any given virtual circuit.
Determining the most effective segment size involves the following factors:
Send buffer size
The most obvious part of this equation involves the size of the send buffer on the local system. If the send buffer fills up, then a segment must be sent in order to make space in the queue for more data, regardless of any other factors. Receive buffer size
Similarly, the size of the receive buffer on the destination system is also a concern, as sending more data than the recipient can handle would cause overruns, resulting in the retransmission of lost segments. MTU and MRU sizes
TCP also has to take into consideration the maximum amount of data that an IP datagram can handle, as determined by the Maximum Transfer Unit (MTU) size of the physical medium in use on the local network, the Maximum Receive Unit (MRU) size of the destination system's network connection, and the MTU/MRU sizes of all the intermediary networks in between the two endpoint systems. If a datagram is generated that is too large for the end-to-end network to handle, then fragmentation would definitely occur, penalizing performance and reliability. Header size
IP datagrams have headers, which will steal anywhere from 20 to 60 bytes of data from the segment. Likewise, TCP also has variable-length headers which will steal another 20 to 60 bytes of space. TCP has to leave room for the IP and TCP headers in the segments that get created, otherwise the datagram would be too large for the network to handle, and fragmentation would occur. Data size and timeliness
The frequency at which queued data is sent is determined by the rate at which data is being generated. Obviously, if lots of data is being generated by an application, then lots of TCP segments will need to be sent quickly. Conversely, small trickles of data will still need to be sent in a timely manner, although this would result in very small segments. In addition, sometimes an application will request that data be sent immediately, bypassing the queue entirely. Taking all of these variables into consideration, the formula for determining the most efficient segment size can be stated as follows:
MESS = (lesser of (send buffer, receive buffer, MTU, or MRU)) - headers) or (data + headers)
Simply put, the most efficient segment size is determined by finding the lowest available unit of storage (send buffers, receive buffers, or the MTU/MRU values in use) minus the required number of bytes for the IP and TCP headers, except in those situations where there is only a little bit of data to send. In that case, the size of the data (plus the required headers) will determine the size of the segment that is being sent.
By limiting the segment size to the smallest available unit of storage, the segment can be sent from one endpoint to another without having to worry about fragmentation. In turn, this allows TCP to use the largest possible segment for sending data that can be sent end-to-end, which allows the most amount of data to be sent in the least amount of time.
Buffer size considerationsPart of determining the most efficient segment size is derived from the size of the send and receive buffers in use on the two systems. If the send buffer is very small, then the sender cannot build a very large segment. Similarly, if the receive buffer is small, then the sender cannot transmit a large segment (even if it could build one), as that would cause overruns at the destination system, which would eventually require the data to be retransmitted.
Every system has a different default buffer size, depending upon its configuration. Most PC-based client systems have eight kilobyte send and receive buffers, while many server-class systems have buffers of 16 kilobytes or more. It is not uncommon for high-end servers to have 32 or 48 kilobyte send and receive buffers. However, most systems will let you specify the default size for the receive buffers on your system, and they will also let the application developer configure specific settings for their particular application.
Sometimes the size of the local system's send buffer is the bottleneck. If the send buffer is very small, then the sending device just won't be able to generate large segments, regardless of the amount of data being written, the size of the receive buffer, or the MTU/MRU sizes in use on the two networks. Typically, this is not the case, although it can be in some situations, particularly with small hand-held computers that have very limited system resources.
Similarly, sometimes the size of the receive buffers in use at the destination system will be the limiting factor. If the receive buffer on the destination system is very small, then the sender must restrict the amount of data that it pushes to the receiving endpoint. This is also uncommon, but is not unheard of High-speed Token Ring networks are capable of supporting MTUs of 16 kilobytes and more, while the PCs attached to those networks may only have TCP receive buffers of eight kilobytes. In this situation, the segment size would be restricted to the available buffer space (eight kilobytes), rather than the MTU/MRU capabilities of the network (16 kilobytes).
Obviously, a sender already knows the size of its send buffers, but the sender also has to determine the size of the recipient's receive buffer before it can use that information in its segment-sizing calculations. This is achieved through the use of a 16-bit 밯indow?field that is stored in the header of every TCP segment that gets sent across a virtual circuit. Whenever a TCP segment is created, the sending end-point stores the current size of their receive buffers into the Window field, and the recipient then reads this information once the segment arrives. This allows each system to constantly monitor the size of the remote system's receive buffers, thereby allowing them to determine the maximum amount of data that can be sent at any given time.
In addition, the Window field is only 16 bits long, which limits the size of a receive buffer to a maximum of 64 kilobytes. RFC 1323 defines a TCP option called the Window Scale option that allows two endpoints to negotiate 30-bit window sizes, allowing for sizes up to one gigabyte to be advertised.
For more information on the Window field, refer to 밯indow.?For information on how to calculate the optimal default receive window size for your system, refer to 밡otes on Determining the Optimal Receive Window Size.?For more information on how the window value affects flow control, refer to 밨eceive window size adjustments.?For more information on the TCP Window Scale option, refer to 밯indow Scale.?All of these sections appear later in this chapter.
MTU and MRU size considerations.Although buffer sizing issues can have an impact on the size of any given segment at any given time, most of the time the deciding factor for segment sizes is based on the size of the MTU and MRU in use by the end-to-end network connection.
For example, even the weakest of systems will have a TCP receive buffer of two or more kilobytes, while the MTU/MRU for Ethernet networks is only 1.5 kilobytes. In this case (and almost all others), the MTU/MRU of the Ethernet segment will determine the maximum segment size for that system, since it indicates the largest amount of data that can be sent in a single datagram without causing fragmentation to occur.
Typically, the MTU and MRU sizes for a particular network are the same values. For example, Ethernet networks have an MTU/MRU of 1500 bytes, and both of
these values are fixed. However, many dial-up networks allow an endpoint system to define different MTU and MRU sizes. In particular, many dial-up systems set the MTU to be quite small, while also setting the MRU to be quite large. This imbalance can actually help to improve the overall performance of the client, making it snappier than a fixed, medium-sized MTU/MRU pair would allow for.
To understand why this is so, you have to understand that most dial-up systems are clients, using applications such as POP3 and TELNET to retrieve large amounts of data from remote servers. Having a small MTU size forces the client to send segments quickly, since the MTU is the bottleneck in the segment-sizing calculations. Conversely, having a large MRU on a dial-up circuit allows the client to advertise a larger receive value, thereby letting the server send larger blocks of data down to the client. Taken together, the combination of a small MTU and a large MRU allows a dial-up client to send data quickly while also allowing it to download data in large chunks.
For example, one endpoint may be connected via a dial-up modem using a 1500-byte MRU, while the other node may be connected to a Token Ring network with a four-kilobyte MTU, as shown in Figure 7-9. In this example, the 1500-byte MRU would be the limiting factor when data was being sent to the dial-up client, since it represented the bottleneck. Furthermore, if the dial-up client had a 576 byte MTU (regardless of the 1500-byte MRU), then that value would be the limiting factor when data was being sent from the dial-up client up to the Token Ring-attached device.
Regardless of whether or not the client has a large or small MTU, it should be obvious that senders have to take the remote system's MRU into consideration when determining the most efficient segment size for a virtual circuit. At the same time, however, the sender also has to worry about the size of its local MTU. Both of these factors will determine the largest possible segment allowable on any given virtual circuit.
In order for all of this to work, both systems have to be able to determine each other's MRU sizes (they already know their own MTU sizes), and then independently calculate the maximum segment sizes that are allowed for the virtual circuit.
This determination is achieved by each system advertising its local MRU during the circuit-setup sequence. When each system sends its TCP start segments, it also includes its local MRU size (minus forty bytes for the IP and TCP headers) in those segments, using a TCP option called the Maximum Segment Size option. Since each system advertises its MRU in the start segments, it is a simple procedure for each of the systems to read the values and compare it with its own MTU values.
In truth, the MSS value advertised in the MSS option field tends to be based on the sender's MTU, rather than the MRU. Only a handful of systems actually use the MRU for their MSS advertisements. Although RFC 732 states that the MSS should be derived from the MRU, RFC 1122 clarified this position, stating that the MSS should be derived from the largest segment size that could be reassembled, which could be just about any value (although most implementations set this to the MTU size). Also, since most networks have fixed MTU/MRU pairs, most vendors set this value to the MTU size, knowing that it is the largest segment they can send. While this probably isn't the most technically accurate approach, it is what most implementations have chosen.
Note that RFC 793 states that the use of the MSS option is entirely optional, and therefore not required. If a system did not include an MSS option in its start segments, then a default value of 536 bytes (which is 576 bytes minus 40 bytes for the TCP and IP headers) should be used as the default. However, RFC 1122 reversed this position, stating that the MSS option is mandatory and must be implemented by all TCP providers.
Also note that some BSD-based systems can send only segments with lengths that are multiples of 512 bytes. So, even if an MTU of 576 bytes were available, the segments generated by these systems would be only 512 bytes long. Similarly, circuits capable of supporting MTU sizes of 1.5 kilobytes would use segments of only 1,024 bytes in length.
For a list of the default MTU sizes used with the most-common network topologies, refer to Table 2-5 in Chapter 2, The Internet Protocol. For more information on the MSS option, refer to 밠aximum Segment Size.?/P>
Path MTU discoveryEven though TCP systems are able to determine the MTU values in use by the endpoints on a virtual circuit, they are not be able to determine the MTU sizes of the networks in between the two endpoints, which may be smaller than the MTU/MRU values in use at either of the endpoint networks. In this scenario, fragmentation would still occur, since the MTU of the intermediary system would require that the IP datagrams be fragmented.
For example, if two systems are both on Token Ring networks using four-kilobyte MTUs, but there is an Ethernet network between them with a 1.5 kilobyte MTU, then fragmentation will occur when the four-kilobyte IP datagrams are sent over the 1.5 kilobyte Ethernet network. This process will lower the overall performance of the virtual circuit and may introduce some reliability problems.
By itself, TCP does not provide any means for determining the MTU of an intermediate network, and must rely on external means to discover the problem. One solution to this problem is to use a technique called Path MTU Discovery, which incorporates the IP Don't Fragment bit and the ICMP Destination Unreachable: Fragmentation Required error message to determine the MTU of the end-to-end IP network.
Essentially, Path MTU Discovery works by having one system create an IP packet of the largest possible size (as determined by the MTU/MRU pair for the virtual circuit), and then setting the Don't Fragment flag on the first IP packet. If the packet is rejected by an intermediary device (due to the packet being too large to forward without being fragmented), then the sender will try to resend the packet using a smaller segment size.
This procedure is repeated until ICMP errors stop coming back. At this point, the sender could use the size of the last-tested packet as the MTU for the entire network. Unfortunately, some systems assume that 뱊o error messages?means that the packet was delivered successfully, without conducting any further testing to verify the theory. However, some routers and firewalls do not return ICMP errors (due to security concerns or configuration errors), which may result in the ICMP errors not being returned to the sender.
This unreliability can cause a situation known as 밣ath MTU Black Hole,?where the sender has chosen to use an MTU that is too large for the end-to-end network, but the network is unable or unwilling to inform the sender of the problem. In this scenario, the sender continues sending data with an MTU that is too large for the intermediary network to forward without being fragmented (which is prohibited by the sender). Some implementations are aware of this problem, and if it appears that packets are not getting through then they reduce the size of the segments that they generate until acknowledgments are returned, or they clear the Don't Fragment flag, allowing fragmentation to occur.
For a complete discussion on this subject, refer to 밡otes on Path MTU Discovery?in Chapter 5, The Internet Control Message Protocol.
Header size considerationsAs we discussed in Chapter 2, The Internet Protocol, most IP packets have a 20-byte header, with a maximum of 60 bytes being used for this data. TCP segments also have their own header information, with a minimum value of 20 bytes (the most common), and a maximum size of 60 bytes. Taken together, most TCP/IP datagram have 40 bytes of header data (20 from IP and 20 from TCP), with the maximum amount of header data being limited to 120 bytes (60 bytes from IP and TCP each).
Whenever TCP creates a segment, it must leave room for these headers. Otherwise, the IP packet that was generated would exceed the MTU/MRU pair in use on that virtual circuit, resulting in fragmentation.
Although RFC 1122 states that TCP implementations must set aside 40 bytes of data when a segment is created, this isn't always enough. For example, some of the new advanced TCP options utilize an additional 10 or more bytes. If this information isn't taken into consideration, then fragmentation will likely occur.
TCP is able to determine much of this information, but not always. If the underlying IP stack also utilizes IP options that TCP is not aware of, then TCP will not make room for them when segments are created. This will also likely result in fragmentation.
For more information on IP header sizes, refer to 밫he IP Header?in Chapter 2, The Internet Protocol. For more information on TCP header sizes, refer to 밫he TCP Header?later in this chapter.
Data considerationsRemember that applications write data to TCP, which then stores the data into a local send buffer, generating a new segment whenever it is convenient or prudent to do so. Although segment sizes are typically calculated based on the available buffer space and MTU/MRU values associated with a given virtual circuit, sometimes the nature of the data itself mandates that a segment be generated, even if that segment won't be the most efficient size.
For example, if an application sends only a little bit of data, then TCP will not be able to create a large segment since there just isn't much data to send to the remote endpoint. This is regardless of the inefficiencies of sending small amounts of data; if there isn't a lot of data to send, TCP can't send large segments.
The decision process that TCP goes through to figure out when to send small amounts of data incorporates many different factors. If an application is able to tell TCP how much data is being written—and if TCP isn't busy doing other stuff—then TCP could choose to send the data immediately. Conversely, TCP could choose to just sit on the data, waiting for more data to arrive.
Sometimes, an application knows that it will be sending only a little bit of data, and can explicitly tell TCP to immediately send whatever data is being written. This service is provided through the use of a 뱎ush?service within TCP, allowing an application to tell TCP to go ahead and immediately send whatever data it gets.
The push service is required whenever an application needs to tell TCP that only a small amount of data is being written to the send buffer. This is most often seen with client applications such as POP3 or HTTP that send only a few bytes of data to a server, but it can also be seen from servers that write a lot of data. For example, if an HTTP server needed to send more data than would fit within a segment; the balance of the data would have to be sent in a separate (small) segment. Once the HTTP server got to the end of the data, it would tell TCP that it was finished and to go ahead and send the data without waiting for more. This step would be achieved by the application setting the Push flag during the final write.
Some applications cause the Push flag to be set quite frequently. For example, some TELNET clients will set the Push flag on every keystroke, causing the client to send the keystroke quickly, thereby causing the server to echo the text back to user's display quickly.
Once TCP gets data that has been pushed, it stores the data in a regular TCP segment, but it also sets a Push flag within that segment's TCP header. This allows the remote endpoint to also see that the data is being pushed. This is an important service, since the Push flag also affects the receiving system's segment-handling process. Just as a sending TCP will wait for more data to arrive from an application before generating a segment, a receiving TCP will sometimes wait for more segments to arrive before passing the data to the destination application. But if a receiver gets a segment with the Push flag set, then it is supposed to go ahead and send the data to the application without waiting for any more segments to arrive.
An interesting (but somewhat irrelevant) detail about the Push flag is that the practical usage is quite a bit different from the behavior defined in the standards. Although RFC 793 states that 밃 sending TCP is allowed to collect data ?until the push function is signaled, then it must send all unsent data,?most TCP implementations do not allow applications to set the Push flag directly. Instead, most TCP implementations simply send data as they receive it (most of the time, applications write data to TCP in chunks rather than in continuous streams), and TCP will set the Push flag in the last segment that it sends. Some implementations will even set the Push flag on every segment that they send.
Similarly, many implementations ignore the Push flag on data they receive, immediately notifying the listening application of all new data, regardless of whether the Push flag is set on those segments.
Another interesting flag within the TCP header is the Urgent flag. The Urgent flag can be used by an application whenever it needs to send data that must be dealt with immediately. If an application requests that a segment be sent using the Urgent flag, then TCP is supposed to place that segment at the front of the send queue, sending it out as soon as possible. In addition, the recipient is supposed to read that segment ahead of any other segments that may be waiting to be processed in the receive buffer.
Urgent data is often seen with TELNET, which has some standardized elements that rely on the use of the TCP Urgent flag. Some of the standardized control characters used with TELNET (such as interrupt process and abort output) have speccific behavioral requirements that benefit greatly from the 뱋ut-of-stream?processing that the Urgent flag defines. For example, if a user were to send an interrupt process signal to the remote host and flag this data for Urgent handling, then the control character would be passed to the front of the queue and acted upon immediately, allowing the output to be flushed faster than would otherwise happen.
However, the use of the Urgent flag has been plagued by incompatibility problems ever since RFC 793 was first published. The original wording of that document did not clarify where the urgent data should be placed in the segment, so some systems put it in one place while other systems put it in another. The wording was clarified in RFC 1122, which stated that the urgent pointer points to the last byte of data in the stream. Also of interest is the fact that the urgent pointer can refer to a byte location somewhere up ahead in the stream, in a future segment. All of the data up to and including the byte position specified by the urgent pointer are to be treated as a part of the urgent block. Unfortunately, some systems (such as BSD and its derivatives) still do not follow this model, resulting in an ongoing set of interoperability problems with this flag in particular.
For more information on the Push and Urgent flags, refer to 밅ontrol Flags?later in this chapter.
Flow ControlWhen an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP will package portions of the data into segments and pass them off to IP for delivery to the destination system.
One of the key elements to this process is flow control, where a sending system will adjust the rate at which it tries to send data to the destination system. A change in rate may be required due to a variety of reasons, including the available buffer space on the destination system and the packet-handling characteristics of the network. For this reason, TCP incorporates a variety of flow control mechanisms, allowing the sending system to react to these changes easily.
Originally, RFC 793 proposed only a handful of flow control mechanisms, most of which were focused on the receiving end of the connection. Of these services, the two most important were:
Receive window sizing
TCP can send only as much data as a receiver will allow, based on the amount of space available in the remote system's receive buffer, the frequency at which the buffers are drained, and other related factors. Therefore, one way for a receiver to adjust the transfer rate is to increase or decrease the size of the buffer being advertised. This in turn controls how much data a sender can transmit at once. Sliding receive windows
In addition to the Window size being advertised by a receiver, the concept of a 뱒liding window?allows the sender to transmit segments 뱋n credit,?before acknowledgments have arrived for segments that were already sent. This lets an endpoint send data even though the preceding data has not yet been acknowledged, trusting that an acknowledgment will arrive for that data shortly. These mechanisms put the destination system in charge of controlling the rate at which the sender transmits data. As the original theory went, the receiver was likely to be the point of congestion in any transfer operation, and as such needed to have the last word on the rate at which data was being sent.
Over time however, the need for sender-based flow control mechanisms has been proven, particularly since network outages may occur, which will require the sender to reduce its rate of transmission, even though the receiving system may be running smoothly. For this reason, RFC 1122 mandated that a variety of networkrelated flow control services also be implemented. Among these services are:
Congestion window sizing
In order to deal with congestion-related issues, the use of a 밹ongestion window?is required at the sending system. The congestion window is similar in concept to the receive window in that it is expanded and contracted, although these actions are taken according to the underlying IP network's ability to handle the quantity of data being sent, rather than the recipient's ability to process the data. Slow start
In an effort to keep congestion from occurring in the first place, a sender must first determine the capabilities of the IP network before it starts sending mass quantities of data over a newly established virtual circuit. This is the purpose of slow start, which works by setting the congestion window to a small size and gradually increasing its size, until the network's saturation point is found. Congestion avoidance
Whenever network congestion is detected, the congestion window is reduced, and a technique called congestion avoidance is used to gradually rebuild the size of the congestion window, eventually returning it to its maximum size. When used in conjunction with slow start, this helps the sender to determine the optimal transfer rate of a virtual circuit. Taken together, the use of the receive and congestion windows gives a sending system a fairly complete view of the state of the network, including the state of both the recipient and the congestion on the network.
A note on local blockingAlthough there are a variety of flow control mechanisms found with TCP, the simplest form of flow control is 뱇ocal blocking,?whereby a sending system refuses to accept data from a local application. This feature is needed whenever TCP knows that it cannot deliver any data to a specific destination system—perhaps due to problems with the receiver or the network—and the local send buffer is already full. Having nowhere to send the data, TCP must refuse to accept any new data from the sending application.
Note that TCP cannot block incoming network traffic (coming from IP). Since TCP is unable to tell which application a segment is destined for until its contents have been examined, TCP must accept every segment that it gets from IP. However, TCP may be unable to deliver the data to the destination application, due to a full queue or some other temporary condition. If this happens, TCP could choose to discard the segment, thereby causing the sender to retry the operation later (an effort which may or may not succeed).
Receive window size adjustmentsIn the section entitled 밡etwork I/O Management,?I first mentioned the TCP header's Window field, suggesting that it provided an insight into the size of the receive buffer in use on a destination system. Although this is an accurate assessment when looking at TCP's segment sizing process, the primary purpose of the Window field is to provide the receiving system with flow control management services. The Window field is used to tell a sender how much data a recipient can handle. In this model, the recipient dictates flow control.
According to RFC 793, the window field 뱒pecifies the number of octets ?that the receiving TCP is currently prepared to receive.?In this scenario, a sending system can transmit only as much data as will fit within the recipient's receive buffer (as specified by the Window field) before an acknowledgment is required. Once the sender has transmitted enough data to fill the receive buffer, it must stop sending data and wait for an acknowledgment from the recipient before sending any more data.
Therefore, one way to speed up and slow down the data transfer rate between the two endpoint systems is for the receiving system to change the buffer size being advertised in the Window field. If a system that had been advertising an eight-kilo398byte window suddenly started advertising a 16-kilobyte window, the sender could pump twice as much data through the circuit before having to wait for an acknowledgment.
Conversely, if the recipient started advertising a four-kilobyte window, then the sender could transmit only half as much data before requiring an acknowledgment (this would be enforced by the sender's TCP stack, which would start blocking writes from the sending application when this occurred).
An important consideration here is that recipients are not allowed to arbitrarily reduce their window size, but instead are only supposed to shrink the advertised window when they have received data which has not yet been processed by the destination application. Arbitrarily reducing the size of the receive window can result in a situation where the sender has already sent a bunch of data in accordance with the window size that was last advertised. If the recipient were to suddenly reduce the window size, then some of the segments would probably get rejected, requiring the sender to retransmit the lost data.
Since the Window field is included in the header of every TCP segment, advertising a different buffer size is a very straightforward affair. If the recipient is willing to speed up or if it needs to slow down, it simply changes the value being advertised in the Window field of any acknowledgment segment that is being returned, and the sender will notice the change as soon as the segment containing the new value is received. Note that there may be some delay in this process, as it may take a while for that segment to arrive.
The size of the buffer also affects the number of segments that can be received, in that the maximum number of available segments is the Window size divided by the maximum segment size. Typically, systems will set their window size to four times the segment size (or larger), so if a system is using one kilobyte segments, then the smallest window size you would want to use on that system would be four kilobytes.
Unfortunately, since the Window field is only 16 bits long, the maximum size that can be advertised is 65,535 bytes. Although this is plenty of buffer space for most applications, there are times when it just isn't enough (such as when the MTU of the local network is also 64 kilobytes, resulting in a Window that is equal to only a single segment). One way around this limitation is the Window Scale option, as defined in RFC 1323. The Window Scale option allows two endpoints to negotiate 30-bit window sizes, allowing up to one gigabyte of buffer space to be advertised.
While it may seem best to use very large window sizes, it is not always feasible or economical to do so. Each segment that is sent must be kept in memory until it has been acknowledged. A hand-held system may not have sufficient resources to cache many segments, and thus would have to use small window sizes in order to limit the amount of data being sent.
In addition, there is a point at which the size of the receive window no longer has any effect on throughput, but instead the bandwidth and delay characteristics of the virtual circuit become the limiting factors. Setting a value larger than necessary is simply a waste of resources and can also result in slower recovery. For example, if a sender sees a large receive window being advertised then it might try to fill that window, even though a router in between the two endpoints may not be able to forward the data very quickly. This delay can result in a substantial queue building up in the router, and if a segment ever does get lost, then it will take a long time for the recipient to notice the problem and the sender to correct it. This would result in extremely long gaps between retransmissions, and may also result in some of the queued data getting discarded (requiring even more retransmissions).
For more information on the Window field, refer to 밯indow.?For more information on the TCP Window Scale option, refer to 밯indow Scale.?For detailed instructions on how to calculate the most optimal window size for a particular connection, refer to 밡otes on Determining the Optimal Receive Window Size.?All of these sections appear later in this chapter.
Sliding receive windowsEven though large window sizes can help to increase overall throughput, they do not provide for sustained levels of throughput. In particular, if a situation required the use of a synchronous 뱒end-and-wait?design that required a system to send data and then stop to wait for an acknowledgment, the network would be quite jerky, with bursts of writes followed by long pauses. This problem is most noticeable on networks with high levels of latency that cause extended periods of delay between the two endpoints.
n an effort to avoid this type of scenario, RFC 1122 states that a recipient should issue an acknowledgment for every two segments that it receives, if not more often. This design causes the receiver to issue acknowledgments quickly. In turn, these acknowledgments arrive back to the sender quickly.
Once an acknowledgment has arrived back at the sending system, the outstanding data is cleared from the send queue, thereby letting the sender transmit more data. In effect, the sending system can 뱒lide?the window over by the number of segments that have been successfully acknowledged, allowing it to transmit more data, even though not all of the segments have been acknowledged yet.
As long as a sender continues receiving acknowledgments, it is able to continue sending data, with the maximum amount of outstanding segments being determined by the size of the recipient's receive buffer. This concept is illustrated in Figure 7-10, which shows how a sender can increment the sliding window whenever it receives an acknowledgment for previously sent data. For example, as the sender is transmitting segment number three, it receives an acknowledgment for segment number one, allowing the sender to move the send buffer forward by one segment.
The key element here is that the sender can transfer only as many bytes of data as the receiver can handle, as advertised in the Window field of the TCP headers sent by the recipient. If the recipient's receive window is set to eight kilobytes and the sender transmits eight one-kilobyte segments without having received an acknowledgment, then it must stop and wait for an acknowledgment before sending any more data.
However, if the sender receives an acknowledgment for the first two segments after having sent eight of them, then it can go ahead and send two more, since the window allows up to eight kilobytes to be in transit at any time. On networks with low levels of latency (such as Ethernet), this feature can have a dramatic impact on
overall performance, providing for sustained levels of high utilization. On networks with very high levels of latency (such as those that use satellite links), the effect is less pronounced, although it is still better than the send-and-wait effect that would otherwise be felt.
In situations where the window size is smaller than the MTU, a sliding window is harder to implement. Some systems will write only a single segment (up to the maximum allowed by the advertised receive buffer), and then stop to wait for an acknowledgment. Other systems will reduce the size of the local send buffers to half (or less) of the advertised receive window, thereby forcing multiple small segments to be written in an effort to increase the number of acknowledgments that are generated.
Another problem can occur with some TCP implementations that do not issue acknowledgments for every two segments that are received, but instead issue acknowledgments when they have received enough data to fill two maximum-sized segments. For example, if the system has a local MTU of 1500 bytes, but is receiving data in 500-byte chunks, then such a system would only issue acknowledgments for every six segments that arrive (6 × 500 = 3000, which is MTU times two). This process would result in a substantially slower acknowledgment cycle that could cause problems if the sender had a small send window. Although this problem is somewhat rare, it does happen.
For systems that implement this procedure correctly (sending acknowledgments for every two segments that are received, regardless of the maximum segment size), this design can substantially improve overall performance. By using the sliding window technique—and by using large windows—it is quite possible for two fast systems on a fast network to practically saturate the connection with data.
For more information on how frequent acknowledgments can impact performance, refer ahead to 밆elayed acknowledgments.?/P>
The Silly Window SyndromeThe amount of buffer space that a system advertises depends on how much buffer space it has available at that given moment, which is dependent upon how quickly applications can pull data out of the receive buffer. This in turn is driven by many factors, such as the complexity of the application in use, amount of CPU time available, the design of the TCP stack in use, and other elements.
Unfortunately, many of the first-generation TCP-based applications did a very poor job of cleaning out the receive buffers, taking only a few bytes at a time. The system only advertised a receive buffer of a few bytes. In turn, a sender would transmit only a very small segment, since that was all that was being advertised by the recipient. This process would repeat incessantly, with the recipient taking another few bytes out of the receive queue, advertising a small window, and then receiving yet another very small segment.
To prevent this scenario (known affectionately as the 밪illy Window Syndrome?, RFC 1122 clarified the amount of buffer space that could be advertised, stating that systems could only advertise a non-zero window if the amount of buffer space available could hold a complete segment (as defined by the value shown in the MSS option), or if the buffer space was at least half of the 뱊ormal?window size. If neither of these conditions are met, then the receiver should advertise a zero-length window, effectively forcing the sender to stop transmitting.
The Nagle algorithm.The Silly Window Syndrome is indicative of a problem at the receiver's end of the virtual circuit. Data is not being read from the receive buffers quickly, resulting in small window sizes being advertised, which in turn causes the sender to transmit small segments. The result is that lots of network traffic gets generated for very small amounts of data.
However, a sending system can also cause these kinds of problems, although for totally different reasons. Some applications (such as TELNET) are designed to send many small segments in a constant barrage, which causes high levels of network utilization for small amounts of data. Other situations in which this is a problem are applications that write data only in small chunks, such as writing 10 megabytes of data in 512-byte blocks. The number of packets that will get generated in that model are extremely wasteful of bandwidth, particularly when this same transfer could be done using larger writes.
One solution proposed to this kind of problem is the Nagle algorithm, which was originally described in RFC 896. Simply put, the Nagle algorithm suggests that segments that are smaller than the maximum size allowed (as defined by the MSS option of the recipient or the discovered MTU of the end-to-end path) should be delayed until all prior segments have been acknowledged or until a full-sized segment can be sent. This rule forces TCP stacks to merge multiple small writes into a single write, which is then sent as a single segment.
On a low-latency LAN, the Nagle algorithm rarely comes into play, since a small segment will be sent and acknowledged very quickly, allowing another small segment to be sent immediately (effectively eliminating the use of the Nagle algorithm). On slow WAN links though, the Nagle algorithm comes into play quite often, since acknowledgments take a long time to be returned to the sender. This results in the next batch of small segments getting bundled together, providing a substantial increase in overall network efficiency.
For these reasons, use of the Nagle algorithm is encouraged by RFC 1122, although its usage is not mandatory. Some applications (such as X Windows) react poorly when small segments are clumped together. In those cases, users must have the option of disabling the Nagle algorithm on a per-circuit basis. However, most TCP implementations do not provide this capability, instead allowing users to enable or disable its use only on a global scale, or leaving it up to the application developer to decide when it is needed.
This limitation can be somewhat of a problem, since some developers have written programs that generate inefficient segment sizes frequently, and have then gone and disabled the use of the Nagle algorithm on those connections in an effort to improve performance, even though doing so results in much higher levels of network utilization (and doesn't do much to improve performance in the end). If those developers had just written their applications to use large writes instead of multiple small writes, then the Nagle algorithm would never come into effect, and the applications would perform better anyway.
Another interesting side effect that appears when the Nagle algorithm is disabled is that the delayed acknowledgment mechanism (as described later in 밆elayed acknowledgments? does not tend to work well when small segments are being generated, since it waits for two full-sized segments to arrive before returning an acknowledgment for those segments. If it does not receive full-sized segments due to a developer having turned off the Nagle algorithm, then the delayed acknowledgment mechanism will not kick in until a timer expires or until data is being returned to the sender (which the acknowledgments can piggyback onto).
This can be a particular problem when just a little bit of data needs to be sent. The sender will transmit the data, but the recipient will not acknowledge it until the timer expires, resulting in a very jerky session.
This situation can also happen when a small amount of data is being generated at the tail-end of a bulk transfer. However, the chances are good that in this situation the remote endpoint is going to generate some sort of data (such as a confirmation status code or a circuit-shutdown request). In that case, the delayed acknowledgment will piggyback onto whatever data is being returned, and the user will not notice any excess delays.
For all of these reasons, application developers are encouraged to write data in large, even multiples of the most-efficient segment size for any given connection, whenever that information is available. For example, if a virtual circuit has a maximum segment size of 1460 bytes (the norm for Ethernet), the application should write data in even multiples of 1460 (such as 2,920 byte blocks, or 5,840 byte blocks, and so forth). This way, TCP will generate an even number of efficiently sized segments, resulting in the Nagle algorithm never causing any delay whatsoever, and also preventing the delayed acknowledgment mechanism from holding up any acknowledgments.
Congestion window sizingTCP's use of variable-length, sliding windows provides good flow control services to the receiving end of a virtual circuit. If the receiver starts having problems, it can slow down the rate at which data is being sent simply by scaling back the amount of buffer space being advertised. But if things are going well, the window can be scaled up, and traffic can flow as fast as the network will allow.
Sometimes, however, the network itself is the bottleneck. Remember that TCP segments are transmitted within IP packets, and that these packets can have their own problems outside of the virtual circuit. In particular, a forwarding device in between the two endpoints could be suffering from congestion problems, whereby it was receiving more data than it could forward, as is common with dial-up servers and application gateways.
When this occurs, the TCP segments will not arrive at their destination in a timely manner (if they make it there at all). In this scenario, the receiving system (and the virtual circuit) may be operating just fine, but problems with the underlying IP network are preventing segments from reaching their destination.
This problem is illustrated in Figure 7-11, which shows a device trying to send data to the remote endpoint, although another device on the network path is suffering from congestion problems, and has sent an ICMP Source Quench error message back to the sender, asking it to slow down the rate of data transfer.
Congestion problems can be recognized by the presence of an ICMP Source Quench error message, or by the recipient sending a series of duplicate acknowledgments (suggesting that a segment has been lost), or by the sender's acknowledgment timer reaching zero. When any of these problems occur, the sender must recognize them as being congestion-related, and take counter-measures that deal with them appropriately. Otherwise, if a sender were to simply retransmit segments that were lost due to congestion, the result would be even more congestion. Orderly congestion recovery is therefore required in order for TCP to maintain high performance levels, but without causing more congestion to occur.
At the heart of the congestion management process is a secondary variable called the 밹ongestion window?that resides on the sender's system. Like the receive window, the congestion window dictates how much data a sender can transmit without stopping to wait for an acknowledgment, although rather than being set by the receiver, the congestion window is set by the sender, according to the congestion characteristics of the IP network.
During normal operation, the congestion window is the same size as the receive window. Thus, the maximum transfer rate of a smooth-flowing network is still restricted by the amount of data that a receiver can handle. If congestion-related problems occur, however, then the size of the congestion window is reduced, thereby making the limiting factor the sender's capability to transmit, rather than the receiver's capability to read.
How aggressively the congestion window is reduced depends upon the event that triggered the resizing action:
?If congestion is detected by the presence of a series of duplicate acknowledgments, then the size of the congestion window is cut in half, severely restricting the sender's ability to transmit segments. TCP then utilizes a technique known as 밹ongestion avoidance?to slowly increment the size of the congestion window, cautiously ramping up the rate at which it can send data, until it returns to the full throttle state.
?If congestion is detected by the TCP acknowledgment timer reaching zero or by the presence of an ICMP Source Quench error message, then the congestion window is shrunk so small that only one segment can be sent. TCP uses a technique known as 뱒low start?to begin incrementing the size of the congestion window until it is half of its original size, at which point the congestion avoidance technique is called into action to complete the ramp-up process.
Slow start and congestion avoidance are similar in their recovery techniques. However, they are also somewhat different and are used different times. Slow start is used on every new connection—even those that haven't yet experienced any congestion—and whenever the congestion window has been dropped to just one segment. Conversely, congestion avoidance is used both to recover from non-fatal congestion-related events and to slow down the rate at which the congestion window is being expanded, allowing for smoother, more sensitive recovery procedures.
Slow startOne of the most common problems related to congestion is that senders attempt to transmit data as fast as they can, as soon as they can. When a user asks for a big file, the server gleefully tries to send it at full speed immediately.
While this might seem like it would help to complete the transfer quickly, in reality it tends to cause problems. If there are any bottlenecks between the sender and receiver, then this burst-mode form of delivery will find them very quickly, causing congestion problems immediately (most likely resulting in a dropped segment). The user may experience a sudden burst of data, followed by a sudden stop as their system attempts to recover one or more lost segments, followed by another sudden burst. Slow start is the technique used to avoid this particular scenario.
In addition, slow start is used to recover from near-fatal congestion errors, where the congestion window has been reset to one segment, due to an acknowledgment timer reaching zero, or from an ICMP Source Quench error message being received.
Slow start works by exponentially increasing the size of the congestion window. Every time a segment is sent and acknowledged, the size of the congestion window is increased by one segment's worth of data (as determined by the discovered MTU/MRU sizes of the virtual circuit), allowing for more and more data to be sent.
For example, if the congestion window is set to one segment (with the segment size being set to whatever value was determined during the setup process), a single segment will be transmitted. If this segment is acknowledged, then the congestion window is incremented by one, now allowing two segments to be transmitted simultaneously. The next two segments in the send buffer then get sent. If they are both acknowledged, they will each cause the congestion window to be incremented by one again, thus adding room for two more segments (with the congestion window being set to four segments total). All of the segments do not have to be acknowledged before the congestion window is incremented, as shown in Figure 7-12.
If a connection is new, then the process is repeated until congestion is detected or the size of the congestion window is equal to the size of the receive window, as advertised by the receiver's Window field. If the ramping process is successful, then the virtual circuit will eventually be running at full speed, with the flow control being dictated by the size of the recipient's receive buffer. But if congestion is detected during the incrementing process, the congestion window will be locked to the last successful size. Any further congestion problems will result in the congestion window being reduced (as per the process described earlier in 밅ongestion window sizing?.
However, if the slow start routines are being used to recover from a congestion event, then the slow start procedure is used only until the congestion window reaches half of its original size. At this point, the congestion avoidance technique is called upon to continue increasing the size of the congestion window (as described soon in 밅ongestion avoidance?. Since congestion is likely to occur again very quickly, TCP takes the more cautious, linear-growth approach outlined with congestion avoidance, as opposed to the ambitious, exponential growth provided with slow start.
Note that although RFC 1122 mandates the use of slow start with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the slow start routines described here.
In addition, RFC 2414 advocates the use of four segments as the seed value for slow start, rather than the one segment proposed in RFC 2581 (TCP Congestion Control), which is arguably an improvement with applications that send more than one segment. For example, if an application needed to send two segments of data but the initial congestion window was locked at 뱋ne segment,?then the application could send only one of those segments. As such, the remote endpoint would not receive all of the application data, and the delayed acknowledgment mechanisms would force a long pause before an acknowledgment was returned. But by setting the initial congestion window to 뱓wo segments,?the sender can issue two full-sized segments, which will result in the recipient issuing an acknowledgment immediately. Although this allows connections to ramp-up faster, note that RFC 2414 is only experimental and is not required to be implemented in any shipping TCP implementations.
Congestion avoidanceThe congestion avoidance routines are called whenever a system needs to use a slower, more sensitive form of congestion avoidance than the exponential mechanisms offered by the slow start procedure. A slow congestion avoidance mechanism may be required when a system detects congestion from the presence of multiple duplicate acknowledgments, or as part of the recovery mechanisms that are utilized when an acknowledgment timer reaches zero.
Although duplicate acknowledgments are not uncommon (and are allowed for by TCP's error-recovery mechanisms), the presence of many such acknowledgments tends to indicate that an IP datagram has been lost somewhere, most likely due to congestion occurring on the network. As such, RFC 1112 states that if three or more duplicate acknowledgments are received, then the size of the congestion window should be cut in half, and the congestion avoidance technique is to be used in an effort to return the network to full throttle.
Another scenario where congestion avoidance is used is if the sender's acknowledgment timer has expired, which means that no acknowledgments are coming back from the other end. This signifies that there are serious congestion problems, or that the other system has left the network. In an effort to recover from this event, the congestion window is shrunk so small that only one segment can be sent. Then the slow start mechanism is called upon and used the congestion window is half of its original size. Congestion avoidance is then used to return the network to full speed, albeit at a slower, more cautious rate.
Congestion avoidance is very similar to slow start in that the size of the congestion window is expanded whenever acknowledgments arrive for segments that have been sent. However, rather than incrementing the congestion window on a one-for-one basis (as is done with slow start), the congestion window is incremented by only one segment when all of the segments sent within a single window are acknowledged.
For example, assume that a system's congestion window is set to allow four segments, although the recipient's receive window is advertising a maximum capacity of eight segments. Using congestion avoidance, a system would send four segments and then wait for all of them to be acknowledged before incrementing the size of the congestion window by one (now being set to 밼ive segments?.
If this effort was a success, then the next five segments would be sent, and if all of them were acknowledged, then the congestion window would be increased to six. This process would continue until either congestion occurred again or the congestion window equals the size of the receive window being advertised by the recipient (밻ight segments?here).
Note that it doesn't matter if the remote system sends back a single acknowledgement for all of the segments previously sent, or if individual acknowledgments are returned for each of the segments. With congestion avoidance, all of the segments must be acknowledged before the size of the congestion window will be incremented.
Also note that although RFC 1122 mandates the use of congestion avoidance with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the congestion avoidance routines described here.
ReliabilityThe most often touted TCP service is reliability, with TCP's virtual circuit design practically guaranteeing that data will get delivered intact. Using this design, TCP will do everything it can to get data to the proper destination, if at all possible. If this is not possible—perhaps due to a failure in the network or some other
catastrophic event—then naturally TCP won't be able to deliver the data. However, as long as the network and hosts are operational, TCP will make sure that the data is delivered intact.
TCP's reliability service takes many forms, employing many different technologies and techniques. Indeed, RFC 793 states that TCP must be able to 뱑ecover from data that is damaged, lost, duplicated, or delivered out of order.?This is a broad range of service, and as such TCP's reliability mechanisms tend to be somewhat complex.
The most basic form of reliability comes from the use of checksums. TCP checksums are used to validate segments (including the TCP headers and any associated data). Furthermore, checksums are mandatory with TCP (as opposed to being optional as they are with UDP), requiring that the sender compute them, and that the recipient compare them to segments received. This provides a simple validation mechanism that lets a receiver test for corrupt data before handing the data off to the destination application.
Although checksums are useful for validating data, they aren't of any use if they never arrive. Therefore, TCP also has to provide delivery services that will ensure that data arrives in the first place. This service is provided by TCP's use of sequence numbers and acknowledgments, both of which work together to make TCP a reliable transport. Once a segment has been sent, the sender must wait for an acknowledgment to be returned stating that all of the data has been successfully received. If a segment is not acknowledged within a certain amount of time, the sender will eventually try to send it again. This design allows TCP to recover from segments that get lost in transit.
Furthermore, the use of unique sequence numbers allows a receiver to reorder any segments that may have come in out of sequence. Since IP is unpredictable, it is entirely possible that some datagram will be routed over a slower link than the rest of the datagrams, causing some of them to arrive in a different order than they were sent. The receiving TCP system can use the sequence numbers to reorder segments into their correct sequence—as well as eliminate any duplicates—before passing the data off to the destination application.
Taken together, these services make TCP an extremely reliable transport protocol, which is why it is the transport of choice for most Internet applications.
In summary, the key elements of TCP's reliability service are:
Checksums
TCP uses checksums for every segment that is sent, allowing the destination system to verify that the data within the segment is valid. Sequence numbers
Every byte of data that gets sent across a virtual circuit is assigned a sequence number. These sequence numbers allow the sender and receiver to refer to a range of data explicitly, and also allows the recipient to reorder segments that come in out of order, as well as eliminate any duplicates. Acknowledgements
Every byte of data sent across a virtual circuit must be acknowledged. This task is achieved through the use of an acknowledgment number, which is used to state that a receiver has received all of the data within a segment (as opposed to receiving the segment itself), and is ready for more data. Timers
Since TCP uses IP for delivery, some segments can get lost or corrupted on their way to the destination. When this happens, no acknowledgment will be received by the sender, which would require a retransmission of the questionable data. In order to detect this error, TCP also incorporates an acknowledgment timer, allowing the sender to retransmit lost data that does not get acknowledged. In practice, these mechanisms are tightly interwoven, with each of them relying on the others in order to provide a totally reliable implementation. They are discussed in detail in the following sections.
TCP checksumsTCP checksums are identical to UDP checksums, with the exception that checksums are mandatory with TCP (instead of being optional, as they are with UDP). Furthermore, their usage is mandatory for both the sending and receiving systems. RFC 1122 clearly states that the receiver must validate every segment received, using the checksum to verify that the contents of the segment are correct before delivering it to the destination application.
Checksums provide a valuable service in that they verify that data has not been corrupted in transit. All of the other reliability services provided by TCP—the sequence numbers, acknowledgments, and timers—serve only to ensure that segments arrive at their destination; checksums make sure the data inside the segments arrives intact.
Checksums are calculated by performing ones-complement math against the header and data of the TCP segment. Also included in this calculation is a 뱎seudo-header?that contains the source and destination IP addresses, the Protocol Identifier (6 for TCP), and the size of the TCP segment (including the TCP headers and data). By including the pseudo-header in the calculations, the destination system is able to validate that the sender and receiver information is also correct, in case the IP datagram that delivered the TCP segment got mixed up on the way to its final destination.
TCP must validate the checksum before issuing an acknowledgment for the segment. If a segment is received with an invalid checksum, then the segment must be discarded. Discarding the segment is a 뱒ilent?event, with no notification of the failure being generated or sent.
This is required behavior, since the recipient has no way of determining which circuit the segment belongs to if the checksum is deemed invalid (the header could be the corrupt part of the segment). In such a situation, an error message could be sent to the wrong source, thereby causing additional (and unrelated) problems to ensue. Instead, the segment is thrown away, and the original sending system would eventually notice that the data was not successfully received (due to the acknowledgment timer expiring), and the segment would eventually be reissued.
Since each virtual circuit consists of a pair of sockets, the receiver has to know the IP address of the sender in order to deliver the data to the correct destination application. If there are multiple connections to port 80 on the local server (as would be found with an HTTP server), TCP has to know which system sent the data in order to deliver it to the right instance of the local server. Although this information is available from IP, TCP verifies the information using the checksum's pseudo-header.
Note that RFC 1146 introduced a TCP option for alternative checksum mechanisms. However, the Alternative Checksum option was classified as experimental, and RFC 1146 has since expired. Therefore, the Alternative Checksum option should not be used with any production TCP implementations.
Sequence numbersA key part of TCP's reliability service is the use of sequence numbers and acknowledgements, allowing the sender and receiver to constantly inform each other of the data that has been sent and received. These two mechanisms work hand-in-glove to ensure that data arrives at the destination system.
RFC 793 states that 밻ach [byte] of data is assigned a sequence number.?The sequence number for the first byte of data within a particular segment is then published in the Sequence Identifier field of the TCP header. Thus, when a segment is sent, the Sequence Identifier field shows the starting byte number for the data within that particular segment. Note that sequence numbers do not refer to segments, but instead refer to the starting byte of a segment's data block.
Once a segment is received, the data is verified by the recipient (using the checksum), and if it's okay, then the recipient will send an acknowledgment back to the sender. The acknowledgment is also contained within a TCP segment, with the
Acknowledgment Identifier field in the TCP header pointing to the next sequence number that the recipient is willing to accept. The acknowledgment effectively says 밒 received all of the data up to this point and am ready for the next byte of data, starting at sequence number n.?/I>
When the acknowledgment arrives, the sender knows that the receiver has successfully received all of the data contained within the segment, and the sender is then able to transmit more data (up to the maximum amount of data that will fit within either the receiver's current receive window or the sender's current congestion window). This process is illustrated in Figure 7-13. In that example, the sender has identified the first byte of data being sent as 1, while the acknowledgment for that segment points to the first byte of data from the next segment that the receiver expects to get (101).
In practice, sequence numbers should rarely be numbered 1. Sequence numbers are 32-bit integers, with a possible range in values from zero through 4,294,967,295. RFC 1122 states that systems must seed the sequence number value on all new circuits using a value derived from the local system's clock. Therefore, the first byte of data being sent across a virtual circuit should not be numbered 1, but instead should be numbered according to a value derived from the current time. Some systems violate this principle, starting at 1 even though they're not supposed to.
The main reason for seeding the sequence number with the system clock is for safety. If sequence numbers always start at a fixed integer (like 1), there is an increased opportunity for overlapping to occur, particularly on circuits that are opened and closed rapidly. For example, if two systems used the same port numbers for multiple connections, and a segment from the first connection got lost in transit, that segment may arrive at the destination during the next connection, thereby appearing to be a valid sequence number. For this reason, all TCP
implementations should always seed the sequence number for all new connections using a value derived from the local system's clock.
In addition, RFC 1948 discussed how this information could be used to launch a variety of attacks against a system, and that using predictable sequence numbers was not only a technical problem but a security risk as well. Essentially, predictable sequence numbers also mean that acknowledgment numbers can be predicted. Given that information, it is easy for a remote hacker to send fake packets to your servers, providing valid IP addresses and acknowledgment numbers. This loophole lets the bad guy compromise your systems without ever seeing a single packet. Unfortunately, some systems still use highly predictable sequence numbers today, and this problem has not gone away entirely.
Another concern with sequence numbers is that they can wrap around during long data transfers. Although there are more than four billion possible sequence numbers, this is not an infinite amount, so reusing sequence numbers will certainly happen on some circuits, particularly those that are kept open for extended periods of time. For example, if a 10-gigabyte file was transferred between two hosts, then the sequence numbers used on that virtual circuit would have to wrap around twice, with some (if not many) of the sequence numbers getting reused at some point. When this occurs, a segment that got lost or redirected in transit could show up late and appear to be a valid segment.
In order to keep reused sequence numbers from causing these kinds of problems, the recipient must limit the active sequence numbers to a size that will fit within the local receive buffer. Since the receive buffer limits the amount of data (in bytes) that can be outstanding and unacknowledged at any given time, a recipient system can simply ignore any segments with sequence numbers that are outside the boundaries of the current window range.
For example, if a recipient has an eight-kilobyte receive buffer, it can set an eight-kilobyte limit on the data that it receives. If it is currently processing segments with sequence numbers around 100,000, then it can safely ignore any segments that arrive with sequence numbers less than 92,000 or greater than 108,000.
In addition, IP's built-in Time-To-Live mechanism also helps to keep older segments from showing up unexpectedly and wreaking havoc. If an IP datagram has a medium-sized Time-To-Live value, then the datagram may be destroyed before it ever reaches the destination. However, most TCP implementations set the Time-To-Live value at 60 seconds (a value recommended in RFC 1122). Since this value tends to be greater than the acknowledgment timer, it is quite possible that a sender will reissue a segment that has not been acknowledged, and that the old datagram will show up unexpectedly. Since the two segments would have the same sequence number, the recipient should be able to detect that they are duplicates, and simply discard the duplicate segment.
Another way to deal with this problem is to use the TCP Timestamp option to identify when a particular segment was sent. On extremely fast networks (such as those using Gigabit Ethernet), it takes only 17 seconds to completely cycle through all four billion sequence numbers. Since the Time-To-Live value on most IP datagrams is substantially larger than this, there is a high probability of an old datagram showing up with a recently used sequence number. RFC 1323 provides a solution to this problem, using the TCP Timestamp option as a secondary reference for each unique segment. When used together, these two mechanisms keep old segments from wreaking havoc when sequence numbers are being reused.
Since the Sequence Identifier field is a standard part of the TCP header, every segment that is sent must have a Sequence Identifier field, even if it is a segment that doesn't contain any data (such as acknowledgments). There's an obvious problem here: it does not contain any data, so what should it use for the Sequence Identifier? After all, the sequence number is supposed to refer to the first byte of data.
If a segment does not contain any data, then the next byte of data expected to be sent is used in the Sequence Identifier field of the TCP header. This sequence number would continue to be used until some data was actually sent, forcing the sequence number to be incremented.
Figure 7-14 illustrates how zero-length segments reuse sequence numbers. As the sender pumps data down to the recipient, the latter has to periodically acknowledge the data that it has received. These acknowledgments are sent as individual segments, with each segment having a Sequence Identifier in the TCP header. Since the client isn't sending any data, it is using the next byte it expects to send as the sequence number. This sequence number will be reused until data actually does get sent, at which point the client's sequence number will be incremented.
One drawback of this approach is that these acknowledgment segments are nonrecoverable if they get lost or become corrupted. Since each of these segments contain duplicate sequence numbers, there's no way for the other end of the connection to uniquely identify them. The remote endpoint cannot ask that sequence number n be resent, because there are lots of segments with that sequence number. In addition, sequence number n has not yet been sent anyway, since the Sequence Identifier field is referring to the next byte of data expected to be sent.
However, this does not mean that the connection will collapse if a zero-length segment is lost. Since zero-length segments typically contain acknowledgments, if one of them is lost then the acknowledgment will be lost as well. But if the sender has sent more data beyond that segment, then the recipient will likely return an acknowledgment for a higher-numbered sequence anyway, obviating the need for that particular acknowledgment to be resent. If the sender is not sending any more data, then it will eventually notice the missing acknowledgment, and resend the
questionable data. This action should result in the recipient re-issuing the lost acknowledgment, providing full recovery.
Some of the other zero-length command segments that get used—such as the command segments used to open and close virtual circuits—have their own special sequence number considerations. For example, 뱒tart?segments that use the Synchronize bit use a sequence number that is one lower than the sequence numbers used for data, while 밹lose?segments that use the Finish or Reset bits use sequence numbers that are one greater than the sequence numbers used for data. By using sequence numbers outside the range of the sequence numbers used for data, these particular segments will not interfere with actual data delivery, and can be tracked individually if necessary.
For more information on the Synchronize, Finish, and Reset flags, refer to 밅ontrol Flags?later in this chapter.
Acknowledgment numbersAcknowledgment numbers and sequence numbers are closely tied, working together to make sure that segments arrive at their destination.
Just as sequence numbers are used to identify the individual bytes of data being sent in a segment, acknowledgment numbers are used to verify that all of the data in that segment was successfully received. However, rather than pointing to the first byte of data in a segment that has just arrived, the acknowledgment number points to the next byte of data that a recipient expects to receive in the next segment.
This process is illustrated earlier in Figure 7-13. In that example, the sender transmits 100 bytes of data, using a sequence number of 1 to identify the first byte of data in the segment. The receiver returns an acknowledgment for the segment, indicating that it's ready to accept the next segment (starting at sequence number 101). Notice that the acknowledgment does not point to bytes 1 or 100, but instead points to 101, the next byte that the receiver expects to get.
This design is commonly referred to as being 밹umulatively implicit,?indicating that all of the data up to (but not including) the acknowledgment number has been received successfully, rather than explicitly acknowledging that a particular byte has been received. Implicit acknowledgments work well when data is flowing smoothly, as a receiver can continually request more data. However, when things go bad, implicit acknowledgments are not very robust. If a segment gets lost or corrupted, then the recipient has no way of informing the sender of the specific problem. Instead, it must re-request the next expected byte of data, since that's all the cumulatively implicit acknowledgment scheme allows for.
Remember that the sliding window mechanism allows a sending system to transmit as many segments as can fit within the recipient's advertised receive buffer. If a system is advertising an eight-kilobyte window, and the sender is using one-kilobyte segments, then as many as eight segments may be issued and in transit at any given moment. If the first segment is lost or damaged, the recipient may still get the remaining seven segments. Furthermore, it should hold those other segments in memory until the missing data arrives, preventing the need for all of the other segments to get resent.
However, the recipient must put the segments back into their original order before passing the data up to the destination application. Therefore, it has to notify the sender of the missing segment before it can process the remaining seven segments.
Most network protocols use either negative acknowledgments or selective acknowledgments for this service. Using a negative acknowledgment, the recipient can send a message back to the sender stating 뱒egment n is missing, please resend.?A selective acknowledgment can be used to notify the sender that 밷ytes a through g and bytes s through z were received, please resend bytes h through r.?/I>
However, TCP does not use negative or selective acknowledgments by default. Instead, a recipient system has to implement recovery using the implicit acknowledgment mechanism, simply stating 밶ll bytes up to n have been received.?When a segment is lost, the recipient has to resend the acknowledgment, thereby informing the sender that it is still waiting for a particular sequence number. The original sender then has to recognize the duplicate acknowledgment as a cry for help, stop transmitting new data, and resend the missing data.
Note that RFC 1106 introduced an experimental TCP option that allowed for the use of negative acknowledgments. However, the Negative Acknowledgment option was never widely used, and RFC 1106 has since expired. Therefore, the Negative Acknowledgment option should not be used with any production TCP implementations.
In addition, RFC 1072 introduced selective acknowledgments to TCP, by way of a set of TCP options. However this work has been clarified in RFC 2018. Using the selective acknowledgment options described therein, a TCP segment can precisely state the data it has received—and thus the data that's missing—even if those blocks of data are non-contiguous. In this model, a receiver uses the normal acknowledgment scheme to state that it is looking for sequence number n, and then supplements this information with the Selective Acknowledgment Data option, stating that it also has bytes y through z in the receive queue. The sender would then resend bytes n through x, filling the hole in the receiver's queue. For a more detailed discussion on Selective Acknowledgments, refer to 밪elective Acknowledgments Permitted?and 밪elective Acknowledgment Data?both later in this chapter.
The cumulatively implicit acknowledgment scheme used by TCP is illustrated in Figure 7-15. In that example, each segment contains 100 bytes. The first segment is received successfully, so the recipient sends an implicit acknowledgment for bytes zero through 100 back to the sender. The second segment, however, is lost in transit, so the recipient doesn't see (or acknowledge) it.
When the third segment arrives, the recipient recognizes that it is missing bytes 101 through 200, yet having no way to issue a negative acknowledgment, it repeats the previous implicit acknowledgment, indicating that it is still waiting for byte 101.
What happens next depends on a variety of implementation issues. In the original specification, the sender could wait until an acknowledgment timer for sequence number 101 had expired before resending the segment. However, RFC 1122 states that if three duplicate acknowledgments are received for a segment—and if no other acknowledgments have been received for any subsequent segments—then the sender should assume that the segment was probably lost in transit. In this
case, the sender should just retransmit the questionable segment rather than waiting for the acknowledgment timer for that segment to expire. This process is known as fast retransmit, which is documented in RFC 2581.
It is important to note that fast retransmit does not work when the data has been lost from the tail-end of the stream. Since no other segments would have been sent after the lost segment, there would not be any duplicate acknowledgments, and as such fast retransmit would never come into play. In those situations, the missing data will be discovered only when the acknowledgment timer for that segment expires.
Regardless of the retransmission strategy used, once the sender had resent the lost segment, it would have to decide whether or not it needed to resend all of the segments following the lost segment, or simply resume sending from the point it left off when the missing segment was discovered. The most common mechanism that is used for this is called fast recovery, which is also described in RFC 2581. The fast recovery algorithm states that if data was retransmitted due to the presence of multiple duplicate acknowledgments, then the sender should just resume transmitting segments on the assumption that none of the subsequent segments were lost. If in fact there were other lost segments, then that information would be discovered in the acknowledgments for the retransmitted segment. This position assumes that multiple segments are not likely to have been lost, and is accurate most of the time.
Of course, this also depends on whether or not the recipient actually kept any other segments that may have been sent. Although RFC 1122 states that the recipient should keep the other segments in memory (thereby allowing for reassembly to occur locally rather than requiring a total retransmission), not all systems conform to this recommendation.
Another related issue is 뱎artial acknowledgment,?whereby a recipient has lost multiple segments. When that happens, the sender may discover and resend the first lost segment through the use of the fast retransmission algorithm. However, rather than getting an acknowledgment back for all of the segments sent afterwards, an acknowledgment is returned that only points to some of the data sent afterwards. Although there aren't any standards-track RFCs that dictate how this situation should be handled, the prevailing logic is to have the sender retransmit the next-specified missing segment and then continue resending from where it left off.
This process illustrates the importance of the recipient's receive buffer, particularly as it pertains to the sender. Every time the recipient received another segment that was out of order, it would have to store this data into the receive buffer until the missing segment arrived. This in turn would take up space in the receive buffer, and so the recipient would have to advertise a smaller receive buffer every time it sent a duplicate acknowledgment for the missing segment. This would in turn cause the sender to slow down (as described earlier in 밨eceive window size adjustments?, with the sender eventually being unable to send any additional segments. Once the recipient got the missing segment, it would reorder the segments and then pass the data off to the destination application. Then it could advertise a large receive buffer again, and the sender could resume sending data.
Depending on the size and condition of the receive buffer, the sender may be able to resume sending data from where it left off (in our example, sequence number 401), without waiting for an acknowledgment for the next segment. This really depends on the number of unacknowledged segments currently outstanding and the maximum amount of unacknowledged data allowed by the recipient.
For example, if the size of the receive buffer was 800 bytes (using 100-byte segments)—and if only two segments were currently unacknowledged—then once the sender had resent the missing data, it could go ahead and resume transmitting additional segments without waiting for an acknowledgment for those other segments. But if the receive buffer had been cut down to just two hundred bytes, then the sender could not send any more data until the two outstanding segments had been acknowledged.
For more details on how the receive buffer controls flow control in general, refer back to 밊low Control.?For more information on the selective acknowledgment option, refer ahead to 밪elective Acknowledgment Data.?/P>
Acknowledgment timersMost of the time, spurious packet loss is dealt with by using the fast retransmit and fast recovery algorithms, as defined in RFC 2581. However, those algorithms are not always usable. For example, if the link has failed completely, then multiple duplicate acknowledgments will not be received. Also, if the last segment from a transmission were the one that got lost, then there would not be any additional segments that would cause multiple duplicate acknowledgments to get generated. In these situations, TCP has to rely on an acknowledgment timer (also known as a retransmission timer) in order to detect when a segment has been lost in transit.
Whenever a sender transmits a segment, it has to wait for an acknowledgment to arrive before it can clear the segment from the outbound queue. On well-heeled networks, these acknowledgments come in quickly, allowing the sender to clear the data immediately, increment the sliding window, and move on to the next waiting segment. But if a segment is lost or corrupted and no acknowledgment ever arrives, then the sender has to rely on the timer to tell it when it should resend unacknowledged data.
Determining the most efficient size for the acknowledgment timer is a complex process that must be handled carefully. Setting the timer too short would result in frequent and unnecessary retransmissions, while setting the timer too long would result in unproductive delays whenever loss actually occurred.
For example, the acknowledgment timer for two systems connected together on a high-speed LAN should be substantially shorter than the timer used for a slow connection over the open Internet. Using a short timer allows failure to be recognized quickly, which is desirable on a high-speed LAN where latency is not much of an issue. However, setting a long timer would be more practical when many slow networks were involved, as it would not be efficient to continually generate duplicate segments when the problem is slow delivery (rather than packet loss).
Most systems start with a default acknowledgment timer, and then adjust this timer on a per-circuit basis according to the round-trip delivery times encountered on that specific connection. However, even this approach can get complicated, because the default timer is likely to be inappropriate for many of the virtual circuits, since some of them will be used for slow, long-haul circuits while others will be used for local and fast connections.
For example, most modern systems use a default timer of 3000 milliseconds, which is really too large for local area networks that have a round-trip time less than 10 milliseconds (even though this is the recommended default in RFC 1122). Conversely, many earlier implementations had a default timer of 200 milliseconds, which is far too short for many dial-up and satellite links, resulting in frequent and totally unnecessary retransmissions.
Also, the round-trip delivery times of most networks change throughout the day, due to changes in network utilization, congestion, and routing updates that affect the path that segments take on the way to their destination. For these reasons, the default setting is only accurate some of the time, and must be modified to reflect the specific latency characteristics of each virtual circuit throughout the connection's lifetime.
The two formulas used for determining round-trip delivery times are Van Jacobsen's algorithm and Karn's algorithm. Van Jacobsen's algorithm is useful for determining a 뱒moothed round-trip time?across a network, while Karn's algorithm offers techniques for adjusting the smoothed round-trip time whenever network congestion is detected. Although these two algorithms are outside the scope of this book, understanding their principles is required in order to fully understand how they can impact TCP acknowledgment timers.
The basis of Van Jacobsen's algorithm is for a sender to watch the delay encountered by acknowledgment segments as they cross the network, constantly tweaking the variables in use according to the specific amount of time it takes to send a segment and then receive an acknowledgment for that segment.
Although Van Jacobsen's original algorithm used acknowledgments to determine round-trip times for specific segments, this model did not provide for guaranteed accuracy, as multiple acknowledgments could arrive for a single segment (due to loss or due to the use of acknowledgments to command segments, each of which would share the same sequence number). In order to provide for a more accurate monitoring tool, RFC 1072 introduced a pair of TCP options that could be used for measuring the round-trip time of any given circuit, called the Echo and Echo Reply options. However, this work was abandoned in favor of a generic Timestamp option, as defined in RFC 1323.
RFC 1323 uses two fields in a single Timestamp option, allowing both systems to monitor the precise round-trip delivery time of every segment that they send. Whenever a system needs to send a segment, it should place the current time into the Timestamp Value field of the Timestamp option in the TCP header of the outgoing segment. When the remote system receives the segment, it will copy that data into the Timestamp Reply field of the response segment, and place its own timestamp into the Timestamp Value field. Upon receipt of the response, the original sender will see the original timestamp for the original segment, and can compare that data to the current time allowing it to determine the exact amount of latency for the network. In addition, the field-swapping operation will also be repeated, allowing the remote end to determine the same information. For more information on the Timestamp option, refer to 밫imestamp?later in this chapter.
Karn's algorithm amplifies the basic round-trip time formula, although it focuses on how to deal with packet loss or congestion. For example, Karn's algorithm suggests that it is best to ignore the round-trip times on packets that get lost (i.e., where no acknowledgment has been received) in order to prevent one failure from unnecessarily tilting the smoothed round-trip time determined from using Van Jacobsen's algorithm. Karn's algorithm also suggests that the value of the acknowledgment timer should be doubled whenever questionable data has been retransmitted due to the acknowledgment timer expiring, in case the problem is with temporary link failure or congestion.
In this model, if the retransmitted segments also go unacknowledged, then the acknowledgment timer will be doubled yet again, with the process repeating until a system-specific maximum has been reached. This could be a maximum number of retransmissions, or a maximum timer value, or a combination of the two.
Systems based on BSD typically limit the length of the retransmission timer to either five minutes or a maximum of twelve attempts, whichever comes first. Windows-based systems limit retransmissions to five attempts, with each retransmission doubling the acknowledgment timer. Other implementations do not double the retransmission timer, but instead use a percentage-based formula or a fixed table, hoping to recover faster than blind-doubling would allow. Regardless, remember that the value that is being incremented or doubled is based on the smoothed round-trip time for that connection, so the maximum acknowledgment timer value could be either quite large or quite small.
Some systems have shown problems in this area, failing to double the size of their retransmission timers whenever the timer expired. As such, these systems would send a retransmission, and then continue resending the data in short fixed intervals. Since these systems had low timers anyway (200 milliseconds was the default), a dial-up user connecting to this system would tend to get at least two or three retransmissions of the very first segment, until the round-trip smoothing started to kick in.
Also, some systems will cache the learned round-trip time for future use, allowing any subsequent connections to the same remote system (or network) to use the previously learned round-trip latency values. This feature allows the new connection to start with a default that should be appropriate for the specific endpoint system, instead of starting at the system default value (which is almost always wrong).
RFC 1122 mandates that both Van Jacobsen's algorithm and Karn's algorithm be used in all TCP implementations so that acknowledgment timers will get synchronized quickly. Subsequent experimentation has shown that these algorithms do in fact help to improve overall throughput and performance, regardless of the networks in use.
However, there are also times when these algorithms can actually cause problems to occur, such as when Karn's algorithm results in an overly slow reaction to a sudden change in the network's characteristics. For example, if the round-trip time suddenly goes through the roof due to a change in the end-to-end network path, the acknowledgment timer on the sender will most likely get triggered before the data ever reaches the destination. When that happens, the sender will resend the unacknowledged data, the size of the retransmission timer will get doubled, and the acknowledgments for the questionable data may also be ignored (since retransmissions aren't supposed to interfere with the smoothed round-trip time). It will take several attempts before the smoothed round-trip time will get updated to reflect the true round-trip latency of the new network path.
Delayed acknowledgments.Figure 7-15 earlier in this chapter shows the receiver sending an acknowledgment every time it receives a segment from the sender. However, this is not necessarily an effective use of resources. For one thing, the receiver has to spend CPU cycles on calculating the acknowledgment, as does the sender when it gets the acknowledgment. Furthermore, the use of frequent acknowledgments also generates excessive amounts of network traffic, thereby consuming bandwidth that could otherwise be used by the sender to transmit data.
Rather than acknowledging every segment, it is better for the receiver to only send acknowledgments on a periodic basis. A mechanism called Delayed Acknowledgment is used for this purpose, allowing multiple segments to be acknowledged simultaneously. Remember that acknowledgments are implicit, stating that 밶ll data up to n has been received.?It is therefore possible for a recipient to acknowledge multiple segments simultaneously by simply setting the Acknowledgment Identifier to a higher inclusive value, rather than sending multiple distinct acknowledgments. Not only does this consume less network resources, but it also requires less computational resources on the part of the two endpoints.
This concept is illustrated in Figure 7-16. In that example, the recipient only sends an acknowledgment after receiving two segments. This approach not only generates less traffic, but it also allows the sender to increment their sliding window by two segment sizes, thereby helping to keep traffic flowing smoothly.
RFC 1122 states that all TCP implementations should utilize the delayed acknowledgment algorithm. However, RFC 1122 also states that TCP implementations who do so must not delay an acknowledgment for more than 500 milliseconds (to prevent the sender's acknowledgment timers from reaching zero).
RFC 1122 also states that an acknowledgment should be sent for every two fullsized segments received. However, this depends upon the ability of the recipient to clear the buffer quickly, and also depends upon the latency of the network in
use. If it takes a long time for cumulative acknowledgments to reach the sender, this design could negatively impact the sender's ability to transmit more data. Instead of helping, this behavior would cause traffic to become bursty, with the sender transmitting lots of segments and then stopping to wait for an acknowledgment. Once the acknowledgment arrived, then the sender would send several more segments, and then stop again.
Furthermore, some applications (such as TELNET) are 밹hatty?by design, with the client and server both sending data to each other on a regular basis. With these applications, both systems would want to combine their acknowledgments with whatever data had to be sent, reducing the number of segments actively crossing the network at any given moment. Delayed acknowledgments are also helpful here, as the two systems can simply combine their acknowledgments with whatever data is being returned.
For example, assume that a TELNET client is sending keystrokes to the server, which must be echoed back to the client. This would generate lots of small segments by both systems, not only for the keystroke data, but also for the acknowledgments that would be generated for each segments containing the keystroke data, as well as for the segments containing the data being echoed back to the client.
By delaying the acknowledgment until the segment containing the echoed data is generated, the amount of network traffic can be reduced dramatically. Effectively, rather than sending an acknowledgment as soon as the client's keystroke data segment had been verified, the server would delay the acknowledgment for a little while. Then, if any data was being returned to the client (such as the echoed keystroke), the server would just set the Acknowledgment Identifier in that segment's TCP header, eliminating the need for a separate acknowledgment segment. When combined with the Nagle algorithm, delayed acknowledgments can really help to cut down the amount of network bandwidth being consumed by small segments.
Unfortunately, there are also some potential problems with this design, although they are typically only seen when used in conjunction with Path MTU Discovery. These problems occur whenever a system chooses to delay an acknowledgment until two full-sized segments have been received, but the system is receiving segments that are not 밼ully sized.?This can happen when two devices announce large MTUs using the MSS option, but then the sender determines that a smaller MTU is required (as detected with Path MTU Discovery). When this happens, the recipient will receive many segments, but will not return an acknowledgment until it has received enough data to fill two full-sized segments (as determined by the MSS option).
In this case, the sender will send as much data as it can (according to the current limitations defined by the congestion window and the local sliding window), and then stop transmitting until an acknowledgment for that data is received. However, the recipient will not return an acknowledgment until the 500-millisecond maximum for delayed acknowledgments has been reached, and then will send one acknowledgment for all of the segments that have been received. The sender will increment its sliding window and resume sending data, only to stop again a moment later, resulting in bursty traffic.
This scenario happens only when Path MTU Discovery detects a smaller MTU than the size announced by the MSS option, which should be a fairly rare occurrence, although it does happen often enough to be a problem. This is particularly problematic with sites that use Token Ring, FDDI, or some other technology that allows for large MTU sizes, with an intermediary network that allows for only 1500-byte MTU sizes. For a more detailed discussion of this problem, refer to 밣artially Filled Segments or Long Gaps Between Sends?later in this chapter.
The TCP HeaderTCP segments consist of header and body parts, just like IP datagrams. The body part contains whatever data was provided by the application that generated it, while the header contains the fields that tell the destination TCP software what to do with the data.
A TCP segment is made up of at least ten fields. Unlike the other core protocols, some TCP segments do not contain data. In addition, there are a variety of supplementary header fields that may show up as 뱋ptions?in the header. The total size of the segment will vary according to the size of the data and any options that may be in use.
Table 7-2 lists all of the mandatory fields in a TCP header, along with their size (in bits) and some usage notes. For more detailed descriptions of these fields, refer to the individual sections throughout this chapter.
Notice that the TCP header does not provide any fields for source or destination IP address, or any other services that are not specifically related to TCP. This is because those services are provided by the IP header or by the application-specific protocols (and thus contained within the data portion of the segment).
As can be seen, the minimum size of a TCP header is 20 bytes. If any options are defined, then the header's size will increase (up to a maximum of 60 bytes). RFC 793 states that a header must be divisible by 32 bits, so if an option has been defined, but it only uses 16 bits, then another 16 bits must be added using the Padding field.
Figure 7-17 shows a TCP segment being sent from Arachnid (an HTTP 1.1 server) to Bacteria (an HTTP 1.1 client). This segment will be used for further discussion of the TCP header fields throughout the remainder of this chapter.
Source PortIdentifies the application that generated the segment, as referenced by the 16-bit TCP port number in use by the application.
Size
Sixteen bits. Notes
This field identifies the port number used by the application that created the data. Capture Sample
In the capture shown in Figure 7-18, the Source Port field is set to hexadecimal 00 50, which is decimal 80 (the well-known port number for HTTP). From this information, we can tell that this segment is a reply, since HTTP servers only send data in response to a request.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
'UP! > Web Service' 카테고리의 다른 글
The User Datagram Protocol (0) | 2008.08.21 |
---|---|
The Internet Control Message Protocol (0) | 2008.08.21 |
Multicasting and the Internet Group Management Protocol (0) | 2008.08.21 |
The Address Resolution Protocol (0) | 2008.08.21 |
The Internet Protocol (0) | 2008.08.21 |