'UP!/Web Service'에 해당되는 글 24건

2008.08.21 The Transmission Control Protocol
2008.08.21 The User Datagram Protocol
2008.08.21 The Internet Control Message Protocol
2008.08.21 Multicasting and the Internet Group Management Protocol
2008.08.21 The Address Resolution Protocol
2008.08.21 The Internet Protocol
2008.08.21 An Introduction to TCP/IP
2008.08.21 Internet Core Protocols
2008.08.21 [토론4] 웹서비스와 시맨틱 웹
2008.08.21 RPC:Remote Procedure Call

The Transmission Control Protocol

UP!/Web Service 2008. 8. 21. 14:17

The Transmission Control Protocol

조회(401)

WEB Services | 2007/09/21 (금) 23:06

추천하기

스크랩하기

The Transmission Control Protocol

Summary	The Transmission Control Protocol provides a reliable, connection-oriented transport protocol for transaction-oriented applications to use. TCP is used by almost all of the application protocols found on the Internet today, as most of them require a reliable, error-correcting transport layer in order to ensure that data does not get lost or corrupted.
Protocol ID	6
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 7 (RFC 793, republished)
Relevant RFCs	793 (Transmission Control Protocol); 896 (The Nagle Algorithm); 1122 (Host Network Requirements); 1323 (Window Scale and Timestamp); 2018 (Selective Acknowledgments); 2581 (TCP Congestion Control);
Related RFCs	1072 (Extensions for High Delay) 1106 (Negative Acknowledgments); 1146 (Alternate Checksums); 1337 (Observations on RFC 1323); 1644 (Transaction Extensions); 1948 (Defending Against Sequence Number Attacks); 2414 (Increasing the Initial Window); 2525 (Known TCP Implementation Problems); 2582 (Experimental New Reno Modifications to Fast Recovery)

On an IP network, applications use two standard transport protocols to communicate with each other. These are the User Datagram Protocol (UDP), which provides a lightweight and unreliable transport service, and the Transmission Control Protocol (TCP), which provides a reliable and controlled transport service. The majority of Internet applications use TCP, since its built-in reliability and flow control services ensure that data does not get lost or corrupted.

TCP is probably the most important protocol in use on the Internet today. Although IP does the majority of the legwork, moving datagrams and packets around the Internet as needed, TCP makes sure that the data inside of the IP datagrams is correct. Without this reliability service, the Internet would not work nearly as well as it does, if it worked at all.

It is also interesting to note that the first versions of TCP were designed before IP, with IP being extracted from TCP later. In fact, TCP is now designed to work with any packet-switched network, whether this be raw Ethernet or a distributed IP-based network like the Internet. This flexible design has resulted in TCP being adopted by other network architectures, including OSI's Transport Protocol 4 (TP4) and Apple Computer Corporation's AppleTalk Data Stream Protocol (ADSP).

The TCP Standard

TCP is defined in RFC 793, which has been republished as STD 7 (TCP is an Internet Standard protocol). However, RFC 793 contained some vagaries which were clarified in RFC 1122 (Host Network Requirements). In addition, RFC 2001 introduced a variety of congestion-related elements to TCP, which have been included into the standard specification, although this RFC was superseded by RFC 2581 (a.k.a., RFC 2001 bis). As such, TCP implementations need to incorporate RFC 793, RFC 1122, and RFC 2581 in order to work reliably and consistently with other implementations.

RFC 793 states that the Protocol ID for TCP is 6. When a system receives an IP datagram that is marked as containing Protocol 6, it should pass the contents of the datagram to TCP for further processing.

TCP Is a Reliable, Connection-Centric Transport Protocol

Remember that all of the transport-layer protocols (including TCP and UDP) use IP for their basic delivery services, and that IP is an unreliable protocol, providing no guarantees that datagrams or packets will reach their destination intact. It is quite possible for IP packets to get lost entirely (due to an untimely link failure on the network somewhere), or for packets to become corrupted (due to an overworked or buggy router), or for packets to get reordered as they cross different networks en route to the destination system, or for a myriad of other problems to crop up while packets are being bounced around the Internet.

For applications that need some sort of guarantee that data will arrive at its destination intact, this uncertainty is simply unacceptable. Electronic mail, TELNET, and other network applications are the basis of many mission-critical efforts, and as such they need some sort of guarantee that the data they transmit will arrive in its original form.

This reliability is achieved through the use of a virtual circuit that TCP builds whenever two applications need to communicate. As we discussed in Chapter 1, An Introduction to TCP/IP, a TCP session is somewhat analogous to a telephone conversation in that it provides a managed, full-duplex, point-to-point communications circuit for application protocols to use. Whenever data needs to be sent between two TCP-based applications, a virtual circuit is established between the two TCP providers, and a highly monitored exchange of application data occurs. Once all of the data has been successfully sent and received, the connection gets torn down.

Building and monitoring these virtual circuits incurs a fair amount of overhead, making TCP somewhat slower than UDP. However, UDP does not provide any reliability services whatsoever, which is an unacceptable trade-off for many applications.

Services Provided by TCP

Although it is possible for applications to provide their own reliability and flow control services, it is impractical for them to do so. Rather than developing (and debugging) these kinds of services, it is much more efficient for applications to leverage them as part of a transport-layer protocol, where every application has access to them. This arrangement allows shorter development cycles, better interoperability, and less headaches for everybody.

TCP provides five key services to higher-layer applications:

Virtual circuits
Whenever two applications need to communicate with each other using TCP, a virtual circuit is established between the two TCP endpoints. The virtual circuit is at the heart of TCP's design, providing the reliability, flow control, and I/O management features that distinguish it from UDP.

Application I/O management
Applications communicate with each other by sending data to the local TCP provider, which then transmits the data across a virtual circuit to the other side, where it is eventually delivered to the destination application. TCP provides an I/O buffer for applications to use, allowing them to send and receive data as contiguous streams, with TCP converting the data into individually monitored segments that are sent over IP.

Network I/O management
When TCP needs to send data to another system, it uses IP for the actual delivery service. Thus, TCP also has to provide network I/O management services to IP, building segments that can travel efficiently over the IP network, and turning individual segments back into a data-stream appropriate for the applications.

Flow control
Different hosts on a network will have different characteristics, including processing capabilities, memory, network bandwidth, and other resources. For this reason, not all hosts are able to send and receive data at the same rate, and TCP must be able to deal with these variations. Furthermore, TCP has to do all of this seamlessly, without any action being required from the applications in use.

Reliability
TCP provides a reliable transport service by monitoring the data that it sends. TCP uses sequence numbers to monitor individual bytes of data, acknowledgment flags to tell if some of those bytes have been lost somewhere, and checksums to validate the data itself. Taken together, these mechanisms make TCP extremely reliable.

All told, these services make TCP an extremely robust transport protocol.

Virtual Circuits

In order for TCP to provide a reliable transport service, it has to overcome IP's own inherent weaknesses, possibly the greatest of which is the inability to track data as it gets sent across the network. IP only moves packets around the network, and makes no pretense towards offering any sort of reliability whatsoever. Although this lack of reliability is actually a designed-in feature of IP that allows it to move data across multiple paths quickly, it is also an inherent weakness that must be overcome in order for applications to communicate with each other reliably and efficiently.

TCP does this by building a virtual circuit on top of IP's packet-centric network layer, and then tracking data as it is sent through the virtual circuit. This concept is illustrated in Figure 7-1. Whenever a connection is made between two TCP endpoints, all of the data gets passed through the virtual circuit.

By using this virtual circuit layer, TCP accomplishes several things. It allows IP to do what it does best (which is moving individual packets around the network), while also allowing applications to send and receive data without them having to worry about the condition of the underlying network. And since each byte of data is monitored individually by TCP, it's easy to take corrective actions whenever required, providing reliability and flow control services on top of the chaotic Internet.


		Figure 7-1. An overview of TCP's virtual circuits

These virtual circuits are somewhat analogous to the way that telephone calls work. It is easy to see this corollary if you think of the two TCP endpoints as being the 뱓elephones,?and the applications being the users of those telephones.

When an application wants to exchange data with another system, it first requests that TCP establish a workable session between the local and remote applications. This process is similar to you calling another person on the phone. When the other party answers (밐ello??, they are acknowledging that the call went through. You then acknowledge the other party's acknowledgment (밐i Joe, this is Eric?, and begin exchanging information (밫he reason I'm calling is뀛).

Likewise, data travelling over a TCP virtual circuit is monitored throughout the session, just as a telephone call is. If at any time parts of the data are lost (밯hat did you say??, the sending system will retransmit the lost data (밒 said뀛). If the connection degrades to a point where communications are no longer possible, then sooner or later both parties will drop the call. Assuming that things don't deteriorate to that point, then the parties will agree to disconnect (밪ee ya? once all of the data has been exchanged successfully, and the call will be gracefully terminated.

This concept is illustrated in Figure 7-2. When a TCP connection needs to be established, one of the two endpoint systems will try to connect with the other endpoint. If the 밹all?goes through successfully, then the TCP stack on the remote system will acknowledge the connection request, which will then be followed by an acknowledgment from the sender. This three-way handshake ensures that the connection is sufficiently reliable for data to be exchanged.

Likewise, each clump of data that is sent is explicitly acknowledged, providing constant feedback that everything is going okay. Once all of the data has been sent, either endpoint can close the virtual circuit. However, the disconnect process also uses acknowledgments in order to ensure that both parties are ready to terminate the call. If one of the systems still had data to send, then they might not agree to drop the circuit.


		Figure 7-2. TCP virtual circuits versus telephone calls

The virtual circuit metaphor has other similarities with traditional telephone calls. For example, TCP is a full-duplex transport that allows each party to send and receive data over the same virtual circuit simultaneously, just like a telephone call does. This allows for a web browser to request an object and for the web server to send the requested data back to the client using a single virtual circuit, rather than requiring that each end establish its own communication channel.

Every TCP virtual circuit is dedicated to one pair of endpoints, also like a telephone call. If an application needs to communicate with multiple endpoints simultaneously, then it must establish unique circuits for each endpoint pair, just as telephone calls do. This is true even if the same applications are in use at both ends of the connection. For example, if a web browser were to simultaneously request four GIF images from the same server using four simultaneous HTTP 밎ET?commands, then four separate TCP circuits would be needed in order for the operations to complete, even though the same applications and hosts were being used with all of the requests.

For all of these reasons, it is easy to think of TCP's virtual circuits as being very similar to the familiar concept of telephone calls.

Application I/O Management.

The primary benefit of the virtual circuit metaphor is the reliability that it allows. However, another set of key benefits is the I/O management services that this design provides.

One of the main features that comes from this design is that applications can send and receive information as streams of data, rather than having to deal with packetsizing and management issues directly. This allows a web server to send a very large graphic image as a single stream of data, rather than as a bunch of individual packets, leaving the task of packaging and tracking the data to TCP.

This design helps to keep application code simple and straightforward, resulting in lower complexity, higher reliability, and better interoperability. Application developers don't have to build flow control, circuit-management, and packaging services into their applications, but can instead use the services provided by TCP, without having to do anything special. All an application has to do is read and write data; TCP does everything else.

TCP provides four distinct application I/O management services to applications:

?Internal Addressing. TCP assigns unique port numbers to every instance of every application that is using a TCP virtual circuit. Essentially, these port numbers act as extension numbers, allowing TCP to route incoming data directly to the appropriate destination application.

?Opening Circuits. Applications inform TCP when they need to open a connection to a remote application, and leave it to TCP to get the job done.

?Data Transfer. Whenever an application needs to send data, it just hands it off to TCP, and assumes that TCP will do everything it can to make sure that the data is delivered intact to the destination system.

?Destroying Circuits. Once applications have finished exchanging data, they inform TCP that they are finished, and TCP closes the virtual circuit.

Application addressing with TCP ports

Applications communicate with TCP through the use of ports, which are practically identical to the ports found in UDP. Application are assigned 16-bit port numbers when they register with TCP, and TCP uses these port numbers for all incoming and outgoing traffic.

Conceptually, port numbers provide 밻xtensions?for the individual applications in use on a system, with the IP address of the local system acting as the main phone number. Remote applications 밹all?the host system (using the IP address), and also provide the extension number (port number) of the destination application that they want to communicate with. TCP uses this information to identify the sending and receiving applications, and to deliver data to the correct application.

Technically, this procedure is a bit more complex than it is being described here. When an application wishes to communicate with another application, it will give the data to TCP through its assigned port number, telling TCP the port number and IP address of the destination application. TCP will then create the necessary TCP message (called a 뱒egment?, marking the source and destination port numbers in the message headers, and storing whatever data is being sent in the payload portion of the message. A complete TCP segment will then get passed off to the local IP software for delivery to the remote system (which will create the necessary IP datagram and shoot it off).

Once the IP datagram is received by the destination system, the remote IP software will see that the data portion of the datagram contains a TCP segment (as can be seen by the Protocol Identifier field in the IP header), and will hand the contents of the segment to TCP for further processing. TCP will then look at the TCP header, see the destination port number, and hand the payload portion of the segment off to whatever application is using the specified destination port number.

This concept is illustrated in Figure 7-3. In that example, an HTTP client is sending data to the HTTP server running on port 80 of the destination system. When the data arrives at the destination system, TCP will examine the destination port number for that segment, and then deliver the contents of the segment to the application it finds there (which should be the HTTP server).


		Figure 7-3. Application-level multiplexing with port numbers

Technically, a port identifies only a single instance of an application on a single system. The term 뱒ocket?is used to identify the port number and IP address concantenated together (i.e., port 80 on host 192.168.10.10 could also be referred to as socket 192.168.10.10:80). A 뱒ocket pair?consists of both endpoints on a virtual circuit, including the IP addresses and port numbers of both applications on both systems.

All TCP virtual circuits work on the concept of socket pairs. Multiple connections between two systems must have unique socket pairs, with at least one of the two endpoints having a different port number.

TCP port numbers are not necessarily linked with applications on a one-to-one basis. It is quite common for some applications to open multiple connections simultaneously, and these connections would all require unique socket pairs, even if there was only one application in use. For example, if an HTTP 1.0 client were to simultaneously download multiple graphic objects from an HTTP server, then each instance of the HTTP client would require a unique and separate port number in order for TCP to route the data correctly. In this case, there would be only one application, but there would be multiple bindings to the network, with each binding having a unique port number.

It is important to realize that circuits and ports are entirely separate entities, although they are tightly interwoven. The virtual circuit provides a managed transport between two endpoint TCP providers, while port numbers provide only an address for the applications to use when talking to their local TCP provider. For this reason, it is entirely possible for a server application to support several different client connections through a single port number (although each unique virtual circuit will have a unique socket pair, with the client-side address and/or socket being the unique element).

For example, Figure 7-4 shows a single HTTP server running on Arachnid, with two active virtual circuits (one for Ferret, and another for Greywolf). Although both connections use the same IP address and port number on Arachnid, the socket pairs themselves are unique, due to the different IP addresses and port numbers in use by the two client systems. In this regard, virtual circuits are different from the port number in use by the HTTP server, although these elements are also tightly related.

Most of the server-based IP applications that are used on the Internet today use what are referred to as 뱖ell-known?port numbers, as we discussed in the previous chapter. For example, an HTTP server will listen on TCP port 80 by default, which is the well-known port number associated with HTTP servers. This way, any HTTP client that needs to connect to any HTTP server can use the default destination of TCP port 80. Otherwise, the client would have to specify the port number of the server that it wanted to connect with (you've seen this in some URLs that use http://www.somehost.com:8080/ or the like; 8080 is the port number of the HTTP server on www.somehost.com).


		Figure 7-4. An HTTP server with two connections, using two distinct socket pairs

Most servers let you use any port number and are not restricted to the well-known port number. However, if you run your servers on non-standard ports, then you would have to tell every user that the server was not accessible on the default port. This would be hard to manage at best. By sticking with the defaults, all users can connect to your server using the default port number, which is likely to cause the least amount of trouble.

Some network administrators purposefully run application servers on nonstandard ports, hoping to add an extra layer of security to their network. However, it is my opinion that security through obscurity is no security at all, and this method should not be relied upon by itself.

Historically, only server-based applications have been allowed to run on ports below 1024, as these ports could be used only by privileged accounts. By limiting access to these port numbers, it was more difficult for a hacker to install a rogue application server. However, this restriction is based on Unix-specific architectures and is not easily enforced on all of the systems that run IP today. Many application servers now run on operating systems that have little or no concept of privileged users, making this historical restriction somewhat irrelevant.

There are a number of predefined port numbers that are registered with the Internet Assigned Numbers Authority (IANA). All of the port numbers below 1024 are reserved for use with well-known applications, although there are also many applications that use port numbers outside of this range. Some of the more common port numbers are shown in Table 7-1. For a detailed listing of all of the port numbers that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/port-numbers).

Table 7-1. Some of the Port Numbers Reserved for Well-Known TCP Servers
Port Number	Description
20	File Transfer Protocol, Control Channel (FTP)
21	File Transfer Protocol, Data Channel (FTP-Data)
23	TELNET
25	Simple Mail Transfer Protocol (SMTP)
80	Hypertext Transfer Protocol (HTTP)
110	Post Office Protocol, v3 (POP3)
119	Network News Transfer Protocol (NNTP)

Besides the reserved addresses that are managed by the IANA, there are also 뱔nreserved?port numbers that can be used by any application for any purpose, although conflicts may occur with other users who are also using those port numbers. Any port number that is frequently used should be registered with the IANA.

To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.

Opening a circuit

Applications communicate with each other using the virtual circuits provided by TCP. These circuits are established on an as-needed basis, getting created and destroyed as requested by the applications in use. Whenever an application needs to communicate with another application somewhere on the network, it will ask the local TCP provider to establish a virtual circuit on its behalf.

There are two methods for requesting that a virtual circuit be opened: either a client will request an open so that data can be sent immediately, or a server will open a port in 뱇isten?mode, waiting for a connection request to arrive from a client.

The simplest of the two methods is the 뱎assive open,?which is the form used by servers that want to listen for incoming connections. A passive open indicates that the server is willing to accept incoming connection requests from other systems, and that it does not want to initiate an outbound connection. Typically, a passive open is 뱔nqualified,?meaning the server can accept an incoming connection from anybody. However, some security-sensitive applications will accept connections only from predefined entities, a condition known as a 뱏ualified passive open.?This type is most often seen with corporate web servers, ISP news servers, and other restricted-access systems.

When a publicly accessible server first gets started, it will request that TCP open a well-known port in passive mode, offering connectivity to any node that sends in a connection request. Any TCP connection requests that come into the system destined for that port number will result in a new virtual circuit being established.

Client applications (such as a web browser) use 밶ctive opens?when making these connection request. An active open is the opposite of a passive open, in that it is a specific request to establish a virtual circuit with a specific destination socket (typically this will be the well-known port number of the server that is associated with the specific client).

This process is illustrated in Figure 7-5. When an HTTP client needs to get a document from a remote HTTP server, it issues an 밶ctive open?to the local TCP software, providing it with the IP address and TCP port number of the destination HTTP server. The client's TCP provider then allocates a random port number for the application and attempts to establish a virtual circuit with the destination system's TCP software. The server's TCP software verifies that the connection can be opened (Is the port available? Are there security filters in place that would prevent the connection?), and then respond with an acknowledgment.

If the destination port is unavailable (perhaps the web server is down), then the TCP provider on the server system rejects the connection request. This is in contrast to UDP, which has to rely on ICMP Destination Unreachable: Port Unreachable Error Messages for this service. TCP is able to reject connections explicitly and can therefore abort connection requests without having to involve ICMP.

If the connection request is accepted, then the TCP provider on the server system acknowledges the request, and the client would then acknowledge the server's acknowledgment. At this point, the virtual circuit would be established and operational, and the two applications could begin exchanging data, as illustrated in Figure 7-5.

The segments used for the handshake process do not normally contain data, but instead are zero-length 밹ommand segments?that have special connection-management flags in their headers, signifying that a new virtual circuit is being established. In this context, the most important of these flags is the Synchronize flag, used by two endpoints to signify that a virtual circuit is being established.

For example, the first command segment sent by the client in Figure 7-5 would have the Synchronize flag enabled. This flag tells the server's TCP software that this is a new connection request. In addition, this command segment will also


		Figure 7-5. A TCP virtual circuit being established

provide the starting byte number (called the 뱒equence number? that the client will use when sending data to the server, with this data being provided in the Sequence Identifier field of the TCP header.

If the server is willing to establish a virtual circuit with the client, then it will respond with its own command segment that also contains the Synchronize flag and that also gives the starting sequence number that the server will use when sending data back to the client. This command segment will also have the Acknowledgment flag enabled, with the Acknowledgment Identifier field pointing to the client's next-expected sequence number.

The client will then return a command segment with the Acknowledge flag enabled to the server, and with its Acknowledgment Identifier field pointing to the server's next-expected sequence number. Note that this segment does not have the Synchronize flag enabled, since the virtual circuit is now considered up and operational, with both systems now being able to exchange data as needed.

It is entirely possible for two systems to issue active opens to each other simultaneously, although this scenario is extremely rare (I know of no applications that do this purposefully). In theory, such an event is possible, although it probably happens only on very slow networks where the circuit-setup messages pass each other on the wire.

For more information on the Synchronize and Acknowledgment flags, refer to 밅ontrol Flags?later in this chapter. For more information on the sequence and acknowledgment numbers, refer to 밨eliability?also later in this chapter.

Exchanging data

Once a virtual circuit has been established, the applications in use can begin exchanging data with each other. However, it is important to note that applications do not exchange data directly. Rather, each application hands data to its local TCP provider, identifying the specific destination socket that the data is for, and TCP does the rest.

Applications can pass data to TCP in chunks or as a contiguous byte-stream. Most TCP implementations provide a 뱖rite?service that is restricted in size, forcing applications to write data in blocks, just as if they were writing data to a file on the local hard drive. However, TCP's buffering design also supports application writes that are contiguous, and this design is used in a handful of implementations.

TCP stores the data that it receives into a local send buffer. Periodically, a chunk of data will get sent to the destination system. The recipient TCP software will then store this data into a receive buffer, where it will be eventually passed to the destination application.

For example, whenever a web browser issues an HTTP 밎ET?request, the request is passed to TCP as application data. TCP stores the data into a send buffer, packaging it up with any other data that is bound for the destination socket. The data then gets bundled into an IP datagram and sent to the destination system. The recipient's TCP provider then takes the data and passes it up to the web server, which fetches the requested document and hands it off to TCP. TCP sends chunks of the document data back to the client in multiple IP packets, where it is queued up and then handed to the application.

This concept is outlined in Figure 7-6, which shows an HTTP client asking for a document from a remote HTTP server. Once the TCP virtual circuit is established, the HTTP client writes 밎ET document?into the local send buffer associated with the virtual circuit in use by the client. TCP then puts this data into a TCP segment (creating the appropriate TCP headers), and sends it on to the specified destination system via IP. The HTTP server at the other end of the connection would then take the same series of steps when returning the requested document back to the client.

The important thing to remember here is that application data is transmitted as independent TCP segments, each of which requires acknowledgments. It is at this layer that TCP's reliability and flow control services are most visible.


		Figure 7-6. Data being exchanged over a TCP virtual circuit

For more information on how TCP converts application data into IP datagrams, refer ahead to 밡etwork I/O Management.?/P>

Closing a circuit

Once the applications have exchanged all of their data, the circuit can be closed. Closing a circuit is similar to opening one, in that an application must request the action (except in those cases where the connection has collapsed, and TCP is forced to terminate it).

Either end of the connection may close the circuit at any time, using a variety of different means. The two common ways to close are 밶ctive closes?that initiate a shutdown sequence and 뱎assive closes?that respond to an active close request.

Just as building a circuit requires a bidirectional exchange of special command segments, so does closing it. One end of the connection requests that the circuit be closed (the active close at work). The remote system then acknowledges the termination request and responds with its own termination request (the passive close). The terminating system then acknowledges the acknowledgment, and both endpoints drop the circuit. At this point, neither system is able to send any more data over the virtual circuit.

Figure 7-7 shows this process in detail. Once the HTTP client has received all of the data, it requests that the virtual circuit be closed. The HTTP server then returns an acknowledgment for the shutdown request, and also sends its own termination request. When the server's shutdown request is received by the client, the client issues a final acknowledgment, and begins closing its end of the circuit. Once the final acknowledgment is received by the server, the server shuts down whatever is left of the circuit. By this point, the connection is completely closed.


		Figure 7-7. A TCP virtual circuit being torn down

Just as TCP uses special Synchronize flags in the circuit-setup command segments, TCP also has special Finish flags that it uses when terminating a virtual circuit. The side issuing the active close sends a command segment with the Finish flag enabled and with a sequence number that is one byte higher than the last sequence number used by that side during the connection. The destination system responds with a command segment that also has the Finish flag enabled and with its sequence number also incremented by one. In addition, the Acknowledgment Identifier field for this response will still point to the next-expected sequence number, even though no other data should be forthcoming. In this regard, the Finish flag is considered to be one byte of data (just like the Synchronize flag), and as such must be acknowledged explicitly.

Once the Finish segments have been exchanged, the terminating system must respond with a final acknowledgment for the last Finish segment, with the

Acknowledgment Identifier also pointing to the next-expected sequence number (even though there should not be another segment coming). However, this last segment will not have the Finish flag enabled, since the circuit is considered to be 밺own?and out of action by this point.

It's important to note that either endpoint can initiate the circuit-termination process, and there are no hard and fast rules for which end should do it, although typically it is left to the client to perform this service since it may make multiple requests over a single connection. POP3 is a good example of this process, as POP3 allows a client to submit multiple commands during a single session. The client would need to dictate when the circuit should be closed with this type of application. However, sometimes a server issues the active close. For example, Gopher servers close the virtual circuit after sending whatever data has been requested, as do HTTP 1.0 servers.

It's also important to note that 뱒erver?applications keeps the port open until the application itself is terminated, allowing other clients to continue connecting to that server. However, the individual circuits will be torn down on a per-connection basis, according to the process described above.

Sometimes, the two systems do not close their ends of the circuit simultaneously. This results in a staggered close—also known as a 밾alf-close?#151;with each end issuing passive close requests at different times. One example of this type can be found in the rsh utility, which is used to submit shell commands to rsh servers. On some systems, once an rsh command has been sent the client will close its end of the connection, effectively switching the virtual circuit into half-duplex mode. The server will then process the shell command, send the results back to the client (for display or further processing), and then close its end of the connection. Once both ends have been closed, the circuit is dropped.

Another option for closing a circuit is to simply drop it without going through an orderly shutdown. Although this method will likely cause unnecessary traffic, it is not uncommon. Typically, this method should only happen if an application is abruptly terminated. If an application needs to immediately close a circuit without going through the normal shutdown sequence, then it will request an immediate termination, and TCP will issue a segment with the Reset flag set, informing the other end that the connection is being killed immediately.

For more information on the Finish, Reset, and Acknowledgment flags, refer ahead to 밅ontrol Flags.?For more information on the sequence and acknowledgment numbers, refer ahead to 밨eliability.?/P>

Application design issues

Some applications open a connection and keep it open for long periods of time, while others open and close connections rapidly, using many circuits for a single operation.

For example, if you instruct your web browser to open a document from an HTTP 1.0 Web server, the HTTP client issues an active open to the destination HTTP 1.0 server, which then sends the document to the client and close the TCP connection. If there are any graphic objects on that document, the HTTP client has to open multiple unique connections for each of those objects. Thus, opening a single web page could easily result in twenty or more circuits being established and destroyed, depending on the number of objects embedded in the requested web page.

Since this model generates a lot of traffic (and uses a lot of network resources on the server), this process was changed with HTTP 1.1, which now allows a single circuit to be used for multiple operations. With HTTP 1.1, a client may request a page and then reuse the existing circuit to download objects embedded within that page. This model results in significantly fewer virtual circuits being used, although it also makes the download process synchronous rather than asynchronous.

Most applications use a single circuit for everything, keeping that circuit open even when there may not be any noticeable activity. TELNET is one example of this, where the TELNET client will issue an active open during the initial connection, and then use that virtual circuit for everything until the connection is terminated. After logging in, the user may get up and walk away from the client system, and thus no activity may occur for an extended period of time, although the TCP connection between the two systems would remain active.

Whether the circuits are torn down immediately or kept open for extended periods of time is really a function of the application's design goal, rather than anything mandated by TCP. It is entirely possible for clients to open and close connections rapidly (as seen with web browsers that use individual circuits for every element in a downloaded document), or to open a single connection and maintain it in perpetuity (as seen with TELNET).

Keep-alives

Although RFC 793 does not make any provision for a keep-alive mechanism, some TCP implementations provide one anyway. There are good reasons for doing this, and bad ones as well.

By design, TCP keep-alives are supposed to be used to detect when one of the TCP endpoints has disappeared without closing the connection. This feature is particularly useful for applications where the client may be inactive for long periods of time (such as TELNET), and there's no way to tell whether the connection is still valid.

For example, if a PC running a TELNET client were powered off, the client would not close the virtual circuit gracefully. Unfortunately, when that happened the TELNET server would never know that the other end had disappeared. Long periods of inactivity are common with TELNET, so not getting any data from the client for an extended period would not cause any alarms on the TELNET server itself. Furthermore, since the TELNET server wouldn't normally send unsolicited data to the client, it would never detect a failure from a lack of acknowledgments either. Thus, the connection might stay open for infinity, consuming system resources for no good purpose.

TCP keep-alives allow servers to check up on clients periodically. If no response is received from the remote endpoint, then the circuit is considered invalid and will be released.

RFC 1122 states that keep-alives are entirely optional, should be user-configurable, and should be implemented only within server-side applications that will suffer real harm if the client were to disappear. Although implementations vary, RFC 1122 also states that keep-alive segments should not contain any data, but may be configured to send one byte of data if required for compatibility with noncompliant implementations.

Most systems use an unsolicited command segment for this task, with the sequence number of the command segment set to one byte less than the sequence number of the next byte of data to be sent, effectively reusing the last sequence number of the last byte of data sent over the virtual circuit. This design effectively forces the remote endpoint to issue a duplicate acknowledgment for the last byte of data that was sent over that connection. When the acknowledgment arrives, the server knows that the client is still there and operational. If no response comes back after a few such tests, then the server can drop the circuit.

Network I/O Management

When an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP packages portions of the data into bundles (called 뱒egments?, and passes them off to IP for delivery to the destination system, as illustrated in Figure 7-8.

Although this process sounds simple, it involves a lot of work, primarily due to segment-sizing issues that TCP has to deal with. For every segment that gets created, TCP has to determine the most efficient segment size to use at that particular moment, which is an extremely complex affair involving may different factors.


		Figure 7-8. An overview of TCP's data-encapsulation process

However, this is also an extremely important service, since accurately determining the size of a segment dictates many of the performance characteristics of a virtual circuit.

For example, making the segment too small wastes network bandwidth. Every TCP segment contains at least 40 bytes of overhead for the IP and TCP headers, so if a segment only contained one byte of data, then the byte ratio of headers-to-data for that segment would be 40:1, a miserable level of throughput by anybody's standard. Conversely, sending 400 bytes of data would change this ratio to 1:10, which is better, although still not very good. Sending four kilobytes of data would change this ratio to 1:100, which would provide excellent utilization of the network's capacity.

On the other hand, sending too much data in a segment can cripple performance as well. If a segment were too big for the IP datagram to travel across the network (due to topology-specific restrictions), then the IP datagram itself would have to be fragmented in order for it to get across that network. This situation would not only require additional processing time on the router that had to fragment the packet, but it would also introduce delays to the destination system, as the receiving IP stack would have to wait for all of the IP fragments to arrive and be reassembled before the TCP segment could be passed to TCP for processing.

In addition, fragmentation also introduces reliability concerns. If a fragment is lost, then the recipient's fragmentation timers have to expire, and an ICMP Time Exceeded error message has to be issued. Then the sender has to resend the entire datagram, which is likely to result in even more fragmentation occurring. Furthermore, on networks that experience known levels of packet loss, fragmentation increases the network's exposure to damage, since a single lost fragment will destroy a large block of data. But if the data were sent as discrete packets to begin with, the same lost packet would result in only that one small segment being lost, which would take less time to recover from. For all of these reasons, avoiding fragmentation is also a critical function of accurately determining the most effective segment size for any given virtual circuit.

Determining the most effective segment size involves the following factors:

Send buffer size
The most obvious part of this equation involves the size of the send buffer on the local system. If the send buffer fills up, then a segment must be sent in order to make space in the queue for more data, regardless of any other factors.

Receive buffer size
Similarly, the size of the receive buffer on the destination system is also a concern, as sending more data than the recipient can handle would cause overruns, resulting in the retransmission of lost segments.

MTU and MRU sizes
TCP also has to take into consideration the maximum amount of data that an IP datagram can handle, as determined by the Maximum Transfer Unit (MTU) size of the physical medium in use on the local network, the Maximum Receive Unit (MRU) size of the destination system's network connection, and the MTU/MRU sizes of all the intermediary networks in between the two endpoint systems. If a datagram is generated that is too large for the end-to-end network to handle, then fragmentation would definitely occur, penalizing performance and reliability.

Header size
IP datagrams have headers, which will steal anywhere from 20 to 60 bytes of data from the segment. Likewise, TCP also has variable-length headers which will steal another 20 to 60 bytes of space. TCP has to leave room for the IP and TCP headers in the segments that get created, otherwise the datagram would be too large for the network to handle, and fragmentation would occur.

Data size and timeliness
The frequency at which queued data is sent is determined by the rate at which data is being generated. Obviously, if lots of data is being generated by an application, then lots of TCP segments will need to be sent quickly. Conversely, small trickles of data will still need to be sent in a timely manner, although this would result in very small segments. In addition, sometimes an application will request that data be sent immediately, bypassing the queue entirely.

Taking all of these variables into consideration, the formula for determining the most efficient segment size can be stated as follows:

MESS = (lesser of (send buffer, receive buffer, MTU, or MRU)) - headers) or (data + headers)

Simply put, the most efficient segment size is determined by finding the lowest available unit of storage (send buffers, receive buffers, or the MTU/MRU values in use) minus the required number of bytes for the IP and TCP headers, except in those situations where there is only a little bit of data to send. In that case, the size of the data (plus the required headers) will determine the size of the segment that is being sent.

By limiting the segment size to the smallest available unit of storage, the segment can be sent from one endpoint to another without having to worry about fragmentation. In turn, this allows TCP to use the largest possible segment for sending data that can be sent end-to-end, which allows the most amount of data to be sent in the least amount of time.

Buffer size considerations

Part of determining the most efficient segment size is derived from the size of the send and receive buffers in use on the two systems. If the send buffer is very small, then the sender cannot build a very large segment. Similarly, if the receive buffer is small, then the sender cannot transmit a large segment (even if it could build one), as that would cause overruns at the destination system, which would eventually require the data to be retransmitted.

Every system has a different default buffer size, depending upon its configuration. Most PC-based client systems have eight kilobyte send and receive buffers, while many server-class systems have buffers of 16 kilobytes or more. It is not uncommon for high-end servers to have 32 or 48 kilobyte send and receive buffers. However, most systems will let you specify the default size for the receive buffers on your system, and they will also let the application developer configure specific settings for their particular application.

Sometimes the size of the local system's send buffer is the bottleneck. If the send buffer is very small, then the sending device just won't be able to generate large segments, regardless of the amount of data being written, the size of the receive buffer, or the MTU/MRU sizes in use on the two networks. Typically, this is not the case, although it can be in some situations, particularly with small hand-held computers that have very limited system resources.

Similarly, sometimes the size of the receive buffers in use at the destination system will be the limiting factor. If the receive buffer on the destination system is very small, then the sender must restrict the amount of data that it pushes to the receiving endpoint. This is also uncommon, but is not unheard of High-speed Token Ring networks are capable of supporting MTUs of 16 kilobytes and more, while the PCs attached to those networks may only have TCP receive buffers of eight kilobytes. In this situation, the segment size would be restricted to the available buffer space (eight kilobytes), rather than the MTU/MRU capabilities of the network (16 kilobytes).

Obviously, a sender already knows the size of its send buffers, but the sender also has to determine the size of the recipient's receive buffer before it can use that information in its segment-sizing calculations. This is achieved through the use of a 16-bit 밯indow?field that is stored in the header of every TCP segment that gets sent across a virtual circuit. Whenever a TCP segment is created, the sending end-point stores the current size of their receive buffers into the Window field, and the recipient then reads this information once the segment arrives. This allows each system to constantly monitor the size of the remote system's receive buffers, thereby allowing them to determine the maximum amount of data that can be sent at any given time.

In addition, the Window field is only 16 bits long, which limits the size of a receive buffer to a maximum of 64 kilobytes. RFC 1323 defines a TCP option called the Window Scale option that allows two endpoints to negotiate 30-bit window sizes, allowing for sizes up to one gigabyte to be advertised.

For more information on the Window field, refer to 밯indow.?For information on how to calculate the optimal default receive window size for your system, refer to 밡otes on Determining the Optimal Receive Window Size.?For more information on how the window value affects flow control, refer to 밨eceive window size adjustments.?For more information on the TCP Window Scale option, refer to 밯indow Scale.?All of these sections appear later in this chapter.

MTU and MRU size considerations.

Although buffer sizing issues can have an impact on the size of any given segment at any given time, most of the time the deciding factor for segment sizes is based on the size of the MTU and MRU in use by the end-to-end network connection.

For example, even the weakest of systems will have a TCP receive buffer of two or more kilobytes, while the MTU/MRU for Ethernet networks is only 1.5 kilobytes. In this case (and almost all others), the MTU/MRU of the Ethernet segment will determine the maximum segment size for that system, since it indicates the largest amount of data that can be sent in a single datagram without causing fragmentation to occur.

Typically, the MTU and MRU sizes for a particular network are the same values. For example, Ethernet networks have an MTU/MRU of 1500 bytes, and both of

these values are fixed. However, many dial-up networks allow an endpoint system to define different MTU and MRU sizes. In particular, many dial-up systems set the MTU to be quite small, while also setting the MRU to be quite large. This imbalance can actually help to improve the overall performance of the client, making it snappier than a fixed, medium-sized MTU/MRU pair would allow for.

To understand why this is so, you have to understand that most dial-up systems are clients, using applications such as POP3 and TELNET to retrieve large amounts of data from remote servers. Having a small MTU size forces the client to send segments quickly, since the MTU is the bottleneck in the segment-sizing calculations. Conversely, having a large MRU on a dial-up circuit allows the client to advertise a larger receive value, thereby letting the server send larger blocks of data down to the client. Taken together, the combination of a small MTU and a large MRU allows a dial-up client to send data quickly while also allowing it to download data in large chunks.

For example, one endpoint may be connected via a dial-up modem using a 1500-byte MRU, while the other node may be connected to a Token Ring network with a four-kilobyte MTU, as shown in Figure 7-9. In this example, the 1500-byte MRU would be the limiting factor when data was being sent to the dial-up client, since it represented the bottleneck. Furthermore, if the dial-up client had a 576 byte MTU (regardless of the 1500-byte MRU), then that value would be the limiting factor when data was being sent from the dial-up client up to the Token Ring-attached device.


		Figure 7-9. An overview of the segment-sizing process, using MTU and MRU values

Regardless of whether or not the client has a large or small MTU, it should be obvious that senders have to take the remote system's MRU into consideration when determining the most efficient segment size for a virtual circuit. At the same time, however, the sender also has to worry about the size of its local MTU. Both of these factors will determine the largest possible segment allowable on any given virtual circuit.

In order for all of this to work, both systems have to be able to determine each other's MRU sizes (they already know their own MTU sizes), and then independently calculate the maximum segment sizes that are allowed for the virtual circuit.

This determination is achieved by each system advertising its local MRU during the circuit-setup sequence. When each system sends its TCP start segments, it also includes its local MRU size (minus forty bytes for the IP and TCP headers) in those segments, using a TCP option called the Maximum Segment Size option. Since each system advertises its MRU in the start segments, it is a simple procedure for each of the systems to read the values and compare it with its own MTU values.

In truth, the MSS value advertised in the MSS option field tends to be based on the sender's MTU, rather than the MRU. Only a handful of systems actually use the MRU for their MSS advertisements. Although RFC 732 states that the MSS should be derived from the MRU, RFC 1122 clarified this position, stating that the MSS should be derived from the largest segment size that could be reassembled, which could be just about any value (although most implementations set this to the MTU size). Also, since most networks have fixed MTU/MRU pairs, most vendors set this value to the MTU size, knowing that it is the largest segment they can send. While this probably isn't the most technically accurate approach, it is what most implementations have chosen.

Note that RFC 793 states that the use of the MSS option is entirely optional, and therefore not required. If a system did not include an MSS option in its start segments, then a default value of 536 bytes (which is 576 bytes minus 40 bytes for the TCP and IP headers) should be used as the default. However, RFC 1122 reversed this position, stating that the MSS option is mandatory and must be implemented by all TCP providers.

Also note that some BSD-based systems can send only segments with lengths that are multiples of 512 bytes. So, even if an MTU of 576 bytes were available, the segments generated by these systems would be only 512 bytes long. Similarly, circuits capable of supporting MTU sizes of 1.5 kilobytes would use segments of only 1,024 bytes in length.

For a list of the default MTU sizes used with the most-common network topologies, refer to Table 2-5 in Chapter 2, The Internet Protocol. For more information on the MSS option, refer to 밠aximum Segment Size.?/P>

Path MTU discovery

Even though TCP systems are able to determine the MTU values in use by the endpoints on a virtual circuit, they are not be able to determine the MTU sizes of the networks in between the two endpoints, which may be smaller than the MTU/MRU values in use at either of the endpoint networks. In this scenario, fragmentation would still occur, since the MTU of the intermediary system would require that the IP datagrams be fragmented.

For example, if two systems are both on Token Ring networks using four-kilobyte MTUs, but there is an Ethernet network between them with a 1.5 kilobyte MTU, then fragmentation will occur when the four-kilobyte IP datagrams are sent over the 1.5 kilobyte Ethernet network. This process will lower the overall performance of the virtual circuit and may introduce some reliability problems.

By itself, TCP does not provide any means for determining the MTU of an intermediate network, and must rely on external means to discover the problem. One solution to this problem is to use a technique called Path MTU Discovery, which incorporates the IP Don't Fragment bit and the ICMP Destination Unreachable: Fragmentation Required error message to determine the MTU of the end-to-end IP network.

Essentially, Path MTU Discovery works by having one system create an IP packet of the largest possible size (as determined by the MTU/MRU pair for the virtual circuit), and then setting the Don't Fragment flag on the first IP packet. If the packet is rejected by an intermediary device (due to the packet being too large to forward without being fragmented), then the sender will try to resend the packet using a smaller segment size.

This procedure is repeated until ICMP errors stop coming back. At this point, the sender could use the size of the last-tested packet as the MTU for the entire network. Unfortunately, some systems assume that 뱊o error messages?means that the packet was delivered successfully, without conducting any further testing to verify the theory. However, some routers and firewalls do not return ICMP errors (due to security concerns or configuration errors), which may result in the ICMP errors not being returned to the sender.

This unreliability can cause a situation known as 밣ath MTU Black Hole,?where the sender has chosen to use an MTU that is too large for the end-to-end network, but the network is unable or unwilling to inform the sender of the problem. In this scenario, the sender continues sending data with an MTU that is too large for the intermediary network to forward without being fragmented (which is prohibited by the sender). Some implementations are aware of this problem, and if it appears that packets are not getting through then they reduce the size of the segments that they generate until acknowledgments are returned, or they clear the Don't Fragment flag, allowing fragmentation to occur.

For a complete discussion on this subject, refer to 밡otes on Path MTU Discovery?in Chapter 5, The Internet Control Message Protocol.

Header size considerations

As we discussed in Chapter 2, The Internet Protocol, most IP packets have a 20-byte header, with a maximum of 60 bytes being used for this data. TCP segments also have their own header information, with a minimum value of 20 bytes (the most common), and a maximum size of 60 bytes. Taken together, most TCP/IP datagram have 40 bytes of header data (20 from IP and 20 from TCP), with the maximum amount of header data being limited to 120 bytes (60 bytes from IP and TCP each).

Whenever TCP creates a segment, it must leave room for these headers. Otherwise, the IP packet that was generated would exceed the MTU/MRU pair in use on that virtual circuit, resulting in fragmentation.

Although RFC 1122 states that TCP implementations must set aside 40 bytes of data when a segment is created, this isn't always enough. For example, some of the new advanced TCP options utilize an additional 10 or more bytes. If this information isn't taken into consideration, then fragmentation will likely occur.

TCP is able to determine much of this information, but not always. If the underlying IP stack also utilizes IP options that TCP is not aware of, then TCP will not make room for them when segments are created. This will also likely result in fragmentation.

For more information on IP header sizes, refer to 밫he IP Header?in Chapter 2, The Internet Protocol. For more information on TCP header sizes, refer to 밫he TCP Header?later in this chapter.

Data considerations

Remember that applications write data to TCP, which then stores the data into a local send buffer, generating a new segment whenever it is convenient or prudent to do so. Although segment sizes are typically calculated based on the available buffer space and MTU/MRU values associated with a given virtual circuit, sometimes the nature of the data itself mandates that a segment be generated, even if that segment won't be the most efficient size.

For example, if an application sends only a little bit of data, then TCP will not be able to create a large segment since there just isn't much data to send to the remote endpoint. This is regardless of the inefficiencies of sending small amounts of data; if there isn't a lot of data to send, TCP can't send large segments.

The decision process that TCP goes through to figure out when to send small amounts of data incorporates many different factors. If an application is able to tell TCP how much data is being written—and if TCP isn't busy doing other stuff—then TCP could choose to send the data immediately. Conversely, TCP could choose to just sit on the data, waiting for more data to arrive.

Sometimes, an application knows that it will be sending only a little bit of data, and can explicitly tell TCP to immediately send whatever data is being written. This service is provided through the use of a 뱎ush?service within TCP, allowing an application to tell TCP to go ahead and immediately send whatever data it gets.

The push service is required whenever an application needs to tell TCP that only a small amount of data is being written to the send buffer. This is most often seen with client applications such as POP3 or HTTP that send only a few bytes of data to a server, but it can also be seen from servers that write a lot of data. For example, if an HTTP server needed to send more data than would fit within a segment; the balance of the data would have to be sent in a separate (small) segment. Once the HTTP server got to the end of the data, it would tell TCP that it was finished and to go ahead and send the data without waiting for more. This step would be achieved by the application setting the Push flag during the final write.

Some applications cause the Push flag to be set quite frequently. For example, some TELNET clients will set the Push flag on every keystroke, causing the client to send the keystroke quickly, thereby causing the server to echo the text back to user's display quickly.

Once TCP gets data that has been pushed, it stores the data in a regular TCP segment, but it also sets a Push flag within that segment's TCP header. This allows the remote endpoint to also see that the data is being pushed. This is an important service, since the Push flag also affects the receiving system's segment-handling process. Just as a sending TCP will wait for more data to arrive from an application before generating a segment, a receiving TCP will sometimes wait for more segments to arrive before passing the data to the destination application. But if a receiver gets a segment with the Push flag set, then it is supposed to go ahead and send the data to the application without waiting for any more segments to arrive.

An interesting (but somewhat irrelevant) detail about the Push flag is that the practical usage is quite a bit different from the behavior defined in the standards. Although RFC 793 states that 밃 sending TCP is allowed to collect data ?until the push function is signaled, then it must send all unsent data,?most TCP implementations do not allow applications to set the Push flag directly. Instead, most TCP implementations simply send data as they receive it (most of the time, applications write data to TCP in chunks rather than in continuous streams), and TCP will set the Push flag in the last segment that it sends. Some implementations will even set the Push flag on every segment that they send.

Similarly, many implementations ignore the Push flag on data they receive, immediately notifying the listening application of all new data, regardless of whether the Push flag is set on those segments.

Another interesting flag within the TCP header is the Urgent flag. The Urgent flag can be used by an application whenever it needs to send data that must be dealt with immediately. If an application requests that a segment be sent using the Urgent flag, then TCP is supposed to place that segment at the front of the send queue, sending it out as soon as possible. In addition, the recipient is supposed to read that segment ahead of any other segments that may be waiting to be processed in the receive buffer.

Urgent data is often seen with TELNET, which has some standardized elements that rely on the use of the TCP Urgent flag. Some of the standardized control characters used with TELNET (such as interrupt process and abort output) have speccific behavioral requirements that benefit greatly from the 뱋ut-of-stream?processing that the Urgent flag defines. For example, if a user were to send an interrupt process signal to the remote host and flag this data for Urgent handling, then the control character would be passed to the front of the queue and acted upon immediately, allowing the output to be flushed faster than would otherwise happen.

However, the use of the Urgent flag has been plagued by incompatibility problems ever since RFC 793 was first published. The original wording of that document did not clarify where the urgent data should be placed in the segment, so some systems put it in one place while other systems put it in another. The wording was clarified in RFC 1122, which stated that the urgent pointer points to the last byte of data in the stream. Also of interest is the fact that the urgent pointer can refer to a byte location somewhere up ahead in the stream, in a future segment. All of the data up to and including the byte position specified by the urgent pointer are to be treated as a part of the urgent block. Unfortunately, some systems (such as BSD and its derivatives) still do not follow this model, resulting in an ongoing set of interoperability problems with this flag in particular.

For more information on the Push and Urgent flags, refer to 밅ontrol Flags?later in this chapter.

Flow Control

When an application needs to send data to another application over TCP, it writes the data to the local TCP provider, which queues the data into a send buffer. Periodically, TCP will package portions of the data into segments and pass them off to IP for delivery to the destination system.

One of the key elements to this process is flow control, where a sending system will adjust the rate at which it tries to send data to the destination system. A change in rate may be required due to a variety of reasons, including the available buffer space on the destination system and the packet-handling characteristics of the network. For this reason, TCP incorporates a variety of flow control mechanisms, allowing the sending system to react to these changes easily.

Originally, RFC 793 proposed only a handful of flow control mechanisms, most of which were focused on the receiving end of the connection. Of these services, the two most important were:

Receive window sizing
TCP can send only as much data as a receiver will allow, based on the amount of space available in the remote system's receive buffer, the frequency at which the buffers are drained, and other related factors. Therefore, one way for a receiver to adjust the transfer rate is to increase or decrease the size of the buffer being advertised. This in turn controls how much data a sender can transmit at once.

Sliding receive windows
In addition to the Window size being advertised by a receiver, the concept of a 뱒liding window?allows the sender to transmit segments 뱋n credit,?before acknowledgments have arrived for segments that were already sent. This lets an endpoint send data even though the preceding data has not yet been acknowledged, trusting that an acknowledgment will arrive for that data shortly.

These mechanisms put the destination system in charge of controlling the rate at which the sender transmits data. As the original theory went, the receiver was likely to be the point of congestion in any transfer operation, and as such needed to have the last word on the rate at which data was being sent.

Over time however, the need for sender-based flow control mechanisms has been proven, particularly since network outages may occur, which will require the sender to reduce its rate of transmission, even though the receiving system may be running smoothly. For this reason, RFC 1122 mandated that a variety of networkrelated flow control services also be implemented. Among these services are:

Congestion window sizing
In order to deal with congestion-related issues, the use of a 밹ongestion window?is required at the sending system. The congestion window is similar in concept to the receive window in that it is expanded and contracted, although these actions are taken according to the underlying IP network's ability to handle the quantity of data being sent, rather than the recipient's ability to process the data.

Slow start
In an effort to keep congestion from occurring in the first place, a sender must first determine the capabilities of the IP network before it starts sending mass quantities of data over a newly established virtual circuit. This is the purpose of slow start, which works by setting the congestion window to a small size and gradually increasing its size, until the network's saturation point is found.

Congestion avoidance
Whenever network congestion is detected, the congestion window is reduced, and a technique called congestion avoidance is used to gradually rebuild the size of the congestion window, eventually returning it to its maximum size. When used in conjunction with slow start, this helps the sender to determine the optimal transfer rate of a virtual circuit.

Taken together, the use of the receive and congestion windows gives a sending system a fairly complete view of the state of the network, including the state of both the recipient and the congestion on the network.

A note on local blocking

Although there are a variety of flow control mechanisms found with TCP, the simplest form of flow control is 뱇ocal blocking,?whereby a sending system refuses to accept data from a local application. This feature is needed whenever TCP knows that it cannot deliver any data to a specific destination system—perhaps due to problems with the receiver or the network—and the local send buffer is already full. Having nowhere to send the data, TCP must refuse to accept any new data from the sending application.

Note that TCP cannot block incoming network traffic (coming from IP). Since TCP is unable to tell which application a segment is destined for until its contents have been examined, TCP must accept every segment that it gets from IP. However, TCP may be unable to deliver the data to the destination application, due to a full queue or some other temporary condition. If this happens, TCP could choose to discard the segment, thereby causing the sender to retry the operation later (an effort which may or may not succeed).

Receive window size adjustments

In the section entitled 밡etwork I/O Management,?I first mentioned the TCP header's Window field, suggesting that it provided an insight into the size of the receive buffer in use on a destination system. Although this is an accurate assessment when looking at TCP's segment sizing process, the primary purpose of the Window field is to provide the receiving system with flow control management services. The Window field is used to tell a sender how much data a recipient can handle. In this model, the recipient dictates flow control.

According to RFC 793, the window field 뱒pecifies the number of octets ?that the receiving TCP is currently prepared to receive.?In this scenario, a sending system can transmit only as much data as will fit within the recipient's receive buffer (as specified by the Window field) before an acknowledgment is required. Once the sender has transmitted enough data to fill the receive buffer, it must stop sending data and wait for an acknowledgment from the recipient before sending any more data.

Therefore, one way to speed up and slow down the data transfer rate between the two endpoint systems is for the receiving system to change the buffer size being advertised in the Window field. If a system that had been advertising an eight-kilo398byte window suddenly started advertising a 16-kilobyte window, the sender could pump twice as much data through the circuit before having to wait for an acknowledgment.

Conversely, if the recipient started advertising a four-kilobyte window, then the sender could transmit only half as much data before requiring an acknowledgment (this would be enforced by the sender's TCP stack, which would start blocking writes from the sending application when this occurred).

An important consideration here is that recipients are not allowed to arbitrarily reduce their window size, but instead are only supposed to shrink the advertised window when they have received data which has not yet been processed by the destination application. Arbitrarily reducing the size of the receive window can result in a situation where the sender has already sent a bunch of data in accordance with the window size that was last advertised. If the recipient were to suddenly reduce the window size, then some of the segments would probably get rejected, requiring the sender to retransmit the lost data.

What happens when the receive buffer goes to zero, effectively preventing the sender from sending any data whatsoever? The answer varies by implementation, but generally speaking the sender will simply stop sending data until the receiver is ready to take data again. Any segments that were already sent may be rejected (or may get accepted), and as such the sender will have to deal with this issue when the window opens again.


		Also, many systems implement an incremental fall-back timer, where they will probe the receiver for a window update periodically whenever this situation occurs. In this scenario, the sender will probe the receiver, and if the size of the window is still zero, then the sender will double the size of its probe timer. Once the timer expires, the sender will probe the receiver again, and if the receive window is still zero, the timer will get doubled again. This process will continue as long as the probe results in an acknowledgement—even if the window remains at zero—up to an implementation-specific maximum (such as 64 or 128 seconds).

As soon as the stalled system is able to begin accepting more data, it is supposed to send an unsolicited acknowledgment to the remote system, advising it that the window is 뱋pen?again.

Since the Window field is included in the header of every TCP segment, advertising a different buffer size is a very straightforward affair. If the recipient is willing to speed up or if it needs to slow down, it simply changes the value being advertised in the Window field of any acknowledgment segment that is being returned, and the sender will notice the change as soon as the segment containing the new value is received. Note that there may be some delay in this process, as it may take a while for that segment to arrive.

The size of the buffer also affects the number of segments that can be received, in that the maximum number of available segments is the Window size divided by the maximum segment size. Typically, systems will set their window size to four times the segment size (or larger), so if a system is using one kilobyte segments, then the smallest window size you would want to use on that system would be four kilobytes.

Unfortunately, since the Window field is only 16 bits long, the maximum size that can be advertised is 65,535 bytes. Although this is plenty of buffer space for most applications, there are times when it just isn't enough (such as when the MTU of the local network is also 64 kilobytes, resulting in a Window that is equal to only a single segment). One way around this limitation is the Window Scale option, as defined in RFC 1323. The Window Scale option allows two endpoints to negotiate 30-bit window sizes, allowing up to one gigabyte of buffer space to be advertised.

While it may seem best to use very large window sizes, it is not always feasible or economical to do so. Each segment that is sent must be kept in memory until it has been acknowledged. A hand-held system may not have sufficient resources to cache many segments, and thus would have to use small window sizes in order to limit the amount of data being sent.

In addition, there is a point at which the size of the receive window no longer has any effect on throughput, but instead the bandwidth and delay characteristics of the virtual circuit become the limiting factors. Setting a value larger than necessary is simply a waste of resources and can also result in slower recovery. For example, if a sender sees a large receive window being advertised then it might try to fill that window, even though a router in between the two endpoints may not be able to forward the data very quickly. This delay can result in a substantial queue building up in the router, and if a segment ever does get lost, then it will take a long time for the recipient to notice the problem and the sender to correct it. This would result in extremely long gaps between retransmissions, and may also result in some of the queued data getting discarded (requiring even more retransmissions).

For more information on the Window field, refer to 밯indow.?For more information on the TCP Window Scale option, refer to 밯indow Scale.?For detailed instructions on how to calculate the most optimal window size for a particular connection, refer to 밡otes on Determining the Optimal Receive Window Size.?All of these sections appear later in this chapter.

Sliding receive windows

Even though large window sizes can help to increase overall throughput, they do not provide for sustained levels of throughput. In particular, if a situation required the use of a synchronous 뱒end-and-wait?design that required a system to send data and then stop to wait for an acknowledgment, the network would be quite jerky, with bursts of writes followed by long pauses. This problem is most noticeable on networks with high levels of latency that cause extended periods of delay between the two endpoints.

n an effort to avoid this type of scenario, RFC 1122 states that a recipient should issue an acknowledgment for every two segments that it receives, if not more often. This design causes the receiver to issue acknowledgments quickly. In turn, these acknowledgments arrive back to the sender quickly.

Once an acknowledgment has arrived back at the sending system, the outstanding data is cleared from the send queue, thereby letting the sender transmit more data. In effect, the sending system can 뱒lide?the window over by the number of segments that have been successfully acknowledged, allowing it to transmit more data, even though not all of the segments have been acknowledged yet.

As long as a sender continues receiving acknowledgments, it is able to continue sending data, with the maximum amount of outstanding segments being determined by the size of the recipient's receive buffer. This concept is illustrated in Figure 7-10, which shows how a sender can increment the sliding window whenever it receives an acknowledgment for previously sent data. For example, as the sender is transmitting segment number three, it receives an acknowledgment for segment number one, allowing the sender to move the send buffer forward by one segment.

The key element here is that the sender can transfer only as many bytes of data as the receiver can handle, as advertised in the Window field of the TCP headers sent by the recipient. If the recipient's receive window is set to eight kilobytes and the sender transmits eight one-kilobyte segments without having received an acknowledgment, then it must stop and wait for an acknowledgment before sending any more data.

However, if the sender receives an acknowledgment for the first two segments after having sent eight of them, then it can go ahead and send two more, since the window allows up to eight kilobytes to be in transit at any time. On networks with low levels of latency (such as Ethernet), this feature can have a dramatic impact on


		Figure 7-10. An overview of TCP's sliding window mechanism

overall performance, providing for sustained levels of high utilization. On networks with very high levels of latency (such as those that use satellite links), the effect is less pronounced, although it is still better than the send-and-wait effect that would otherwise be felt.

In situations where the window size is smaller than the MTU, a sliding window is harder to implement. Some systems will write only a single segment (up to the maximum allowed by the advertised receive buffer), and then stop to wait for an acknowledgment. Other systems will reduce the size of the local send buffers to half (or less) of the advertised receive window, thereby forcing multiple small segments to be written in an effort to increase the number of acknowledgments that are generated.

Another problem can occur with some TCP implementations that do not issue acknowledgments for every two segments that are received, but instead issue acknowledgments when they have received enough data to fill two maximum-sized segments. For example, if the system has a local MTU of 1500 bytes, but is receiving data in 500-byte chunks, then such a system would only issue acknowledgments for every six segments that arrive (6 × 500 = 3000, which is MTU times two). This process would result in a substantially slower acknowledgment cycle that could cause problems if the sender had a small send window. Although this problem is somewhat rare, it does happen.

For systems that implement this procedure correctly (sending acknowledgments for every two segments that are received, regardless of the maximum segment size), this design can substantially improve overall performance. By using the sliding window technique—and by using large windows—it is quite possible for two fast systems on a fast network to practically saturate the connection with data.

For more information on how frequent acknowledgments can impact performance, refer ahead to 밆elayed acknowledgments.?/P>

The Silly Window Syndrome

The amount of buffer space that a system advertises depends on how much buffer space it has available at that given moment, which is dependent upon how quickly applications can pull data out of the receive buffer. This in turn is driven by many factors, such as the complexity of the application in use, amount of CPU time available, the design of the TCP stack in use, and other elements.

Unfortunately, many of the first-generation TCP-based applications did a very poor job of cleaning out the receive buffers, taking only a few bytes at a time. The system only advertised a receive buffer of a few bytes. In turn, a sender would transmit only a very small segment, since that was all that was being advertised by the recipient. This process would repeat incessantly, with the recipient taking another few bytes out of the receive queue, advertising a small window, and then receiving yet another very small segment.

To prevent this scenario (known affectionately as the 밪illy Window Syndrome?, RFC 1122 clarified the amount of buffer space that could be advertised, stating that systems could only advertise a non-zero window if the amount of buffer space available could hold a complete segment (as defined by the value shown in the MSS option), or if the buffer space was at least half of the 뱊ormal?window size. If neither of these conditions are met, then the receiver should advertise a zero-length window, effectively forcing the sender to stop transmitting.

The Nagle algorithm.

The Silly Window Syndrome is indicative of a problem at the receiver's end of the virtual circuit. Data is not being read from the receive buffers quickly, resulting in small window sizes being advertised, which in turn causes the sender to transmit small segments. The result is that lots of network traffic gets generated for very small amounts of data.

However, a sending system can also cause these kinds of problems, although for totally different reasons. Some applications (such as TELNET) are designed to send many small segments in a constant barrage, which causes high levels of network utilization for small amounts of data. Other situations in which this is a problem are applications that write data only in small chunks, such as writing 10 megabytes of data in 512-byte blocks. The number of packets that will get generated in that model are extremely wasteful of bandwidth, particularly when this same transfer could be done using larger writes.

One solution proposed to this kind of problem is the Nagle algorithm, which was originally described in RFC 896. Simply put, the Nagle algorithm suggests that segments that are smaller than the maximum size allowed (as defined by the MSS option of the recipient or the discovered MTU of the end-to-end path) should be delayed until all prior segments have been acknowledged or until a full-sized segment can be sent. This rule forces TCP stacks to merge multiple small writes into a single write, which is then sent as a single segment.

On a low-latency LAN, the Nagle algorithm rarely comes into play, since a small segment will be sent and acknowledged very quickly, allowing another small segment to be sent immediately (effectively eliminating the use of the Nagle algorithm). On slow WAN links though, the Nagle algorithm comes into play quite often, since acknowledgments take a long time to be returned to the sender. This results in the next batch of small segments getting bundled together, providing a substantial increase in overall network efficiency.

For these reasons, use of the Nagle algorithm is encouraged by RFC 1122, although its usage is not mandatory. Some applications (such as X Windows) react poorly when small segments are clumped together. In those cases, users must have the option of disabling the Nagle algorithm on a per-circuit basis. However, most TCP implementations do not provide this capability, instead allowing users to enable or disable its use only on a global scale, or leaving it up to the application developer to decide when it is needed.

This limitation can be somewhat of a problem, since some developers have written programs that generate inefficient segment sizes frequently, and have then gone and disabled the use of the Nagle algorithm on those connections in an effort to improve performance, even though doing so results in much higher levels of network utilization (and doesn't do much to improve performance in the end). If those developers had just written their applications to use large writes instead of multiple small writes, then the Nagle algorithm would never come into effect, and the applications would perform better anyway.

Another interesting side effect that appears when the Nagle algorithm is disabled is that the delayed acknowledgment mechanism (as described later in 밆elayed acknowledgments? does not tend to work well when small segments are being generated, since it waits for two full-sized segments to arrive before returning an acknowledgment for those segments. If it does not receive full-sized segments due to a developer having turned off the Nagle algorithm, then the delayed acknowledgment mechanism will not kick in until a timer expires or until data is being returned to the sender (which the acknowledgments can piggyback onto).

This can be a particular problem when just a little bit of data needs to be sent. The sender will transmit the data, but the recipient will not acknowledge it until the timer expires, resulting in a very jerky session.

This situation can also happen when a small amount of data is being generated at the tail-end of a bulk transfer. However, the chances are good that in this situation the remote endpoint is going to generate some sort of data (such as a confirmation status code or a circuit-shutdown request). In that case, the delayed acknowledgment will piggyback onto whatever data is being returned, and the user will not notice any excess delays.

For all of these reasons, application developers are encouraged to write data in large, even multiples of the most-efficient segment size for any given connection, whenever that information is available. For example, if a virtual circuit has a maximum segment size of 1460 bytes (the norm for Ethernet), the application should write data in even multiples of 1460 (such as 2,920 byte blocks, or 5,840 byte blocks, and so forth). This way, TCP will generate an even number of efficiently sized segments, resulting in the Nagle algorithm never causing any delay whatsoever, and also preventing the delayed acknowledgment mechanism from holding up any acknowledgments.

Congestion window sizing

TCP's use of variable-length, sliding windows provides good flow control services to the receiving end of a virtual circuit. If the receiver starts having problems, it can slow down the rate at which data is being sent simply by scaling back the amount of buffer space being advertised. But if things are going well, the window can be scaled up, and traffic can flow as fast as the network will allow.

Sometimes, however, the network itself is the bottleneck. Remember that TCP segments are transmitted within IP packets, and that these packets can have their own problems outside of the virtual circuit. In particular, a forwarding device in between the two endpoints could be suffering from congestion problems, whereby it was receiving more data than it could forward, as is common with dial-up servers and application gateways.

When this occurs, the TCP segments will not arrive at their destination in a timely manner (if they make it there at all). In this scenario, the receiving system (and the virtual circuit) may be operating just fine, but problems with the underlying IP network are preventing segments from reaching their destination.

This problem is illustrated in Figure 7-11, which shows a device trying to send data to the remote endpoint, although another device on the network path is suffering from congestion problems, and has sent an ICMP Source Quench error message back to the sender, asking it to slow down the rate of data transfer.


		Figure 7-11. Detecting congestion with ICMP Source Quench error messages

Congestion problems can be recognized by the presence of an ICMP Source Quench error message, or by the recipient sending a series of duplicate acknowledgments (suggesting that a segment has been lost), or by the sender's acknowledgment timer reaching zero. When any of these problems occur, the sender must recognize them as being congestion-related, and take counter-measures that deal with them appropriately. Otherwise, if a sender were to simply retransmit segments that were lost due to congestion, the result would be even more congestion. Orderly congestion recovery is therefore required in order for TCP to maintain high performance levels, but without causing more congestion to occur.

At the heart of the congestion management process is a secondary variable called the 밹ongestion window?that resides on the sender's system. Like the receive window, the congestion window dictates how much data a sender can transmit without stopping to wait for an acknowledgment, although rather than being set by the receiver, the congestion window is set by the sender, according to the congestion characteristics of the IP network.

During normal operation, the congestion window is the same size as the receive window. Thus, the maximum transfer rate of a smooth-flowing network is still restricted by the amount of data that a receiver can handle. If congestion-related problems occur, however, then the size of the congestion window is reduced, thereby making the limiting factor the sender's capability to transmit, rather than the receiver's capability to read.

How aggressively the congestion window is reduced depends upon the event that triggered the resizing action:

?If congestion is detected by the presence of a series of duplicate acknowledgments, then the size of the congestion window is cut in half, severely restricting the sender's ability to transmit segments. TCP then utilizes a technique known as 밹ongestion avoidance?to slowly increment the size of the congestion window, cautiously ramping up the rate at which it can send data, until it returns to the full throttle state.

?If congestion is detected by the TCP acknowledgment timer reaching zero or by the presence of an ICMP Source Quench error message, then the congestion window is shrunk so small that only one segment can be sent. TCP uses a technique known as 뱒low start?to begin incrementing the size of the congestion window until it is half of its original size, at which point the congestion avoidance technique is called into action to complete the ramp-up process.

Slow start and congestion avoidance are similar in their recovery techniques. However, they are also somewhat different and are used different times. Slow start is used on every new connection—even those that haven't yet experienced any congestion—and whenever the congestion window has been dropped to just one segment. Conversely, congestion avoidance is used both to recover from non-fatal congestion-related events and to slow down the rate at which the congestion window is being expanded, allowing for smoother, more sensitive recovery procedures.

Slow start

One of the most common problems related to congestion is that senders attempt to transmit data as fast as they can, as soon as they can. When a user asks for a big file, the server gleefully tries to send it at full speed immediately.

While this might seem like it would help to complete the transfer quickly, in reality it tends to cause problems. If there are any bottlenecks between the sender and receiver, then this burst-mode form of delivery will find them very quickly, causing congestion problems immediately (most likely resulting in a dropped segment). The user may experience a sudden burst of data, followed by a sudden stop as their system attempts to recover one or more lost segments, followed by another sudden burst. Slow start is the technique used to avoid this particular scenario.

In addition, slow start is used to recover from near-fatal congestion errors, where the congestion window has been reset to one segment, due to an acknowledgment timer reaching zero, or from an ICMP Source Quench error message being received.

Slow start works by exponentially increasing the size of the congestion window. Every time a segment is sent and acknowledged, the size of the congestion window is increased by one segment's worth of data (as determined by the discovered MTU/MRU sizes of the virtual circuit), allowing for more and more data to be sent.

For example, if the congestion window is set to one segment (with the segment size being set to whatever value was determined during the setup process), a single segment will be transmitted. If this segment is acknowledged, then the congestion window is incremented by one, now allowing two segments to be transmitted simultaneously. The next two segments in the send buffer then get sent. If they are both acknowledged, they will each cause the congestion window to be incremented by one again, thus adding room for two more segments (with the congestion window being set to four segments total). All of the segments do not have to be acknowledged before the congestion window is incremented, as shown in Figure 7-12.


		Figure 7-12. An overview of the slow start algorithm

If a connection is new, then the process is repeated until congestion is detected or the size of the congestion window is equal to the size of the receive window, as advertised by the receiver's Window field. If the ramping process is successful, then the virtual circuit will eventually be running at full speed, with the flow control being dictated by the size of the recipient's receive buffer. But if congestion is detected during the incrementing process, the congestion window will be locked to the last successful size. Any further congestion problems will result in the congestion window being reduced (as per the process described earlier in 밅ongestion window sizing?.

However, if the slow start routines are being used to recover from a congestion event, then the slow start procedure is used only until the congestion window reaches half of its original size. At this point, the congestion avoidance technique is called upon to continue increasing the size of the congestion window (as described soon in 밅ongestion avoidance?. Since congestion is likely to occur again very quickly, TCP takes the more cautious, linear-growth approach outlined with congestion avoidance, as opposed to the ambitious, exponential growth provided with slow start.

Note that although RFC 1122 mandates the use of slow start with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the slow start routines described here.

In addition, RFC 2414 advocates the use of four segments as the seed value for slow start, rather than the one segment proposed in RFC 2581 (TCP Congestion Control), which is arguably an improvement with applications that send more than one segment. For example, if an application needed to send two segments of data but the initial congestion window was locked at 뱋ne segment,?then the application could send only one of those segments. As such, the remote endpoint would not receive all of the application data, and the delayed acknowledgment mechanisms would force a long pause before an acknowledgment was returned. But by setting the initial congestion window to 뱓wo segments,?the sender can issue two full-sized segments, which will result in the recipient issuing an acknowledgment immediately. Although this allows connections to ramp-up faster, note that RFC 2414 is only experimental and is not required to be implemented in any shipping TCP implementations.

Congestion avoidance

The congestion avoidance routines are called whenever a system needs to use a slower, more sensitive form of congestion avoidance than the exponential mechanisms offered by the slow start procedure. A slow congestion avoidance mechanism may be required when a system detects congestion from the presence of multiple duplicate acknowledgments, or as part of the recovery mechanisms that are utilized when an acknowledgment timer reaches zero.

Although duplicate acknowledgments are not uncommon (and are allowed for by TCP's error-recovery mechanisms), the presence of many such acknowledgments tends to indicate that an IP datagram has been lost somewhere, most likely due to congestion occurring on the network. As such, RFC 1112 states that if three or more duplicate acknowledgments are received, then the size of the congestion window should be cut in half, and the congestion avoidance technique is to be used in an effort to return the network to full throttle.

Another scenario where congestion avoidance is used is if the sender's acknowledgment timer has expired, which means that no acknowledgments are coming back from the other end. This signifies that there are serious congestion problems, or that the other system has left the network. In an effort to recover from this event, the congestion window is shrunk so small that only one segment can be sent. Then the slow start mechanism is called upon and used the congestion window is half of its original size. Congestion avoidance is then used to return the network to full speed, albeit at a slower, more cautious rate.

Congestion avoidance is very similar to slow start in that the size of the congestion window is expanded whenever acknowledgments arrive for segments that have been sent. However, rather than incrementing the congestion window on a one-for-one basis (as is done with slow start), the congestion window is incremented by only one segment when all of the segments sent within a single window are acknowledged.

For example, assume that a system's congestion window is set to allow four segments, although the recipient's receive window is advertising a maximum capacity of eight segments. Using congestion avoidance, a system would send four segments and then wait for all of them to be acknowledged before incrementing the size of the congestion window by one (now being set to 밼ive segments?.

If this effort was a success, then the next five segments would be sent, and if all of them were acknowledged, then the congestion window would be increased to six. This process would continue until either congestion occurred again or the congestion window equals the size of the receive window being advertised by the recipient (밻ight segments?here).

Note that it doesn't matter if the remote system sends back a single acknowledgement for all of the segments previously sent, or if individual acknowledgments are returned for each of the segments. With congestion avoidance, all of the segments must be acknowledged before the size of the congestion window will be incremented.

Also note that although RFC 1122 mandates the use of congestion avoidance with TCP, the procedure was not fully documented until RFC 2001 was published. Therefore, many of the earlier systems do not incorporate the congestion avoidance routines described here.

Reliability

The most often touted TCP service is reliability, with TCP's virtual circuit design practically guaranteeing that data will get delivered intact. Using this design, TCP will do everything it can to get data to the proper destination, if at all possible. If this is not possible—perhaps due to a failure in the network or some other

catastrophic event—then naturally TCP won't be able to deliver the data. However, as long as the network and hosts are operational, TCP will make sure that the data is delivered intact.

TCP's reliability service takes many forms, employing many different technologies and techniques. Indeed, RFC 793 states that TCP must be able to 뱑ecover from data that is damaged, lost, duplicated, or delivered out of order.?This is a broad range of service, and as such TCP's reliability mechanisms tend to be somewhat complex.

The most basic form of reliability comes from the use of checksums. TCP checksums are used to validate segments (including the TCP headers and any associated data). Furthermore, checksums are mandatory with TCP (as opposed to being optional as they are with UDP), requiring that the sender compute them, and that the recipient compare them to segments received. This provides a simple validation mechanism that lets a receiver test for corrupt data before handing the data off to the destination application.

Although checksums are useful for validating data, they aren't of any use if they never arrive. Therefore, TCP also has to provide delivery services that will ensure that data arrives in the first place. This service is provided by TCP's use of sequence numbers and acknowledgments, both of which work together to make TCP a reliable transport. Once a segment has been sent, the sender must wait for an acknowledgment to be returned stating that all of the data has been successfully received. If a segment is not acknowledged within a certain amount of time, the sender will eventually try to send it again. This design allows TCP to recover from segments that get lost in transit.

Furthermore, the use of unique sequence numbers allows a receiver to reorder any segments that may have come in out of sequence. Since IP is unpredictable, it is entirely possible that some datagram will be routed over a slower link than the rest of the datagrams, causing some of them to arrive in a different order than they were sent. The receiving TCP system can use the sequence numbers to reorder segments into their correct sequence—as well as eliminate any duplicates—before passing the data off to the destination application.

Taken together, these services make TCP an extremely reliable transport protocol, which is why it is the transport of choice for most Internet applications.

In summary, the key elements of TCP's reliability service are:

Checksums
TCP uses checksums for every segment that is sent, allowing the destination system to verify that the data within the segment is valid.

Sequence numbers
Every byte of data that gets sent across a virtual circuit is assigned a sequence number. These sequence numbers allow the sender and receiver to refer to a range of data explicitly, and also allows the recipient to reorder segments that come in out of order, as well as eliminate any duplicates.

Acknowledgements
Every byte of data sent across a virtual circuit must be acknowledged. This task is achieved through the use of an acknowledgment number, which is used to state that a receiver has received all of the data within a segment (as opposed to receiving the segment itself), and is ready for more data.

Timers
Since TCP uses IP for delivery, some segments can get lost or corrupted on their way to the destination. When this happens, no acknowledgment will be received by the sender, which would require a retransmission of the questionable data. In order to detect this error, TCP also incorporates an acknowledgment timer, allowing the sender to retransmit lost data that does not get acknowledged.

In practice, these mechanisms are tightly interwoven, with each of them relying on the others in order to provide a totally reliable implementation. They are discussed in detail in the following sections.

TCP checksums

TCP checksums are identical to UDP checksums, with the exception that checksums are mandatory with TCP (instead of being optional, as they are with UDP). Furthermore, their usage is mandatory for both the sending and receiving systems. RFC 1122 clearly states that the receiver must validate every segment received, using the checksum to verify that the contents of the segment are correct before delivering it to the destination application.

Checksums provide a valuable service in that they verify that data has not been corrupted in transit. All of the other reliability services provided by TCP—the sequence numbers, acknowledgments, and timers—serve only to ensure that segments arrive at their destination; checksums make sure the data inside the segments arrives intact.

Checksums are calculated by performing ones-complement math against the header and data of the TCP segment. Also included in this calculation is a 뱎seudo-header?that contains the source and destination IP addresses, the Protocol Identifier (6 for TCP), and the size of the TCP segment (including the TCP headers and data). By including the pseudo-header in the calculations, the destination system is able to validate that the sender and receiver information is also correct, in case the IP datagram that delivered the TCP segment got mixed up on the way to its final destination.

TCP must validate the checksum before issuing an acknowledgment for the segment. If a segment is received with an invalid checksum, then the segment must be discarded. Discarding the segment is a 뱒ilent?event, with no notification of the failure being generated or sent.

This is required behavior, since the recipient has no way of determining which circuit the segment belongs to if the checksum is deemed invalid (the header could be the corrupt part of the segment). In such a situation, an error message could be sent to the wrong source, thereby causing additional (and unrelated) problems to ensue. Instead, the segment is thrown away, and the original sending system would eventually notice that the data was not successfully received (due to the acknowledgment timer expiring), and the segment would eventually be reissued.

Since each virtual circuit consists of a pair of sockets, the receiver has to know the IP address of the sender in order to deliver the data to the correct destination application. If there are multiple connections to port 80 on the local server (as would be found with an HTTP server), TCP has to know which system sent the data in order to deliver it to the right instance of the local server. Although this information is available from IP, TCP verifies the information using the checksum's pseudo-header.

Note that RFC 1146 introduced a TCP option for alternative checksum mechanisms. However, the Alternative Checksum option was classified as experimental, and RFC 1146 has since expired. Therefore, the Alternative Checksum option should not be used with any production TCP implementations.

Sequence numbers

A key part of TCP's reliability service is the use of sequence numbers and acknowledgements, allowing the sender and receiver to constantly inform each other of the data that has been sent and received. These two mechanisms work hand-in-glove to ensure that data arrives at the destination system.

RFC 793 states that 밻ach [byte] of data is assigned a sequence number.?The sequence number for the first byte of data within a particular segment is then published in the Sequence Identifier field of the TCP header. Thus, when a segment is sent, the Sequence Identifier field shows the starting byte number for the data within that particular segment. Note that sequence numbers do not refer to segments, but instead refer to the starting byte of a segment's data block.

Once a segment is received, the data is verified by the recipient (using the checksum), and if it's okay, then the recipient will send an acknowledgment back to the sender. The acknowledgment is also contained within a TCP segment, with the

Acknowledgment Identifier field in the TCP header pointing to the next sequence number that the recipient is willing to accept. The acknowledgment effectively says 밒 received all of the data up to this point and am ready for the next byte of data, starting at sequence number n.?/I>

When the acknowledgment arrives, the sender knows that the receiver has successfully received all of the data contained within the segment, and the sender is then able to transmit more data (up to the maximum amount of data that will fit within either the receiver's current receive window or the sender's current congestion window). This process is illustrated in Figure 7-13. In that example, the sender has identified the first byte of data being sent as 1, while the acknowledgment for that segment points to the first byte of data from the next segment that the receiver expects to get (101).


		Figure 7-13. Using sequence numbers to track data

In practice, sequence numbers should rarely be numbered 1. Sequence numbers are 32-bit integers, with a possible range in values from zero through 4,294,967,295. RFC 1122 states that systems must seed the sequence number value on all new circuits using a value derived from the local system's clock. Therefore, the first byte of data being sent across a virtual circuit should not be numbered 1, but instead should be numbered according to a value derived from the current time. Some systems violate this principle, starting at 1 even though they're not supposed to.

The main reason for seeding the sequence number with the system clock is for safety. If sequence numbers always start at a fixed integer (like 1), there is an increased opportunity for overlapping to occur, particularly on circuits that are opened and closed rapidly. For example, if two systems used the same port numbers for multiple connections, and a segment from the first connection got lost in transit, that segment may arrive at the destination during the next connection, thereby appearing to be a valid sequence number. For this reason, all TCP

implementations should always seed the sequence number for all new connections using a value derived from the local system's clock.

In addition, RFC 1948 discussed how this information could be used to launch a variety of attacks against a system, and that using predictable sequence numbers was not only a technical problem but a security risk as well. Essentially, predictable sequence numbers also mean that acknowledgment numbers can be predicted. Given that information, it is easy for a remote hacker to send fake packets to your servers, providing valid IP addresses and acknowledgment numbers. This loophole lets the bad guy compromise your systems without ever seeing a single packet. Unfortunately, some systems still use highly predictable sequence numbers today, and this problem has not gone away entirely.

Another concern with sequence numbers is that they can wrap around during long data transfers. Although there are more than four billion possible sequence numbers, this is not an infinite amount, so reusing sequence numbers will certainly happen on some circuits, particularly those that are kept open for extended periods of time. For example, if a 10-gigabyte file was transferred between two hosts, then the sequence numbers used on that virtual circuit would have to wrap around twice, with some (if not many) of the sequence numbers getting reused at some point. When this occurs, a segment that got lost or redirected in transit could show up late and appear to be a valid segment.

In order to keep reused sequence numbers from causing these kinds of problems, the recipient must limit the active sequence numbers to a size that will fit within the local receive buffer. Since the receive buffer limits the amount of data (in bytes) that can be outstanding and unacknowledged at any given time, a recipient system can simply ignore any segments with sequence numbers that are outside the boundaries of the current window range.

For example, if a recipient has an eight-kilobyte receive buffer, it can set an eight-kilobyte limit on the data that it receives. If it is currently processing segments with sequence numbers around 100,000, then it can safely ignore any segments that arrive with sequence numbers less than 92,000 or greater than 108,000.

In addition, IP's built-in Time-To-Live mechanism also helps to keep older segments from showing up unexpectedly and wreaking havoc. If an IP datagram has a medium-sized Time-To-Live value, then the datagram may be destroyed before it ever reaches the destination. However, most TCP implementations set the Time-To-Live value at 60 seconds (a value recommended in RFC 1122). Since this value tends to be greater than the acknowledgment timer, it is quite possible that a sender will reissue a segment that has not been acknowledged, and that the old datagram will show up unexpectedly. Since the two segments would have the same sequence number, the recipient should be able to detect that they are duplicates, and simply discard the duplicate segment.

Another way to deal with this problem is to use the TCP Timestamp option to identify when a particular segment was sent. On extremely fast networks (such as those using Gigabit Ethernet), it takes only 17 seconds to completely cycle through all four billion sequence numbers. Since the Time-To-Live value on most IP datagrams is substantially larger than this, there is a high probability of an old datagram showing up with a recently used sequence number. RFC 1323 provides a solution to this problem, using the TCP Timestamp option as a secondary reference for each unique segment. When used together, these two mechanisms keep old segments from wreaking havoc when sequence numbers are being reused.

Since the Sequence Identifier field is a standard part of the TCP header, every segment that is sent must have a Sequence Identifier field, even if it is a segment that doesn't contain any data (such as acknowledgments). There's an obvious problem here: it does not contain any data, so what should it use for the Sequence Identifier? After all, the sequence number is supposed to refer to the first byte of data.

If a segment does not contain any data, then the next byte of data expected to be sent is used in the Sequence Identifier field of the TCP header. This sequence number would continue to be used until some data was actually sent, forcing the sequence number to be incremented.

Figure 7-14 illustrates how zero-length segments reuse sequence numbers. As the sender pumps data down to the recipient, the latter has to periodically acknowledge the data that it has received. These acknowledgments are sent as individual segments, with each segment having a Sequence Identifier in the TCP header. Since the client isn't sending any data, it is using the next byte it expects to send as the sequence number. This sequence number will be reused until data actually does get sent, at which point the client's sequence number will be incremented.

One drawback of this approach is that these acknowledgment segments are nonrecoverable if they get lost or become corrupted. Since each of these segments contain duplicate sequence numbers, there's no way for the other end of the connection to uniquely identify them. The remote endpoint cannot ask that sequence number n be resent, because there are lots of segments with that sequence number. In addition, sequence number n has not yet been sent anyway, since the Sequence Identifier field is referring to the next byte of data expected to be sent.

However, this does not mean that the connection will collapse if a zero-length segment is lost. Since zero-length segments typically contain acknowledgments, if one of them is lost then the acknowledgment will be lost as well. But if the sender has sent more data beyond that segment, then the recipient will likely return an acknowledgment for a higher-numbered sequence anyway, obviating the need for that particular acknowledgment to be resent. If the sender is not sending any more data, then it will eventually notice the missing acknowledgment, and resend the


		Figure 7-14. Reusing sequence numbers with command segments

questionable data. This action should result in the recipient re-issuing the lost acknowledgment, providing full recovery.

Some of the other zero-length command segments that get used—such as the command segments used to open and close virtual circuits—have their own special sequence number considerations. For example, 뱒tart?segments that use the Synchronize bit use a sequence number that is one lower than the sequence numbers used for data, while 밹lose?segments that use the Finish or Reset bits use sequence numbers that are one greater than the sequence numbers used for data. By using sequence numbers outside the range of the sequence numbers used for data, these particular segments will not interfere with actual data delivery, and can be tracked individually if necessary.

For more information on the Synchronize, Finish, and Reset flags, refer to 밅ontrol Flags?later in this chapter.

Acknowledgment numbers

Acknowledgment numbers and sequence numbers are closely tied, working together to make sure that segments arrive at their destination.

Just as sequence numbers are used to identify the individual bytes of data being sent in a segment, acknowledgment numbers are used to verify that all of the data in that segment was successfully received. However, rather than pointing to the first byte of data in a segment that has just arrived, the acknowledgment number points to the next byte of data that a recipient expects to receive in the next segment.

This process is illustrated earlier in Figure 7-13. In that example, the sender transmits 100 bytes of data, using a sequence number of 1 to identify the first byte of data in the segment. The receiver returns an acknowledgment for the segment, indicating that it's ready to accept the next segment (starting at sequence number 101). Notice that the acknowledgment does not point to bytes 1 or 100, but instead points to 101, the next byte that the receiver expects to get.

In truth, acknowledgment numbers are closer in concept to a flow control throttle rather than being explicit acknowledgments. Rather than saying 밒 got the data,?they say 밒'm ready for more data, starting at byte number n.?/I>

This design is commonly referred to as being 밹umulatively implicit,?indicating that all of the data up to (but not including) the acknowledgment number has been received successfully, rather than explicitly acknowledging that a particular byte has been received. Implicit acknowledgments work well when data is flowing smoothly, as a receiver can continually request more data. However, when things go bad, implicit acknowledgments are not very robust. If a segment gets lost or corrupted, then the recipient has no way of informing the sender of the specific problem. Instead, it must re-request the next expected byte of data, since that's all the cumulatively implicit acknowledgment scheme allows for.

Remember that the sliding window mechanism allows a sending system to transmit as many segments as can fit within the recipient's advertised receive buffer. If a system is advertising an eight-kilobyte window, and the sender is using one-kilobyte segments, then as many as eight segments may be issued and in transit at any given moment. If the first segment is lost or damaged, the recipient may still get the remaining seven segments. Furthermore, it should hold those other segments in memory until the missing data arrives, preventing the need for all of the other segments to get resent.

However, the recipient must put the segments back into their original order before passing the data up to the destination application. Therefore, it has to notify the sender of the missing segment before it can process the remaining seven segments.

Most network protocols use either negative acknowledgments or selective acknowledgments for this service. Using a negative acknowledgment, the recipient can send a message back to the sender stating 뱒egment n is missing, please resend.?A selective acknowledgment can be used to notify the sender that 밷ytes a through g and bytes s through z were received, please resend bytes h through r.?/I>

However, TCP does not use negative or selective acknowledgments by default. Instead, a recipient system has to implement recovery using the implicit acknowledgment mechanism, simply stating 밶ll bytes up to n have been received.?When a segment is lost, the recipient has to resend the acknowledgment, thereby informing the sender that it is still waiting for a particular sequence number. The original sender then has to recognize the duplicate acknowledgment as a cry for help, stop transmitting new data, and resend the missing data.

Note that RFC 1106 introduced an experimental TCP option that allowed for the use of negative acknowledgments. However, the Negative Acknowledgment option was never widely used, and RFC 1106 has since expired. Therefore, the Negative Acknowledgment option should not be used with any production TCP implementations.

In addition, RFC 1072 introduced selective acknowledgments to TCP, by way of a set of TCP options. However this work has been clarified in RFC 2018. Using the selective acknowledgment options described therein, a TCP segment can precisely state the data it has received—and thus the data that's missing—even if those blocks of data are non-contiguous. In this model, a receiver uses the normal acknowledgment scheme to state that it is looking for sequence number n, and then supplements this information with the Selective Acknowledgment Data option, stating that it also has bytes y through z in the receive queue. The sender would then resend bytes n through x, filling the hole in the receiver's queue. For a more detailed discussion on Selective Acknowledgments, refer to 밪elective Acknowledgments Permitted?and 밪elective Acknowledgment Data?both later in this chapter.

The cumulatively implicit acknowledgment scheme used by TCP is illustrated in Figure 7-15. In that example, each segment contains 100 bytes. The first segment is received successfully, so the recipient sends an implicit acknowledgment for bytes zero through 100 back to the sender. The second segment, however, is lost in transit, so the recipient doesn't see (or acknowledge) it.

When the third segment arrives, the recipient recognizes that it is missing bytes 101 through 200, yet having no way to issue a negative acknowledgment, it repeats the previous implicit acknowledgment, indicating that it is still waiting for byte 101.

What happens next depends on a variety of implementation issues. In the original specification, the sender could wait until an acknowledgment timer for sequence number 101 had expired before resending the segment. However, RFC 1122 states that if three duplicate acknowledgments are received for a segment—and if no other acknowledgments have been received for any subsequent segments—then the sender should assume that the segment was probably lost in transit. In this


		Figure 7-15. Detecting data loss from multiple duplicate acknowledgments

case, the sender should just retransmit the questionable segment rather than waiting for the acknowledgment timer for that segment to expire. This process is known as fast retransmit, which is documented in RFC 2581.

It is important to note that fast retransmit does not work when the data has been lost from the tail-end of the stream. Since no other segments would have been sent after the lost segment, there would not be any duplicate acknowledgments, and as such fast retransmit would never come into play. In those situations, the missing data will be discovered only when the acknowledgment timer for that segment expires.

Regardless of the retransmission strategy used, once the sender had resent the lost segment, it would have to decide whether or not it needed to resend all of the segments following the lost segment, or simply resume sending from the point it left off when the missing segment was discovered. The most common mechanism that is used for this is called fast recovery, which is also described in RFC 2581. The fast recovery algorithm states that if data was retransmitted due to the presence of multiple duplicate acknowledgments, then the sender should just resume transmitting segments on the assumption that none of the subsequent segments were lost. If in fact there were other lost segments, then that information would be discovered in the acknowledgments for the retransmitted segment. This position assumes that multiple segments are not likely to have been lost, and is accurate most of the time.

Of course, this also depends on whether or not the recipient actually kept any other segments that may have been sent. Although RFC 1122 states that the recipient should keep the other segments in memory (thereby allowing for reassembly to occur locally rather than requiring a total retransmission), not all systems conform to this recommendation.

Another related issue is 뱎artial acknowledgment,?whereby a recipient has lost multiple segments. When that happens, the sender may discover and resend the first lost segment through the use of the fast retransmission algorithm. However, rather than getting an acknowledgment back for all of the segments sent afterwards, an acknowledgment is returned that only points to some of the data sent afterwards. Although there aren't any standards-track RFCs that dictate how this situation should be handled, the prevailing logic is to have the sender retransmit the next-specified missing segment and then continue resending from where it left off.

This process illustrates the importance of the recipient's receive buffer, particularly as it pertains to the sender. Every time the recipient received another segment that was out of order, it would have to store this data into the receive buffer until the missing segment arrived. This in turn would take up space in the receive buffer, and so the recipient would have to advertise a smaller receive buffer every time it sent a duplicate acknowledgment for the missing segment. This would in turn cause the sender to slow down (as described earlier in 밨eceive window size adjustments?, with the sender eventually being unable to send any additional segments. Once the recipient got the missing segment, it would reorder the segments and then pass the data off to the destination application. Then it could advertise a large receive buffer again, and the sender could resume sending data.

Depending on the size and condition of the receive buffer, the sender may be able to resume sending data from where it left off (in our example, sequence number 401), without waiting for an acknowledgment for the next segment. This really depends on the number of unacknowledged segments currently outstanding and the maximum amount of unacknowledged data allowed by the recipient.

For example, if the size of the receive buffer was 800 bytes (using 100-byte segments)—and if only two segments were currently unacknowledged—then once the sender had resent the missing data, it could go ahead and resume transmitting additional segments without waiting for an acknowledgment for those other segments. But if the receive buffer had been cut down to just two hundred bytes, then the sender could not send any more data until the two outstanding segments had been acknowledged.

For more details on how the receive buffer controls flow control in general, refer back to 밊low Control.?For more information on the selective acknowledgment option, refer ahead to 밪elective Acknowledgment Data.?/P>

Acknowledgment timers

Most of the time, spurious packet loss is dealt with by using the fast retransmit and fast recovery algorithms, as defined in RFC 2581. However, those algorithms are not always usable. For example, if the link has failed completely, then multiple duplicate acknowledgments will not be received. Also, if the last segment from a transmission were the one that got lost, then there would not be any additional segments that would cause multiple duplicate acknowledgments to get generated. In these situations, TCP has to rely on an acknowledgment timer (also known as a retransmission timer) in order to detect when a segment has been lost in transit.

Whenever a sender transmits a segment, it has to wait for an acknowledgment to arrive before it can clear the segment from the outbound queue. On well-heeled networks, these acknowledgments come in quickly, allowing the sender to clear the data immediately, increment the sliding window, and move on to the next waiting segment. But if a segment is lost or corrupted and no acknowledgment ever arrives, then the sender has to rely on the timer to tell it when it should resend unacknowledged data.

Determining the most efficient size for the acknowledgment timer is a complex process that must be handled carefully. Setting the timer too short would result in frequent and unnecessary retransmissions, while setting the timer too long would result in unproductive delays whenever loss actually occurred.

For example, the acknowledgment timer for two systems connected together on a high-speed LAN should be substantially shorter than the timer used for a slow connection over the open Internet. Using a short timer allows failure to be recognized quickly, which is desirable on a high-speed LAN where latency is not much of an issue. However, setting a long timer would be more practical when many slow networks were involved, as it would not be efficient to continually generate duplicate segments when the problem is slow delivery (rather than packet loss).

Most systems start with a default acknowledgment timer, and then adjust this timer on a per-circuit basis according to the round-trip delivery times encountered on that specific connection. However, even this approach can get complicated, because the default timer is likely to be inappropriate for many of the virtual circuits, since some of them will be used for slow, long-haul circuits while others will be used for local and fast connections.

For example, most modern systems use a default timer of 3000 milliseconds, which is really too large for local area networks that have a round-trip time less than 10 milliseconds (even though this is the recommended default in RFC 1122). Conversely, many earlier implementations had a default timer of 200 milliseconds, which is far too short for many dial-up and satellite links, resulting in frequent and totally unnecessary retransmissions.

Also, the round-trip delivery times of most networks change throughout the day, due to changes in network utilization, congestion, and routing updates that affect the path that segments take on the way to their destination. For these reasons, the default setting is only accurate some of the time, and must be modified to reflect the specific latency characteristics of each virtual circuit throughout the connection's lifetime.

The two formulas used for determining round-trip delivery times are Van Jacobsen's algorithm and Karn's algorithm. Van Jacobsen's algorithm is useful for determining a 뱒moothed round-trip time?across a network, while Karn's algorithm offers techniques for adjusting the smoothed round-trip time whenever network congestion is detected. Although these two algorithms are outside the scope of this book, understanding their principles is required in order to fully understand how they can impact TCP acknowledgment timers.

The basis of Van Jacobsen's algorithm is for a sender to watch the delay encountered by acknowledgment segments as they cross the network, constantly tweaking the variables in use according to the specific amount of time it takes to send a segment and then receive an acknowledgment for that segment.

Although Van Jacobsen's original algorithm used acknowledgments to determine round-trip times for specific segments, this model did not provide for guaranteed accuracy, as multiple acknowledgments could arrive for a single segment (due to loss or due to the use of acknowledgments to command segments, each of which would share the same sequence number). In order to provide for a more accurate monitoring tool, RFC 1072 introduced a pair of TCP options that could be used for measuring the round-trip time of any given circuit, called the Echo and Echo Reply options. However, this work was abandoned in favor of a generic Timestamp option, as defined in RFC 1323.

RFC 1323 uses two fields in a single Timestamp option, allowing both systems to monitor the precise round-trip delivery time of every segment that they send. Whenever a system needs to send a segment, it should place the current time into the Timestamp Value field of the Timestamp option in the TCP header of the outgoing segment. When the remote system receives the segment, it will copy that data into the Timestamp Reply field of the response segment, and place its own timestamp into the Timestamp Value field. Upon receipt of the response, the original sender will see the original timestamp for the original segment, and can compare that data to the current time allowing it to determine the exact amount of latency for the network. In addition, the field-swapping operation will also be repeated, allowing the remote end to determine the same information. For more information on the Timestamp option, refer to 밫imestamp?later in this chapter.

Karn's algorithm amplifies the basic round-trip time formula, although it focuses on how to deal with packet loss or congestion. For example, Karn's algorithm suggests that it is best to ignore the round-trip times on packets that get lost (i.e., where no acknowledgment has been received) in order to prevent one failure from unnecessarily tilting the smoothed round-trip time determined from using Van Jacobsen's algorithm. Karn's algorithm also suggests that the value of the acknowledgment timer should be doubled whenever questionable data has been retransmitted due to the acknowledgment timer expiring, in case the problem is with temporary link failure or congestion.

In this model, if the retransmitted segments also go unacknowledged, then the acknowledgment timer will be doubled yet again, with the process repeating until a system-specific maximum has been reached. This could be a maximum number of retransmissions, or a maximum timer value, or a combination of the two.

Systems based on BSD typically limit the length of the retransmission timer to either five minutes or a maximum of twelve attempts, whichever comes first. Windows-based systems limit retransmissions to five attempts, with each retransmission doubling the acknowledgment timer. Other implementations do not double the retransmission timer, but instead use a percentage-based formula or a fixed table, hoping to recover faster than blind-doubling would allow. Regardless, remember that the value that is being incremented or doubled is based on the smoothed round-trip time for that connection, so the maximum acknowledgment timer value could be either quite large or quite small.

Some systems have shown problems in this area, failing to double the size of their retransmission timers whenever the timer expired. As such, these systems would send a retransmission, and then continue resending the data in short fixed intervals. Since these systems had low timers anyway (200 milliseconds was the default), a dial-up user connecting to this system would tend to get at least two or three retransmissions of the very first segment, until the round-trip smoothing started to kick in.

Also, some systems will cache the learned round-trip time for future use, allowing any subsequent connections to the same remote system (or network) to use the previously learned round-trip latency values. This feature allows the new connection to start with a default that should be appropriate for the specific endpoint system, instead of starting at the system default value (which is almost always wrong).

RFC 1122 mandates that both Van Jacobsen's algorithm and Karn's algorithm be used in all TCP implementations so that acknowledgment timers will get synchronized quickly. Subsequent experimentation has shown that these algorithms do in fact help to improve overall throughput and performance, regardless of the networks in use.

However, there are also times when these algorithms can actually cause problems to occur, such as when Karn's algorithm results in an overly slow reaction to a sudden change in the network's characteristics. For example, if the round-trip time suddenly goes through the roof due to a change in the end-to-end network path, the acknowledgment timer on the sender will most likely get triggered before the data ever reaches the destination. When that happens, the sender will resend the unacknowledged data, the size of the retransmission timer will get doubled, and the acknowledgments for the questionable data may also be ignored (since retransmissions aren't supposed to interfere with the smoothed round-trip time). It will take several attempts before the smoothed round-trip time will get updated to reflect the true round-trip latency of the new network path.

Delayed acknowledgments.

Figure 7-15 earlier in this chapter shows the receiver sending an acknowledgment every time it receives a segment from the sender. However, this is not necessarily an effective use of resources. For one thing, the receiver has to spend CPU cycles on calculating the acknowledgment, as does the sender when it gets the acknowledgment. Furthermore, the use of frequent acknowledgments also generates excessive amounts of network traffic, thereby consuming bandwidth that could otherwise be used by the sender to transmit data.

Rather than acknowledging every segment, it is better for the receiver to only send acknowledgments on a periodic basis. A mechanism called Delayed Acknowledgment is used for this purpose, allowing multiple segments to be acknowledged simultaneously. Remember that acknowledgments are implicit, stating that 밶ll data up to n has been received.?It is therefore possible for a recipient to acknowledge multiple segments simultaneously by simply setting the Acknowledgment Identifier to a higher inclusive value, rather than sending multiple distinct acknowledgments. Not only does this consume less network resources, but it also requires less computational resources on the part of the two endpoints.

This concept is illustrated in Figure 7-16. In that example, the recipient only sends an acknowledgment after receiving two segments. This approach not only generates less traffic, but it also allows the sender to increment their sliding window by two segment sizes, thereby helping to keep traffic flowing smoothly.

RFC 1122 states that all TCP implementations should utilize the delayed acknowledgment algorithm. However, RFC 1122 also states that TCP implementations who do so must not delay an acknowledgment for more than 500 milliseconds (to prevent the sender's acknowledgment timers from reaching zero).

RFC 1122 also states that an acknowledgment should be sent for every two fullsized segments received. However, this depends upon the ability of the recipient to clear the buffer quickly, and also depends upon the latency of the network in


		Figure 7-16. An overview of the delayed acknowledgment algorithm

use. If it takes a long time for cumulative acknowledgments to reach the sender, this design could negatively impact the sender's ability to transmit more data. Instead of helping, this behavior would cause traffic to become bursty, with the sender transmitting lots of segments and then stopping to wait for an acknowledgment. Once the acknowledgment arrived, then the sender would send several more segments, and then stop again.

Furthermore, some applications (such as TELNET) are 밹hatty?by design, with the client and server both sending data to each other on a regular basis. With these applications, both systems would want to combine their acknowledgments with whatever data had to be sent, reducing the number of segments actively crossing the network at any given moment. Delayed acknowledgments are also helpful here, as the two systems can simply combine their acknowledgments with whatever data is being returned.

For example, assume that a TELNET client is sending keystrokes to the server, which must be echoed back to the client. This would generate lots of small segments by both systems, not only for the keystroke data, but also for the acknowledgments that would be generated for each segments containing the keystroke data, as well as for the segments containing the data being echoed back to the client.

By delaying the acknowledgment until the segment containing the echoed data is generated, the amount of network traffic can be reduced dramatically. Effectively, rather than sending an acknowledgment as soon as the client's keystroke data segment had been verified, the server would delay the acknowledgment for a little while. Then, if any data was being returned to the client (such as the echoed keystroke), the server would just set the Acknowledgment Identifier in that segment's TCP header, eliminating the need for a separate acknowledgment segment. When combined with the Nagle algorithm, delayed acknowledgments can really help to cut down the amount of network bandwidth being consumed by small segments.

Unfortunately, there are also some potential problems with this design, although they are typically only seen when used in conjunction with Path MTU Discovery. These problems occur whenever a system chooses to delay an acknowledgment until two full-sized segments have been received, but the system is receiving segments that are not 밼ully sized.?This can happen when two devices announce large MTUs using the MSS option, but then the sender determines that a smaller MTU is required (as detected with Path MTU Discovery). When this happens, the recipient will receive many segments, but will not return an acknowledgment until it has received enough data to fill two full-sized segments (as determined by the MSS option).

In this case, the sender will send as much data as it can (according to the current limitations defined by the congestion window and the local sliding window), and then stop transmitting until an acknowledgment for that data is received. However, the recipient will not return an acknowledgment until the 500-millisecond maximum for delayed acknowledgments has been reached, and then will send one acknowledgment for all of the segments that have been received. The sender will increment its sliding window and resume sending data, only to stop again a moment later, resulting in bursty traffic.

This scenario happens only when Path MTU Discovery detects a smaller MTU than the size announced by the MSS option, which should be a fairly rare occurrence, although it does happen often enough to be a problem. This is particularly problematic with sites that use Token Ring, FDDI, or some other technology that allows for large MTU sizes, with an intermediary network that allows for only 1500-byte MTU sizes. For a more detailed discussion of this problem, refer to 밣artially Filled Segments or Long Gaps Between Sends?later in this chapter.

The TCP Header

TCP segments consist of header and body parts, just like IP datagrams. The body part contains whatever data was provided by the application that generated it, while the header contains the fields that tell the destination TCP software what to do with the data.

A TCP segment is made up of at least ten fields. Unlike the other core protocols, some TCP segments do not contain data. In addition, there are a variety of supplementary header fields that may show up as 뱋ptions?in the header. The total size of the segment will vary according to the size of the data and any options that may be in use.

Table 7-2 lists all of the mandatory fields in a TCP header, along with their size (in bits) and some usage notes. For more detailed descriptions of these fields, refer to the individual sections throughout this chapter.

Table 7-2. The Fields in a TCP Segment
Field	Bits	Usage Notes
Source Port	16	Identifies the 16-bit port number in use by the application that is sending the data.
Destination Port	16	Identifies the 16-bit target port number of the application that is to receive this data.
Sequence Identifier	32	Each byte of data sent across a virtual circuit is assigned a somewhat unique number. The Sequence Identifier field is used to identify the number associated with the first byte of data in this segment.
Acknowledgment Identifier	32	Each byte of data sent across a virtual circuit is assigned a somewhat unique number. The Acknowledgment Identifier field is used to identify the next byte of data that a recipient is expecting to receive.
Header Length	4	Specifies the size of the TCP header, including any options.
Reserved	6	Reserved. Must be 뱙ero.?/FONT>
Flags	6	Used to relay control information about the virtual circuit between the two endpoint systems.
Window	16	User to store a checksum of the entire TCP segment.
Checksum	16	Used to store a checksum of the entire TCP segment.
Urgent Pointer	16	Identifies the last byte of any urgent data that must be dealt with immediately.
Options (optional)	varies	Additional special-handling options can also be defined using the options field. These options are the only thing that can cause a TCP header to exceed 20 bytes in length.
Padding (if required)	varies	A TCP segment's header length must be a multiple of 32 bits. If any options have been introduced to the header, the header must be padded so that it is divisible by 32 bits.
Data (optional)	varies	The data portion of the TCP segment. Not all TCP segments have data, since some of them are used only to relay control information about the virtual circuit.

Notice that the TCP header does not provide any fields for source or destination IP address, or any other services that are not specifically related to TCP. This is because those services are provided by the IP header or by the application-specific protocols (and thus contained within the data portion of the segment).

As can be seen, the minimum size of a TCP header is 20 bytes. If any options are defined, then the header's size will increase (up to a maximum of 60 bytes). RFC 793 states that a header must be divisible by 32 bits, so if an option has been defined, but it only uses 16 bits, then another 16 bits must be added using the Padding field.

Figure 7-17 shows a TCP segment being sent from Arachnid (an HTTP 1.1 server) to Bacteria (an HTTP 1.1 client). This segment will be used for further discussion of the TCP header fields throughout the remainder of this chapter.


		Figure 7-17. A typical TCP segment

Source Port

Identifies the application that generated the segment, as referenced by the 16-bit TCP port number in use by the application.

Size
Sixteen bits.

Notes
This field identifies the port number used by the application that created the data.

Capture Sample
In the capture shown in Figure 7-18, the Source Port field is set to hexadecimal 00 50, which is decimal 80 (the well-known port number for HTTP). From this information, we can tell that this segment is a reply, since HTTP servers only send data in response to a request.


		Figure 7-18. The Source Port field

'UP! > Web Service' 카테고리의 다른 글

The User Datagram Protocol (0)	2008.08.21
The Internet Control Message Protocol (0)	2008.08.21
Multicasting and the Internet Group Management Protocol (0)	2008.08.21
The Address Resolution Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21

Posted by 으랏차

The User Datagram Protocol

UP!/Web Service 2008. 8. 21. 14:16

The User Datagram Protocol

Summary	The User Datagram Protocol provides a low-overhead transport service for application ptotocols that do not need (or cannot use) the connection-oriented services offered by TCP. UDP is most often used with applications that make heavy use of broadcasts or multicasts, as well as applications that need fast turnaround times on lookups and queries.
Protocol ID	17
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 6 (RFC 768, republished)
Relevant RFCs	768 (User Datagram Protocol); 1122 (Host Network Requirements)

There are two standard transport protocols that applications use to communicate with each other on an IP network. These are the User Datagram Protocol (UDP), which provides a lightweight and unreliable transport service, and the Transmission Control Protocol (TCP), which provides a reliable and controlled transport service.

The majority of Internet applications use TCP, since its built-in reliability and flow control services ensure that data does not get lost or corrupted. However, many applications that do not require the overhead found in TCP—or that cannot use TCP because the application has to use broadcasts or multicasts—will use UDP instead. UDP is more appropriate for any application that has to issue frequent update messages or that does not require every message to get delivered.

The UDP Standard

UDP is defined in RFC 768, which has been republished as STD 6 (UDP is an Internet Standard protocol). However, RFC 768 contained some vagaries that were clarified in RFC 1122 (Host Network Requirements). As such, UDP implementations need to incorporate both RFC 768 and RFC 1122 in order to work reliably and consistently with other implementations.

RFC 768 states that UDP is a stateless, unreliable transport protocol that does not guarantee delivery. Thus, UDP is meant to provide a low-overhead transport for applications to use when they do not need guaranteed delivery.

RFC 768 also states that the Protocol ID for UDP is 17. When a system receives an IP datagram that is marked as containing Protocol 17, it should pass the contents of the datagram to the local UDP service for further processing.

UDP Is an Unreliable, Datagram-Centric Transport Protocol

As we discussed in Chapter 1, An Introduction to TCP/IP, sending a message via UDP is somewhat analogous to sending a postcard in that it is totally untrustworthy, providing no guarantees of any kind of delivery. UDP messages are sent and then forgotten about immediately. As such, applications that need a reliable transport protocol should not use UDP.

However, UDP's lightweight model does provide some distinct benefits, particularly in comparison to TCP's highly managed connection model. While TCP provides high levels of reliability through highly managed virtual circuits, UDP offers high performance from having so little overhead. If reliability comes at the expense of performance, then conversely, performance can be gained by eliminating some of the overhead associated with reliability.

In addition, many applications simply cannot use TCP, since TCP's virtual circuit design requires dedicated end-to-end connections between two (and no more than two) endpoints. If an application needs to use broadcasts or multicasts in order to send data to multiple hosts simultaneously, then that application will have to use UDP to do so.

Limited reliability

Although applications that broadcast information on a frequent basis have to use UDP, they do gain some benefits from doing so. Since broadcasts are sent to every device on the local network, it would take far too long for the sender to establish individual TCP connections with every other system on the network, exchange data with them all, and then disconnect. Conversely, UDP's connectionless service allows the sender to simply send the data to all of the devices simultaneously. If any of the systems do not receive one of the messages, then they will likely receive one of the next broadcasts, and as such will not be substantially harmed by missing one or two of them.

Furthermore, streaming applications (such as real-time audio and video) can also benefit from UDP's low-overhead structure. Since these applications are streamoriented, the individual messages are not nearly as important as the overall stream of data. The user will not notice if a single IP packet gets lost every so often, so it is better to just continually keep sending the next message, rather than stopping everything to resend a single message. These applications actually see errorcorrection as a liability, so UDP's connectionless approach is a 밼eature?rather than a 뱎roblem.?/P>

Similarly, any application that needs only a lightweight query and response service would be unduly burdened by TCP's connection-oriented services and would benefit from UDP's low overhead. Some database and network-lookup services use UDP for just this reason, allowing a client and server to exchange data without having to spend a lot of time establishing a reliable connection when a single query is all that's required.

It should be pointed out that many of the applications that use UDP require some form of error correction, but that this error correction also tends to be specific to the application at hand, and is therefore embedded directly into the application logic. For example, a database client would need to be able to tell when no response came back from a query, and so the database client may choose to just reissue the entire query rather than try to fix a specific part of the datastream (this is how Domain Name System queries work). Applications that use UDP must therefore incorporate any required error-checking and fault-management routines internally, rather than rely on UDP to provide these services.

Another interesting point is that most of the network technologies in use today are fairly reliable to begin with, so unreliable protocols like UDP (and IP) are likely to reach their destinations without much problems. Most LANs and WANs are extremely reliable, losing only tiny amounts of data over the course of their lifetime. On these types of networks, UDP can be used without much concern. Even topologies that are unreliable (such as analog modems) typically provide a modicum of error-correction and retransmission services at the data-link layer.

For these reasons, UDP probably shouldn't be considered totally unreliable, although you must always remember that UDP doesn't provide any error-correction or retransmission services within the transport itself. It just inherits any existing reliability that is provided by the underlying medium.

Furthermore, UDP also provides a checksum service that allows an application to verify that whatever data has arrived is probably the same as that which was sent. The use of UDP's checksum service is optional, and not all of the applications that use UDP also use the checksum service (although they are encouraged to do so by RFC 1122). Some applications incorporate their own verification routines within the UDP data segment, augmenting or bypassing UDP's provisional data-verification services with application-specific equivalents.

Datagram-centric transport services

Another unique aspect of UDP is the way in which it deals with only one datagram at a time. Rather than attempting to manage a stream of application data the way that TCP does, UDP deals with only individual blocks of data, as generated by the application protocols in use. For example, if an application gives UDP a fourkilobyte block of data, then UDP will hand that data to IP as a single datagram, without trying to create efficient segment sizes (one of TCP's most significant traits). The data may be fragmented by IP when it builds and sends IP packets for that four-kilobyte block of data, but UDP does not care if this happens and is not involved in that process whatsoever.

Furthermore, since each IP datagram contains a fully formed UDP datagram, the destination system will not receive any portion of the UDP message until the entire IP datagram has been received. For example, if the underlying IP datagram has been fragmented, then UDP will not receive any portion of the message until all of the fragments have arrived and been reassembled by IP. But once that happens, then UDP (and the application in use with UDP) will get and read the entire fourkilobyte message in one shot.

Some UDP stacks require that the application have enough buffers to read the entire datagram. If the application cannot accept all of the data, then it will not get any of the data, since the datagram will be discarded.

Conversely, remember that TCP does not generally cause fragmentation to occur, since it attempts to avoid fragmentation through the use of efficiently sized segments. In that model, TCP would send multiple TCP segments, each of which could arrive independently and be made available to the destination application immediately. Although there are benefits to the TCP design, record-centric applications also have to perform more work when using it instead of UDP, since UDP provides one-shot access to all of the data.

In fact, UDP is particularly useful for applications that have to transfer fixed-length records of data (such as database records or even fixed-length files). For example, if an application needs to send six records from a database to another system, then it can generate six fixed-length UDP datagrams, and UDP will send those datagrams as independent UDP messages (which become independent IP datagrams). The recipient then receives the datagrams as self-contained records and will be able to immediately process them as six unique records.

In contrast, TCP's circuit-centric model would require the same application to write the data to the TCP virtual circuit, which would then break the data into segments for transport across the network. The recipient would then have to read through the segments as they arrived at the destination system, poking through the data and looking for end-of-record markers until all six records were received and found.

For all of these reasons, UDP is a more efficient protocol, although it is still unreliable. As such, application protocols that want to leverage the low-overhead nature of UDP must provide their own reliability services. In addition, these applications typically have to provide their own flow-control and packet-ordering services, ensuring that datagrams are not received out of order. Most applications incorporate a half-duplex data-exchange mechanism in order to provide these services. The application protocol waits for a clear-to-send signal from the remote system, transmits a datagram, and then stops to wait for the clear-to-send signal again.

For example, Trivial File Transfer Protocol (TFTP) clients use acknowledgment messages embedded in UDP datagrams to tell a server that it received the last block of data and that it is ready to receive another block. The TFIP server then sends another block of data as another UDP message and then wait to receive an acknowledgment before sending another block. Although this method is clumsy when compared to TCP's graceful sliding window concept, it has been proven to work over the years.

UDP Ports.

UDP does very little. In fact, it does almost nothing, acting only as a very basic facilitator for applications to use when they need to send or receive datagrams on an IP network. In order to perform this task, UDP has to provide two basic services: it must provide a way for applications to send data over the IP software, and it must also provide a way to get data that it has received from IP back to the applications that need it.

These services are provided by a multiplexing component within the UDP software. Applications must register with UDP, allowing it to map incoming and outgoing messages to the appropriate application protocols themselves.

This multiplexing service is provided by 16-bit port numbers that are assigned to specific applications by UDP. When an application wishes to communicate with the network, it must request a port number from UDP (server applications such as TFTP will typically request a pre-defined, specific port number, while most client applications will use whatever port number they are given by UDP). UDP will then use these port numbers for all incoming and outgoing datagrams

This concept is illustrated in Figure 6-1. Each of the applications that are using UDP have allocated a dedicated port number from UDP, which they use for all incoming and outgoing data.


		Figure 6-1. Application-level multiplexing with port numbers

When an application wishes to send data over the network, it gives the data to UDP through the assigned port number, also telling UDP which port on the destination system the data should be sent to. UDP then creates a UDP message, marking the source and destination port numbers, which is then passed off to IP for delivery (IP will create the necessary IP datagram).

Once the IP datagram is received by the destination system, the IP software sees that the data portion of the IP datagram contains a UDP message (as specified in the Protocol Identifier field in the IP header), and hands it off to UDP for processing. The UDP software looks at the UDP header, sees the destination port number, and hands the payload portion of the datagram to whatever application is using the specified port number. Figure 6-2 illustrates this concept using the Trivial File Transfer Protocol (TFTP), a small file transfer protocol that uses UDP.


		Figure 6-2. Data being sent from a TFTP client to a TFTP server

Technically, a 뱎ort?identifies only a single instance of an application on a single system. The term 뱒ocket?is used to identify the port number and IP address concantenated together (i.e., port 80 on host 192.168.10.10 would be referred to as the socket 192.168.10.10:80). Finally, a 뱒ocket pair?consists of both endpoints, including the IP addresses and port numbers of both applications on both systems. Multiple connections between two systems must have unique socket pairs, with at least one of the two endpoints having a different port number.

Although the concept of socket pairs with UDP is similar to the same concept as it works with TCP, there are some fundamental differences that must be taken into consideration when looking at how connections work with UDP versus how they work with TCP. Most importantly, while TCP can maintain multiple virtual circuits on a single port number through the use of socket pairs, UDP-based applications do not have this capability at all, and simply treat all data sent and received over a port number as data for a single 밹onnection.?/P>

For example, if a DNS server is listening for queries on port 53, then any queries that come in to that port are treated as equal, with the DNS server handling the multiplexing services required to distinguish between the different clients that are issuing the distinct queries. This is the opposite of how TCP works, where the transport protocol would create and manage virtual circuits for each of the connections. With UDP, all data is treated as a single 밹onnection,?and the application must manage any multiplexing services required on that port.

Well-known ports

Most server-based IP applications use what are referred to as 뱖ell-known?port numbers. For example, a TFTP server will listen on UDP port 69 by default, which is the well-known port number for TFTP servers. This way, any TFTP client that needs to connect to any TFTP server can use the default destination of UDP port

69. Otherwise, the client would have to specify the port number of the server that it wanted to connect with (you've seen this in some URLs that use http://www. somehost.com:8080/ or the like; 8080 is the port number of the HTTP server on www.somehost.com).

Most application servers allow you to use any port number you want. However, if you run your servers on non-standard ports, then you would have to tell every user that the server was not accessible on the default port. This would be a hard-to-manage implementation at best. By sticking with the defaults, all users can connect to your server using the default port number, which is likely to cause the least amount of trouble.

Some network administrators purposefully run application servers on nonstandard ports, hoping to add an extra layer of security to their network. However, it is my opinion that security through obscurity is no security at all and that this method should not be relied upon by itself.

Historically, only servers have been allowed to run on ports below 1024, as these ports could be used only by privileged accounts. By limiting access to these port numbers, it was more difficult for hacker to install a rogue application server. However, this restriction is based on Unix-specific architectures, and is not easily enforced on all of the systems that run IP today. Many application servers now run on operating systems that have little or no concept of privileged users, making this historical restriction somewhat irrelevant.

There are a number of predefined port numbers that are registered with the Internet Assigned Numbers Authority (IANA). All of the port numbers below 1024 are reserved for use with well-known applications, although there are also many applications that use port numbers outside of this range. Some of the more common port numbers are shown in Table 6-1. For a detailed listing of all of the port numbers that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/port-numbers).

Table 6-1. Some of the Port Numbers Reversed for Well-Known UDP Servers
Port Number	Description
53	Domain Name System (DNS)
69	Trivial File Transfer Protocol (TFTP)
137	NetBIOS Name Service (sometimes referred to as WINS)
161	Simple Network Management Protocol (SNMP)

Besides the reserved addresses that are managed by the IANA, there are also unreserved port numbers that can be used by any application for any purpose, although conflicts may occur with other users who are also using those port numbers. Any port number that is frequently used is encouraged to register with the IANA.

To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.

The UDP Header

UDP messages consist of header and body parts, just like IP datagrams. The body part contains whatever data was provided by the application in use, while the header contains the fields that tell the destination UDP software what to do with the data.

A UDP message is made up of six fields (counting the data portion of the message). The total size of the message will vary according to the size of the data in the body part. The fields in a UDP message are shown in Table 6-2, along with their size (in bytes) and their usage.

Table 6-2. The fields in a UDP Message
Field	Bytes	Usage Notes
Source Port	2	Identifies the 16-bit port number in use by the application that is sending the data
Destination Port	2	Identifies the 16-bit target port number of the application that is to receive this data
Length	2	Specifies the size of the total UDP message, including both the header and data segments
Checksum	2	Used to store a checksum of the entire UDP message
Data	varies	The data portion of the UDP message

Notice that the UDP header does not provide any fields for source or destination IP addresses, or for any other services that are not specifically related to UDP. This is because those services are provided by the IP header or by the application-specific protocols (and thus contained within the UDP message's data segment).

Every UDP message has an eight-byte header, as can be seen from Table 6-2. Thus, the theoretical minimum size of a UDP message is eight bytes, although this would not leave any room for any data in the message. In reality, no UDP message should ever be generated that does not contain at least some data.

Figure 6-3 shows a UDP message sent from a TFTP client to a TFTP server. In that example, a TFTP session is opened between Greywolf (the client) and Arachnid (the server), with Greywolf sending a file (called testfile.txt) to Arachnid. We'll use this message for further discussion of the UDP header fields.


		Figure 6-3. A simple UDP message

The following sections describe the header fields of the UDP message in detail.

Source Port

Identifies the message's original sender, as referenced by the 16-bit UDP port number in use by the application.

Size
Sixteen bits.

Notes
This field identifies the port number used by the application that created the data.

Note that RFC 768 states 밪ource Port is an optional field, when meaningful, it indicates the port of the sending process, and may be assumed to be the port to which a reply should be addressed in the absence of any other information. If not used, a value of zero is inserted.?/BLOCKQUOTE>
Although Source Port is optional, it should always be used.

Capture Sample
In the capture shown in Figure 6-4, the Source Port field is set to hexadecimal 04 2c, which equates to decimal 1068.

Figure 6-4.
The Source Port field

'UP! > Web Service' 카테고리의 다른 글

The Transmission Control Protocol (0)	2008.08.21
The Internet Control Message Protocol (0)	2008.08.21
Multicasting and the Internet Group Management Protocol (0)	2008.08.21
The Address Resolution Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21

Posted by 으랏차

The Internet Control Message Protocol

UP!/Web Service 2008. 8. 21. 14:16

The Internet Control Message Protocol

Summary	The Internet Control Message Protocol provides a mechanism for IP devices to use when they need to exchange information about network problems that are preventing delivery. Although IP is an unreliable protocol that does not guarantee delivery, it is important to be able to inform a sender when delivery is not possible due to a semi-permanent, nontransient error.
Protocol ID	1
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 4 (RFC 1812, republished); 5 (includes RFCs 791, 792, 919, 922, 950, and 1112)
Relevant RFCs	792 (Internet Control Message Protocol); 896 (Source Quench); 950 (Address Mask Extension); 1122 (Host Network Requirements); 1191 (Path MTU Discovery); 1256 (Router Discovery); 1812 (Router Requirements)
Related RFCs	1393 (Traceroute Extension)

IP is an unreliable protocol, and as such, delivery is not guaranteed. In this model, if important datagrams are lost, then a higher-layer protocol (such as a transport-layer protocol like TCP or an application-layer protocol like TFTP) will eventually recognize that a problem has occurred and will deal with it. As the theory goes, important data will eventually get through.

However, sometimes a problem crops up that prevents all datagrams from getting through to their destination. When these kinds of nontransient errors occur, IP fails for a specific and avoidable reason, and the sender should be notified about the problem so that it can either stop sending data to that destination or modify its behavior so that the specific problem is avoided. IP uses the Internet Control Message Protocol (ICMP) for reporting these kinds of problems.

The ICMP Specification

ICMP is documented in RFC 792, which is included in STD 5 (the IP standard). As such, ICMP is a part of STD 5 and is therefore considered to be an Internet Standard protocol. However, RFC 792 contained some vagaries that were clarified in RFC 1122 (Host Network Requirements) and RFC 1812 (Router Requirements). In addition, much of ICMP's core functionality has been redefined and clarified in STD 2. As such, ICMP implementations need to incorporate RFC 792, RFC 1122, RFC 1812, and STD 2 in order to work reliably and consistently with other implementations.

RFC 792 states that the Protocol ID for ICMP is 1. When a system receives an IP datagram that is marked as containing Protocol 1, it should pass the contents of the datagram to ICMP for further processing. However, ICMP is not a transport protocol and is not used to deliver application data. Rather, ICMP is a control protocol like IGMP, useful for informing devices of network events and changes.

RFC 792 states that ICMP must be used whenever IP itself needs to report a problem. Thus, although ICMP works at a layer above IP, IP also depends on ICMP in order to function properly. ICMP and IP are tightly interwoven, and for all practical purposes are inseparable. For this reason, every IP implementation is also required to include ICMP.

The Need for ICMP

Remember that IP is only responsible for getting datagrams from one host to another, one network at a time. Each IP datagram gets sent as an individual entity capable of following whatever path is available to it. Datagrams are moved across whatever hosts, routers, and networks are capable of getting that specific chunk of data closer to its destination.

In this model, any IP datagram can fail to get delivered for any number of reasons. Some datagrams will get discarded simply because the next-hop router is unavailable, and the current router is unable to forward them. Sometimes a datagram will be destroyed due to the user on the sending system providing a nonexistent destination IP address or port number to their local application. In all of these cases, the system that detects an error will simply destroy the IP datagram that's failing, and move on to the next datagram waiting to be processed.

Depending upon the exact cause of the failure, the system that destroyed the datagram may or may not choose to return an ICMP error message back to the original sender, notifying it of the failure and its cause. Typically this decision is made based on whether the failure is transient or semi-permanent.

Transient failures such as invalid checksums are generally ignored, since it is assumed that the sender will eventually notice the failure and retransmit any important data (which may be handled by TCP or by an application-specific reliability mechanism). The assumption is that if the data wasn't important enough for the sender to use a reliable protocol, then the sender probably doesn't care that delivery failed, and the problem can go unreported. In this model, transient errors can be safely ignored, since it is somewhat unlikely that the next packet will have the exact same problem. Eventually, the transport or application protocol in use will detect the error, and the failure itself does not indicate that there is a problem with the network at large.

Conversely, semipermanent failures (such as invalid destination IP addresses) need to be reported to the sender immediately, since these kinds of failures represent fundamental problems with the network itself, or at least indicate that there is a problem in the way that the sender is trying to use the network. In either case, semi-permanent failures should always be reported back to the sender, thus either causing it to stop sending data to that destination, or forcing it to modify its behavior so that the specific problem is avoided.

ICMP is the protocol used to send failure messages back to a system when a semipermanent delivery problem has been detected. This includes events such as a destination being unreachable, the IP Time-to-Live value reaching zero, and so forth. In addition, ICMP can be used for exchanging general information about the network, or for probing the network for certain characteristics. For example, the popular ping program uses ICMP messages to test basic connectivity between two devices.

In fact, ICMP is essentially just a collection of predefined messages, each of which provide very specific functionality. When a system needs to send an ICMP message, it chooses a message from the dictionary, places the code for the message into an ICMP-specific datagram, and then transmits the ICMP message via IP back to the system that sent the original (failing) datagram.

The recipient will see that the IP datagram contains an ICMP message (as indicated in the IP header's Protocol Type field), examine the message and its data, and then hand the message off to the appropriate protocol for additional processing. If the message is intended for ICMP itself (such a message might be an 밻cho request?message, generated by ping), then ICMP will deal with the message directly and not involve any other protocols.

If the message is intended for IP (such a message might be a 뱑edirect?message, suggesting that the sender should use a different router), then the message will be delivered to the system's IP software for processing. In this example, the IP software should update the local routing table to reflect the path suggested by the message and use that router for any subsequent traffic for the affected destination.

If the message is intended for a transport protocol (such a message might be 밺estination port is unreachable?, then the message will be delivered to the appropriate transport protocol for processing. The transport protocol should process the message directly and then inform the application protocol of the error, suggesting that it stop sending data to that particular destination socket. Most of the ICMP error messages are meant to be processed by the transport protocols.

When Not to Send ICMP Messages

Just as it is important to know when to send an error message, it is also important to know when an error message should not be sent. For example, any transient error that causes delivery to fail (such as an invalid checksum or a data-link delivery failure) should not be reported. However, the ICMP specifications also state that ICMP error messages should not be generated when their usage will generate an excessive amount of network traffic. For example, RFC 1122 states that ICMP error messages should not be sent as the result of receiving:

?Another ICMP error message (although ICMP error messages can be sent in response to ICMP query messages)

?An IGMP message of any kind

?An IP datagram with a destination IP address for a broadcast or multicast address

?An IP datagram with a nonspecific source address (all zeroes, a loopback address, or a broadcast or multicast address)

?A data-link frame with a broadcast or multicast address

?Any fragment—other than the first fragment—from a fragmented IP datagram

The first rule is obvious. You would not want to generate ICMP error messages in response to other ICMP error messages, as a message loop could cause a network storm that would prevent any other packets from getting sent. However, RFC 1122 states that devices can send ICMP error messages in response to ICMP query messages. For example, a router can issue an ICMP Redirect error message in response to an ICMP Echo Request query message that was sent to the wrong router.

The reason that systems shouldn't generate ICMP error messages in response to broadcast or multicast datagrams is to keep network traffic to a minimum. For example, assume that a user on one host broadcasts a UDP message to all of the hosts on the local network, but only a few of the hosts were running the application associated with that UDP traffic. If all of the other hosts on the network sent a Destination Port Unreachable error message back to the sender, then the number of ICMP messages could be quite high. Every time a broadcast was sent, several ICMP error messages would also get generated, which could theoretically overwhelm the network if there were enough of these errors.

With multicast traffic (such as is used for streaming audio or video), this problem would be exacerbated since there could be many thousands of multicast datagrams, which would be followed by many tens- or hundreds-of-thousands of ICMP error messages. On a large shared-access network (such as nonswitched Ethernet), the resulting collisions could theoretically render the network useless. In addition, ICMP error messages should not be generated in response to IGMP messages, for much the same reason.

Note however that ICMP query messages can be sent to a broadcast address, and those messages should be responded to with additional query messages (but should not be responded to with ICMP error messages). For example, a user can use ping to send an ICMP Echo Request query message to all of the hosts on the local network, and any of the hosts on that network may respond with ICMP Echo Reply query messages (although this behavior is entirely optional according to RFC 1122). Therefore, the 뱊o errors for broadcast or multicast traffic?rule only applies to ICMP error messages.

Similarly, systems should not return error messages in response to every fragment of a fragmented packet, since this process would also result in multiple messages being sent in response to a single IP datagram. Although the resulting congestion would probably not be as much of a problem as it would be with broadcast or multicast traffic, it would still be an unnecessary utilization of the available network bandwidth.

Reporting on Delivery Problems

As mentioned earlier, ICMP error messages are used when it is necessary to report a problem that is preventing delivery from occurring. Although IP is an unreliable protocol that may fail without warning, it is important for the network to know when problems occur that will prevent delivery from ever occurring.

This is a fundamentally different concept from packets simply becoming lost or corrupted. If a system is trying to send data to a destination network that is totally unavailable, then the sending system ought to know that no datagrams are ever going to make it through to the destination system. The problem should be reported to the sending application so it can stop sending data to that destination.

Although ICMP can be used to report on IP failures, it is important to note that ICMP does not make IP a reliable protocol. IP is still capable of losing packets, sending duplicate or out-of-sequence packets, or doing anything else that it wants to. Nor does the lack of ICMP messages mean that the network is functioning perfectly; a host may be ignoring messages for any number of reasons without an ICMP message ever getting returned to the sender.

By the same token, TCP is a reliable transport that uses negotiated connections for data exchange, and as such many of ICMP's error messages are not needed or used by TCP. For example, if an application specified an invalid port number on a destination system, the remote system's TCP software would simply reject the connection request, using the TCP Reset flag. The sender's TCP stack would be informed of the error immediately, and it would be redundant to send an ICMP message stating that the destination port was unreachable. This is true of almost every ICMP error message (but not all of them): TCP just doesn't need to know that a segment was not processed, because it's already keeping track of every segment that it sends.

Conversely, UDP does not have any circuit-management mechanisms, so it has no way to monitor individual segments. As such, UDP benefits greatly from ICMP error messages, and is the principle user of these messages.

Destination Unreachable error messages.

A Destination Unreachable error message can signify any number of problems. It can mean that a router was unable to find a path to a remote system, or it can mean that a port number on the destination system is currently unavailable, or it can signal a variety of other problems.

In order to provide more-detailed reporting, the Destination Unreachable message provides a variety of submessages (using the ICMP Message Code field as described later in 밇rror message headers?. The major Destination Unreachable submessages documented in the various ICMP RFCs, include:

Network Unreachable
This error message means that no route for the destination network could be found in the routing table on the reporting router, and is commonly seen when a user tries to connect to a private address that is non-routable across the Internet. This message can also result when datagrams are sent to a router that has corrupt or out-of-date routing tables.

For example, Figure 5-1 shows Ferret (192.168.10.10) trying to send an IP datagram to 192.168.30.10. Since the default route on Ferret points to Sasquatch, Ferret sends the datagram to that router for forwarding. However, Sasquatch does not have a routing table entry for the 192.168.30.0 network, so it returns

an ICMP Destination Unreachable: Network Unreachable error message back to the sender.


		Figure 5-1. A typical scenario for Network Unreachable error messages

It's important to note that this error can be generated by any router between the source and destination systems. Since the problem may be short-term (and thus, may get fixed), Network Unreachable error messages are not meant to be the interpreted as a final work on routing issues, but should only be used to indicate that a temporary routing problem has occurred.

Host Unreachable
This error message means that the IP datagram could not be delivered to the final destination system. This error message is generated by the last-hop router if that router does not know how to reach the destination system in particular. Like the Network Unreachable error, this is only an advisory message and should not be interpreted to mean that the host does not exist.

Protocol Unreachable
This error message indicates that the specified transport protocol is unavailable on the destination system. This message is normally seen when a user tries to use a nonstandard transport protocol (like XTP) to communicate with another host that isn't running that protocol.

Port Unreachable
This error message means that the specified destination port number is not in use on the destination host. Typically, this error indicates that a client application has attempted to connect to a server application that is not loaded or that isn't using the port number expected by the client.

For example, Figure 5-2 shows Ferret trying to send an UDP datagram to port 69 (the well-known port for TFTP) on Froggy. However, Froggy is not running a TFTP server, so it returns an ICMP Destination Unreachable: Port Unreachable error message back to the sender.


		Figure 5-2. A typical scenario for Port Unreachable error messages

Port Unreachable errors are almost always generated by UDP. Since TCP uses a handshake to establish a connection with a remote system, the destination system's TCP stack would use the TCP Reset flag to reject the connection request if the destination port number were not in use by any applications.

Fragmentation Required but DF Bit Is Set
This error message indicates that the IP datagram had to travel across a network whose MTU was smaller than the IP datagram, but the datagram's Don't Fragment flag in the IP header was enabled. Since the router is unable to forward or fragment the datagram, it must be rejected.

For example, Figure 5-3 shows the host at 192.168.30.10 trying to send a 16 kilobyte datagram (with the 밆on't Fragment?flag set in the IP header) to 192.168.10.10, which is on a segment that offers an MTU of only 1500 bytes. Since the router is unable to forward the datagram without fragmenting it, and the Don't Fragment flag is enabled, the router returns an ICMP Destination Unreachable: Fragmentation Required but DF Bit Is Set error message back to the sender.

Nowadays, the Fragmentation Required but DF Bit Is Set error message is most often seen with Path MTU Discovery, an algorithm defined in RFC 1191 that allows a sending system to discover the largest MTU size for an end-to-end


		Figure 5-3. A typical scenario for Fragmentation Required error messages

connection. Once the largest-possible MTU has been discovered, the sender can use that MTU on all subsequent datagrams, thereby preventing fragmentation from occurring on that connection. For more information on this procedure, refer to 밡otes on Path MTU Discovery?later in this chapter.

Source Route Failed
This error message means that the router was unable to deliver the packet to the next-hop router as specified in the datagram's Source Route IP Option fields. Source routing may fail because an invalid next-hop router was specified, or because the router was unable to send the datagram to the specified next-hop router.

Destination Network Unknown
This error message means that the destination network absolutely does not exist. It should be sent only when the data-link network has unequivocable proof that the destination network really does not exist. This is the opposite of the Network Unreachable error, which only suggests that a path to the destination network could not be found in the current routing tables (although the network may actually exist). Whereas the Network Unreachable error message suggests that a network may not exist, the Destination Network Unknown error message states it as fact.

Destination Host Unknown
This error message means that the destination host absolutely does not exist. It should be sent only when the data-link network has unequivocable proof that the destination system simply does not exist. This is the opposite of the Host

Unreachable error, which only suggests that the destination system cannot be found (although the host may actually exist). Whereas the Host Unreachable error message suggests that a host may not exist, the Destination Host Unknown error message states it as fact.

Network Unreachable for Type-of-Service
This error message is generated by intermediary routers if any of the next-hop networks in between the sender and destination systems either do not support the Type-of-Service value requested in the IP datagram, or do not support the default Type-of-Service.If a device sends an IP packet with a particular Type-of-Service defined, but no path is available that matches that particular Type-of-Service, then the router should reject the packet and inform the sender of the problem.

Host Unreachable for Type-of-Service
This error message is generated by the last-hop router if the last-hop network for the destination system either does not support the Type-of-Service value requested in the IP datagram, or does not support the default Type-of-Service. If a device sends an IP packet with a particular Type-of-Service defined, but the last-hop network does not support that particular Type-of-Service, then the last-hop router should reject the packet and inform the sender of the problem.

Communication Administratively Prohibited
This error message means that the destination system is configured to reject datagrams from the sending system. This error is generally used when firewall restrictions or other security measures are filtering datagrams based on some sort of criteria. This message effectively says, 밫he destination may be up and running, but it will not get the datagrams that you're sending. Stop sending them.?/P>

For example, Figure 5-4 shows Ferret (192.168.10.10) trying to send a datagram to Fungi (192.168.20.60). However, the router Sasquatch is configured to reject all datagrams from the 192.168.10.0 network, so it returns an ICMP Destination Unreachable: Communications Administratively Prohibited error message back to the sender.

Some firewalls are configured not to issue the Communication Administratively Prohibited messages, since such messages may be considered a security risk in their own right. Telling an attacker which hosts are being protected is not necessarily a good idea; sometimes saying nothing is the most secure option.

RFC 1122 also defines 밅ommunication With Destination Network Administratively Prohibited?(code 9) and 밅ommunication With Destination Host Administratively Prohibited?(code 10). However, these messages are reserved for use by U.S. military agencies, and aren't supposed to be used by the general public.


		Figure 5-4. Typical scenario for Communications Administratively Prohibited error messages

Host Precedence Violation
This error message means that the sender has specified a Precedence value for the IP datagram that is not supported by the intermediary network, destination network, destination host, or destination application. In order for the source to communicate with the destination, it must change the Precedence value of the IP datagram to a supported value.

Precedence Cutoff in Effect
This error will occur if a sending system defines a specific Precedence value that is lower than the minimum Precedence required on an intermediary or destination network. Such filters are typically found only on very expensive networks. In order for the source to communicate with the destination, it must increase the Precedence value of the IP datagram, or else use a different route for that specific destination.

According to RFC 1122, all Destination Unreachable error messages must be reported to the transport layer that triggered the failure. If the error indicates a 뱒oft failure?(such as Network Unreachable), the transport layer should not abort the connection, but instead should note that the problem occurred. If the connection fails due to this problem, the error may be passed to the application protocol so that it can inform the user of the failure. If the error indicates a 밾ard failure?(such as Network Unknown), then the connection should be terminated immediately.

For more information on the Destination Unreachable error message, refer to

밆estination Unreachable?later in this chapter.

Time Exceeded error messages

Time Exceeded error messages are used to indicate that a forwarding or reassembly operation took too long to complete and that the reporting device is discarding the data. In order to provide more-detailed reporting, the Time Exceeded message provides two different submessages (using the ICMP Message Code field as described later in 밇rror message headers?. The submessages offered by Time Exceeded include:

Time-to-Live Exceeded in Transit
This error message is used when an IP datagram's Time-to-Live value has reached zero but the datagram has not yet been delivered to its final destination. Since the Time-to-Live field indicates the maximum number of hops that a datagram can take, the router cannot forward a datagram with a Time-to-Live value of zero, and must destroy it instead. Since most systems use Time-to-Live values of 30 or more, the presence of this message generally indicates that a routing loop is preventing the datagram from being delivered.

However, this message is also used with the traceroute program to identify the routers in between a source and destination system. For more information on how traceroute uses this message, refer to 밡otes on traceroute?later in this chapter.

Fragment Reassembly Time Exceeded
This error message is used when a datagram has been fragmented but the destination system has not received all of the fragments within the allotted time (60 seconds on most Unix systems). This message generally indicates that a fragment has been lost in transit somewhere and that the destination system is discarding the other fragments that it has received.

According to RFC 1122, all Time Exceeded error messages must be reported to the transport layer that triggered the failure, although the transport layer should not abort the connection. Instead, the transport layer should note that the problem occurred, and if the connection fails due to this problem, the message may be passed to the application protocol so that it can inform the user of the failure.

For more information on the Time Exceeded error message, refer to 밫ime Exceeded?later in this chapter.

Redirect error messages

The Redirect error message is used whenever a router needs to inform a sender of a shorter path to the specified destination. This message is typically seen when users only have a single (default) route defined on a network with multiple routers, and they should be sending datagrams for a specific network to a router other than the default. If the users don't send the datagrams to the 밷etter?router, then the default router may use Redirect error messages to inform the sender of the correct router to use.

There are some rules that have to be taken into consideration with this design. First and foremost among them is the fact that Redirect error messages can only be sent when the 밷etter?router is on the same subnet as the host that is sending the packets. This is pretty obvious; if the specified router were on a remote subnet, then the host would not be able to forward packets through it to the destination system. Furthermore, Redirect error messages should be sent only from a router on the same subnet as the original sender, since other routers on other subnets are not likely to know the routing paths available to the sender.

RFC 1122 even goes so far as to state that systems should discard any redirect messages that do not originate from first-hop routers, or messages with a suggested route that does not point to another first-hop router. In addition, RFC 1122 also states that whenever a system receives a Redirect error message, it must update its routing tables to use the suggested router if it is on the same network as the device itself. This rule is important, since a failure to update the routing table results in the sender continuously transmitting packets through the wrong gateway.

Figure 5-5 shows Ferret (192.168.10.10) trying to send a datagram for 192.168.30.10 to Sasquatch (which is the default router for Ferret). The router examines its local routing table and sees that the best router for the 192.168.30.0 network is via Canary (192.168.10.1), which is also on the same network as the original sender. Since Canary is 밹loser?to the final destination network than Sasquatch, Sasquatch issues an ICMP Redirect error message back to Ferret, telling it to use Canary for that traffic.

Note that Sasquatch will go ahead and forward the pending datagram to Canary on Ferret's behalf (as shown in step 3), but after that initial transfer, Ferret should hand all datagrams for the 192.168.30.0 network to Canary directly (as shown in step 4).

Redirect error messages are most commonly seen on networks that rely on the Router Discovery protocol (as described in RFC 1256) for dynamic routing services. Router Discovery makes extensive use of the Redirect error message, and in that environment, these messages do not signify 뱎roblems?with the network.

In order to provide more-detailed reporting, the Redirect message provides four different submessages (using the ICMP Message Code field as described later in 밇rror message headers?. The submessages offered by Redirect include:

Redirect for Destination Network
This message is used when all traffic for the destination network should go through another router. This is the most common form of the Redirect error


		Figure 5-5. A typical scenario for Redirect error messages

message on local networks that rely on Router Discover for dynamic routing services.

Redirect for Destination Host
This message is used when all traffic for the final destination system should go through another router. Since routing tables can contain host-specific entries (just as they can have network-specific entries), this error is useful if a host-specific redirect is required.

Redirect for Destination Network Based on Type-of-Service
This error message is used when the sending system has requested a particular Type-of-Service for a specific destination, and all traffic with that Type-of-Service for the destination network should go through another router.

Redirect for Destination Host Based on Type-of-Service
This error message is used when the sending system has requested a particular Type-of-Service for a specific system, and all traffic with that Type-of-Service for that destination should be sent through another router.

The Redirect message can be a very useful tool for network administrators, since administrators only have to configure their network devices to use a single router, and that router will provide redirection services to the rest of the devices based on its own routing tables. In fact, this is how the Router Discovery protocol works. As the network topology changes, the routers can inform the other devices about the changes dynamically.

However, even though RFC 1122 states that all IP implementations must support Redirect message, many products do not do so, either failing to update their routing tables or ignoring the Redirect messages entirely. In addition, different implementations support Redirect to different degrees, with some systems taking the redirection message as permanent and modifying their routing tables accordingly, while other systems will only accept the change as a temporary suggestion and will modify their routing tables for only a few minutes. You should verify that your systems fully support the Redirect message to the extent that you require before attempting to implement dynamic routing services across your network using this mechanism.

Note that ICMP Redirect error messages are not sent when a route has been specified using the Source Route IP Option. In that case, the source route takes priority over the optimal path, and the first-hop router specified by the Source Route IP Option must forward the datagram to the next-hop router, regardless of any better paths that may be available.

For more information on the Redirect error message, refer ahead to 밨edirect?

Source Quench error messages

The Source Quench error message is perhaps the simplest of all the ICMP error messages. Whenever a device is sending too much data for the destination host to process, the recipient can send an ICMP Source Quench error message back to the sender, suggesting that the sender throttle back on the rate at which it is sending data. If the sender does not slow down, then some packets are likely to be discarded by the congested device.

Source Quench is most often seen when a dial-up server connects a high-bandwidth network (such as a LAN) to a low-bandwidth device (such as a dial-up client). In this kind of scenario, it is easy for a high-powered system on the LAN to transmit more data than the dial-up server can feed to the end-point system. Eventually, the dial-up server will fill its transfer buffers and will have to start dropping packets if the sender doesn't slow down. Source Quench allows the dial-up server to inform the sender of the congestion, effectively requesting it to please stop sending so much data.

According to RFC 1122, all Source Quench error messages must be reported to the transport layer that triggered the failure. In addition, RFC 1122 states that if the Source Quench error message is handed to TCP, then TCP should shrink the congestion window for that virtual circuit to 뱋ne segment?and implement the slow start recovery algorithm (as described in 밅ongestion window sizing?in Chapter 7, The Transmission Control Protocol).

It is also important to note that RFC 792 states that either the destination system or any router in between the source and destination systems could issue an ICMP Source Quench error message. However, RFC 1812 states that routers should not send Source Quench error messages, claiming they are of questionable benefit. In fact, RFC 1812 even says that routers can choose not to forward Source Quench error messages if they do not want to.

It is my opinion that this is probably a priori call, however, and that Source Quench is an extremely effective tool for controlling intensive traffic flows (particularly those that use UDP, which does not provide for any flow-control services whatsoever). As such, it is my opinion that Source Quench should be used whenever and wherever possible.

For more information on the Source Quench error message, refer to 밪ource Quench?later in this chapter.

Parameter Problem error messages

The Parameter Problem error message generally means that something is wrong with the IP datagram itself, and that the datagram is being discarded.

Parameter Problem errors almost always result from an incorrect usage of an IP option. For example, a device may have sent an IP datagram with a malformed Source Route option in the IP header. This datagram would fail to be delivered due to this error, and would get discarded by the recipient (or an intermediary gateway) once the error was detected. Although it might seem that a Destination Unreachable: Source Route Failed error message would be sent in this case, that would not be true since the problem was a malformed option, rather than an undeliverable address.

In order to provide more-detailed reporting, the Parameter Problem message provides three different submessages (using the ICMP Message Code field as described later in 밇rror message headers?. The submessages offered by Parameter Problem include:

Pointer Indicates the Error
This error means that there is a specific problem with the datagram's structure (such as a malformed header field). The location of the bad data is provided in the ICMP Message Data field of the Parameter Problem error message, allowing the sender to determine the cause of the failure.

Required Option Is Missing
This error means that a required IP option has not been defined. This message is used only with the Security options, which are used only by the U.S. military, and are therefore not discussed in this book.

Bad Length
This error indicates that the Header Length and/or Total Packet Length values of the IP datagram do not appear to be accurate.

Note that there is no 밹hecksum failure?submessage. If a checksum does not calculate correctly, then the entire packet is deemed to be corrupt and is discarded without warning. Any portion of the packet may have been corrupted, including the Destination Address field of the datagram's header. In this case, the recipient may not even be the desired destination system, so it should not do anything other than destroy the packet.

For more information on the Parameter Problem error message, refer to 밣arameter Problem?later in this chapter.

Probing the Network

Since ICMP is a generic messaging protocol, it is also useful for determining general characteristics about the network. ICMP query messages provide this service, allowing systems to request information about the network in general.

ICMP queries are conversational by nature, with one system seeking information from another, and with the remote system returning the requested information. This process can be seen with the ping program's use of Echo Request query messages, which are responded to with Echo Reply query messages. This model is in contrast to the one-way nature of ICMP error messages, which are sent but not responded to.

Echo Request and Echo Reply query messages

One of the simplest tests that a user may wish to perform is verifying that a remote system is up and running on the network. Such a test may be required when basic connectivity appears to be failing.

ICMP provides two query messages that work together to provide just this service. The ICMP Echo Request query message is a probe sent by a user to a destination system, which responds with an ICMP Echo Reply query message.

RFC 1122 states that 밻very host must implement an ICMP Echo server.?Since this service is mandatory, any user should be able to send an ICMP Echo Request to any host on the Internet and receive an ICMP Echo Reply message in return. However, this is not always the case, as firewalls may be blocking the packets (for security reasons), or the packets may simply fail to be delivered.

Furthermore, RFC 1122 also states that every host should implement an end-user-accessible application interface for sending Echo Request query messages to other hosts. Typically this application interface is implemented as a program called ping.

Almost every computing environment—even the most basic network print servers and fax gateways—offers some kind of ping program for testing basic connectivity. This is expected, since RFC 1122 mandates that any device with a TCP/IP stack must have one.

ping works by sending one or more ICMP Echo Request messages to a destination system, and then measuring the amount of time it took for the ICMP Echo Reply messages to be received from the system being probed. Some implementations do less than this, only printing a message that the destination system 밿s alive,?while others do more than this, allowing the user to specify oversized messages or specific Precedence and Type-of-Service values for the IP datagrams that hold the ICMP Echo Request messages.

Figure 5-6 shows a simple ping between two hosts on a local network. In that example, Ferret sends an ICMP Echo Request query message to Froggy, which responds with an ICMP Echo Reply query message. This is a typical example of how ping is used on any local network.


		Figure 5-6. A typical exchange of Echo Request and Echo Reply query messages

Although the structure of the Echo Request and Echo Reply messages are described later in detail in 밇cho Request and Echo Reply,?some aspects of the formatting of these messages are important to understanding how the messages can be used for more-detailed testing. For example, Echo Request messages have an ancillary field for storing test data (called the Optional Data field), allowing the sender to provide customized test data inside the ICMP message. Most ping applications use some sort of default data in the test block (such as 64 bytes of 7-bit

US-ASCII text), although many of them will also let you provide your own data in this field, or will at least let you change the quatity of data being sent.

For example, you could choose to have ping send 2,000 bytes of data in the Optional Data field. This Echo Request message could then be used to test the effects of fragmentation when the datagram had to be sent across a network with an MTU that was smaller than that (such as an Ethernet segment).

RFC 1122 states that any data that is received in an ICMP Echo Request must be returned in the resulting Echo Reply message, so if your test data made it to the recipient, you should get all of the data back. However, if the recipient does not support fragmentation (and not every IP device does), then an Echo Reply message would still need to be returned. It would only have to contain as much data as would fit within a single message (according to the MTU restrictions of the target system).

This concept is illustrated in Figure 5-7, which shows a Token-Ring-attached device with an MTU of four kilobytes (192.168.30.10) sending a two-kilobyte Echo Request message to an Ethernet-attached device with an MTU of 1500 bytes (192.168.10.10). In order for the router to forward the message to the Ethernet segment, it must split the IP datagram into two fragments, which will be reassembled by the destination host once both fragments have arrived.


		Figure 5-7. Fragmenting Echo Request and Echo Reply query messages

Once the message was received, 192.168.10.10 would respond with an ICMP Echo Reply message that was also two kilobytes long (this datagram would have to be fragmented by the sender before it could be sent, since it exceeds the local network's MTU size). The router would then forward both fragments back to 192.168. 30.10, who would reassemble them into a single datagram before the data is processed. This illustrates how it is possible to use oversized datagrams with ping to test fragmentation across a complex network.

Another interesting (but limited) test for ping is for a device to send an ICMP Echo Request query message to all of the devices on the local network, by specifying the broadcast address or the all-hosts multicast address for the local network in the Destination Address field of the IP datagram. However, RFC 1122 explicitly states that devices do not have to respond to these queries, and many implementations will not do so. In addition, not all versions of ping are capable of dealing with multiple responses to a single request, and will report erroneous information when this situation occurs.

In addition, some ping implementations also allow the user to specify the Precedence or Type-of-Service values for the underlying IP datagrams. According to RFC 1122, the remote system is supposed to use those same value in the Echo Reply messages that they return. By specifying non-default values, you can use ping to test how well your network handles classified data as opposed to regular (i.e., nonclassified) data, or how traffic is routed according to nondefault Type-of-Service or Precedence values.

Other IP options and services explicitly supported by ICMP Echo Request and Echo Reply query messages are the Source Route, Record Route, and Timestamp IP Options. For example, RFC 1122 states that if an ICMP Echo Request query message is received within an IP datagram that has the Source Route IP Option defined, then the ICMP Echo Reply query message must follow the same route back to the sender, in reverse order.

For more information on the Echo Request and Echo Reply query messages, refer ahead to 밇cho Request and Echo Reply.?/P>

Timestamp Request and Timestamp Reply query messages

Another pair of ICMP query messages that can be useful for testing the network is the Timestamp Request and Timestamp Reply query messages, which allow a sender to determine the amount of latency that a particular network is experiencing. This gives good insight into the performance characteristics of the network.

As can be imagined, this information also provides more insight than most people want to provide, at least to users outside the local network. For this reason, RFC 1122 states that the Timestamp Request and Timestamp Reply query messages are entirely optional. Indeed, there are only a few TCP/IP implementations that support them.

Timestamp works by having one system send a Timestamp Request query message to another host, with the current time in the Timestamp Request message's Originate Timestamp field. The recipient then creates a Timestamp Reply query message that contains the Originate Timestamp field from the original message, along with new Receive Timestamp and Transmit Timestamp fields.

Once the Timestamp Reply query message is received by the querying device, the different timestamps can be examined, allowing the system to determine the amount of time that it took for the remote system to process the Timestamp Request query message. This data can then be subtracted from the round-trip delivery time, allowing the system to calculate the length of time that it took for the datagrams to travel across the network. This provides a better indication of actual network latency than the Echo Request and Echo Reply query messages, which do not provide any information about the amount of time that the remote system spent processing the ICMP messages.

Note that the timestamps used in the Timestamp Request and Timestamp Reply query messages are based on the number of milliseconds since midnight, using Universal Time (UTC). By using UTC, network devices do not have to worry about timezone issues, since all devices are on the same timezone (UTC).

The ICMP Timestamp Request and Timestamp Reply messages can also be set to use specific Type-of-Service and Precedence values, allowing latency to be measured across specific networks. In addition, RFC 1122 states that if an ICMP Timestamp Request query message is received within an IP datagram that has the Source Route IP Option defined, then the ICMP Timestamp Reply query message must follow the same route back to the sender, in reverse order. This sequence allows a user to measure latency across a specific network path.

Also note that accurate measurements of network latency depend upon each system having accurate (or consistent) time values for their local clocks. Before relying on the Timestamp query messages for latency measurements, synchronize the clocks on the systems being tested using a protocol such as NTP.

For more information on the Timestamp Request and Timestamp Reply query messages, refer ahead to 밫imestamp Request and Timestamp Reply.?/P>

Address Mask Request and Address Mask Reply query messages

RFC 792 defined a variety of host-configuration messages, allowing diskless systems to obtain IP addresses and other data during the boot process by using ICMP messages. However, the primary ICMP query messages used for this—the Information Request and Information Reply query messages—have since been deprecated and are now obsolete. The Address Mask Request and Address Mask Reply query messages are also somewhat obsolete, although their usage has not been deprecated as of yet.

Essentially, the Address Mask Request and Address Mask Reply query messages allow a host to determine the subnet mask in use on the local network. Once a host has obtained an IP address (which could be obtained via Reverse ARP, BOOTP, or DHCP), they could then send an Address Mask Request message to the local broadcast address of 255.255.255.255. An Address Mask server on the network would then respond with an Address Mask Reply query message that is unicast back to the requesting device directly, providing the host with a subnet mask for the IP address being used.

RFC 1122 states that the Address Mask Request and Address Mask Reply query messages are entirely optional. Indeed, only a handful of systems use these messages today.

For more information on the Address Request and Address Reply query messages, refer to 밃ddress Mask Request and Address Mask Reply?later in this chapter.

Router Solicitation and Router Advertisement query messages.

RFC 1256 introduced the concept of a Router Discovery protocol that allows the devices on a network to locate routers using ICMP query messages, rather than having to be configured to use static routing entries or run a full-fledged routing protocol like RIP or OSPF.

The Router Discovery protocol consists of a Router Solicitation query message, which is issued by hosts when they first become active on the network (sent to the all-routers multicast address of 224.0.0.2). Each router on the network should then respond to the Router Solicitation query messages with a unicast Router Advertisement query message, informing the querying device directly of the IP addresses that can be used for packet forwarding. In addition, routers will also issue unsolicited Router Advertisement messages on a periodic basis (sent to the all-hosts multicast address of 224.0.0.1), allowing hosts both to change their routing tables as higher-priority routers become available, and to expire old routers if they are no longer being advertised after a certain length of time.

According to RFC 1256, unsolicited Router Advertisement query messages should be sent to the all-hosts multicast address of 224.0.0.1 through every network interface in use by the router, although directed Router Advertisements that result from Router Solicitations are only unicast back to the requesting device, through the most appropriate interface. Each of these advertisement messages must contain the IP address in use on the local interface, as well as a Preference value for the IP address being published. Network devices then use the IP address with the highest Preference (the lowest number) as their default router.

This process is illustrated in Figure 5-8. In that example, Ferret issues an ICMP Router Solicitation query message to the all-routers multicast group address of 224. 0.0.2, which both Sasquatch and Canary respond to (using the ICMP Router Advertisement query message, also sent to the all-routers multicast address of 224.0.0.2). Each response also contains a Preference value for that specific router, with Sasquatch having the highest preference (this router will become the default router for Ferret).


		Figure 5-8. Router Discovery using Router Solicitation and Advertisement query messages

Note that routers do not advertise the networks or paths that they can route for. Devices must discover (and build) their own routing tables dynamically, by way of the ICMP Redirect error message discussed earlier in 밨edirect error messages.?In that model, the Router Discovery client uses the Router Solicitation query message to choose a default route, and then relies on the Redirect error messages whenever the host tries to send a datagram to the default router for delivery, but the router knows of a better path for the specified destination system.

This concept is illustrated in Figure 5-9. As we saw in Figure 5-8, Sasquatch had the strongest preference, and was chosen as the default router by Ferret. If the host wants to send a datagram to 192.168.30.10 (a non-local device), it would send the packets to Sasquatch for delivery. However, Sasquatch would know that Canary was the better router for that destination, so it would return an ICMP Redirect error message back to Ferret, informing it that all traffic for the 192.168.30.0

network should go to Canary. Sasquatch would then forward on the packets it had already received, while Ferret would make note of the new path, and (hopefully) start to use that router for any future datagrams for that destination network.


		Figure 5-9. Router Discovery using Redirect error messages

On small or isolated networks that have only one or two routers on any given segment, the process of learning about the entire network will only take a few moments and is not nearly as complicated as it sounds here. Conversely, on large and complex networks with many different routers, this process can be significantly more complicated than it sounds here.

For more information on the Router Solicitation and Router Advertisement query messages, refer to 밨outer Solicitation?and 밨outer Advertisement,?both later in this chapter.

ICMP Messages

ICMP is used for two different types of operations: reporting on problems that prevent delivery (such as 밆estination Unreachable?errors), and probing the network through the use of request and reply messages (such as those found in the ping program's 밇cho Request?and 밇cho Reply?ICMP messages).

Every ICMP message is assigned a unique 뱈essage type,?which is simply a numeric code that represents each of the predefined messages. There are a variety of predefined message types that devices can choose from when they need to send a message to another device. Many of these messages were defined in RFC 792, while others were added in RFC 950, RFC 1122, RFC 1812, and RFC 1256. In addition, some messages have been rendered obsolete (such as the Information Request and Information Reply query messages), and are no longer considered part of the standard dictionary of message types.

For a detailed listing of all of the ICMP Message Types that are currently registered, refer to the Internet Assigned Numbers Authority's (IANA) online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/icmp-parameters). Note that RFC 1122 states that if a system receives an ICMP message with a type or code that it does not understand, it must ignore the message.

Table 5-1 lists the Message Types most often used with IPv4.

Table 5-1. Message Types and Their Usage
Type	Message Description	Message Family	Defined In
0	Echo Reply	Query (Reply)	RFC 792
3	Destination Unreachable	Error	RFC 1122
4	Source Quench	Error	RFC 792
5	Redirect	Error	RFC 792
8	Echo Request	Query (Request)	RFC 792
9	Router Advertisement	Query (Reply)	RFC 1256
10	Router solicitation	Query (Request)	RFC 1256
11	Time Exceeded	Error	RFC 1122
12	Parameter Problem	Error	RFC 792
13	Timestamp Request	Query (Request)	RFC 792
14	Timestamp Reply	Query (Reply)	RFC 792
17	Address Mask Request	Query (Request)	RFC 950
18	Address Mask Reply	Query (Reply)	RFC 950

Since these messages are used for specific types of functionality, they vary widely in their structure and formatting, with some messages having more fields than others. Refer to 밇rror message headers?for information on the structure of ICMP error messages, and 밦uery message headers,?both later in this chapter, for information on the structure of ICMP query messages.

ICMP Error Messages

ICMP error messages are used whenever a nontransient delivery problem occurs

that the sender should be made aware of. A variety of ICMP error messages

exist (as described later in 밇rror message types and codes?,

although each of these

messages have their own special requirements and treatments. However, all ICMP error messages have the same basic structure, and these fields are described in the next section.

Error message headers

ICMP error messages consist of a predefined set of fields that indicate the type of message being passed, and that also provide message-specific services (such as IP addresses or pointers to corrupt data). These predefined fields are followed by the IP header and the first eight bytes of data from the datagram that caused the error to be generated.

Since delivery problems can occur at any layer, ICMP error messages must include the full header and the first eight bytes of data from the failing IP datagram in order for the sender to see which packet failed, and in order for the sender to notify the transport and application protocols that generated the failure. This process allows the sender to pass the error message to the appropriate layer for subsequent processing.

Table 5-2 lists all of the fields in an ICMP error message's header, along with their size (in bytes), and some usage notes. For more information on these fields, refer to the individual discussions throughout this section.

Table 5-2. Format of the ICMP Error Message
Field	Size	Usage Notes
Message Type	1 byte	Indicates the specific ICMP error message.
Message Code	1 byte	Indicates the specific subclass of the specific ICMP error message.
Checksum	2 bytes	Used to validate the ICMP message's contents.
Message Data	4 bytes	Used for message-specific needs.
Original Headers	20?0 bytes	The full header of the IP datagram that failed.
Original Data	8 bytes	The first 64 bits of the IP datagram's data. This data will contain the Source and Destination Port fields for the transport protocol used by the sender, allowing the transport protocols to determine which specific application generated the failing datagram.

The total length of an ICMP error message is eight bytes. However, this measurement does not include the original datagram's IP headers, which can add anywhere from 20 to 60 bytes of additional data to the message, not does it include the first eight bytes of data from the failing packet. All told, an ICMP error message can be anywhere from 36 to 72 bytes, including all of these fields.

All ICMP messages are sent within IP datagrams directly. Each of these datagrams must have a protocol identifier of 1. Also, RFC 1122 (Host Network Requirements)

states that ICMP error messages sent from hosts should use the default Type-of-Service and Precedence values (0), although RFC 1812 (Router Requirements) states that error messages sent from routers should have a precedence value of 6 or 7.

The remainder of this section discusses the ICMP error messages in detail, using the capture shown in Figure 5-10 for reference purposes.


		Figure 5-10. A simple ICMP error message

Message Type

The Type field is used to specify the exact ICMP error message being sent.

Size
Eight bits.

Notes
ICMP essentially consists of a predefined dictionary of messages, allowing messages to be exchanged using numeric identifiers. The Type field is used to specify the major class of the message, while the Code field is used to specify the minor class, though some message Types do not have any subtype Codes.

Table 5-3 lists all of the ICMP error message Types used with IPv4, and their meanings.

Table 5-3. ICMP Error Message Types
Type	Message Description	Usage Notes
3	Destination Unreachable	The network, host, or port number specified by the original datagram is unreachable. This could be related to a variety of problems.
4	Source Quench	Either the destination system or an intermediary device is receiving more data than it can process, and is asking the sender to reduce the rate of transfer.
5	Redirect	This message will occur if a device tries to send a datagram to a destination through a router, but the router knows of a shorter path to the destination.
11	Time Exceeded	Either the datagram's Time-to-Live timer expired, or some fragments were not received in time to be reassembled by the destination system.
12	Parameter Problem	Another problem occurred. This error is almost always related to problems with IP or TCP Options.

Capture Sample
In the capture shown in Figure 5-11, the Type field is set to 3, which is used for all Destination Unreachable errors.


		Figure 5-11. The Message Type field

'UP! > Web Service' 카테고리의 다른 글

The Transmission Control Protocol (0)	2008.08.21
The User Datagram Protocol (0)	2008.08.21
Multicasting and the Internet Group Management Protocol (0)	2008.08.21
The Address Resolution Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21

Posted by 으랏차

Multicasting and the Internet Group Management Protocol

UP!/Web Service 2008. 8. 21. 14:16

Multicasting and the Internet Group Management Protocol

Summary	IP multicasting allows IP datagrams to be sent to explicit hosts on multiple networks simultaneously. This is different from traditional broadcasting in that not all hosts have to process the data, and hosts on remote networks can also receive the datagrams. The Internet Group Management Protocol provides a mechanism for IP devices to register with multicast routers, indicating the multicast groups they are interested in participating in.
Protocol ID	2
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 5 (includes RFCs 791, 792, 919, 922, 950 and 1112)
Relevant RFCs	1112 (IP Multicasting and IGMPv1); 1122 (Host Network Requirements); 1469 (Token Ring Multicasting); 2236 (IMGP v2)
Related RFCs	1075 (Distance Vector Multicast Routing Protocol); 1256 (Router Discovery); 1584 (Multicast Extensions to OSPF); 2365 (Administratively Scoped IP Multicast); 2588 (IP Multicast and Firewalls)

When a system needs to communicate with other hosts on the local network, it typically does so either by broadcasting a packet to every host on the local network or by sending unicast datagrams to a specific destination.

Another alternative that can be used is 뱈ulticasting,?a technique that lets a system send data to a group address. Any hosts that are monitoring that group address will receive the data, while every other device on the network will simply ignore it.

This model is particularly useful for applications that require a host to send data to multiple destinations simultaneously, but would be hindered by the limitations found with broadcasts.

For example, applications such as streaming audio require that a host be able to send data to multiple hosts simultaneously. While it may seem that general-purpose broadcasting can be used for these types of applications, it is not necessarily the most efficient model to use. For one thing, broadcasts are meant for every host on the local network. As such, they will not reach hosts that are on remote networks (most routers filter out broadcast traffic). But if a host on a remote network needed to receive that data, then the broadcast traffic would have to be re-broadcast to the remote network, getting sent to every host on that remote network, regardless of whether or not they wanted the traffic.

Furthermore, with broadcast traffic, every device on the local network has to monitor all of the packets and then decide if the data is needed by any local applications. If the data wasn't useful, then it still had to be processed before this fact was discovered. With bandwidth-intensive applications like network multimedia, this traffic results in every system on the network spending a large amount of computing resources examining and then discarding unwanted messages. But by using multicasts instead of broadcasts, hosts can choose which network streams they want to monitor, and higher-layer protocols will only have to process the packets they are specifically interested in. Hosts that do not want any data from a particular multicast group will not have to process anything extra.

In short, multicasts allow for broadcast-like activity, while providing explicit support for remote hosts and networks on a selective basis. This is a particularly important service with bandwidth-intensive applications like network multimedia, but also is useful for any application that requires a host be able to communicate with multiple systems simultaneously.

For example, the Router Discovery protocol (defined in RFC 1256) makes extensive use of multicasting to allow devices to automatically locate the routers on a network, and there is a draft proposal that proposes to imbue DNS clients with the ability to locate DNS servers automatically using this same model. There are even proposals for using multicast transfers with TFTP to send data to multiple hosts simultaneously, and for replicating Usenet newsfeeds to multiple servers with a single stream (instead of many large unicast transfers).

Multicasts are useful for all of these applications, allowing data to be sent to multiple systems simultaneously, rather than sending it on a point-to-point basis to each of them discreetly. In fact, anything that can benefit from a single feed to multiple, distributed destinations can benefit from IP multicasts.

The IP Multicasting and IGMP Specifications

IP multicasting is documented in RFC 1112, which is included in STD 5 (the IP standard). As such, IP multicasting is a part of STD 5, and is therefore considered to be an Internet Standard protocol. All hosts are required to implement multicasting into their IP stacks if they are to be compliant with Internet standards.

In addition to the IP multicasting services, RFC 1112 also defined IGMPv1, a follow-up to IGMPv0 (originally published in RFC 998). IGMPv2 was introduced in RFC 2236 as an update to IGMPv1, and has gained a substantial number of implementations. Some IGMP implementations only support IGMPv1 while others support IGMPv2 and v1.*

Of the differences between IGMPv1 and v2, the most notable addition in IGMPv2 was the addition of the 밚eave Report?message, which allows a host to inform the network that it is no longer interested in receiving IP datagrams for a specific multicast group. This feature is required for networks that use data-link services that do not support multicasting directly (such as PPP). This chapter focuses on IGMPv2 primarily, and only mentions IGMPv1 when there is a significant difference.

RFC 2236 states that the Protocol ID for IGMP is 2. When a system receives an IP datagram that is marked as containing Protocol 2, it should pass the contents of the datagram to IGMP for further processing. However, note that IGMP is not a transport protocol and is not used for the delivery of multicast data. Rather, IGMP is a control protocol like ICMP, useful for informing devices of network events and changes.

It is also important to note that although all hosts are required to implement IP multicasting, they are not required to implement IGMP. If they do not implement IGMP, however, then they cannot tell local multicast routers that they wish to receive multicast packets. Although a host can send any data it wants to a multicast group IP address, hosts can only receive multicast data reliably if they implement IGMP. Most of the multicast-aware systems in use today implement IGMP, as well as the basic support for multicast services.

The default Time-to-Live value for IP datagrams sent to a multicast address is ? hop? unless the sending application explicitly specifies a larger value. Also, it should be noted that IGMP messages are only meant to be received and processed by devices on the local network, and as such the Time-to-Live value for IP datagrams that contain IGMP messages should always be ? hop? Although multicast data can be sent across router lines, IGMP messages should never be forwarded.

* IGMP stands for Internet Group Management Protocol.

An Introduction to IP Multicasting

In essence, IP multicasting is a function of the IP address in use by a particular 뱈ulticast group.?If a user wants to send an audio presentation to a variety of distributed hosts simultaneously, then that user would send the data to an IP address that was associated with a particular multicast group. Any hosts that were listening for traffic destined for that IP address would then receive the data and process it accordingly, while other hosts would simply ignore it.

The application protocol used for such a feed could be a RealAudio stream sent as a bunch of UDP messages over IP. These messages would not look any different from the same data being sent to a single destination system or to the local broadcast address. The only difference between the multicast datagrams and their unicast or broadcast equivalents would be the destination IP address in use with those datagrams: the unicasts would point to a specific destination system, the broadcasts would point to all hosts on the local network, and the multicasts would point to a group address.

In the IP world, multicast group addresses are known as Class D addresses and include all of the IP addresses in the range of 224.0.0.0 through 239.255.255.255. Each of the individual addresses in this range refer to specific multicast groups, many of which are associated with a specific application or service.

There are a number of predefined, reserved addresses that are registered with the Internet Assigned Numbers Authority (IANA). For example, all of the addresses in the range of 224.0.0.0 through 224.0.0.255 are reserved for use with routing protocols and other low-level, infrastructure-related services. Some of the more common registered addresses are shown in Table 4-1. For a detailed listing of all of the multicast groups that are currently registered, refer to the IANA's online registry (accessible on the Web at http://www.isi.edu/in-notes/iana/assignments/multicast-addresses).

Table 4-1. Some Common Multicast Group Addresses
Multicast Address	Description
224.0.0.1	All local multicast hosts (including routers). Never forwarded.
224.0.0.2	All local multicast routers. Never forwarded.
224.0.0.4	DVMRP routers. Never forwarded.
224.0.1.1	Network Time Protocol.
224.4.1.24	Microsoft's Windows Internet Name Service locator service.

Besides the reserved addresses that are managed by the IANA, there are also 뱔nreserved?addresses that can be used by any application for any purpose, although conflicts may occur with other users who are also using those multicast addresses. Any group address that is frequently used is encouraged to register with the IANA.

Any host can send data to a multicast group address simply by providing the multicast group address in the destination field of the IP datagram. Any hosts that are listening for traffic sent to that group address will pick up the data.

If an application wishes to participate in a local multicast group as a listener, all it has to do is inform the local system's IP stack and network adapter of the group address to be monitored. If this step is not performed, then the system will likely ignore traffic going to the multicast group addresses of the applications in use. By registering with the local interface, any data sent to the multicast address in use by the application will get picked up and monitored by the local system and then get delivered to the local application for processing. The following section, 밚ocal Multicasting? describes this process in detail.

If an application wishes to participate in a distributed multicast group across a router, then the application must still perform the local registration tasks described above, but it must also register its desire to receive traffic for that group address with any local multicast routers. This step is required before multicast routers will forward any remote multicast datagrams for that group address onto the local network. 밆istributed Multicasting?later in this chapter describes multicasting over multiple networks in detail, while 밠anaging Group Memberships? also later, describes the registration process.

Local Multicasting

Remember that IP datagrams (and IP addresses) are used only to represent a consistent view between IP-enabled devices, and that when two devices on the same local network want to communicate with each other they will use the topology-specific protocols and addresses defined for the local network medium to actually exchange data. For example, two hosts on the same local Ethernet segment will use Ethernet framing and addressing mechanisms when they want to exchange IP data, with the IP packets being passed in the Ethernet frames. Meanwhile, hosts on Token Ring networks will use the protocols and addressing mechanisms specified in the Token Ring standards when they want to exchange data.

These same rules apply with multicast traffic, which must also use the low-level protocols and addressing rules defined for whatever medium is in use with the local network. When multicast IP packets are being passed around the network, they are really being sent in a form that uses the multicasting services that are provided by the underlying network topology.

Most shared-medium networks provide three distinct types of addressing:

Unicast
Unicast data is sent from one device to another, using an explicit destination address. Devices on the network must monitor the network traffic, looking for any frames that are marked for their local hardware address. With Ethernet, this is a 48-bit address typically written in hexadecimal notation as something like c0:14:3d:22:1e:04.

Broadcast
Broadcast data is sent from one device to every other device on the local network, using a broadcast address specific to the network topology. Devices on the network must monitor the network traffic and read any frame that is marked for the broadcast address. With Ethernet, the 48-bit broadcast address is 밶ll on,?typically being written in hexadecimal notation as ff:ff:ff:ff:ff:ff.

Multicast
Multicast data is sent from one device to a specific multicast group address. Devices on the network choose which multicast groups they want to participate in, and then monitor the network traffic, looking for any frames that are destined for one of those multicast groups (ignoring any other frames that are going to any other multicast group addresses). Ethernet offers a variety of multicast group addresses, ranging from 01:00:5e:00:00:00 to 01:00:5e:7f:ff:ff.

In order for IP multicast traffic to be sent to multiple hosts simultaneously, the IP multicast addresses must be mapped to the multicast addresses in use by the local network, and the local IP stack must inform the local network adapter of the multicast groups that the adapter should be listening for. Unless the hardware is listening for traffic sent to the hardware-specific multicast addresses, the IP software (and thus UDP and the application-layer protocols) will never get the IP packets.

Some conversion is usually required in the process of mapping multicast IP addresses to multicast hardware addresses. For example, the range of available Ethernet multicast addresses only provides 23 of the 48 available bits for defining specific multicast group addresses, although IP uses 26 bits from the 32-bit Class D address block to identify specific multicast groups. Therefore, to map IP multicast addresses onto Ethernet multicast addresses, the last 23 bits from the IP multicast address are mapped onto the 23 bits available for the Ethernet multicast address.

For example, Figure 4-1 shows a multicast packet bound for 224.0.0.1 being sent to the Ethernet group address of 01:00:5e:00:00:01. The last 23 bits of the IP address (0000000 00000000 00000001) have simply been copied to the Ethernet group address block of 01:00:5e:00:00:00.

This mapping results in some overlap, with some multicast groups sharing a single Ethernet multicast address. In this case, the IP software on the destination


		Figure 4-1. Mapping Ethernet addresses to Class D IP addresses

systems must perform some filtering, discarding any IP packets that are destined for a multicast group address (as specified in the IP header) that the device does not want. For example, if the network adapter is monitoring for Ethernet-specific multicast addresses that fall within the duplicate range, then the IP software in use on the listening host must also examine the destination IP address shown in the IP datagram before handing the data to the specified transport protocol for subsequent processing.

Although this presents a challenge for users on Ethernet networks, some topologies have even greater degrees of overlap, and require much greater levels of filtering. For example, Token Ring networks have a limited number of group addresses available to them, so all Token Ring multicasts must share the same group address of c0:00:00:04:00:00.

Some networks don't provide any form of group address at all, and are strictly limited to unicast and broadcast frames. On those networks, the topology-specific broadcast address must be used for all multicast data, requiring that hosts filter through all broadcast packets. Note that some point-to-point networks don't even have broadcast addresses. Instead, they have only 뱒end?and 뱑eceive?wires. On those networks, multicast routers must use topology-specific unicast frames to deliver data to the destination systems.

Some switches offer an intelligent multicast filtering mechanism, whereby they can interpret an IP multicast request and apply a filter on the switch port for that device, effectively limiting the multicast traffic sent to the device according to the multicast IP addresses that the device has sent out.

Regardless of these issues, the important thing to remember is that IP multicasts use the addressing and delivery mechanisms offered by the underlying network topology, just as every other type of IP packet does. Some network topologies provide better mapping services than others, but almost all of the widely used LAN topologies allow for multicasts using some form of explicit group addressing.

When an application on a host wishes to begin receiving multicast packets, it must inform the local network adapter that it should begin accepting network packets for the specific multicast group (if that service is available, as it is on Ethernet). If a host is directly attached to multiple networks, it may register with any of the network interfaces explicitly, or it may register with all supported interfaces simultaneously. Applications have to explicitly state which networks they want to monitor. Typically, hosts with multiple interfaces will have a 밺efault?multicast interface that they use for reading and writing multicast traffic.

Sometimes this registration occurs at the same time as the network-wide IGMP registration process, and sometimes it occurs just before the IGMP registration process, as a precondition of the IGMP registration request. This is an implementation-specific issue. The important thing to remember is that registering with the network adapter(s) on the local system and registering with the local multicast routers (via IGMP messages) are two separate and distinct acts.

Distributed Multicasting

One of the big differences between IP multicasts and IP broadcasts is that multicast datagrams can be sent to remote networks, while broadcasts are normally restricted to the local network. By being able to forward multicast feeds across network boundaries, multicast routers can get multicast datagrams to hosts around the world, allowing the hosts to participate in global multicasting groups, with multicast datagrams going to every network that has a host that is participating in that particular multicast group.

This process is achieved by using multicasting routers that are particularly aware of multicast traffic and the routing issues associated with that type of traffic. This model involves a couple of theories that are somewhat different from traditional routing.

For one thing, the local multicast router must be listening for multicast packets. The router has to actively monitor the network for IP traffic that has a destination address of 224.0.0.1 through 239.255.255.255, and then forward that data on to any other multicast-enabled network segments that are attached to the router (or any tunnels to remote multicast networks that may be configured). Traditional routers do not monitor the network looking for packets that need to be forwarded, but instead wait for data that needs to be sent to the router (for forwarding) explicitly.

Also, since the destination address is a multicast address and not a unicast address, the multicast router has to forward any multicast data that it sees to every other network segment attached to the multicast router. In addition, since not every network is multicast-enabled (i.e., does not have a multicast router on it), many networks are bridged together using multicast 뱓unnels,?whereby multicast traffic is sent as unicast datagrams to another multicast router at the far end of the tunnel. Regardless of how the networks are connected, the model is the same: multicast routers send multicast data to the other multicast-enabled networks that they know of, and any multicast routers on those networks will pick up the frames and forward the data to any other multicast-enabled networks that they are attached to.

This allows hosts on remote networks to receive locally generated multicast data, and vice versa. And since any additional multicast routers on those networks will see and forward any multicast data that comes onto their network, they will also forward the data on to any other multicast-enabled network segments that they are attached to. This allows the data to be propagated across the entire Internet.

For example, Figure 4-2 shows an IP packet being sent from Ferret to the multicast address of 224.1.2.3. If any remote hosts (such as Fungi) want to monitor that group address, then Sasquatch has to monitor the network on its behalf, pick up any matching data, and then explicitly forward the packets out to the other segments (where they would eventually be received by Fungi).

There are some special considerations with this model that must be followed in order for everything to work smoothly. Otherwise, simply propagating multicasts across the different networks could result in them becoming saturated with unwanted data.

Limited forwarding

The primary rule that multicast routers follow is that they will only forward multicast traffic to the networks that have expressed an interest in that specific type of data. For example, Figure 4-2 shows two subnets. When Ferret sends a multicast datagram to the 224.1.2.3 group address, Sasquatch has to forward the datagram to the 192.168.20.0 network in order for Fungi to receive it. However, Sasquatch will only do this if Fungi has explicitly stated that it was interested in data for the 224. 1.2.3 multicast group.


		Figure 4-2. Routing in a multicast environment

This 뱒tatement of interest?is achieved by using IGMP messages, which are sent by the hosts when they first come onto the network, and which are also sent in response to IGMP queries issued by local multicast routers. There are lots of rules and conditions that are involved with this scenario which are discussed in more detail in 밠anaging Group Memberships?later in this chapter. However, the basis of the protocol is that hosts announce which multicast groups they want to listen to, and routers use this information to determine which networks they will forward data to. In essence, a multicast router will only forward multicast data to a network if there is an active listener for that group address on that network.

In addition, multicast routers must also forward any group registration information they have to all other multicast routers that they know about. This procedure ensures that other routers know to send multicast data to their networks. Otherwise, upstream routers would not send any data for that multicast group address down to the local network. Note that this process is handled by different multicast routing protocols in different ways, but that they all do the same basic thing: informing other networks of the groups that this particular network wants to receive traffic for.

Time-to-Live considerations.

Blindly forwarding multicast data around the Internet could cause significant utilization problems. For this reason, another important aspect of wide-area multicasting is the Time-to-Live value of the multicast IP datagram. As was discussed in Chapter 2, The Internet Protocol, the IP datagram's Time-to-Live header field dictates the maximum number of hops that a datagram can take before it must be discarded. Since all multicast datagrams are sent using IP, this field is also used with multicast datagrams, and is also used by multicast routers to keep traffic from propagating across too many multicast routers.

Multicast group addresses in the range of 224.0.0.0 through 224.0.0.255 are reserved for administrative use only, and as such are restricted to the local network. Any data sent to those addresses must have a Time-to-Live setting of 1 in order to keep those messages from being forwarded across multicast routers. RFC 2236 goes so far as to state that even if the Time-to-Live is not set to 1 on those datagrams, they still should not be forwarded.

For example, the all-hosts group address of 224.0.0.1 refers to all of the multicast-aware devices on the local network. Whenever a multicast router wants to locate the active multicast groups in use on the local network, it will send an IGMP query to the 224.0.0.1 group address. In addition, the multicast router must set the Time-to-Live value of that query to 1, preventing any other multicast routers on the local network from forwarding the query to any other networks they may be attached to.

In addition to these considerations, it is also important to note that ICMP errors must not be generated in response to IGMP or multicast datagrams, as that could result in many ICMP messages getting sent for every multicast or IGMP packet. Therefore, if a multicast router sees an IGMP message with the Time-to-Live value set to 1 (normally meaning that the datagram is about to expire and cannot be forwarded), then the ICMP software on that router must not return an ICMP Time Exceeded error message. Instead, the datagram should just be discarded silently. For more information on ICMP, refer to Chapter 5, The Internet Control Message Protocol.

Managing Group Memberships

Although any host can send traffic to a multicast group, applications that want to participate in a distributed multicast group as listeners must inform the local multicast routers of the groups that they want to listen for. This is done using IGMP messages that state the multicast group address that the host wants to participate in.

In addition, multicast routers also use IGMP messages to query the local network for hosts that have active memberships in multicast groups that are being routed to the local network. This is basically the entire function of IGMP: routers asking for hosts that are listening for multicast traffic, and hosts telling routers which groups they are listening for.

All IGMP messages have certain requirements. First of all, they must all have the Time-to-Live value of the IP datagram set to ? hop? thereby preventing them from being forwarded by any routers. In addition, RFC 2236 states that all IGMP messages should set the IP datagram to use the IP Router Alert Option, as discussed in 밨outer Alert?in Chapter 2.

Membership reports

The process of registering with a router is very straightforward. Whenever an application that needs to participate in a specific multicast group is started, an IGMP 밠embership Report?is sent by the host to the multicast address in use by the application. RFC 2236 states that multicast hosts should resend the first Membership Report within a few moments, in case the first one was lost or damaged.

For example, servers running the multicast-enabled Simple Network Time Protocol (SNTP) can use the multicast group address of 224.0.1.1 to automatically pass time-synchronization data across the network to SNTP-enabled clients. Whenever an SNTP system is started, it will immediately send a couple of IGMP Membership Reports to the 224.0.1.1 multicast address, as illustrated in Figure 4-3. Any multicast routers on the local network will see this report, and use this information when building their multicast-forwarding maps.


		Figure 4-3. An SNTP-enabled system announcing its presence to the network

Membership Reports are also sent in response to IGMP Membership Queries, which are periodically sent by multicast routers on the local network in order to determine which multicast groups still have active memberships. Whenever a query is received, at least one host from each multicast group on the local network must respond with a Membership Report. This procedure is described in more detail in 밠embership queries?later in this chapter.

Note that Membership Reports are never generated for the 밶ll-hosts?group address of 224.0.0.1. Although every multicast-enabled host should listen for traffic on that multicast address, no host should ever report its membership in that group.

Leave reports

Another announcement mechanism that is commonly seen with many multicast applications is the 밚eave Report,?used to announce that a particular host is no longer interested in receiving multicast traffic for a particular group. The idea behind this report is that it can be used to inform a local multicast router that the host is going away, and that the router can stop forwarding multicast data for the specified group address to this network.

There are some special considerations with this type of report. For one thing, the router has to verify that the host that is leaving the multicast group is indeed the last user of the specified group address, otherwise any other hosts on the local network that still wanted to receive traffic for that group wouldn't get it any more. This is achieved through the use of a group-specific query message (as we'll discuss in the next section, 밠embership queries?. If no hosts respond to the group-specific query, the router can stop forwarding traffic for that group's multicast address immediately, thereby reducing the local bandwidth consumption.

Furthermore, Leave Reports were introduced in the IGMPv2 specification, and were not a part of the IGMPv1 specification. As such, they will only be used on systems that use IGMPv2, and will be ignored by systems that are using IGMPv1.

It is important to note that Leave Reports are not sent to the multicast address of the group being left as are Membership Reports, but instead are sent to the 밶ll-routers?group address of 224.0.0.2, which is similar to the 밶ll-hosts?address, except that it is only monitored by multicast routers.

Figure 4-4 shows the NTP client from Figure 4-3 announcing that it is leaving the network. In the example network shown, only Sasquatch should receive the Leave Report, since it is the only monitored by multicast router on the network.

Devices that do not understand Leave Report messages should silently discard them, as they should all other unknown message types.


		Figure 4-4. An SNTP-enabled system announcing its departure from the network

Membership queries

Another commonly seen IGMP message is the 밠embership Query.?Queries are sent by multicast routers in order to verify that each of the multicast groups that are currently being forwarded still have listening members on the local network (by default, the interval between queries is 125 seconds). If no hosts respond to a Membership Query, then the router can stop forwarding multicast datagrams for any of the multicast groups that no longer have any active listeners on that network.

Typically, only one multicast router issues these queries. Any other multicast routers on the network then takes a passive role, watching for Membership Reports, but not sending any queries. The 뱏uery router?is elected by having the lowest-numbered IP address. If that router stops issuing queries, then the multicast router with the next-higher IP address will become the query router.

Hosts respond to Membership Queries with standard Membership Reports, and the multicast routers use this information to determine which multicast addresses should be forwarded to the local network. If no hosts respond for a multicast address that is already being forwarded, then the multicast router can assume that nobody wants that traffic anymore and can stop forwarding remote datagrams for that multicast address to this network.

Membership Queries are sent to the 밶ll-hosts?address of 224.0.0.1. With IGMPv2, routers are also allowed to send group-specific queries to test for membership in a particular group, if this is needed for some reason. Regardless of the query type,

the hosts on the network will respond to the Multicast Query by sending their Membership Reports to the multicast addresses in use by their local applications, just like always. Multicast routers are required to listen for data on every multicast address, so they will see all of the responses.

Figure 4-5 shows an IGMP Membership Query being sent to the 밶ll-hosts?group address of 224.0.0.1. The IGMP messages that are generated in response to the query would be simple Membership Reports—sent to the multicast addresses in use by the applications on those hosts—as shown back in Figure 4-3.


		Figure 4-5. An overview of the membership-sampling process

Keep in mind that the purpose of the Membership Query is to locate any hosts that are actively listening for a specific group, and is not intended to locate every host that may be listening. Since multicast routers only need to know that at least one host is listening for specific multicast addresses, they do not benefit from multiple hosts responding for a single multicast address.

For this reason, hosts incorporate a random length timer that they use before responding to a Membership Query. If no other hosts have responded on behalf of the multicast groups in use on this host before the timer expires, then the host will issue a response. Any other hosts that have not yet responded for that group will then abort their own responses. However, it should be noted that this procedure is conducted on a per-group basis: if a host aborts a response for a group because some other host has already responded to the Membership Query, then it does not abort all of the pending responses; instead it will only abort responses for the groups that have already been responded to by the other hosts.

IGMP Messages

Before a host can receive and process multicast datagrams, it has to do two things: it must inform the local network interface card that it wishes to accept and process network frames for specific multicast groups, and it must inform any multicast routers on the local network that it wishes to receive the IP packets for those multicast groups as well.

How the application communicates with the local network interface card is a function of the specific IP implementation in use on the local system (the process is discussed earlier in this chapter in 밚ocal Multicasting?. However, the mechanisms used by IP to inform the multicast routers on the local network of its desire to participate in certain multicast groups is a function of the Internet Group Management Protocol (this process is discussed earlier in 밠anaging Group Memberships?.

IGM Message Headers

IGMP messages consist of a predefined set of fields that indicate the type of message being passed and the multicast group that the message applies to. Each 뱈essage type?is used in specific scenarios and is indicated by a field in the header. Depending on the version of IGMP in use and the type of message being sent, an IGMP message will have four or five fields, with different fields being used for the different versions of IGMP.

Table 4-2 lists all of the fields in IGMPv1 messages, along with their size (in bits) and some usage notes. For more information on these fields, refer to the individual discussions throughout this section.

Table 4-2. The Fields in an IGMPv1 Message
Field	Bits	Description
Version	4	Indicates the version of IGMP in use (always 1).
Message Type	4	Indicates the message type (such as 뱏uery?or 뱑eport?.

Table 4-2. The Fields in an IGMPv1 Message (continued)
Field	Bits	Description
Unused	8	This field is unused with IGMPv1 query messages, and must be filled with zeroes.
Checksum	16	Used to store a checksum of the IGMP message.
Multicast Group	32	The 32-bit IP address of the multicast group that this message is associated with.

Table 4-3 lists all of the fields in IGMPv2 messages, along with their size (in bits) and some usage notes. For more information on these fields, refer to the individual discussions throughout this section.

Table 4-3. The Fields in an IGMPv2 Message
Field	Bits	Description
Message Type	8	Indicates the message type (such as 뱏uery?or 뱑eport?, and also uses some bits to indicate the version of the message.
Max Response Time	8	Indicates the longest period of time that a host should wait before responding to a query. This field is only used with IGMPv2 query messages, and is unused with IGMPv1 messages.
Checksum	16	Used to store a checksum of the IGMP message.
Multicast Group	32	The 32-bit IP address of the multicast group that this message is associated with.

Note that the size and use of the fields are such that IGMPv2 messages are almost identical to IGMPv1 messages. The only real differences between the two versions is that IGMPv2 messages have subsumed the Version field into the Message Type field, and that the IGMPv1 밬nused?field is now the IGMPv2 Maximum Response Time field.

The total length of a 뱊ormal?IGMP message is eight bytes. However, other IGMP messages may be longer, particularly with messages that are specifically used with multicast routing protocols and the like. If the Message Type is recognized as being one of the core messages, IGMPv2 implementation must ignore any additional data after the first eight bytes (although the checksum will be for the entire message, and not just the first eight bytes).

All IGMP messages are sent within IP datagrams directly. Each of these datagrams must have a protocol identifier of 2 and an IP Time-to-Live value of ? hop? In addition, all IGMPv2 messages must use the Router Alert Option in the header of the IP datagram.

Figure 4-6 shows an IGMPv2 Membership Query message, which will be used to dissect the headers of the IGMP Message.


		Figure 4-6. A simple IGMP Message

The following sections discuss the header fields in detail.

Version

With IGMPv1 messages, indicates the version of IGMP in use. However, IGMPv2 subsumed this field into the Message Type field, although IGMPv2 messages still set these four bits to 1.

Size
Four bits.

Default Value
1

Defined In
RFCs 1112 and 2236.

Notes
With IGMPv2, this field has been subsumed into the Message Type field, although it is still used to report version information in some cases. In all cases, this field should report a version number of 1, allowing older multicast routers to process the data. If this field reported 2, then IGMPv1 multicast routers would ignore the messages.

Capture Sample
In the capture shown in Figure 4-7, the Version field (i.e., the most significant four bits of the Type field) is set to 1, although the Message Type field shows that this is an IGMPv2 message. All of the messages currently defined for IGMPv2 use 1 in the four-bit version field.


		Figure 4-7. The Version field

'UP! > Web Service' 카테고리의 다른 글

The User Datagram Protocol (0)	2008.08.21
The Internet Control Message Protocol (0)	2008.08.21
The Address Resolution Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21
An Introduction to TCP/IP (0)	2008.08.21

Posted by 으랏차

The Address Resolution Protocol

UP!/Web Service 2008. 8. 21. 14:15

The Address Resolution Protocol

Summary	The Address Resolution Protocol provides a mechanism for IP devices to locate the hardware addresses of other devices on the local network. This service is required in order for IP-enabled systems to communicate with each other.
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 37 (RFC 826, republished); 38 (RFC 903, republished)
Relevant RFCs	826 (Address Resolution Protocol); 903 (Reverse ARP); 1122 (Host Network Requirements); 1433 (Directed ARP); 1868 (UnARP Extension); 2131 (DHCP and DHCP ARP); 2390 (Inverse ARP)

When two IP-enabled devices on the same network segment want to communicate, they do so using the low-level protocols and addressing mechanisms defined for the specific medium in use.

For example, Ethernet devices use Ethernet-specific addresses when they communicate with each other, while Frame Relay networks use Frame Relay-specific addresses.

In order for IP systems to communicate with each other, they must first be able to identify the hardware addresses of the other devices on the same network segment that the local device is on. This service is provided by the Address Resolution Protocol.

The ARP Standard

ARP is defined in RFC 826, which has been republished as STD 37 (ARP is an Internet Standard protocol). However, RFC 826 contained some vagaries which were clarified in RFC 1122 (Host Network Requirements). As such, ARP implementations need to incorporate both RFC 826 and RFC 1122 in order to work reliably and consistently with other implementations.

RFC 826 introduced the concept of an Address Resolution Protocol as a useful way for devices to locate the Ethernet hardware address of another IP host on the same local network. As it turns out, ARP is also useful for many network topologies—not just Ethernet—and has since become incorporated into most of the other network topologies. All LAN media—and many WAN media—now use ARP to locate the hardware addresses of other IP devices on the local network.

When a device needs to send an IP packet to another device on the local network, the IP software will first check to see if it knows the hardware address associated with the destination IP address. If so, then the sender just transmits the data to the destination system, using the protocols and addressing appropriate for whatever network medium is in use by the two devices. However, if the destination system's hardware address is not known, then the IP software has to locate it before any data can be sent. At this point, IP will call on ARP to locate the hardware address of the destination system.

ARP achieves this task by issuing a low-level broadcast onto the network, requesting that the system that is using the specified IP address respond with its hardware address. If the destination system is powered up and on the network, it will see this broadcast (as will all of the other devices on the local network), and it will return an ARP response back to the original system. Note that the response is not broadcast back over the network, but is instead sent directly to the requesting system.

This process is illustrated in Figure 3-1. In that example, Ferret issues an ARP request for the hardware address associated with 192.168.10.40 to the network-specific broadcast address in use on that LAN. When Froggy sees the ARP request, it unicasts an ARP reply (containing Froggy's hardware address) to Ferret directly. Having gained the information it needed, Ferret can then send whatever data it has for Froggy to that host's hardware address, using the protocols that are specific to the underlying network medium.

ARP packets work at the data-link layer, the same as IP packets. As such, ARP packets are completely separate from IP packets; they even have a different protocol ID of 0806, instead of 0800 as used with IP.

ARP packets contain several fields, although only five of them are used to provide the bulk of ARP's functionality: the hardware address of the source, the IP address


		Figure 3-1. An overview of ARP

of the source, the hardware address of the destination, the IP address of the destination, and a 뱈essage-type?field that indicates whether the current ARP packet is a request or a response to a request.

When a device issues an ARP request, it fills in three of the four address-related fields, providing its own hardware and IP address, as well as the IP address of the target (the target's hardware address is unknown, so that field is filled with zeroes). In addition, it will set the message-type field to indicate that the current packet is an ARP request, and then broadcast the request onto the local network for all devices to see.

All of the local devices should monitor the network for ARP broadcasts, and whenever they see a request for themselves (as indicated in the destination IP address field of the ARP request), they should generate a response packet and send it back to the requesting system. The response packet will consist of the local device's IP and hardware addresses (placed into the sender fields), and the IP and hardware address of the original sender (placed in the destination fields of the response packet). The response will also be marked as such, with the message-type field indicating that the current packet is an ARP response. The new ARP packet is then unicast directly to the original requester, where it is received and processed.

Note that there is no timeout for an ARP request. The packet is sent and that's all there is to it. If a response comes back, that's great. If not, then ARP itself does not care, although the IP stack itself probably does care, and will continue to issue ARP lookups. However, this is also dependent upon the implementation of ARP,

and how it deals with queuing. Many implementations only have room in the ARP lookup queue for a single packet for each of the hosts being queried. If an ARP request is not satisfied and another packet arrives from IP for the host being queried, then the first query will likely be aborted, and a second query will be issued. If the host is not responding to the ARP queries—but IP is trying to send multiple datagrams to that host—then there will be multiple queries for that host.

In this model, a higher-layer protocol such as TCP may notice that a problem has occurred and attempt a retransmission, which would in turn generate another IP packet (and another ARP request). Other protocols (such as UDP and ICMP) will not do this, however, and will simply treat a failed ARP lookup as a general timeout error (if a timer is being kept).

For example, if you try to use the ping program to test a host that is powered off, that host will not be able to respond to ARP broadcasts, and as such the local ICMP software will not be able to send an ICMP Echo Request message to the destination system. However, ping may continue generating ICMP Echo Request messages, which will result in multiple ARP lookups. ping may report timeout errors to the user for each of the failed lookups (since no response was received to the ICMP Echo Request message that was never actually sent over the network), although these errors will be generated by ping, and not by ARP.

The ARP Cache.

When the requesting system gets an ARP response, it will store the hardware and IP address pair of the requested device into a local cache. The next time that system needs to send data, it will check the local cache, and if an entry is found it will go ahead and use it, eliminating the need to broadcast another request.

Likewise, the system that responded to the ARP broadcast will store the hardware and IP addresses of the system that sent the original broadcast. If it did not do so, it would have to issue an ARP broadcast to find out where to send the ARP response.

Since all of the systems on the network will see the ARP broadcast, they could go ahead and capture the IP and hardware addresses of the sender, storing this information in their own caches. However, doing this might cause existing entries in the cache to be flushed (an especially problematic issue with systems that have small caches), so only those systems who already have the sender's IP address in their cache should update their entries. Hosts that do not already have the device in their cache should ignore the broadcast.

Cache size issues

Most systems have a very limited ARP cache, with only enough room for a few entries. These entries will be overwritten as needed—a problem for busy networks.

For example, if a client system frequently accesses a variety of different servers (one for mail, one for web, one for routing, one for database, etc.), and its ARP cache was smaller than the number of systems it was connecting to, then it would have to overwrite entries in the cache on a continual basis. It would issue an ARP broadcast, cache the result, and then overwrite that entry with the results from another ARP broadcast for a different system a few moments later. If this problem were multiplied by hundreds of devices, the network could theoretically become saturated with ARP broadcasts.

Likewise, if many clients are accessing a single server, and if the ARP cache on that server is too small to hold entries for all of the clients, then the ARP cache on the server may get flushed continually. This would require the server to issue ARP requests whenever it needed to send data back to the clients. This in turn would flush even more entries, forcing the cycle to be repeated endlessly.

Note that many large, multiuser systems and network routers have very large ARP caches to prevent these types of problems from occurring in the first place. For example, high-end Cisco routers have an ARP cache that is large enough to hold several hundred entries, since the router is likely to exchange data with each PC on the local network quite frequently. Having a large cache on these types of devices is essential to smooth operation, since otherwise the servers and routers could communicate with only a few PCs simultaneously.

Cache timeout issues

Systems should flush entries from the ARP caches after they have been unused for a while. This allows well-known IP addresses to be moved to a different machine—or for a well-known machine to be given a new IP address—without communication problems coming about due to stale (and invalid) address mappings. ARP cache timeout values that are too high will cause problems whenever a host is assigned a different IP address, since the other hosts who have an older entry in their caches will still try to send data to the old (and invalid) hardware address.

Conversely, an ARP timeout that is too short will also result in problems, especially on busy networks with lots of devices. If network clients are constantly flushing their ARP caches due to short timeout values, then many broadcasts will be required. This hurts performance, since the IP software will not be able to send any data until an ARP broadcast has been sent and responded to.

RFC 826 does not give a specific timeout value for ARP cache entries, so each vendor implements this value differently. On desktop PCs, the timeout value can be quite low, with entries expiring after just 120 seconds on Windows 95 systems. Conversely, many server-class systems will set the timeout value to be ten minutes or more, while some routers set the timeout as high as four hours. These types of systems typically benefit from having high timeout values, since the client systems do not change addresses very often, and provide their new hardware address whenever they switch to a new IP address (by issuing ARP queries).

You may need to ask your system vendor for the default ARP timeout value, or for instructions on how to change it. However, RFC 826 does state that if a device sees an ARP packet whose IP address is already in the cache, then any timeout clocks for that entry should be reset. This helps the already-cached data to survive for as long as possible.

Static caching

To get around some of the problems encountered with entries expiring and being over-written, many products provide tools for adding static entries to the cache that never expire. On some systems, static entries are cleared from the cache when the system is restarted, requiring they be added again, while on other systems the static entries are more or less permanent, remaining until they are manually purged or until space is needed in the cache.

There are some benefits that come from using static ARP entries. Some network administrators incorporate static entries into an overall network security strategy. By adding static ARP entries for every device on the network, an administrator can prevent someone from using somebody else's IP address to access a sensitive host. Another benefit of static cache entries comes from the elimination of frequent lookups, since clients will always have ARP cache entries for the servers they use most frequently.

However, this benefit can also cause management headaches if systems are moved or renumbered frequently. If the network adapter is changed on a server—or if a well-known IP address is moved to another computer—then the static cache entry for that device will be invalid on all of the systems that have it. Systems will be unable to communicate until the static entries have been purged.

Proxy ARP

Sometimes it is useful to have a device respond to ARP broadcasts on behalf of another device. This is particularly useful on networks with dial-in servers that connect remote users to the local network. In such a scenario, a remote user might have an IP address that appears to be on the local network, although the user's system would not be physically present (it would be at a remote location, connected through the dial-in server).

Systems trying to communicate with this node believe that the device was local, and would use ARP to try and find the associated hardware address. However, since the system is remote, it would not see (nor respond to) the ARP lookups.

Normally, this type of problem is handled through Proxy ARP, which allows a dial-in server to respond to ARP broadcasts on behalf of any remote devices it services.

Figure 3-2 shows a dial-up client connected to an Ethernet network using an IP address that is associated with the local Ethernet network. Whenever another device on the network needed to locate the hardware address of the remote user, it would issue an ARP Request for that IP address (as shown in step one in the figure), and expect an Ethernet address back in the ARP Response.

However, since the dial-up client is actually on a remote network, it would not see the ARP request (unless bridging were enabled over the connection, which is highly unlikely). Instead, Sasquatch should respond to the ARP Request on behalf of the dial-up system, providing its own Ethernet address in the ARP Response packet (as shown in step two in the figure). Essentially, Sasquatch would trick the other hosts into thinking that it was the remote system.


		Figure 3-2. An overview of Proxy ARP

Having gained the necessary information, the other devices on the local network could then start communicating directly with the communications server (using the low-level protocols appropriate for the medium). The communications server would accept the packets on behalf of the dial-up user, see that they were intended for the dial-up host (as shown in the destination address field of the IP header), and forward them on.

Some interesting problems can occur with Proxy ARP, especially in regards to mobile users who frequently connect and disconnect with multiple dial-in servers that share a modem pool or ISDN line. Whenever a client connects to an application server on the local LAN, ARP entries are stored in the server's cache, and if the user were to disconnect from the dial-up server and reconnect to another dial-up server on the same LAN (negotiating the same IP address for both connections), the application server's ARP cache entry for the client would still point to the first dial-up server's hardware address. This situation would prevent the application server from being able to communicate with the client, at least until the Proxy ARP entry in the server's cache was flushed. This particular problem can be resolved in a variety of ways, however, with the most common solution being the use of Gratuitous ARP on the dial-up servers in the pool.

Variations on the ARP Theme

Just as an IP device can locate the hardware address for a specific IP address, so can a device locate the IP address associated with a known hardware address. Two variations on ARP take advantage of this capability, although for very different purposes:

Inverse ARP
Inverse ARP works exactly the opposite of regular ARP. Rather than a device needing to find the hardware address associated with a known IP address, Inverse ARP is used to find the IP address for a known hardware address.

Reverse ARP
Reverse ARP is used to allow a diskless workstation to request its own IP address, simply by providing its hardware address in an ARP request broadcast. A Reverse ARP server will see this request, assemble an ARP response that contains an IP address for the requester to use, and then send it back to the requesting device. The workstation will then use this IP address for its networking activity.

In addition, three other variations on ARP have been developed that attempt to overcome some of the problems that have proven to be common on IP networks:

DHCP ARP
DHCP ARP is used by devices that obtain an IP address using an address-allocation protocol such as DHCP. The purpose of DHCP ARP is to allow the device to probe the network for any other device that may be using the assigned IP address already, prior to actually trying to use the address. This process helps prevent problems with duplicate or overlapping address assignments, as the requester can just reject the assignment if another device responds to the DHCP ARP query.

Gratuitous ARP
Gratuitous ARP is used when a device issues an ARP broadcast for the sole purpose of keeping other devices informed of its presence on the network.

There is no information gained by the requesting device in this scenario. However, other devices that already know about this device will update their cache entries, keeping them from expiring too quickly.

UnARP
Similarly, UnARP provides a mechanism for de-registering ARP entries. When a device wishes to leave the network, it can issue an UnARP broadcast, causing the other devices to clear their ARP caches of entries for this device.

DHCP ARP, Gratuitous ARP, and UnARP can each be extremely useful when managing networks that change frequently.

Inverse ARP (InARP)

The design goal for ARP was to allow an IP system to locate the hardware address associated with the (known) IP address of another system on a local network. Inverse ARP—which is documented in RFC 2390—works in the exact opposite manner, instead allowing a system to locate the IP address associated with a known data-link address. This feature is necessary when the devices are known to each other at the data-link layer, but the IP addresses of those systems are not known, which will prevent the two systems from being able to communicate with each other using IP. This is a common scenario on networks that share data-link addresses across different physical networks, such as Frame Relay and ATM.

Frame Relay networks are somewhat different from traditional networks in that the devices themselves do not have hardware addresses. Instead, the devices have addresses for each of the circuits that they are connected to (known as 밺ata-link connection identifiers,?or DLCI addresses). Whenever a device needs to send data to another device, it sends the data to the DLCI address that is associated with the specific circuit that the destination device is available through.

From IP's point of view, if a device wants to send IP packets to another device on the Frame Relay network, then the sender must have an ARP cache of sorts that maps destination IP addresses to specific circuit identifiers, allowing the sender to transmit data for a known IP address through the correct Frame Relay circuit. Typically, Frame Relay access devices must be manually configured with this mapping information, showing that DLCI 153 is associated with the IP address of 192.168.30.122, for example. In this model, the sender is using a static ARP entry for each of the circuits that it is connected to. This scenario does not scale well, to say the least, particularly with Frame Relay networks that have hundreds of circuits (and therefore hundreds of IP-to-DLCI mappings).

This is the problem that Inverse ARP solves: by sending a query to each of the circuits that a device is attached to, a device can automatically build circuit-to-IP address mappings on a somewhat-automated basis. Once the destination systems on the far end of those circuits receive the Inverse ARP requests, they will respond with the IP address they are using on that specific circuit, allowing the sender to build these maps automatically.

Figure 3-3 provides a rough illustration of how this concept works. In that example, one of the devices wants to know the IP address of the other device at the far end of circuit 153, so it sends an Inverse ARP request through that circuit. The remote system sees the incoming request and responds with an Inverse ARP response that shows the IP address that is associated with that virtual circuit. Note that if the remote system were connected to multiple circuits, it would need to have different IP addresses for each of the circuits it were connected to (per-interface IP addresses are required in order for basic IP routing to work across the network).


		Figure 3-3. An overview of Inverse ARP

Every device on the Frame Relay network can go through this process, sending Inverse ARP requests through every circuit defined on the network, until all of the devices discover the IP addresses in use by all of the other devices. This information can then be used to build workable ARP caches for each of the systems. Note that these ARP entries should be treated the same as any other type of ARP data, with address pairs getting flushed or updated as needed by the system

An interesting aside here is that Frame Relay circuits can have different circuit identifiers at each end, depending upon the number of switches in between the two end-points. In that scenario, the two end-points will have different circuit identifiers for the IP addresses in their ARP caches. For example, the sending router shown in Figure 3-3 may see the circuit as DLCI 153, while the recipient may see the circuit as DLCI 298. However, since the end-points only communicate with the circuit (and not with the remote system directly), the hardware address associated with the remote IP address is somewhat irrelevant.

Although the model put forth by Inverse ARP could be incorporated into Ethernet and other LAN-based topologies, it has not yet been done so to my knowledge. Generally speaking, there is no need for these services on a local network where every device shares the same physical medium. Frame Relay and ATM networks are unique in that devices can share a data-link connection while not being connected to the same physical network medium.

For more information on Inverse ARP, refer to RFC 2390. For more information on using IP with Frame Relay networks, refer to RFC 2427. For more information on using IP with ATM networks, refer to RFC 2225.

Reverse ARP (RARP)

Reverse ARP (RARP) is most often used with diskless workstations that do not have a permanent IP address, or that do not have sufficient resources for storing a permanent IP address. Whenever such a device needs to obtain an IP address for its own use (generally at boot time), it issues an ARP broadcast that contains its own hardware address in both the Sender and Recipient Hardware Address fields. Note that the Sender and Recipient Protocol Address fields would be zeroed out.

A special host (called a RARP server) watches for RARP broadcasts (RARP packets have their own unique ethertype of 8035). When one is seen, the server attempts to find an entry for the requesting device's hardware address in a local table of IP-to-hardware address mappings. If a match is found, it returns a RARP response to the requesting device, providing it with the IP address needed to continue the boot process. This process is illustrated in Figure 3-4.


		Figure 3-4. An overview of Reverse ARP

Although RARP was functional enough to be useful, it did not offer much in the way of comprehensive resource configuration. RARP did not provide for automatically determining the subnet mask in use on the network, the routers available, or much of anything else. For this reason, RARP is used much less frequently today than it has been in the past, with most devices having long ago moved to the BOOTP or DHCP configuration protocols for address-assignment and configuration services. You probably won't see it in use except on networks that have lots of older equipment.

For more information on RARP, refer to RFC 903.

DHCP ARP

One of the key differences between DHCP and RARP is that DHCP uses a shared pool of addresses when assignments are made, rather than using fixed entries like RARP does. This allows a network manager to share a limited number of IP addresses among a large number of hosts, instead of having to assign fixed addresses to each host. Although this makes life a little easier for the administrator in one sense, it can also make life more difficult, particularly when it comes to making sure that more than one device is not trying to use the same address at the same time. This is particularly problematic when users do not release the address they have been assigned and another device starts trying to use it, or when a user has manually configured his system to use a specific address that is also being assigned out of the DHCP pool.

In order to minimize the opportunity for this problem to occur, RFC 2131 (the DHCP RFC) stated that devices should verify the integrity of the address they are getting from the DHCP server before they accept the assignment. RFC 2131 also defined a specific ARP request format to be used for this purpose, in order to keep other systems from getting confused by the ARP request.

With DHCP ARP, the requesting device issues a normal ARP request, except that instead of putting its own IP address in the Source Protocol Address field, it puts in ?.0.0.0? The rest of the packet looks like a normal ARP request, with the local hardware address in the Source Hardware Address field, the questionable IP address in the Destination Protocol Address field, and the Destination Hardware Address field containing zeroes.

This combination of fields serves two purpose. First of all, by providing the local hardware address in the Source Hardware Address field, a device that is already using the questionable address will see the ARP request and will be able to respond to it (ARP does not use IP for delivery, so it does not need an IP address for responses to work). At the same time, the lack of an IP address in the Source Protocol Address field prevents the other systems on the network from updating their ARP caches. Otherwise, the IP address provided in the Source Protocol

Address field would cause the other network devices to update any existing entry in their caches, even though another system may be using that IP address.

This process is demonstrated in Figure 3-5. In that example, Greywolf has been assigned an IP address from the DHCP server, but it tests the assignment using a DHCP ARP request before attempting to use it. In that case, it would create an ARP request packet, and set the Destination Protocol Address field to the address that it had been assigned by the DHCP server. Then it would fill the Destination Hardware Address and Source Protocol Address fields with zeroes.


		Figure 3-5. An overview of DHCP ARP

If Greywolf does not receive a response to the query, then it can assume that no other device on the local network is using that IP address, and start using it. If Greywolf did receive a response, however, then it should reject the IP address offered by the DHCP server, and restart the assignment process.

It is important to note that just because a client does not receive a response from a DHCP ARP request does not mean that the IP address is safe to use. For example, another device may have that address permanently configured, but may also be powered off at the moment the DHCP ARP was issued. When that device was powered back on, it would try to use the permanently assigned address, resulting in addressing conflict problems (such as reset TCP connections, unexpected ICMP messages, and other such problems). Furthermore, not every device understands the DHCP ARP format, and many older systems will just ignore an ARP request that has zeroes in the Source Protocol Address field. In fact, the UnARP protocol explicitly uses ARP responses instead of requests for just that reason.

RFC 2131 also suggests that clients 뱒hould broadcast an ARP reply to announce the client's new IP address and clear any outdated ARP cache entries in hosts on the client's subnet.?However, there is some confusion in regards to this clause. Most systems implement this type of functionality using the Gratuitous ARP mechanism, which uses a broadcast ARP request message (rather than an ARP reply)._. Although RFC 2131 clearly states that devices should issue ARP replies, many implementations will use an ARP request for this service.

For more information on DHCP ARP, refer to RFC 2131.

Gratuitous ARP.

Simply put, Gratuitous ARP is where a device broadcasts its own hardware and IP address pair, solely for the purpose of causing other devices to update their ARP caches. Remember that systems will not add new devices that they see in a broadcast (as this may cause the cache to flush more important entries). However, they will gladly update any entries that are already in the cache.

When a Gratuitous ARP request is broadcast, the sender puts its hardware and IP address information into the appropriate sender fields, and also places its IP address in the destination IP address field. It does not put its hardware address in the destination hardware field, however. Other devices on the network see this broadcast, and if they have the sender's information in their cache already, they either restart the cache entry's countdown timers or modify the entry to use the new hardware address (in case the network adapter on the sending host has been changed).

This process can be particularly useful with servers that communicate with many different clients frequently; the frequent use of Gratuitous ARP messages causes those clients to update their cache entries frequently, thereby keeping the clients from having to constantly reissue queries whenever they want to connect to this server. Furthermore, Gratuitous ARP messages can also be useful when a system frequently dials into a modem pool that is served by multiple communications servers, and the user has pre-selected an IP address to use for her connection. In that environment, the address being used may be associated with a different hardware address from one of the other communications servers (from one of the previous connections). By using Gratuitous ARP, the active dial-up server will ensure that all of the hosts on the LAN have the right hardware address in their ARP caches.

This process is illustrated in Figure 3-6, which shows Ferret issuing an ARP Request for its own IP address (192.168.10.10).

Nobody should reply to the ARP request, since nobody else should have the IP address listed in the destination field. If anybody does reply to the Gratuitous ARP


		Figure 3-6. An overview of Gratuitous ARP

broadcast, it would be because the responding host was also using the same IP address of the sending system. In this case, both systems should notice that there is an addressing conflict, and may choose to display a warning message on the system console stating that another host is using the local host's IP address.

In fact, many hosts use Gratuitous ARP whenever they come online just for this purpose, since it offers a cheap way to test for address conflicts on startup. Unfortunately, the use of Gratuitous ARP in this way can actually cause some problems to occur, and is not always a good idea. For example, if a misconfigured host issues a Gratuitous ARP for an IP address that is already in use by another device, then any clients who see that broadcast will update their ARP caches to use the new hardware address being published, even though the new mapping may point to the wrong host. From that moment on—or at least until the cache entries are updated—those clients will try to send data to the misconfigured host, rather than to the original (proper) host.

DHCP ARP is probably a better mechanism to use when testing for addressing conflicts on startup, since it allows the sender to detect whether or not another device is using the IP address before it actually tries to use it. Furthermore, the use of zeroes in the Source Protocol Address field prevents the other systems from updating their ARP caches with this host's information, which may be the more prudent choice.

UnARP

As discussed in 밫he ARP Cache?earlier in this chapter, there are no Time-to-Live mechanisms defined in the ARP standard. Each vendor implements its own ARP timers, ranging anywhere from two minutes to twenty minutes and more. This wide diversity can cause several problems.

For example, a DHCP client may be assigned an IP address, use it for a while, and then release it. Moments later, another client may be assigned the same IP address from the DHCP server. If any systems had the first DHCP client's hardware address stored in their cache, they would not be able to communicate with the new DHCP client.

One possible solution to this problem that has been proposed is UnARP, another variation on the ARP theme as defined in RFC 1868. UnARP dictates that a special ARP packet should be broadcast whenever a node disconnects from the network, explicitly telling other devices that the node is going away, and that the cache entry for that host should be flushed. This warning would allow another device (such as another DHCP client) to reuse the IP address immediately, without having to worry about stale caches causing any problems.

Essentially, UnARP is an unsolicited ARP response with zeroes in the Source and Destination Hardware Address fields that get broadcast across the network, as illustrated in Figure 3-7.


		Figure 3-7. An overview of UnARP

There are some key concepts to UnARP that are subtle enough to be easily missed:

?Zeroes are used in the Source and Destination Hardware Address fields to indicate that no hardware address is associated with the IP address provided in the Source Protocol Address field. In addition, the Hardware Address Length field is also set to zero.

?UnARP uses an ARP response packet rather than an ARP request packet. This is because UnARP uses zeroes in the hardware address fields, which could break lesser-bred ARP implementations that were expecting data in those fields. Since devices do not typically monitor for unsolicited ARP responses, support for UnARP would therefore require special code within the device, helping to minimize problems that might result from using zeroes in the hardware address fields.

Note that RFC 1868 is defined as an experimental RFC, and therefore many vendors do not support it. You should check with your vendor to see if it is among those that do.

For more information on UnARP, refer to RFC 1868.

The ARP Packet

An ARP packet works at approximately the same layer as IP. It communicates with the data-link services provided by the physical medium, and as such, ARP is a separate protocol than IP (and is identified separately by any network that both categorizes the protocols being carried in the low-level frames and supports ARP directly). The ethertype for ARP is 0806 as opposed to the ethertype for IP, which is 0800.

According to RFC 826, an ARP packet is made up of nine fields. The total size of the packet will vary according to the size of the physical addresses in use on the local network medium.

The fields in an ARP packet are listed in Table 3-1.

Table 3-1. The Fields in an ARP Packet
Header Field	Bytes	Usage Notes
Hardware Type	2	Identifies the medium-specific address type being requested.
Protocol Type	2	Identifies the higher-layer protocol information being discussed (which is always IP as far as we're concerned).

Table 3-1. The Fields in an ARP Packet (continued)
Header Field	Bytes	Usage Notes
Hardware Address Length	1	Specifies the size of the physical medium's hardware address, in bytes. Since each network uses different physical addressing mechanisms, this field is required for other fields further in the ARP packet.
Protocol Address Length	1	Specifies the size of the higher-layer protocol's address, in bytes. This is always 4 (32 bits) for IP.
Message Type	2	Identifies the purpose for this ARP packet (뱑equest?or 뱑esponse?.
Source Hardware Address	varies	Identifies the hardware address of the system issuing the ARP broadcast.
Source IP Address	varies	Identifies the higher-layer protocol address of the system issuing the ARP broadcast. This field's value will be 4 when IP is in use.
Destination Hardware Address	varies	Identifies the hardware address of the system responding to the ARP broadcast.
Destination IP Address	varies	Identifies the higher-layer protocol address of the system responding to the ARP broadcast. This field's value will be 4 when IP is used.

Notice that the ARP packet's format does not provide Time-to-Live or Version fields. This deficiency has caused a considerable amount of difficulty. Since there is no Time-to-Live field, every implementation uses its own cache timeout mechanisms, which results in inconsistent views of the network. Also, since there is no Version field, it is not possible to change the protocol's structure (such as adding a Time-to-Live field) without causing incompatibilities between the different implementations.

Figure 3-8 shows a capture of two ARP packets: an ARP Request from Ferret to the local broadcast address, and an ARP Response from Krill back to Ferret. These two packets will be used to illustrate the ARP packet throughout the rest of this section.

The following sections discuss the fields of the ARP packet.

Hardware Type

Identifies the hardware address type being requested, in decimal format. This would be 1 for DIX-Ethernet, 6 for IEEE 802.x Ethernet, 7 for ARCnet, etc.

Size
Sixteen bits.

Notes
Since ARP is used for getting the hardware address associated with an IP address—and some systems allow an IP address to be associated with multiple types of hardware—there must be a way to request the specific type of


		Figure 3-8. Two ARP packets

hardware address desired. Two devices on the same Ethernet LAN would want to communicate using the Ethernet-specific addresses, and the ARP request needs to be able to ask for that kind of address specifically.

The Hardware Type field uses decimal codes to indicate the address type being requested. Some of the more common hardware types are listed in Table 3-2. For a detailed listing of all of the known hardware types that are currently registered for ARP, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/arp-parameters).

Table 3-2. The Hardware Types Listed in RFC 1700 (Internet Assigned Numbers)
Hardware Type	Decimal Code
DIX-Ethernet and FDDI	1
IEEE 802 (includes 802.3 Ethernet and 802.5 Token Ring)	6
ARCnet	7
LocalTalk	11
SMDS (Switched Multimegabit Data Service)	14
Frame Relay	15
ATM (Asynchronous Transfer Mode)	19
Serial Line	20

Capture Sample
In the capture shown in Figure 3-9, the Hardware Type field is set to hexadecimal 0001, the code for DIX-Ethernet.


		Figure 3-9. The Hardware Type field

'UP! > Web Service' 카테고리의 다른 글

The Internet Control Message Protocol (0)	2008.08.21
Multicasting and the Internet Group Management Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21
An Introduction to TCP/IP (0)	2008.08.21
Internet Core Protocols (0)	2008.08.21

Posted by 으랏차

The Internet Protocol

UP!/Web Service 2008. 8. 21. 14:15

The Internet Protocol

Summary	The Internet Protocol provides a basic delivery service for transport protocols such as TCP and UDP. IP is responsible for getting data to its destination host and network. IP is not reliable, so the effort may fail.
Relevant STDs	2 (http://www.iana.org/); 3 (includes RFCs 1122 and 1123); 4 (RFC 1812, republished); 5 (includes RFCs 791, 792, 919, 922, 950, and 1112)
Relevant RFCs	781 (Timestamp Option); 791 (Internet Protocol); 815 (Fragmentation Reassembly); 919 (IP Broadcasts); 922 (Broadcasting on Sub-Nets); 950 (Sub-Net Recommendations); 1108 (Security Option); 1112 (IP Multicasting and IGMP v1); 1122 (Host Network Requirements); 1349 (Type-of-Service Flags); 1455 (Data-Link Security TOS Flags); 1812 (Router Requirements); 2113 (Router Alert Option)

As we learned in Chapter 1, An Introduction to TCP/IP, a variety of protocols are used for moving application data between different systems. We saw that hardware-specific protocols are used by devices when they need to exchange data directly, that the Internet Protocol is used to get IP datagrams across the different network segments to their final destination, and that TCP and UDP provide transport and connection management services to the application protocols used by end-user applications.

Although each of these layers provides unique and valuable services, the Internet Protocol is perhaps the most important to the overall operation of the Internet in general, since it is responsible for getting data from one host to another.

In this regard, IP can be thought of as being like a national delivery service that gets packages from a sender to a recipient, with the sender being oblivious to the routing and delivery mechanisms used by the delivery agent. The sender simply hands the package to the delivery agent, who then moves the package along until it is delivered.

For example, a package that is shipped from New York to Los Angeles is given to the delivery service (let's say UPS), with instructions on where the package has to go, although no instructions are provided on how the package should get to the destination. The package may have to go through Chicago first; the delivery agent at the New York UPS office makes that routing decision. Once the package reaches the Chicago UPS office, another delivery agent at that facility decides the best route for the package to take in order to get to Los Angeles (possibly going through Denver first, for example).

At each juncture, the local delivery agent does its best to get the package delivered using the shortest available route. When the package arrives at the Los Angeles facility, then another agent does its best to get it to the final destination system, using the destination address provided with the package to determine the best local routing.

Similarly, it is the function of IP to provide relaying and delivery decisions whenever an IP datagram has to be sent across a series of networks in order for it to be delivered to the final destination. The sending system does not care how the datagram gets to the destination system, but instead chooses the best route that is available at that specific moment. If this involves sending the datagram through another intermediary system, then that system also makes routing decisions according to the current condition of the network, forwarding the data on until it arrives at the destination system, as specified in the datagram's header.

The IP Standard.

IP is defined in RFC 791, which has been republished as STD 5 (IP is an Internet Standard protocol). However, RFC 791 contained some vagaries that were clarified in RFC 1122 (Host Network Requirements). As such, IP implementations need to incorporate both RFC 791 and RFC 1122 in order to work reliably and consistently with other implementations.

RFC 791 begins by stating 밫he Internet Protocol is designed for use in interconnected systems of packet-switched computer communication networks. The Internet protocol provides for transmitting blocks of data called datagrams from sources to destinations. The Internet protocol also provides for fragmentation and reassembly of long datagrams, if necessary, for transmission through 몊mall packet?networks.?/P>

RFC 791 goes on to say 밫he Internet Protocol is specifically limited in scope to provide the functions necessary to deliver a package of bits (an Internet datagram) from a source to a destination over an interconnected system of networks. There are no mechanisms to augment end-to-end data reliability, flow control, sequencing, or other services commonly found in host-to-host protocols.?/P>

That pretty much sums it up. A source system will send a datagram to a destination system, either directly (if the destination host is on the local network) or by way of another system on the local network. If the physical medium that connects the sending and receiving system offers enough capacity, IP will send all of the data in one shot. If this isn't possible, the data will be broken into fragments that are small enough for the physical medium to handle.

Once the datagram is sent, IP forgets about it and moves on to the next datagram. IP does not offer any error-correction, flow-control, or management services. It just sends datagrams from one host to another, one network at a time.

Remember this rule: the Internet Protocol is responsible only for getting datagrams from one host to another, one network at a time.

IP Datagrams Versus IP Packets

Hosts on an IP network exchange information using IP datagrams, which include both the units of data that contain whatever information is being exchanged and the header fields that describe that information (as well as describing the datagram itself). Whenever a device needs to send data to another system over an IP network, it will do so by creating an IP datagram, although the datagram is not what gets sent by IP, at least not in the literal sense.

Instead, IP datagrams get sent as IP packets, which are used to relay the IP datagrams to the destination system, one hop at a time. Although in many cases an IP datagram and an IP packet will be exactly the same, they are conceptually different entities, which is an important concept for understanding how IP actually works.

This concept is illustrated in Figure 2-1. In that example, Ferret needs to send an IP datagram to Fungi. However, since Fungi is on a remote network, Ferret has to send the packet containing the datagram to Sasquatch, who will then send another packet to Fungi.


		Figure 2-1. IP datagrams versus IP packets

IP datagrams contain whatever data is being sent (and the associated IP headers), while IP packets are used to get the datagram to the destination system (as specified in the IP headers). These IP packets are sent using the framing mechanisms defined for the specific network medium in use on the local network, and are subject to network events such as fragmentation or loss. However, the datagram itself will always remain as the original piece of data that was sent by the original sender, regardless of anything that happens to any of the packets that are used to relay the datagram.

For example, Figure 2-2 shows a four-kilobyte datagram that is being sent from Ferret to Fungi. Since this datagram is too large for the Ethernet network to send in a single frame, the datagram is split into four IP packets, each of which are sent as individual entities in individual Ethernet frames. Once all of the IP packets are received by the destination system, they will be reassembled into the original datagram and processed.

This model is necessary due to the way that IP provides a virtual network on top of the different physical networks that make up the global Internet. Since each of those networks have different characteristics (such as addressing mechanisms, frame sizes, and so forth), IP has to provide a mechanism for forwarding datagrams across those different networks reliably and cleanly. The datagram concept allows a host to send whatever data needs to be sent, while the IP packet allows the datagram to actually get sent across the different networks according to the characteristics of each of the intermediary networks.

This concept is fundamental to the design nature of the Internet Protocol, and is the key to understanding how IP operates on complex networks.


		Figure 2-2. Datagram fragmentation overview

Local Versus Remote Delivery

The IP header stores the IP addresses of both the source and destination systems. If the destination system is on the same physical network as the sending system, then the sender will attempt to deliver the datagram directly to the recipient, as shown in Figure 2-3. In this model, the sender knows that the recipient is on the same local network, so it transmits the data directly to the recipient, using the low-level protocols appropriate for that network medium.

However, if the two system are not connected to the same IP network, then the sender must find another node on the local network that is able to relay the IP datagram on to its final destination. This intermediate system would then have to deliver the datagram if the final recipient was directly accessible, or it would have to send the datagram on to yet another intermediary system for subsequent delivery. Eventually, the datagram would get to the destination system.

A slightly more complex representation of this can be seen in Figure 2-4. In that example, the sending system knows that the destination system is on a remote network, so it locates an intermediate system that can forward the data on to the final destination. It then locates the hardware address of the forwarding system, and passes the data to the intermediate system using the low-level protocols appropriate for the underlying medium. The intermediate system then examines the destination IP address of the datagram, chooses an exit interface, and sends the data to the final destination system using the low-level protocols appropriate to that network.


		Figure 2-3. An example of local delivery


		Figure 2-4. An example of routed delivery

The two network models shown in Figure 2-3 and Figure 2-4 are both relatively simple, and each represents the majority of the traffic patterns found on internal

corporate networks. Most networks only have a few segments, with the target being no more than a handful of hops away from the originating system.

But once datagrams start travelling over the Internet, things can get very complex very quickly. Rather than having to deal with only one or two routers, all of a sudden you may be looking at a dozen or more hops. However, IP handles complex networks the same way it handles small networks: one hop at a time. Eventually, the datagrams will get through. This concept is illustrated in Figure 2-5, which shows five different network segments in between the sending and destination systems.


		Figure 2-5. A complex, multi-hop network path

In the example shown in Figure 2-5, the sender has to give a packet to the local router, which will send another packet off to a router at the other end of a modem connection. The remote router then has to forward the data to yet another router across the carrier network, which has to send the data to its dial-up peer, which will finally deliver the datagram to the destination system. In order for all of this to work, however, each router must be aware of the path to the destination host, passing the data off to the next-hop router.

How IP finds remote hosts and networks

Every IP device—regardless of the function it serves—must have an IP address for every network that it is connected to. Most systems (such as PCs) only have a single network connection, and therefore only have a single IP address. But devices that have multiple network interfaces (such as routers or high-load devices like file servers) must have a dedicated IP address for every network connection.

When the IP protocols are loaded into memory, an inventory is taken of the available interfaces, and a map is built showing what networks the system is attached to. This map is called a routing table: it stores information such as the networks that the node is connected to and the IP address of the network interface connected to that network.

If a device only has a single interface, then there will be only one entry in the routing table, showing the local network and the IP address of the system's own network interface. But if a device is connected to multiple networks—or if it is connected to the same network several times—then there will be multiple entries in the routing table.

In reality, just about every IP device also has a 뱇oopback?network, used for testing and debugging purposes. The loopback network is always numbered 127.0.0.0, while the loopback interface always has the IP address of 127.0.0.1. This means that routing tables will generally show at least two entries: one for the physical connection and one for the loopback network.

When a system has to send a datagram to another system, it looks at the routing table and finds the appropriate network interface to send the outbound traffic through. For example, the router shown in the top-left corner of Figure 2-5 has two network connections: an Ethernet link with the IP address of 192.168.10.3 and a serial connection with an IP address of 192.168.100.1. If this router needed to send data to 192.168.10.10, then it would use the Ethernet interface for that traffic. If it needed to send datagrams to 192.168.100.100, it would use the serial interface. Table 2-1 shows what the router's routing table would look like based on this information.

Table 2-1. The Default Routing Table for 192.168.10.1
Destination Network	Interface/Router
127.0.0.0 (loopback network)	127.0.0.1 (loopback interface)
192.168.10.0 (local Ethernet network)	192.168.10.1 (local Ethernet interface)
192.168.100.0 (local serial network)	192.168.100.1 (local serial interface)

However, such a routing table would not provide any information about any remote networks or devices. In order for the router to send an IP datagram to 172.16.100.2, it would need to have an entry in the routing table for the 172.16.100.0

network. Systems are informed of these details by adding entries to the routing table. Most TCP/IP packages provide end-user tools that allow you to manually create and delete routing entries for specific networks and hosts. Using such a tool, you could inform the router that the 172.16.100.0 network is accessible via the router at 192.168.100.100. Once done, the routing table for the local router would be similar to the one shown in Table 2-2.

Table 2-2. The Routing Table for 192.168.10.1 with a Remote Route Added
Destination Network	Interface/Router
127.0.0.0 (loopback network)	127.0.0.1 (loopback interface)
192.168.10.0 (local Ethernet network)	192.168.10.1 (local Ethernet interface)
192.168.100.0 (local serial network)	192.168.100.1 (local serial interface)
172.16/0.0 (remote carrier network)	192.168.100 (next-hop router)

Since the router already knows how to send datagrams to 192.168.100.100, it now knows to send all datagrams for 172.16.100.2 to 192.168.100.100, under the assumption that the remote router would forward the packets for delivery. By adding entries for each network segment to the local routing table, you would be able to tell every device how to get datagrams to remote segments of the network. Such a routing table might look the one shown in Table 2-3.

Table 2-3 Complete Routing Table for 192.168.10.1, Showing Entire Network
Destination Network	Interface/Router
127.0.0.0 (loopback network)	127.0.0.1 (loopback interface)
192.168.10.0 (local Ethernet network)	192.168.10.1 (local Ethernet interface)
192.168.100.0 (local serial network)	192.168.100.1 (local serial interface)
172.16.100.0 (remote carrier network)	192.168.100.100 (next-hop router)
192.168.110.0 (remote serial network)	192.168.100.100 (next-hop router)
192.168.30.0 (remote Ethernet network)	192.168.100.100 (next-hop router)

Unfortunately, you would have to add entries for every segment of the network to every device on the network in order for everything to function properly. Each router would have to have a map showing every network and the routers that were to be used for that network. This task can be a lot of work, and is also highly prone to human error.

Several application protocols can be used to build maps of the network and distribute them to all of your systems without human intervention. The most popular of these for private networks is the Routing Information Protocol (RIP), which uses UDP broadcasts to distribute routing tables every thirty seconds. Another popular protocol is Open Shortest Path First (OSPF), which provides the same basic functionality as RIP but with more detail and less overhead. For external networks, neither of these protocols work well enough to support a significant number of networks, and other protocols (such as the Border Gateway Protocol) are more common for those environments.

In common practice, most network administrators run these dynamic routing protocols only on their routers (but not on their hosts) since they tend to consume a lot of CPU cycles, memory, and network bandwidth. They then define 밺efault?routes at the hosts, pointing them to the router(s) that serve the local network that the host is attached to. By using this model, clients need to keep only one entry in their routing tables, while the dedicated routers worry about keeping track of the overall network topology.

Table 2-4 shows what this might look like from the perspective of our example router. Notice that it has routing entries only for the locally attached networks, and that it now knows to send any other datagrams to the default router at 192.168.100.100. That router would then forward all of the datagrams that it gets to its default router as well.

Table 2-4. A Simplified Routing Table for 192.168.10.1
Destination Network	Interface/Router
127.0.0.0 (loopback network)	127.0.0.1 (loopback interface)
192.168.10.0 (local Ethernet network)	192.168.10.1 (local Ethernet interface)
192.168.100.0 (local serial network)	192.168.100.1 (local serial interface)
0.0.0.0 (default route)	192.168.100.100 (next-hope router)

Default routes can be built manually (using the tools provided with the IP software in use on the local system), or can be assigned during system boot (using a protocol such as BOOTP or DHCP). In addition, a protocol called Router Discovery can provide network devices with default route information dynamically, updating the devices' routing tables as the network topology changes.

The examples shown earlier illustrate that managing routing tables can be complex, even with relatively small networks. Unfortunately, the Internet consists of several hundred thousand such networks. If all of the routers connecting these networks together had to be tracked by all of the other routers, there would be so much router-management traffic that nothing else could get through. The Internet would collapse under its own weight.

Route aggregation

New address assignment schemes are being deployed that allow routes to be aggregated together. Now, when you request a block of Internet addresses from your Internet Service Provider, the ISP must assign one from a larger block that has already been assigned to them. This allows routing to happen at a much higher level. Rather than ISPs having to track and advertise thousands of network routes, they only have to advertise a few super-routes.

The ISP will still have to track all of the networks that are under it, but it won't have to advertise them to other ISPs. This feature cuts down on the amount of backbone router-update traffic immensely, without losing any functionality.

Geography-based aggregation schemes are also being deployed. For example, any network that begins with 194 is somewhere in Europe. This simple assignment allows major routers on the Internet to simply forward traffic for any network that begins with 194 to the backbone routers in Europe. Those routers will then forward the datagrams to the appropriate regional ISP, who will then relay the datagrams on to their final destination.

This process is conceptually similar to the way that area codes and prefixes help the phone company route a call. Telephone switches can route a long-distance call simply by examining the area code. The main switches in the remote area code will then examine the telephone number's three-digit prefix, and route the call to the appropriate central office. By the time you finish dialing the last four digits of the phone number, the call is practically already established.

By using aggregated routing techniques, IP datagrams can be moved around the Internet in much the same manner. Aggregation allows routers to use much smaller tables (around 50,000 routes instead of two million routes), which keeps CPU and memory requirements as low as possible, which, in turn, allows performance to be higher than it otherwise would be if every router had to keep track of every network's router path.

For more information about hierarchical routing, refer to 밅lassless Inter-Domain Routing (CIDR)?in Appendix B, IP Addressing Fundamentals.

Datagram Independence

In the preceding section, we used an analogy of a telephone number to illustrate how routers are able to route datagrams to their final destination quickly, based on the destination IP address. However, we should also point out that IP packets are not at all like telephone calls.

Telephone networks use the concept of 밹ircuits?to establish a point-to-point connection between two users. When two people establish a telephone call, a dedicated point-to-point connection is established and is preserved for the duration of the call. In contrast, IP networks treat every individual IP datagram as a totally unique entity, each of which is free to travel across whatever route is most suitable at that moment.

For example, if a user were to retrieve a document from a remote web server, the server would probably need to generate several IP datagrams in order to return the requested material. Each of these datagrams is considered to be a unique and separate entity, totally unrelated to the datagrams sent before or after.

Each of these datagrams may take whatever path is deemed most appropriate by the routers that are forwarding them along. Whereas the first datagram sent from the web server to the requesting client may travel across an underground fiber-optic cable, the second datagram may be sent across a satellite link, while a third may travel over a conventional network. This concept is illustrated in Figure 2-6.


		Figure 2-6. Every IP datagram is an individual entity and may take a different route

These routing decisions are made by the routers in between the source and destination systems. As the network changes, the routers that are moving datagrams around will have to adapt to the changing environment. Many things can cause the network to change: network cables can be ripped up, or downstream routers can become too busy to service a request, or any number of other events can happen to cause a route to become unavailable.

A result of this independence is that datagrams may arrive at their destination out of sequence, since one of them may have gone over a fast network, while another may have been sent over a slow network. In addition, sometimes datagrams get duplicated, causing multiple copies of the same packet to arrive at the destination system.

This architecture is purposefully designed into IP: one of the original design goals for the Internet Protocol was for it to be able to survive large-scale network out-ages in case of severe damage caused during war-time. By allowing each datagram to travel along the most-available path, every datagram's chances of survival increases dramatically. IP does not care if some of them happen to arrive out of sequence, get lost in transit, or even arrive multiple times; its job is to move the datagram, not to keep track of it. Higher-level protocols deal with any problems that result from these events.

Furthermore, by treating every datagram as an individual entity, the network itself is relieved of the responsibility of having to track every connection. This means that the devices on the network can focus on moving datagrams along, and do not have to watch for the beginning and end of every web browser's session. This feature allows overall performance to be as high as the hardware will allow, with as little memory and CPU requirements as possible.

Housekeeping and Maintenance

Every system that receives a packet—whether the system is the final destination or a router along the delivery path—will inspect it. If the packet has become corrupt or has experienced some other form of temporary failure, then the packet will be destroyed right then and there. Whenever one of these transient errors occurs, the datagram is destroyed rather than being forwarded on.

However, if a problem occurs that is semi-permanent—for example, if the current device does not have a routing table entry for the destination network, or if the packet does not meet certain criteria for forwarding across the next-hop network—then IP may call upon the Internet Control Message Protocol (ICMP) to return an error message back to the original sender, informing them of the failure. Although the datagram will still be destroyed by the last-hop device, it will also inform the sender of the problem, thereby allowing it to correct whatever condition was causing the failure to occur.

This distinction between transient and semi-permanent failures is important. Transient errors are caused by no fault of the sender (such as can happen when the Time-to-Live timer expires, or a checksum is miscalculated), while semi-permanent failures are problems with the packet or network that will always prevent delivery from occurring over this path. In the latter case, it is best either to inform the sender of the problem so that it can take whatever corrective actions are required, or to notify the application that tried to send the data of the problem.

Chapter 5, The Internet Control Message Protocol, discusses the error messages that are generated by ICMP whenever a semi-permanent problem is encountered. However, the remainder of this section also discusses some of the transient problems that may occur with IP delivery in particular.

Header checksums

Part of this integrity-checking service is handled through the use of a checksum applied against the IP datagram's header (but not against the data inside of the IP datagram). Every device that receives an IP datagram must examine the IP header and compare that information with the value stored in the header's checksum field. If the values do not match, then the datagram is assumed to be corrupt and is discarded immediately.

The data portion of the IP datagram is not verified, for three reasons. First of all, a device would have to examine the entire datagram to verify the contents. This process would require additional CPU processing time, which is more often than not going to be a waste of time.

Second, the data portion of an IP datagram always consists of a higher-level datagram, such as those generated by TCP and UDP. Since these protocols provide their own error-checking routines, the recipient system will have to conduct this verification effort anyway. The theory is that datagrams will move faster if routers do not have to verify their contents, a task which will be handled by the destination system anyway.

Finally, some application protocols are capable of working with partially corrupt data. In those cases, IP would actually be performing a disservice if it were to throw away datagrams with invalid checksums, since the application protocol would never get it. Granted, most applications do not work this way, but most applications will also utilize some form of error-correction service to keep this from becoming a problem.

Time-to-Live

Another validation service provided by IP is checking to see if a datagram has outlived its usefulness. This is achieved through a Time-to-Live field provided in the IP datagram's header. When a system generates an IP packet, it stores a value in the Time-to_Live header field. Every system that forwards the packet decreases the value of the Time-to-Live field by one, before sending the datagram on. If the Time-to-Live value reaches zero before the datagram gets to its final destination, then the packet is destroyed.

The purpose of the Time-to-Live field is to keep datagrams that are caught in an undeliverable loop from tying up network resources. Let's assume that a pair of routers both have bad information in their routing table, with each system pointing to the other for final delivery. In this environment, a packet would be sent from one router to the other, which would then return the packet, with this process repeating forever. Meanwhile, more packets may be introduced to this network from external devices, and after a while, the network could become saturated.

But by using a Time-to-Live field, each of these routers would decrement the value by one every time it forwarded a packet. Eventually the Time-to-Live value would reach zero, allowing the datagram to be destroyed. This safeguard prevents routing loops from causing network meltdowns.

The strict definition of the Time-to-Live field states that the value is a measure of time in seconds, or any forwarding act that took less than one second to perform. However, there are very few Internet routers that require a full second to perform forwarding, so this definition is somewhat misrepresentative. In actual practice, the Time-to-Live value is decremented for every hop, regardless of the actual time required to forward a datagram from one network segment to another.

It is also important to note that an ICMP failure-notification message gets sent back to the original sender when the Time-to-Live value reaches zero. For more information on this error message, refer to 밫ime Exceeded?in Chapter 5.

The default value for the Time-to-Live field should be set to 64 according to the Assigned Numbers registry (http://www.iana.org/). In addition, some of the higher-layer protocols also have default Time-to-Live values that they are supposed to use (such as 64 for TCP, and 1 for IGMP). These values are really only suggestions, however, and different implementations use different values, with some systems setting the Time-to-Live on all outgoing IP datagrams as high as 255.

Fragmentation and Reassembly.

Every network has certain characteristics that are specific to the medium in use on that network. One of the most important characteristics is the maximum amount of data that a network can carry in a single frame (called the Maximum Transmission Unit, or 밠TU?. For example, Ethernet can pass only 1500 bytes in a single frame, while the typical MTU for 16-megabit Token Ring is 17,914 bytes per frame.

RFC 791 specifies that the maximum allowed MTU size is 65,535 bytes, and that the minimum allowed MTU size is 68 bytes. No network should advertise or attempt to use a value that is greater or lesser than either of those values. Several RFCs define the specific default MTU values that are to be used with different networking topologies. Table 2-5 lists the common MTU sizes for the most-common media types, and also lists the RFCs (or other sources) that define the default MTU sizes for those topologies.

Table 2-5. Common MTU Sizes and the Related RFCs
Topology	MTU (in bytes)	Defined By
Hyperchannel	535	RFC 1374
16 MB/s Token Ring	17,914	IBM
802.4 Token Bus	8,166	RFC 1042

Table 2-5. Common MTU Sizes and the Related RFCs (continued)
Topology	MTU (in bytes)	Defined By
4 MBs Token Ring	4,464	RFC 1042
FDDI	4,352	RFC 1390
DIX Ethernet	1,500	RFC 894
Point-to-Point Protocol (PPP)	1,500	RFC 1548
802.3 Ethernet	1,492	RFC 1042
Serial-Line IP (SLIP)	1,006	RFC 1055
X.25 & ISDN	576	RFC 1356
ARCnet	508	RFC 1051

Since an IP datagram can be forwarded across any route available, every IP packet that gets generated by a forwarding device has to fit the packet within the available MTU space of the underlying medium used on the transient network. If you're on an Ethernet network, then IP packets have to be 1500 bytes or smaller in order for them to be carried across that network as discrete entities, regardless of the size of the original datagram.

There are really two concepts at work here: the size of the original IP datagram and the size of the packets that are used to relay the datagram from the source to the destination. If the datagram is too large for the sending system's local MTU, then that system has to fragment the datagram into multiple packets for local delivery to occur. In addition, if any of those IP packets are too large to cross another network segment somewhere between the sender and final recipient, then the packets must be fragmented by that router as well, allowing them to be sent across that network.

On an isolated network, size rarely matters since all of the systems on that network will share the same maximum frame size (a server and a client can both use at most 1500-byte datagrams, if both of them are on the same Ethernet segment). However, once you begin to mix different network media together, size becomes very important.

For example, suppose that a web server were on a Token Ring network that used 4,464-byte packets, while the end users were on a separate Ethernet segment that used 1500-byte packets. The TCP/IP software on the server would generate IP datagrams (and packets) that were 4,464 bytes long (according to the MTU characteristics of the local network), but in order for the IP datagrams to get to the client, the router in between these two segments would have to fragment the large packets into smaller packets that were small enough to move over the Ethernet network, as illustrated in Figure 2-7.


		Figure 2-7. One 4,464-byte packet being split into four 1500-byte packets

During the act of fragmentation, the router will do several things. First of all, it will examine the size of the data that is stored in the original packet, and then it will create as many fragments as are needed to move the original packet's data across the smaller segment. In the example shown in Figure 2-7, a single 4,464-byte IP packet would require four IP packets in order to travel across the 1500-byte Ethernet (the mathematics behind this process will be explained in a moment).

In this example, the destination host may not be able to reassemble the original datagram, since the datagram is larger than the MTU of the local Ethernet connection. RFC 1122 states that hosts must be able to reassemble datagrams of at least 576 bytes, and should be able to reassemble datagrams that are 밽reater than or equal to the MTU of the connected network(s).?In this case, the local MTU is 1500 bytes, although the original datagram was four kilobytes, so it is possible that the destination system would be unable to reassemble the original datagram. Although most systems do not have problems with this, it should not come as a surprise if a wireless hand-held device cannot reassemble 65 KB datagrams sent from high-speed servers.

When the original 4,464-byte packet was fragmented, the headers of each of the new 1500-byte IP packets would be given whatever information was found in the original packet's header, including the source and the destination IP addresses, the Time-to-Live value, the Type-of-Service flags, and so on.

With regards to fragmentation in particular, the most important of these fields is the Fragmentation Identifier field, which is used to mark each of the fragments as belonging to the same original IP datagram. The Fragmentation Identifier field is really more of a Datagram Identifier, and is a 16-bit 뱒erial number?that gets generated by the sending system whenever a datagram gets created. Whenever a packet gets fragmented, all of the resulting fragments use the original datagram's Fragmentation Identifier, and the destination system uses this information to collect all of the fragments together, and then reassemble the original datagram into its original form.

In addition, two fields within each of the fragments' IP headers will also be set, to reflect the fact that fragmentation has occurred. The fields that get set are the Fragmentation Offset and a Fragment Flags field (the latter is used to provide ordering and reassembly clues to the destination system).

Fragmentation Offset
This field is used to indicate the byte-range of the original datagram that a specific fragment provides. However, only the starting position of the byte-range is provided in this field (the remainder of the packet is assumed to contain the rest of that fragment). This starting position is stored in terms of eight-byte (64-bit) blocks of data. The Fragmentation Offset identifier allows the receiving system to re-order the fragments into their proper sequence once all of the fragments have arrived.

Fragment Flags
This field provides clues as to the current fragmentation status (if any). There are three one-bit flags, although only the last two are currently used. The first bit is reserved for future use and must always be set to 0. The second bit indicates whether or not fragmentation is allowed (0 means fragmentation is allowed and 1 means do not fragment). The third and final bit is used to indicate whether a current fragment is the last (0), or if more fragments will follow this one (1).

In addition to these changes, the Total Packet Length field for each of the newly minted IP packets also gets set according to the size of the fragments (rather than the size of the original datagram).

The resulting IP packets are then sent over the Internet as independent entities, just as if they had originally been created that way. Fragments are not reassembled until they reach the destination system. Once they reach the final destination, however, they are reassembled by the IP software running on the destination system, where they are combined back into their original datagram form. Once the original datagram has been reassembled, the IP datagram's data is forwarded to the appropriate transport protocol for subsequent processing.

There are a few rules that you must remember when trying to understand how IP fragments get created:

?Fragmentation only occurs on the data portion of a packet.

?Packet headers are not included in the fragmentation process. If the original datagram is 4,464 bytes long, then at least 20 bytes of that datagram are being used to store header information, meaning that the data portion is 4,444 bytes long. This 4,444 bytes is what will get fragmented.

?Each new fragment results in a new packet that requires its own IP headers, which consume at least 20 bytes in each new packet generated for a fragment. The IP software must take this factor into consideration when it determines the maximum amount of payload data that can be accommodated in each fragment, and thus the number of fragments that will be required for a particular MTU.

?Fragmentation must occur on an eight-byte boundary. If a datagram contains 256 bytes of data, but only 250 bytes can fit into a fragment, then the first fragment contains only 248 bytes of data (248 is the largest number divisible by eight that's less than 250). The remaining 8 bytes (256 - 248 = 8) will be sent in the next fragment.

?The Fragmentation Offset field is used to indicate which parts of the original datagram are in each fragment, by storing the byte count in quantities of eight-byte blocks. Rather than indicating that the starting position for a fragment's data is ?48 bytes,?the Fragmentation Offset field will show ?1 blocks?(248/8 =31). Also, note that the block count starts with 0 and not 1. This means that the 32nd block will be numbered 31 instead of 32.

As shown in Figure 2-7, in order for the original 4,464-byte IP datagram to be sent across the Ethernet network segment, four IP fragments will have to be created. Each of the new packets will contain an IP header (copied from the original datagram's header), plus however much data they could carry (although the quantity has to be divisible by eight). The result is four unique fragments, as shown in Figure 2-8.

The relevant fields from the original IP packet are shown in Table 2-6.

Table 2-6. Headers from the Original 4,464-byte Packet
Fragment	Fragment Identifier	Reserved Flag	May Fragment Flag	More Fragment Flags	Fragment Offset	Packet Length
1	321	0	0	0	0	4,464


		Figure 2-8. The mathematics of datagram fragmentation

After converting the single 4,464-byte IP packet into four 1500-byte IP fragments, the headers of each fragment will appear as shown in Table 2-7.

Table 2-7. Headers from Four 1500-byte Fragments
Fragment	Fragment Identifier	Reserved Flag	May Fragment Flag	More Fragment Flags	Fragment Offset	Packet Length
1	321	0	0	1	0	1,500
2	321	0	0	1	185	1,500
3	321	0	0	1	370	1,500
4	321	0	0	0	555	24

Each of the fragments contains the following header information:

?Each fragment belongs to the same original datagram, so each of them share the same 뱒erial number?in the Fragmentation Identifier field (321 in this case).

?The first bit in the 3-bit Flags field is reserved, and must be marked 0.

?Each packet may be fragmented further, so the 밠ay Fragment?flags are marked 0.

?The 밠ore Fragments?flag is used to indicate if more fragments are following after this fragment. Since the first three fragments all have another fragment coming behind them, they all have the More Fragments flag marked 1, while the last fragment identifies the end of the set by having a 0 in this field.

?Since the first fragment marks the beginning of the original data, the Fragment Offset field starts at 0. Since the first fragment held 1,480 bytes of data, the second fragment would have its Fragmentation Offset field set to 185 (1480 / 8 = 185). The second fragment was also able to store 1,480 bytes, so the Fragment Offset flag for the third packet will be set to 370 ((1480 × 2) / 8 = 370). The third fragment was also able to hold 1,480 bytes, so the fourth fragment's Fragment Offset flag will be set to 555 ((1480 × 3) / 8 = 555).

?In addition, each new IP packet created during the fragmentation process will also have its Total Packet Length field set to the size of the resulting IP packets, rather than set to the size of the original IP datagram.

In order for the destination system to reassemble the datagram, it must read the fragmentation-specific headers in each of the fragments as they arrive and order them into their correct sequence (as indicated by the Fragment Offset field). Since each fragment may arrive out of sequence (due to a slower link, a down segment, or whatever), the destination system has to store each fragment in memory until all of them have arrived before they can be rearranged and the data processed.

Once all of the segments have been received, the system will examine their headers and find the fragment whose Fragment Offset is 0. The IP software will then read the data portion of the IP packet containing that fragment, recording the number of eight-byte blocks that it finds. Then it will locate the fragment that shows the Fragment Offset needed to continue reading the data, and then read that fragment's data into memory. This process will continue until all of the data has been read from all of the packets. Once a packet has been read that has the 밠ore Fragments?flag set to 0—and if each of the Fragment Offset fields matches up without leaving any holes in the final datagram—then the process is complete.

If all of the fragments do not arrive within the predefined time (normally 60 seconds on most Unix-like systems), then all of the fragments will be destroyed, and an error message will be sent to the original sender, using the ICMP 밫ime Exceeded?error message. For more information on this error message, refer to 밫ime Exceeded?in Chapter 5.

This process can get fairly tricky, and it may seem like an awful lot of overhead. However, there are many benefits offered by fragmentation. First and foremost, fragmentation allows IP to use whatever packet sizes are required by the underlying medium. Furthermore, any traffic that is local to your own network probably won't require fragmentation, so you can use large packets on your local network. If IP were forced to use a lowest-common-denominator approach of very small packets for all data, then local performance would always be miserable. But by using a flexible MTU size, the local network can run at full speed, with fragmentation only occurring whenever large datagrams must leave the local network.

RFC 791 states that all systems must be able to send an IP datagram of at least 576 bytes. Indeed, many of the early IP routers required that IP datagrams be cut into 576-byte fragments if they were to be forwarded over a different media (regardless of that media's MTU capacity).

In addition, there are some techniques that can be used by a sending system to determine the most efficient segment size when sending data to a remote network, thereby preventing fragmentation from occurring. TCP connections use a 밠aximum Segment Size?header option that can be used to determine the MTU of the remote network, and most IP systems implement a technology called 밣ath MTU Discovery?that allows them to detect the largest available MTU on the end-to-end connection. For more information on the Maximum Segment Size option, refer to 밠aximum Segment Size?in Chapter 7, The Transmission Control Protocol. For more information on Path MTU Discovery, refer to 밡otes on Path MTU Discovery?in Chapter 5.

Prioritization and Service-Based Routing

One of the key differences between IP and other networking protocols is that IP offers direct support for prioritization, allowing network hosts and routers to send important packets before less important packets. This feature is particularly crucial with applications that are sensitive to high levels of delay resulting from network congestion.

For example, assume that an organization has two high-speed networks that are interconnected by a relatively slow wide area network (WAN), and that a lot of data has to cross the WAN frequently. In this example, the routers could forward data across the WAN only at whatever rate was allowed by the WAN itself. If the WAN were fixed at a maximum throughput of 256 KB/s, then the routers on the WAN could only send 262,144 bits across the WAN in a single second. This may be plenty of bandwidth for a few terminal emulation sessions—or even for a couple of simultaneous database updates—but it would not be enough for several simultaneous streaming video feeds in conjunction with those other applications.

The problem is that the routers just wouldn't be able to forward enough data across the WAN for all of the applications to work smoothly. The routers would have to start dropping packets once their buffers began filling up or as the queuing delays exceeded the maximum Time-to-Live values on some of the packets. UDP-based applications may not care much about these dropped packets, but TCP-based applications care very much about lost packets. They would attempt to resend any data that had not yet been acknowledged, and if congestion was sustained for a long period of time, then those applications would eventually just timeout.

This may not matter with some applications, but it would be a very big deal with some others, particularly those that are crucial to the operation of the business itself. For example, if users were unable to enter sales orders into a remote database, the problem would be somewhat greater than if they were unable to access a recreational video.

In order to ensure that congestion doesn't break the mission-critical applications on your network, IP supports two key concepts: prioritization and type-of-service handling. Every IP datagram has an 8-bit field (called the 밫OS byte? that consists of a three-bit precedence field used for prioritization and a four-bit field that indicates specific handling characters desired for a datagram (the last bit is currently unused).

By using three bits for precedence, IP has eight levels of prioritization (0 through 7), which provide eight distinct priority levels to all IP traffic. Table 2-8 lists the values of the Precedence field and their meaning as defined in RFC 791, with the highest priority level being 7 and the lowest being 0.

Table 2-8. The Precedence Flags and Their Meaning.
Precedence	Definition
0	Routine (normal)
1	Priority
2	Immediate
3	Flash
4	Flash Override
5	Critical
6	Internetwork Control
7	Network Control

Using these priority values, you could assign database applications a higher priority level than the streaming video traffic. The routers would then sift through data that was waiting in the queue, sending the higher priority traffic before sending the lower priority traffic. In this model, the database traffic would be sent out first,

while the streaming video traffic would be forced to wait until bandwidth was available. Your mission-critical applications would continue to function smoothly, while the less-critical applications would take a back seat, possibly suffering dramatic performance losses.

The remaining four bits of the TOS byte provide administrators with the ability to implement per-datagram routing based on the characteristics of the datagram's data. Thus, an IP datagram that contains Usenet news traffic can be marked as desiring a 뱇ow-cost?service, while Telnet traffic can be marked as desiring a 뱇ow-latency?service.

Originally, there were only three types of service defined in RFC 791. These services were identified with unique bits that were either on or off, depending on whether or not the specific type of service was desired. However, this interpretation was modified by RFC 1349, which added a fourth service class, and which also stated that the bits were to be interpreted as numeric values rather than independent flags. By making them numeric, the four bits provided for a maximum of sixteen possible values (0 through 15), rather than four distinct options (although the values cannot be combined and must be used independently).

There are a number of predefined Type-of-Service values that are registered with the Internet Assigned Numbers Authority (IANA). Some of the more common registered values are shown in Table 2-9.

For a detailed listing of all of the Type-of-Service values that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/ip-parameters).

Table 2-9. Type-of-Service Values and Their Meaning
Value	Service	Description
0	Normal	When all of the Type-of-Service Flags are off, the IP datagram is to be treated as a normal datagram, and is not to be given any special handling. Almost all IP datagrams are marked with all zeroes in the Type-of-Service field.
1	Minimize Delay	The Delay Flag is used to request that IP route this packet over a network that provides lower latency than normal. This may be useful for an application such as Telnet, where the user would want to see their keystrokes echoed back to them quickly. The Delay flag may be set to either 0 (normal) or 1 (low delay).
2	Maximize Through-put	The Throughput flag is used to request that IP route this packet over a network that provides higher throughput than normal. This may be useful for an application such as FTP, where the user would want to download a lot of data very quickly. The Throughput flag may be set to 0 (normal) or 1 (high throughput).

Table 2-9. Type-of-Service Values and Their Meaning (continued)
Value	Service	Description
4	Maximize Reliability	The Reliability flag is used to request that IP route this packet over a network that provides the most reliable service (perhaps as indicated by overall up-time, or by the number of secondary routes). This may be useful for an application such as NFS, where the user would want to be able to open a database on a remote server without worrying about a network failure. The Reliability flag may be set to 0 (normal) or 1 (high reliability).
8	Minimize Cost	The Cost flag was added by RFC 1349 and was not defined in RFC 791. For this reason, many systems do not recognize or use it. The Cost flag is used to request that IP route this packet over the least expensive route available. This may be useful for an application such as NNTP news, where the user would not need data very quickly. The Cost flag may be set to 0 (normal) or 1 (low cost).
15	Maximize Security	RFC 1455—an experimental specification for data-link layer security—states that his flag is used to request that IP route this packet over the most secure path possible. This may be useful with applications that exchange sensitive data over the open Internet. Since RFC 1455 is experimental, most vendors do not support this setting.

In addition, the IANA's online registry also defines a variety of default Type-of-Service values that specific types of applications should use. Some of the more common application protocols and their suggested Type-of-Service values are shown in Table 2-10. For a detailed listing of all of the suggested default Type-of-Service values, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/ip-parameters).

Table 2-10. Suggested Type-of-Service Values for Common Application Protocols
Application Protocol	Suggested TOS Value
Telnet	8
FTP Control Channel	8
FTP Data Channel	4
Trivial FTP	8
SMTP Commands	8
SMTP Data	4
DNS UDP Query	8
DNS TCP Query	0
DNS Zone Transfer	4
NNTP	1
ICMP Error Messages	0
SNMP	2

It is important to note that not all of the TCP/IP products on the market today use these values. Indeed, many implementations do not even offer any mechanisms for setting these values, and will not treat packets that are flagged with these values any differently than packets that are marked for 뱊ormal?delivery. However, most of the Unix variants on the market today (including Linux, BSD, and Digital Unix) do support these values, and set the appropriate suggested default values for each of the major applications.

Administrators that have complex networks with multiple routing paths can use these type of service flags in conjunction with TOS-aware routers to provide deterministic routing services across their network. For example, an administrator might wish to send low-latency datagrams through a terrestial fiber-optic connection rather than through a satellite link. Conversely, an administrator might wish to send a low-cost datagram through a slower (but fixed-cost) connection, rather than take up bandwidth on a satellite connection.

By combining the type of service flags with the prioritization bits, it is possible to dictate very explicit types of behavior with certain types of data. For example, you could define network filters that mark all Lotus Notes packets as medium priority and tag them with the low-latency TOS flag. This would not only provide your Notes users with preferential service over less-critical traffic, but it would also cause that traffic to be routed over faster network segments. Conversely, you could also define another set of filters that marked all streaming video traffic as lower priority and also enable the high-bandwidth TOS flag, forcing that traffic to use a more appropriate route.

As long as you own the end-to-end connection between the source and destination systems, you can pretty much do whatever you want with these flags, and you should be able to queue and route those datagrams according to the flags that you set. Keep in mind, however, that most ISPs will not treat these datagrams any different than unmarked datagrams (otherwise, you'd mark all of your packets with the high-priority and minimize-latency flags). Indeed, if you need a certain type of service from an ISP, then you will mostly likely end up paying for a dedicated link between your site and the destination network, since you will not be able to have your datagrams prioritized over other customer's packets across the ISP's backbone.

The IP Header

IP datagrams consist of two basic components: an IP header that dictates how the datagram is treated and a body part that contains whatever data is being passed between the source and destination systems.

An IP datagram is made up of at least thirteen fields, with twelve fields being used for the IP header, and one field being used for data. In addition, there are also a variety of supplemental fields that may show up as 뱋ptions?in the header. The total size of the datagram will vary according to the size of the data and the options in use.

Table 2-11 lists all of the mandatory fields in an IP header, along with their size (in bits) and some usage notes. For more detailed descriptions of these fields, refer to the individual sections throughout this chapter.

Table 2-11. The Fields in an IP Datagram
Field	Bits	Usage Notes
Version	4	Identifies the version of IP used to create the datagram. Every device that touches this datagram must support the version shown in this field. Most TCP/IP products use IP v4. NOTE: This book only covers IP v4.
Header Length	4	Specifies the length of the IP header in 32-bit multiples. Since almost all IP headers are 20 bytes long, the value of this field is almost always 5 (5 × 32 = 160 bits, or 20 bytes).
Type-of-Service Flags	8	Provide a prioritization service to applications, hosts, and routers on the Internet. By setting the appropriate flags in this field, an application could request that the datagram be given higher priority than others waiting to be processed.
Total Packet Length	16	Specifies the length of the entire IP packet, including both the header and the body parts, in bytes.
Fragment Identifier	16	Identifies a datagram, useful combining fragments back together when fragmentation has occurred.
Fragmentation Flags	3	Identifies certain aspects of any fragmentation that may have occurred, and also provides fragmentation control service, such as instructing a router not to fragment a packet.
Fragmentation Offset	13	Indicates the byte-range of the original IP datagram that this fragment provides, as measured in eight-byte offsets.
Time-to-Live	8	Specifies the remaining number of hops a datagram can take before it must be considered undeliverable and be destroyed.
Protocol Identifier	8	Identifies the higher-layer protocol stored within the IP datagram's body.
Header Checksum	16	Used to store a checksum of the IP header.
Source IP Address	32	Used to store the 32-bit IP address of the host that originally sent this datagram.
Destination IP Address	32	Used store the 32-bit IP address of the final destination for this datagram.

Table 2-11. The Fields in an IP Datagram (continued)
Field	Bits	Usage Notes
Options (optional)	varies	Just as IP provides some prioritization services with the Type-of-Service flags, additional special-handling options can also be defined using the Options field. Special-handling options include Source Routing, Timestamp, and others. These options are rarely used, and are the only thing that can cause an IP header to exceed 20 bytes in length.
Padding (if required)	varies	An IP datagram's header must be a multiple of 32 bits long. If any options have been introduced to the header, the header must be padded so that it is divisible by 32 bits
Data	varies	The data portion of the IP packet. Normally, this would contain a complete TCP or UDP message, although it could also be a fragment of another IP datagram.

As can be seen, the minimum size of an IP header is 20 bytes. If any options are defined, then the header's size will increase (up to a maximum of 60 bytes). RFC 791 states that a header must be divisible by 32 bits, so if an option has been defined, but it only uses eight bits, then another 24 zero-bits must be added to the header using the Padding field, thereby making the header divisible by 32.

Figure 2-9 shows an IP packet containing an ICMP Echo Request Query Message, sent from Ferret to Bacteria. It does not show any advanced features whatsoever.

The following sections discuss the individual fields in detail.

Version

Identifies the version of IP that was used to create the datagram. Most TCP/IP products currently use IP v4, although IP v6 is gaining acceptance. NOTE: This book only covers IP v4.

Size
Four bits.

Notes
Since the datagram may be sent over a variety of different devices on the way to its final destination, all of the intermediary systems (as well as the destination) must support the same version of IP as the one used to create the datagram in the first place. As features are added, removed or modified from IP, the datagram header structures will change. By using the Version field, these changes can be made without having to worry about how the different systems in use will react. Without the Version field, there would be no way to identify changes to the basic protocol structure, which would result in a frozen specification that could never be changed.


		Figure 2-9. A simple IP packet

Almost all TCP/IP products currently use IP v4, which is the latest 뱒tandard?version. However, a new version, IP v6, is rapidly gaining supporters and acceptance in the Internet community. It should also be pointed out that IP v4 is the first 뱑eal?version of IP, since prior versions were only drafts that were not widely deployed. NOTE: This book only covers IP v4.

Capture Sample
In the capture shown in Figure 2-10, the Version field is set to 4, indicating that this packet contains an IP v4 datagram.


		Figure 2-10. The Version field

Header Length

Specifies the size of the IP header, in 32-bit multiples.

Size
Four bits.

Notes
The primary purpose of this field is to inform a system where the data portion of the IP packet starts. Due to space constraints, the value of this field uses 32-bit multiples. Thus, 20 bytes is the same as 160 bits, which would be shown here as 5 (5 × 32 = 160). Since each of the header's mandatory fields are fixed in size, the smallest this value can be is 5.

If all of the bits in this field were 뱋n,?the maximum value would be 15. Thus, an IP header can be no larger than 60 bytes (15 × 32 bits = 480 bits = 60 bytes).

Capture Sample
In the capture shown in Figure 2-11, the Header Length field is set to 5, indicating that this packet has 20-byte header (20 bytes / 32 bits = 5), which is the default size when no options are defined.


		Figure 2-11. The Header Length field

'UP! > Web Service' 카테고리의 다른 글

Multicasting and the Internet Group Management Protocol (0)	2008.08.21
The Address Resolution Protocol (0)	2008.08.21
An Introduction to TCP/IP (0)	2008.08.21
Internet Core Protocols (0)	2008.08.21
[토론4] 웹서비스와 시맨틱 웹 (0)	2008.08.21

Posted by 으랏차

An Introduction to TCP/IP

UP!/Web Service 2008. 8. 21. 14:15

An Introduction to TCP/IP

If you've been using TCP/IP-based networking products for any length of time at all, you're probably already aware of how IP addressing, routing, and other fundamental aspects of the Internet family of protocols work, at least from a user's perspective.

What you probably don't know—unless you've been formally trained in these subjects—is what makes TCP/IP work from the wire's perspective, or from the perspective of the applications in use on your network. This chapter provides you with an introduction to these viewpoints, providing you with a better understanding of the nature of the traffic on your network.

A Brief History of the Internet

Before you can understand how TCP/IP works—or why it works the way it does—you first have to understand the origins of the networking protocols and the history of the Internet. These subjects provide a foundation for understanding the basic design principles behind TCP/IP, which in turn dictate how it is used today.

TCP/IP presented a radical departure from the traditional computer networking services in use during its development. In the early days of commercial computing (the late 1960s), most companies bought a single large system for all of their data processing needs. These systems used proprietary networking architectures and protocols, which primarily consisted of plugging dumb terminals or line printers into an intelligent communications controller, each of which used proprietary networking protocols to communicate with the central hosts.

Most of the early computer networks used this hierarchical design for their proprietary network protocols and services. As users' computing requirements expanded,

they rarely bought a different system from a different vendor, but instead added new components to their existing platforms or replaced the existing system with a newer, larger model. Cross-platform connectivity was essentially unheard of, and was not expected. To this day you still can't plug an IBM terminal into DEC system and expect it to work. The protocols in use by those devices are completely different from each other.

As the use of computers became more critical to national defense, it became clear to the U.S. military in particular that major research centers and institutions needed to be able to share their computing resources cooperatively, allowing research projects and supercomputers to be shared across organizational boundaries. Yet, since each site had different systems (and therefore different networking technologies) that were incompatible with the others, it was not possible for users at one site to use another organization's computing services easily. Nor could programs easily be ported to run on these different systems, as each of them had different languages, hardware, and network devices.

In an effort to increase the sharing of resources, the Advanced Research Projects Agency (ARPA) of the Department of Defense (DOD) began coordinating the development of a vendor-independent network to tie the major research sites together. The need for a vendor-independent network was the first priority, since each facility used different computers with proprietary networking technology. In 1968, work began on a private packet-switched network, which eventually became known as ARPAnet.

ARPAnet was the world's first wide-area packet-switching network, designed to allow individual units of data to be routed across the country as independent entities. Previous networks had been circuit-switched, involving dedicated end-to-end connections between two specific sites. In contrast, the ARPAnet allowed organizations to interconnect into a mesh-like topology, allowing data to be sent from one site to another using a variety of different routes. This design was chosen for its resilience and built-in fault-tolerance: if any one organization were bombed or otherwise removed from the network, it wouldn't affect the rest of the organizations on the network.

During this same time period, other network providers also began interconnecting with the ARPAnet sites, and when these various networks began connecting to each other, the term 밒nternet?came into use. Over the next few years, more organizations were added to the ARPAnet, while other networks were also being developed, and new network technologies such as Ethernet were beginning to gain popularity as well.

All of this led to the conclusion that networking should be handled at a higher layer than was allowed by the ARPAnet's packet-switching topology. It became increasingly important to allow for the exchange of data across different physical networks, and this meant moving to a set of networking protocols that could be implemented in software on top of any physical topology, whether that be a packet-switched WAN such as ARPAnet or a local area network (LAN) topology such as Ethernet.

TCP/IP to the Rescue

In 1973, work began on the TCP/IP protocol suite, a software-based set of networking protocols that allowed any system to connect to any other system, using any network topology. By 1978, IP version 4 (the same version that we use today) had been completed, although it would be another four years before the transition away from ARPAnet to IP would begin. Shortly thereafter, the University of California at Berkeley also began bundling TCP/IP with their freely distributed version of Unix, which was a widely used operating system in the research community.

The introduction and wide-scale deployment of TCP/IP represented a major ground-shift in computer networking. Until the introduction of TCP/IP, almost every other network topology required that hardware-based network nodes send traffic to a central host for processing, with the central host delivering the data to the destination node on behalf of the sender. For example, Figure 1-1 shows a host-centric networking architecture. In this model, devices are attached to a centralized system that coordinates all network traffic. A user at a terminal could not even send a screen of text to a printer without first sending the data to the central host, which would parse the data and eventually send it to the printer for printing.


		Figure 1-1. Host-centric networking

But with TCP/IP, each network device was treated as a fully functional, self-aware network end-point, capable of communicating with any other device directly, without having to talk to a central host first. IP networks are almost anarchic, with every device acting as a self-aware, autonomous unit, responsible for its own network services, as illustrated in Figure 1-2.


		Figure 1-2. Node-centric networking

This design allowed for application- and resource-sharing on a national scale, since a top-down model simply would not work with millions of widely distributed devices. In addition, this design also provided reliability in case any part of the network was damaged, since a host-based model would simply stop functioning if the central host was destroyed or disabled.

The Internet Today

Over time, the ARPAnet evolved into an open 뱊etwork-of-networks?using TCP/IP, with educational, commercial, and other organizations connected to each other through an interwoven mesh of networks. Today this type of mesh architecture is far less common, replaced by a much more structured hierarchy.

Rather than organizations connecting to each other directly, most organizations now connect to a local network access provider who routes network traffic upwards and outwards to other end-point networks.

Generally speaking, there are only a handful of top-level Internet Service Providers (ISPs), each of which provide major interconnection services around the country or globe. Most of these firms are telecommunications companies that specialize in large-scale networking (such as long-distance providers like MCI WorldCom and Sprint).

Below these top-level carriers are local or regional access providers who offer regional access and lower-speed connection services to end users directly (these mid-level carriers are sometimes referred to as Internet Access Providers, or 밒APs?. This design is represented in Figure 1-3.


		Figure 1-3. The hierarchical architecture of the Internet

Visually, the Internet can be thought of as a few major networking companies who provide large-scale 밷ackbone?services around the world, followed by a large number of secondary providers that resell bandwidth on those networks. At the end of the line are the end-leaf organizations that actually generate the traffic that crosses these networks.

The Internet, Defined.

Simply having a lot of interconnected networks does not by itself mean that you have the 밒nternet.?To 밿nternet?(with a lowercase 밿? means to interconnect networks. You can create an internet of Macintosh networks using AppleTalk and some routers, for example. The term 밒nternet?(with a capital 밒? refers to the specific global network of TCP/IP-based systems, originally consisting of ARPAnet and the other research networks.

There have been lots of private and public networks that have offered a multi-layer network design (private SNA* networks from the 1980s are a good example of this). Therefore, the Internet in particular is a collection of networks that support host-to-host communications using TCP/IP protocols.

* SNA stands for Systems Network Architecture, a proprietary IBM networking protocol.

Under this definition, the network is made up of intelligent end-point systems that are self-deterministic, allowing each end-point system to communicate with any host it chooses. Rather than being a network where communications are controlled by a central authority (as found in many private networks), the Internet is specifically meant to be a collection of autonomous hosts that can communicate with each other freely.

This is an important distinction, and one that is often overlooked. For example, many of the private networks have offered mail-delivery services for their customers, allowing a user on one network to send email to another user on another network, but only by going through a predefined mail gateway service. Conversely, the Internet allows users to exchange mail directly, without going through a central politburo first. In this regard, the Internet is a collection of self-deterministic, autonomous hosts.

Having hosts communicate with each other directly is not enough to make the Internet, however. Many networks have offered users the ability to communicate directly with other hosts on those networks, and those networks have not been considered as parts of the Internet per se. For example, there have been many private DECnet networks that have offered this capability, and Novell offers a similar service using IPX today.

The last key criteria is that the Internet is a collection of networks that allows host-to-host communications through voluntary adherence to open protocols and procedures defined by Internet standards. Therefore, in order for these networks to be parts of the Internet, they must also use Internet protocols and standards, allowing for vendor-neutral networking.

This is perhaps the most important part of the entire definition, since the use of consistent protocols and services is what allows the Internet to function at all. For example, it is not enough for a private network to allow users to send email messages to each other directly. Rather, those users must use the same protocols and services to exchange email messages, and those protocols must be defined as Internet standards.

TCP/IP's Architecture

A key part of understanding the distributed nature of TCP/IP is the realization that TCP/IP is a modular family of protocols, providing a wide range of highly segmented functions. TCP/IP is not a single monolithic protocol, but instead is a collection of protocols that range from application-specific functions like web browsing down to the low-level networking protocols like IP and TCP.

One common tool used for comparing different kinds of protocols is the OSI* Reference Model, which is a simplistic breakdown of networking functions from the physical wiring up to the applications that run on the network. By comparing TCP/IP to the OSI Reference Model, it is easier to understand how each of the major protocols interact with each other.

An Introduction to the OSI Reference Model

The OSI Reference Model is a conceptual model that uses seven 뱇ayers?to identify the various functions provided by a network, and these seven layers can be used to compare different protocols using a common framework. Each layer within the OSI Reference Model has a very specific function, and each layer depends on the other layers in order for the entire model to function properly. Each layer only communicates with the layers immediately above or below it. If there is a problem at one layer, it is the responsibility of that specific layer to provide feedback to the layers surrounding it.

The OSI Reference Model is extremely useful as a tool for discussing various network services. For example, if we were to look at a simple network service such as printing a document to a locally attached printer, we could use the OSI Reference Model to determine how this simple task was being achieved. We could also use the model to determine how printing over a Novell network was done, or how printing over a TCP/IP network was accomplished. Because all three of these examples use the same model, they can all be compared to each other even though they all use extremely different technologies to achieve the same objective.

Not all networking technologies have seven layers, nor do they all match up to the seven layers in the OSI Reference Model exactly. Most of them do not match it except in small, specific ways, although all of them can be compared to the model with a little bit of thought. This flexibility is what makes it such a popular tool.

The following list briefly describes each of the seven layers and the purpose each serve. Remember that this is a conceptual model, with very little direct meaning to the real world.

The physical layer
The physical layer is concerned with the physical wiring used to connect different systems together on the network. Examples include serial and parallel cables, Ethernet and Token Ring cabling, telephone wiring, and even the specific connectors and jacks used by these cabling systems. Without strictly standardized definitions for the cabling and connectors, vendors might not implement them in such a way that they would function with other implementations, which in turn would make it impossible for any communication

* OSI stands for Open Systems Interconnect, an alternate suite of network protocols.

whatsoever to occur. Each of these wiring systems therefore follows very strict standards, ensuring that network devices will at least be able to communicate without having to worry about issues such as voltage and impedance.

The data-link layer
The data-link layer defines how information is transmitted across the physical layer, and is responsible for making sure that the physical layer is functioning properly. Some networks—such as the public telephone network, radio, and television—use analog sine-waves to transmit information, while most computer networks use square-wave pulses to achieve this objective. If there are any problems with transmitting the information on the physical cabling (perhaps due to a damaged wire or circuit), then this layer must deal with those errors, either attempting to retransmit the information or reporting the failure to the network layer.

The network layer
The network layer is used to identify the addresses of systems on the network, and for the actual transmission of data between the systems. The network layer must be aware of the physical nature of the network, and package the information in such a way that the data-link layer can deliver it to the physical layer. For example, if a telephone line is the physical layer, then the network layer must package the information in such a way that the data-link layer can transmit it over an analog circuit. Likewise, if the physical layer is a digital Ethernet LAN, then the network layer must encapsulate the information into digital signals appropriate for Ethernet, and then pass it to the data-link layer for transmission.

On many networks, the network layer does not provide any integrity checking. It simply provides the packaging and delivery services, assuming that if the data-link layer is not reporting any errors then the network is operational. Broadcast television and radio work in this manner, assuming that if they can transmit a signal, then it can also be received. Many digital networking technologies also take this approach, leaving it up the higher-level protocols to provide delivery tracking and reliability guarantees.

The transport layer
The transport layer provides the reliability services lacking from the network layer, although only for basic transmission services, and not for any application- or service-specific functions. The transport layer is responsible for verifying that the network layer is operating efficiently, and if not, then the transport layer either requests a retransmission or returns an error to the layer above it. Since higher-level services have to go through the transport layer, all transport services are guaranteed when this layer is designed into the network software and used. Not all systems mandate that the transport layer provide reliability,

and many networks provide unreliable transport layers for nonessential services such as broadcast messages.

The session layer
The session layer is responsible for establishing connections between systems, applications, or users. The session layer may receive this request from any higher layer, and then will negotiate a connection using the lower layers. Once a connection is established, the session layer simply provides an interface to the network for the higher layers to communicate with. Once the higher layers are finished, the session layer is responsible for destroying the connection.

The presentation layer
The presentation layer provides a consistent set of interfaces for applications and services to utilize when establishing connections through the session layer. Although these interfaces could also exist at the session layer, that would burden it unnecessarily. It is better to have the session layer only manage sessions and not worry about verifying data or providing other extended services. An example of a service provided by the presentation layer is data-compression, allowing applications to take advantage of the performance gains that compression provides without forcing the applications to develop these services themselves, and without forcing the transport layer to provide this service when it may not always be needed.

The application layer
Finally, the application layer provides the network's interface to end-user application protocols such as HTTP and POP3. This layer should not be confused with the part of the end-user application that displays data to the end user. That function is an entirely separate service, and is outside the scope of the OSI Reference Model.

Although every network must use all seven layers of the OSI Reference Model in some form or another, not every network design provides distinct protocols or services that match all seven layers precisely. TCP/IP is one such networking design, with many layers that do not match up to each of the layers used by the OSI Reference Model.

Comparing TCP/IP to the OSI Reference Model

TCP/IP does not strictly conform to the OSI Reference Model. Some portions of the OSI Reference Model map directly to some of the protocols and services provided by TCP/IP, while many of the layers do not map to each other directly at all. For example, the actual delivery of data over the network is handled at the physical layer, and in this case, the wire is the physical layer. There are no services in TCP/IP that correspond with the physical or data-link layers. Rather, IP passes data to a network adapter's device driver, which provides an interface to the data-link layer in use with the physical layer.

Figure 1-4 shows how TCP/IP matches up with the OSI Reference Model. Notice that TCP/IP does not provide any physical or data-link layer services directly, but instead relies on the local operating system for those services.


		Figure 1-4. TCP/IP in comparison to the OSI Reference Model

The specific layers offered by TCP/IP include:

The Internet Protocol
IP itself works at the network layer of the OSI reference model. It is responsible for tracking the addresses of devices on the network, determining how IP datagrams are to be delivered, and sending IP packets from one host to another across a specific segment. In essence, IP provides a virtual representation of the network that is independent of any of the individual network segments, acting more like a national delivery service than a local courier service.

The Transport Protocols (TCP and UDP)
TCP/IP provides two protocols that work at the transport layer: TCP and UDP. TCP provides a highly monitored and reliable transport service, while UDP provides a simple transport with no error-correcting or flow-control services. It is also interesting to note that TCP and UDP also provide session layer services, managing all of the connections between the different hosts. When an application protocol such as HTTP is used to exchange data between a web client and a web server, the actual session-management for this exchange is handled by TCP.

Presentation Services
TCP/IP does not provide a presentation layer service directly. However, some applications use a character-based presentation service called the Network Virtual Terminal (NVTs are a subset of the Telnet specification), while others might use IBM's NetBIOS or Sun's External Data Representation (XDR) programming libraries for this service. In this regard, TCP/IP has many presentation layer services that it can use, but it does not have a formal service that every application protocol must use.

Application Protocols (HTTP, SMTP, etc.)
TCP/IP provides an assortment of application protocols, providing the end-user applications with access to the data being passed across the transport protocols. These protocols include the Simple Message Transfer Protocol (SMTP), which is used by electronic mail systems to move mail messages around the Internet, and the Hyper-Text Transfer Protocol (HTTP), which is used by web browsers to access data stored on web servers, among many others.

All of these services get called upon whenever an application wants to exchange data with another application across the Internet. For example, a mail client will use the SMTP application protocol whenever a user wants to send a mail message to a remote mail server, and the SMTP protocol uses rules defined by the NVT specification whenever it exchanges data with TCP. In turn, TCP provides error-correction and flow-control services back to SMTP. IP is used to move the TCP segments between the source and destination networks, while hardware-specific protocols (like Ethernet-specific framing) will be used to move the IP packets between the various systems on the network itself.

TCP/IP Protocols and Services In-Depth

Whenever data is exchanged between two applications across a TCP/IP network, each of the major layers provided by TCP/IP come into play.

This can be seen with email clients that use the Simple Mail Transfer Protocol (SMTP) to send mail to a local server, as is shown in Figure 1-5. The email software on the client contains local application-specific code for parsing and displaying email messages, but everything else is done with network protocols such as SMTP, TCP, and IP.

As data is passed through each of the different layers, packets are generated that contain two distinct elements: headers and data. As information is passed down through the protocol stack, each layer encapsulates the previous layer's information (including both the header and the data) into a new packet, containing a new layer-specific header and the newly minted data segment. This process is shown in Figure 1-6.


		Figure 1-5. Some of the layers used by TCP/IP applications


		Figure 1-6. The sub-parts of layers

At the bottom-most layer, the physical network is used to transfer bits of data (called 밼rames? between two devices on the network. IP packets are contained within these network-specific frames. The only reason IP is used for this process is because the data can go over a variety of different network topologies, and as such the TCP/IP applications must have a way of addressing and routing traffic consistently, regardless of the specific networks in use.

Embedded within the IP datagrams are TCP segments, which provide a reliable virtual circuit for the SMTP application protocol to use. TCP does things like open a connection between two application protocol end-points, resend lost data,

remove duplicates, and exert flow control, each of which is beyond the simple delivery function of IP itself, yet is common enough to be useful as a separate, distinct service.

The SMTP application protocol contains application-specific semantics. In this case, this might consist of an SMTP command such as 밨CPT TO ehall?and an application-specific response code such as 250 (뱋kay?. Note that the commands and data used by SMTP conform to the NVT specification, which prescribes how the data should be formatted, the types of data allowed, and so forth, although SMTP is doing all of the real work.

As can be seen, each of the layers in the TCP/IP suite provide specific functionality to the layers above and below it, making the overall design extremely modular. It is this modularity that makes TCP/IP so powerful, and also what makes it so complex.

Data-Link Services

When two devices on a network communicate with each other, they don't use IP to do so. Rather, they use protocols that are specific to the wire itself. For example, devices on an Ethernet segment use a predefined series of electrical impulses to communicate with each other. Whenever an Ethernet device wants to send data to another device on the same network, it raises and lowers the voltage of the shared medium so that a series of 뱋n?and 뱋ff?voltage patterns are generated. These changes in voltage are interpreted as bits by the other devices on the network.

The changes in voltage are dictated by protocols that are specific to the different types of physical networks. Ethernet networks have data-link protocols that will not work with technologies like Token Ring. Similarly, modems use protocols specific to different types of modem technology.

Much of IP's functionality is determined by the physical media that the IP device is connected to. When an IP device has information that it needs to send to another device on the same wire, it has to understand the characteristics of the wire in order to prepare the information so that is usable for that particular medium.

One of the issues that IP has to deal with is the mechanisms used for the network-specific addressing. Just as physical networks have to provide mechanisms for encapsulating and disseminating data on the wire, they also have to provide a way for devices to locate each other, using addressing methods defined by the low-level protocols.

On shared networks, each device must have a unique hardware address in order for devices to indicate which node the traffic is for. Ethernet networks use a 48-bit

Media Access Control (MAC) address for this purpose, while Frame Relay networks use Data-Link Connection Identifier (DLCI) addresses, and so on. This concept is illustrated in Figure 1-7, where IP traffic for 192.168.10.40 is sent to the Ethernet address of 00:00:c0:c8:b2:27, using Ethernet-specific signalling.


		Figure 1-7. Topology-specific protocols and addressing

In contrast, modems are point-to-point; only two devices can communicate over any given circuit. As such, modem circuits don't use addresses per se, but instead just send and receive data over dedicated 뱓ransmit?and 뱑eceive?wires as needed. The same is true of T-1 lines and most other point-to-point circuit-based networks.

In all of these cases, the IP stack running on the local device must understand the addressing mechanisms used by the hardware, and implement it accordingly, just as it must understand the framing characteristics and signalling mechanisms in use on the physical network.

The Internet Protocol

When an IP-enabled device wants to send data to another IP node, the data-link services on that device convert the IP datagrams into a format usable by the local network medium, and then send the data to the destination system using the addressing and framing mechanisms dictated by the network.

These steps occur on each of the networks that an IP datagram traverses on its way to the final destination system. If an IP datagram were sent from a dial-up user working at her home in Los Angeles to a server in Rome, Italy, the number of networks that would be crossed could be quite high. But at each step of the way, the data would be transmitted using the low-level protocols appropriate for each of the particular networks being crossed.

In this regard, IP provides a virtual representation of the global Internet to the hosts that are on it. IP provides datagram formatting and addressing mechanism that is not dependent upon any of the specific characteristics of the individual networks that make up the global Internet. Data can be sent to an IP address, and the data will be encapsulated and transmitted according to the rules of each of the intermediary networks, with the IP datagram being used to provide delivery clues to the sending, receiving, and intermediary devices. Essentially, routing occurs at the network layer (IP), while delivery occurs at the data-link layer (Ethernet, modems, whatever).

This concept is illustrated in Figure 1-8. In that example, data sent over a modem would be encapsulated into a form usable by the dial-up connection. Once received, the data would be determined to be an IP datagram, and would then get converted into a form that was usable by the LAN connection and sent out again. The receiving system (Ferret) would eventually get the packets.


		Figure 1-8. IP datagrams versus the topology-specific protocols

One thing to keep in mind is that this was the primary design goal of IP, allowing it to scale beyond the packet-switched networks that made up the original Internet (which could not be grown easily or cheaply). Without moving to a virtual networking protocol like IP, the Internet would still be using packet-switching networks, and we'd all have WAN connections on our desktops instead of Ethernet or Token Ring (or, more likely, we wouldn't be using IP). But by leveraging the virtual nature of IP, we can use any network we want anywhere we want, and the IP data will still be deliverable across any of them.

One side effect of this design is that the IP datagram is a separate entity from the IP packets that are being used to transfer the datagram from the source to the destination. Whenever a device needs to send data, it will form an IP datagram containing both the data that needs to be send and whatever headers are required to deliver the data over IP to the destination system. However, as this datagram is sent across the network, it will be shipped as a series of packets that get created and destroyed by each network device that processes or forwards the datagram on to its final destination. In essence, the datagram becomes a series of packets, each of which can go anywhere they need to in order for the datagram to be delivered.

Another interesting aspect of IP is that it does not guarantee that any of these packets will ever get delivered at all. A system may be able to send the data, but the data may not be received intact, or the data may be ignored by the destination system due to high processing loads or some other reason. Although some networking topologies provide an intelligent retransmission mechanism in case data is lost, many of them do not. As such, IP's designers had to assume that data would get lost sometimes.

In this regard, IP offers absolutely no guarantees, leaving it up to higher-layer protocols to perform this function if required. For this reason, IP can be thought of as being unreliable, particularly in the sense that application designers (and users) should not assume that every IP datagram will arrive at its destination intact. Some people refer to this as 밷est-effort?delivery, while others refer to it jokingly as 밷est-of-luck?delivery.

Another key design goal of IP was the concept of datagram independence. The IP protocol does not dictate that all datagrams must travel the same route. In fact, IP dictates just the opposite: any datagram can travel across any network path that the devices on the network deem most suitable.

For example, a user in California may be downloading data from a host in Ireland, and some part of the network may simply stop functioning. The sending system (or a router somewhere in between the two systems) would eventually detect this failure and would begin forwarding datagrams through a different network path. This feature gives IP a robust and flexible foundation, allowing networks to become self-healing, in a sense.

Since each datagram is independent, it is likely that some datagrams will take different paths to the same destination. As such, one datagram may end up crossing a satellite link, while another datagram crosses a fiber-optic line. When this happens, the second datagram will likely arrive at the destination system before the first datagram does. In another situation, the satellite link may experience some sort of problem that results in the first datagram getting sent twice.

In both of these cases, the network has caused the IP datagrams to get out of synch. But since IP is simply a virtual representation of the network, it does not care when this happens. If sequencing is important to an application, then it has to implement that service directly or by using TCP (appropriately the Transmission Control Protocol) for transport-layer services.

Another related concept is fragmentation. Assume for a moment that the sending system were on a high-capacity network such as Token Ring, while the destination system were on a low-capacity dial-up connection. Since the sending system generates datagrams according to the characteristics of the local network, it generates large datagrams appropriate for the Token Ring frames.

But when the next system tries to relay the IP datagram to the dial-up recipient, the IP datagrams are too large for the dial-up network to handle in one frame. When this happens, the datagram must be split into multiple fragments, with each of the fragments being sent across the network as independent entities. Each fragment follows all of the other rules outlined earlier, thus being capable of getting lost in transit, routed separately, or arriving out of sequence.

When the fragments arrive at the destination system (if they arrive), then they need to be reassembled into their original datagram form. However, this only occurs at the final destination system, and not at any intermediary routers, since any of the fragments could have gone off in another direction.

Taken together, these services make IP a highly unreliable and unpredictable network protocol. Datagrams can get lost or may be broken into multiple packets, all without warning. The only thing IP does is move data from one host to another, one network at a time. Of course, users have little interest in applications that must be provided by a higher-level protocol than either IP itself (for example, TCP) or the application.

Remember this rule: The Internet Protocol is only responsible for getting IP datagrams from one host to another, one network at a time.

For more information on IP, refer to Chapter 2, The Internet Protocol.

The Address Resolution Protocol

Since two IP devices on the same physical medium communicate with each other using the low-level protocols specific to that physical medium, the two devices must locate each other's hardware addresses before they can exchange any data. However, each networking topology has its own addressing mechanisms that are different from all of the others, and IP has to be able to locate hardware addresses for each of them.

Since there are so many different types of networking topologies, it is not possible for IP to be imbued with the specific knowledge of how to build the address mappings for each of them explicitly. Attempting to do so would be an extraordinarily inefficient use of the IP software's basic functionality, preventing the rapid adoption of new networking topologies and introducing other fundamental problems into the network.

Instead, the Address Resolution Protocol (ARP) is used as a helper to IP, and is called upon to perform the specific task of building each address mapping whenever address conversion is required. ARP works by issuing a broadcast on the selected medium, requesting that the device using a specific IP address respond with its hardware address. Once the destination device has responded, the sending system can establish communication with the receiving system and start sending data to the discovered hardware address. This process is shown in Figure 1-9, with 192.168.10.10 issuing a lookup for 192.168.10.40, who responds with its local Ethernet hardware address.


		Figure 1-9. Using ARP to locate the hardware address associated with a known IP address

The ARP requests and responses work at the physical layer and are embedded directly into the frames provided by the low-level protocols in use on the physical medium. ARP messages do not use IP, but instead are entirely separate protocols.

For more information on ARP, refer ahead to Chapter 3, The Address Resolution Protocol.

The Internet Control Message Protocol

From time to time, IP datagrams will fail to get delivered. This may be due to errors in the datagram structure, a general network outage, or a delivery timeout. IP doesn't really care about these problems, since it never promised delivery in the first place. However, applications care about these problems very much. They would like to be able to react to a failure, either by taking an alternative course of action, or by at least informing the user that a problem has occurred.

IP uses the Internet Control Message Protocol (ICMP) for error-reporting services. When a system needs to report a problem that is preventing delivery from occurring, it generates an ICMP message that describes the general problem, and then sends it back to the original sender of the original datagram. ICMP messages are not sent when a packet is lost in transit or when some other transient error occurs. Rather, ICMP error messages are only sent when there is a detectable problem that is preventing certain packets or datagrams from being delivered due to a specific reason. This indicates that the sending host should probably stop trying to send those kinds of datagrams to this specific destination system, or that a different path should be used.

Even if two IP-enabled systems are able to communicate effectively, there are no guarantees that everything will work, since the data inside the datagrams may be corrupt, or packets may get lost without any ICMP errors being generated. IP is an unreliable network protocol by its very definition, and as such does not provide any guarantees. ICMP does not change this fact.

ICMP runs on top of IP, allowing it to traverse the global Internet just as easily as TCP or UDP messages. This seems a bit confusing to many people: if an IP datagram could not be delivered, it would not seem that an ICMP error—delivered over IP—would make it back to the original sender. However, remember that most delivery errors occur due to problems on the next leg of the network, and that the original IP datagram at least made it as far as the system that's reporting a problem. In this scenario, the network between the original sender and the host that's reporting the problem is likely to be functioning properly.

There are a variety of ICMP message types, and not all of them are limited to reporting on network errors. There are also ICMP 뱏uery?messages, useful for diagnosing and testing the network interactively. The most common of these are the ICMP Echo Request and Echo Reply query messages, which are better known as ping to most users.

For more information on ICMP, refer to Chapter 5, The Internet Control Message Protocol.

The Transport Protocols.

Application protocols do not communicate with IP directly, but instead talk to one of two transport protocols: TCP or UDP. In turn, these transport protocols pass data to IP, which encapsulates the data into IP datagrams that get sent over the network. In essence, the transport protocols hide the network from the application protocols so that they do not have to deal with packet-sizing and other issues, while also shielding the network from having to multiplex the application protocol traffic (a task that IP can leave to the transport protocols).

For example, both UDP and TCP provide a multiplexing service to application protocols by way of application-specific port numbers. Essentially, port numbers act as virtual post office boxes for messages to be delivered to within a single host, allowing multiple applications to run on a single host. When datagrams arrive at a destination system, they are handed off to the transport protocol specified in the datagram, which then delivers the transport-specific message data to the port number specified in the header of the message. In this manner, many different application protocols can run on the same host, using different port numbers to identify themselves to the transport protocols.

The transport protocol that an application protocol uses is determined by the kinds of network- and application-management services that are required. TCP is a reliable, connection-oriented transport protocol, providing error-correction and flow-control services that can tolerate IP's knack for losing packets. Conversely, UDP is an unreliable, message-centric transport protocol that offers little functionality over IP alone. There are many applications that need to use one of these models or the other, and there are a handful of applications that use both of them. A good example of an application that could use them both is a network printer.

If many users share a network printer, all of them need to be kept informed of the printer's availability. Using UDP, the printer could send out periodic status messages such as 뱋ut of paper?or 밹over open.?The software on the client PCs would then pick up on these status updates, changing the desktop icon appropriately, or notifying an administrator that something has gone awry. UDP allows the printer to notify everyone of these updates simultaneously, since it's not a big deal if some of these updates get lost.

This concept is illustrated in Figure 1-10, in which the printer at 192.168.10.2 is periodically sending out UDP broadcasts, indicating its current status. Network systems that are interested in that information can monitor for those updates and can change their desktop icons or management station software appropriately. If a system does not receive one of these updates for some reason, then it will probably get the next update message, so it's not a big deal.


		Figure 1-10. Using UDP to broadcast status updates

Conversely, when a user wants to print a file, she would prefer to use TCP, since that would ensure that the printer received all of the data intact. When the user wants to print, the client software on the end user's PC establishes a TCP session with the printer, sends the data to the printer's software, and then closes the connection once the job was submitted.

If the printer is functioning properly, it accepts the data, and uses the error-correction and flow-control services offered by TCP to manage the data transfer. If the printer is not available (perhaps it is out of paper, or low on toner), then it sends an appropriate message using the existing TCP connection. This ensures that the client is notified of whatever problem is preventing the print job from being serviced.

This process is illustrated in Figure 1-11. Here, the desktop PC is trying to print a file to the printer, but since the printer is out of toner, it rejects the connection. Because TCP is a reliable, circuit-centric protocol, the client is sure to get the message, even if it didn't get all of the UDP broadcasts sent earlier.

As you can see, both TCP and UDP provide functionality that is above that offered by IP alone, and both protocols are required to build an effective set of network applications.


		Figure 1-11. Using TCP for transaction-oriented applications

The Transmission Control Protocol

TCP provides error-correction through the use of a connection-oriented transaction. Whenever an application needs to send data to another host, TCP builds a 뱒tart?segment and sends it to the destination node. When the other system sends a 뱒tart?segment back (along with an acknowledgment that it got the first segment), a monitored conversation between the two systems begins.

TCP works in much the same way as a telephone conversation. When an application wants to trade data with another system, it first tries to establish a workable session. This is similar to you calling another person on the phone. When the other party answers (밐ello??, they are acknowledging that the call went through. You then acknowledge the other party's acknowledgment (밐i Joe, this is Eric?, and begin exchanging information.

If at any time during the call parts of the data exchange are lost (밪orry, what did you say??, the sender retransmits the questionable data. If the connection degrades to a point where no communication is possible, then sooner or later both parties simply stop talking. Otherwise, once all of the data has been exchanged, the parties agree to disconnect (밪ee ya?, and close the call gracefully. TCP follows most of these same rules, as is illustrated in Figure 1-12.

TCP segments are encapsulated within IP datagrams. They still rely on IP to get the data where it's going. However, since IP doesn't offer any guarantees regarding delivery, TCP has to keep track of the status of the connection at all times. This is achieved through the use of sequence numbers and acknowledgment flags embedded within the TCP header. Every byte of data sent over TCP must be acknowledged (although these acknowledgments are usually clumped together). If one of the systems does not acknowledge a segment, then TCP will resend the


		Figure 1-12. TCP virtual circuits versus telephone calls

questionable data. This provides error correction and recovery functions that overcome IP's unreliable nature.

The use of sequence numbers also allows TCP to implement flow control and other services on top of IP. Applications can send as much data as they need to, and TCP will break the data into chunks that will fit within IP segments. If the receiving system is unable to process data quickly enough, it can tell the sending system to slow down, thereby reducing the likelihood that data will get lost.

In addition, it is important to realize that TCP offers a byte-stream service for applications to use whenever they need to read and write data. Whenever an application needs to send data—whether that data is a 20-byte message or a two-megabyte file—the application can send the data in a stream to TCP, where it will be converted into manageable chunks of data that are sent (and tracked) over IP cleanly. Once the IP datagrams are received by the destination system, the data is made available to the destination application immediately, where it can be read and processed.

Applications such as the Internet's Simple Message Transport Protocol (SMTP) and Hypertext Transfer Protocol (HTTP) both require the reliable and controlled connection services that TCP provides. In addition, these types of application protocols also benefit from TCP's streaming model, allowing the applications to send data as a continual stream of bytes that will be read and processed by the recipient upon their arrival. Without these services, mail messages sent over SMTP and GIF images sent over HTTP would not flow smoothly, and would likely get garbled. And since TCP provides these services directly, applications do not have to embed these routines within their internal application code.

For more information on TCP, see Chapter 7, The Transmission Control Protocol.

The User Datagram Protocol

Not every application requires guaranteed delivery, and these applications typically use UDP for transport services. Unlike TCP, UDP sends only the data it has received from the application, and makes no pretense towards guaranteed delivery or flow control or anything else. As such, UDP is much like IP, but is the protocol that applications use to communicate with each other, rather than using IP directly.

UDP is much like a postcard. If you were travelling around a foreign country, you might send postcards to friends and family from the different cities that you visit, informing them of recent events. You wouldn't worry about the postcards getting delivered quickly, or even if they got lost entirely, since you'll probably send more postcards from the next town anyway. You wouldn't necessarily want the postcards to get lost, but at the same time you wouldn't rely on the postcards for any urgent business (like 뱒end money to the embassy?. For anything important, you'd use the telephone (TCP) to ensure that your message arrived intact and was processed correctly.

You may wonder why a UDP protocol exists, when it would seem that IP could serve the same function. The reason is simple: IP doesn't do anything but get datagrams from one host to another. IP doesn't provide any application interfaces or management services. UDP does provide these services, and it provides a consistent environment for developers to use when writing low-overhead network applications. UDP also provides application multiplexing services through the use of port numbers, allowing many application protocols to be used on a single host. Trying to do this with IP would require either a lot more transport protocols, or an application multiplexing layer within IP directly, neither of which would be very efficient.

Another benefit of UDP is that it offers a message-centric delivery model, allowing chunks of data to be sent as single IP datagrams (instead of being streamed over virtual circuits like they would with TCP). For example, a UDP-based application protocol can write a four-kilobyte block of data to UDP, and that block will be handed to IP directly. IP will then create an IP datagram that contains the entire four kilobytes, and send this data as a series of IP packets to the destination system (according to the rules defined for the network medium in use). Once all of the data arrives, the IP datagram is reassembled and the entire four-kilobyte UDP message will be handed to UDP for processing.

In this model, it is easy for applications to exchange record-oriented data (such as a fixed-length file or a atabase record), since the entire record can be read by a single operation. Since the IP datagram (and thus the UDP message) will be contained in a single message, if the client has received any of the data, then they will

receive all of the data in that message. Conversely, TCP would require that the client continually read the queue, waiting for all of the data to arrive, and having no clear indication of when all the data for that record had arrived (without also using application-specific markers in the data stream, anyway).

Also, applications that need fast turnaround or that already have their own internal error-correction routines can make good use of UDP because of its low overhead. Some database software packages can be configured to use UDP, and many file transfer protocols also use UDP because it is a light, fast, and message-centric protocol that is easier and faster than TCP, and that does not require the overhead of TCP's virtual-circuit model.

For more information on UDP, refer to Chapter 6, The User Datagram Protocol.

Presentation Services

Whenever application protocols wish to communicate with each other, they must do so using a predefined set of rules that define the types of data that will be exchanged. For example, if an application protocol is to use textual data, then those characters must have the same byte-order and binary values on both systems. For example, one system cannot use US-ASCII while the other system uses EBCDIC characters. Nor can one system pass data in 밷ig-endian?form to a processor that only understands 뱇ittle-endian?data, since the bits will be interpreted backwards.

For these reasons, the application protocols must agree to use certain types of data, and must also agree on how to present that data so that it is interpreted consistently. Typically, this falls under the heading of 뱎resentation layer services,?with some network architectures providing detailed presentation-layer specifications that cover everything from character sets to numeric formatting rules. However, TCP/IP does not have a formally defined presentation layer. Instead, it has many informal mechanisms that act as presentation layers, with each of them providing specific kinds of presentation services to different kinds of applications.

Most of the application protocols used on the Internet today use the Network Virtual Terminal (NVT) specification for presentation services. NVTs are a subset of the Telnet specification, and provide a basic terminal-to-terminal session that applications use to exchange textual data. The NVT specification defines a simple definition for the characters to use (seven-bit, printable characters from the US-ASCII character set) and end-of-line markers.

However, NVTs do not provide for much in the way of complex data types, such as numeric formatting. If an application needs to exchange a complex piece of data—including extended characters, long integers, and record markers—then NVTs can not be used alone. For this reason, a variety of other presentation-layer services are also used with TCP/IP applications, although typically these services are restricted to vendor-specific applications and offerings.

One presentation service that is popular with Microsoft-based applications is IBM's NetBIOS, a set of network APIs that provide functionality suitable for PC-based network applications. Another popular service is Sun's External Data Representation (XDR) service, a set of APIs that are useful for passing complex data types. Yet another popular service is the Distributed Computing Environment's Remote Procedure Call (DCE RPC) mechanism, useful for passing network-specific data between highly dissimilar hosts.

Each of these mechanisms is popular with different groups and for different reasons. But most Internet-based applications use just NVTs since they are usable on a wide variety of systems. Remember that many of the computing systems in use on the Internet are still quite old and are incapable of supporting anything other than seven-bit ASCII text.

Application Protocols

A variety of application protocols exist that provide standardized mechanisms for the exchange of information across vendor bounds. Among these are file transfer protocols such as FTP, Gopher, and HTTP; groupware and electronic mail services such as SMTP, POP3, IMAP4, and NNTP; and protocols for locating network resources such as DNS, Finger, and LDAP, among many others.

It's important to realize that client applications generally consist of two distinct components: the application protocol (such as HTTP or POP3), and an end-user interface that displays information. For example, a web browser uses HTTP (the protocol) to retrieve HTML and GIFs from a web server, but the code for displaying that data is a separate service that is not covered by the protocol specification.

For more on the common application protocols found on the Internet today, refer to the book Internet Application Protocols, which covers most of these protocols.

How Application Protocols Communicate Over IP

Almost all IP applications follow the same basic model: a client sends a request of some kind to a server running on another system, and the server examines the request, acts upon it in some form, and then possibly returns some form of data back to the client. This is not always the case (many UDP-based 뱒ervers?do not return any data, but simply monitor network activity), but it holds true for most applications.

Server-based applications (like an email server or web server) are generally loaded by the operating system when the computer is started. The servers then go into a 뱇isten?state, watching for incoming connections. Conversely, client applications will only establish a connection when some sort of action is required (like 밽et new messages?.

Applications communicate with the transport protocols through the use of 뱎orts,?which are unique I/O identifiers used by the transport protocols and the specific instance of the application protocol. 밣orts?are conceptually similar to the mail-boxes used at your local post office. When a letter comes in for a recipient, it is placed into a known mailbox reserved for that specific recipient. Whenever the recipient comes by, he will pick up any messages in that mailbox and process the data at his convenience.

Similarly, ports provide TCP and UDP with a way to deliver data to higher-layer application protocols. Every time an application protocol opens a connection to one of the transport protocols, it will allocate a port from the transport protocol, and then use that port for all network I/O. Any traffic that is destined for that particular application will be routed to the appropriate port for the application to deal with.

Just as every device on an IP network has a unique IP address, every instance of every application protocol also has a unique port number that is used to identify it to the transport protocols on the local system. This concept is illustrated in Figure 1-13, which shows how UDP reserves ports for specific applications. Any UDP or TCP messages that come into a system will be identified as destined for a specific port number, and the transport layer will use that information to route the data to the correct application.

Some applications can open many simultaneous network connections, and in this case, each instance would get its own port number. One example of this is the ubiquitous web browser, which can open many simultaneous connections to a remote web server, depending on the number of files that need to be downloaded from a web page. Each of these HTTP connections will get created as independent network connections, with each of the connections having unique port numbers for the client side of the connection. Once the web browser finishes downloading the objects, then each of the individual connections will be closed.

Every connection between a client and a server consists of four pieces of information: a source IP address, a source port number, a destination address, and a destination port number. All together, these four pieces of information make connections unique. For example, if a web browser were to open two connections to a web server, then the IP addresses of both hosts would be the same. In addition, the well-known server port number (80) would also be the same.


		Figure 1-13. Application-level multiplexing with port numbers

Therefore, in order for the individual connections to be unique, the client must use a different port number for each of the unique connections. Servers do not care if a single client asks for multiple connections, as long as each connection comes from a unique port number on the client, since each connection must be uniquely identifiable to the server.

This four-way identifier is called a 뱒ocket pair?in IP lingo, and is the basis of all communications for all application protocols. A 뱎ort?identifies a connection point in the local stack (i.e., port number 80). A 뱒ocket?identifies a IP address and port number together (i.e., port 80 on host 192.168.10.20 could be written as 뱒ocket 192.168.10.20:80.?. A 뱒ocket pair?refers to a distinct connection between two different applications, including the IP addresses and port numbers in use by both. Each individual connection requires that the socket pair contain at least one unique element.

Servers Listen for Incoming Connections

Most server-based IP applications use what are referred to as 뱖ell-known?port numbers. For example, an HTTP server will listen on TCP port 80 by default, which is the well-known port number for an HTTP server. This way, any HTTP client that connects to HTTP servers can use the default of TCP port 80. Otherwise, the client would have to specify the port number of the server that it wanted to connect with (you've seen this in some URLs that use http://www.somehost.com:8080/ or the like, where ?080?/I> is the port number of the HTTP server on www.somehost.com).

Most application servers allow you to use any port number you want. However, if you were to run your web server on TCP port 8080 for example, then you would have to tell every Internet user that your web server was not accessible on TCP port 80. This would be an impossible task. By sticking with the default, all users can connect to your web server using the default of TCP port 80.

Some network administrators purposefully run application servers on non-standard ports, hoping to add an extra layer of security to their network. However, it is the author's opinion that security through obscurity is no security at all, and this method should not be relied upon by itself.

In addition to the reserved addresses that are managed by the IANA, there are also 뱔nreserved?port numbers that can be used by any application for any purpose, although conflicts may occur with other users who are also using those port numbers. Any port number that is frequently used is encouraged to register with the IANA.

For a detailed listing of all of the port numbers that are currently registered, refer to the IANA's online registry (accessible at http://www.isi.edu/in-notes/iana/assignments/port-numbers). To see the well-known ports used on your system, examine the /etc/services file on a Unix host, or the C:\WinNT\System32\Drivers\Etc\SERVICES file on a Windows NT host.

Clients Open Connections to Servers

In contrast to server-based applications that are always listening for incoming connections on a fixed port number, client applications will use a randomly assigned port number for their end of the connection. Whenever an IP application needs to send data, the transport protocol will allocate a random port number above 1024 and use this port number for all incoming and outgoing data associated with that application.

For example, when a POP3 client is used to establish a connection with a mail server, the client application will pass an application-specific command to TCP, specifying the server's IP address and port number as the destination. TCP will then add its own information—including stuff like the port number of the local

POP3 client—and hand the entire package off to IP for delivery. IP then does its best to get the message to the destination system.

When the mail server's IP stack receives the datagram, it verifies that it contains a TCP segment, and then hands the contents off to TCP for further processing. TCP will see that the destination port number refers to the local POP3 server, and then hand off the original application command. Once the server has processed the command, it will reverse the process, sending whatever data it generates back to the port number and IP address in use by the client. Once the transaction is finished, the client's TCP port will be released. Any subsequent connections would require a new connection be opened, with a different port number being allocated by the client.

'UP! > Web Service' 카테고리의 다른 글

The Address Resolution Protocol (0)	2008.08.21
The Internet Protocol (0)	2008.08.21
Internet Core Protocols (0)	2008.08.21
[토론4] 웹서비스와 시맨틱 웹 (0)	2008.08.21
RPC:Remote Procedure Call (0)	2008.08.21

Posted by 으랏차

Internet Core Protocols

UP!/Web Service 2008. 8. 21. 14:14

Internet Core Protocols:
The Definitive Guide

By Eric Hall
February 2000
1-56592-572-6, Order Number: 5726
469 pages, $39.95 US, $58.95 CA, ?8.50 UK.

Table of Contents

Foreword

Preface

1. An Introduction to TCP/IP
A Brief History of the Internet
TCP/IP's Architecture
TCP/IP Protocols and Services In-Depth
How Application Protocols Communicate Over IP

2. The Internet Protocol
The IP Standard
The IP Header
IP in Action
Troubleshooting IP

3. The Address Resolution Protocol
The ARP Standard
The ARP Packet
ARP in Action
Debugging ARP Problems

4. Multicasting and the Internet Group Management Protocol
The IP Multicasting and IGMP Specifications
IGMP Messages
Multicasting and IGMP in Action
Troubleshooting Multicasts and IGMP

5. The Internet Control Message Protocol
The ICMP Specification
ICMP Messages
ICMP in Action
Troubleshooting ICMP

6. The User Datagram Protocol
The UDP Standard
The UDP Header
Troubleshooting UDP

7. The Transmission Control Protocol
The TCP Standard
The TCP Header
TCP in Action
Troubleshooting TCP

A. The Internet Standardization Process

B. IP Addressing Fundamentals

C. Using the CD-ROM

Bibliography

'UP! > Web Service' 카테고리의 다른 글

The Internet Protocol (0)	2008.08.21
An Introduction to TCP/IP (0)	2008.08.21
[토론4] 웹서비스와 시맨틱 웹 (0)	2008.08.21
RPC:Remote Procedure Call (0)	2008.08.21
RMI (0)	2008.08.21

Posted by 으랏차

[토론4] 웹서비스와 시맨틱 웹

UP!/Web Service 2008. 8. 21. 14:13

숭실대학교 컴퓨터학부 교수

1.개요
2. 시맨틱 웹 기술
3. 시맨틱 웹 피라미드
4. 시맨틱 웹 시장
5. 시맨틱 웹 시장 동향
6. 결론

1. 개요

1989년에 Tim Berners-Lee에 의해서 제안된 웹(Werld Wide Web: WWW)은 사람들 간의 정보 공유에 매우 큰 영향을 미쳤다. 많은 사람들은 자신이 생각한 내용이나 연구 결과를 HTML이라는 간단한 markup 언어로 표현하여 웹에 올려 놓을 수 있고, 웹에 올려진 정보는 사람들 간의 정보 공유에 매우 큰 기여를 하였다. 이에 따라 수많은 정보가 인터넷 상에 발표되고 유통되어서 사람들은 정보의 바다에서 자신이 원하는 정보를 유용하게 찾을 수 있고 이를 활용할 수 있게 되었다[1]-[2].
그러나 정보의 바다에 수많은 정보가 올려짐에 따라서 자신이 원하는 정보를 발견하는 작업이 점점 더 어려워지는 현상이 발생하게 되었다. 정보의 양이 적을 때는 웹으로부터 원하는 정보를 발견하는 데 어려움을 느끼지 못했지만, 점점 정보의 양이 많아짐에 따라서 정보를 발견하는 데 매우 많은 시간을 투자해야만 하는 현상이 생기게 되었다. 또한 웹에는 정보만이 존재하는 것이 아니라 점차적으로 서비스를 제공하는 응용 프로그램이 등장하게 되었는데, 이와 같은 응용 프로그램을 유효 적절하게 찾아서 활용하는 데도 많은 어려움에 봉착하게 되었다.
이러한 어려움을 극복하기 위해서 컴퓨터공학 분야에서는 소프트웨어 에이전트라는 방안을 제안하였다.

소프트웨어 에이전트는 사람이 하는 일을 대신하여 수행할 수 있는 지능형 소프트웨어를 총칭하여 부르는 이름이다. 사람을 대신한다는 의미는 일을 시킨 사람이 만족할 정도의 작업 결과를 내놓을 수 있다는 것을 뜻한다. 따라서, 소프트웨어 에이전트는 사람이 지시한 작업을 지시한 사람이 만족할 수 있는 수준의 결과를 제출할 수 있어야 한다. 그러나 사람이 지시한 일이라는 것이 일반적인 분야에서는 매우 높은 지능을 요구하기 때문에 지능형 에이전트를 구축한다는 것은 매우 어려운 일이다. 창의적인 작업이나 매우 복잡한 이론과 기술을 복합하는 분야에 에이전트를 활용하는 데는 아직까지 기술적인 어려움이 많이 있다. 그러나 비교적 단순 작업을 수행하는 분야에서는 에이전트가 성공적으로 사람의 일을 대신할 수 있다. 그러므로 웹은 에이전트가 사람에게 도움을 주기에 매우 좋은 분야이다. 사람이 웹에 있는 정보나 서비스를 검색하는 행위는 아직까지 단순한 작업이기 때문이다. 즉, 사람들은 자신이 원하는 정보의 주요 키워드를 주면 소프트웨어 에이전트는 웹에 있는 문서들에서 사람이 입력한 키워드가 있는 정보나 서비스를 찾아주면 되는 단순 작업만을 수행하면 되기 때문에 소프트웨어 에이전트가 매우 성공적으로 사람이 하는 일을 대신할 수 있었다. 이와 같은 대표적인 소프트웨어 에이전트가 우리가 일상 생활에서 활용하는 Google 같은 웹 검색 프로그램이다.
그러나 사람들은 이와 같은 단순 작업만을 수행하는 에이전트를 보다 지능화하기를 원하게 되었다. 그 이유는 단순한 에이전트만을 이용해서는 바쁜 일상 생활에서 자신이 만족할 수 있는 결과를 얻을 수 없었기 때문이다. 예를 들면, 키워드를 이용한 검색 프로그램을 이용하면 자신이 원하는 정보가 바로 나오는 것이 아니라 수천 또는 수만의 결과가 나오고 다시 사람이 이 중에서 자신이 원하는 정보를 찾는 수고를 해야 한다. 웹에 정보의 양이 적을 때에는 문제가 별로 심각하지 않았지만, 정보의 양이 많아지고 사람들이 보다 빠르게 정보를 요구하는 욕구가 높아짐에 따라서 기존의 웹은 많은 문제점을 노출하기 시작했다. 또한, 사람은 에이전트를 이용하여 웹에 있는 서비스를 자동으로 받고 싶지만, 기존의 에이전트는 단순 작업만을 할 수 밖에 없기 때문에 이와 같은 요구 사항을 만족시킬 수 없었다.
이와 같이 소프트웨어 에이전트가 제대로 동작을 못한 이유는 웹의 구조적인 문제에서 유래한다. 즉, 웹에 있는 모든 정보는 사람을 위한 정보이다. 사람은 웹에 있는 정보를 이해할 수 있으나, 소프트웨어 에이전트는 웹에 있는 정보를 이해할 수 없기 때문이다. 이와 같은 문제를 극복하기 위해서 웹을 주창하였던 Tim Berners-Lee는 1999년에 W3C를 중심으로 차세대 웹기술인 시맨틱 웹(Semantic Web)을 제안하게 된다[1]-[2],[8].

2. 시맨틱 웹 기술

가. 지능형 에이전트 기술

시맨틱 웹의 출현은 지능형 에이전트를 위한 새로운 공간을 제공하기 위한 것이라고 할 수 있다. 즉, 웹이 사람을 위한 사이버 공간을 제공하였다면, 시맨틱 웹은 지능형 에이전트를 위한 공간을 제공하여 보다 사람이 편리하게 에이전트로부터 정보를 찾거나 서비스를 받을 수 있게 하는 것이다. 웹에 있는 HTML문서는 사람이 보고 이해하는 데는 불편함이 없도록 설계되어 있으나, 소프트웨어 에이전트는 HTML문서를 이해할 방법이 없다. 예를 들면, 우리가 프랑스 영화를 볼 때 불어를 모르면 영화의 내용을 이해할 수 없는 것과 같이 소프트웨어 에이전트도 HTML문서를 보면 그 내용을 이해할 수 없다. 번역된 자막을 보면 우리가 프랑스 영화의 내용을 이해할 수 있는 것과 유사하게 소프트웨어 에이전트는 시맨틱 웹 환경에서 HTML문서의 메타데이터를 보면 HTML문서의 내용을 파악하게 된다. 그러나 불어 대본이 가지고 있는 의미와 자막에 번역된 의미의 차이는 매우 크다. 번역된 자막은 영화를 이해하는 데 필요한 최소한의 정보만을 가지고 있을 수 밖에 없다. 이와 같이 시맨틱 웹 공간에서는 지능형 에이전트가 웹에 있는 정보나 서비스를 한정된 범위에서 이해할 수 있는 구조를 만들어 줌으로써, 사람이 지능형 에이전트를 이용하여 보다 복잡한 작업을 시킬 수 있게 된다.
따라서, 소프트웨어 에이전트가 이해할 수 있는 단순한 언어를 이용하여 HTML 문서의 내용을 메타데이터의 형태로 표현하면 된다. 즉, HTML 문서가 있는 웹 공간은 사람을 위한 사이버 공간이고, HTML 문서의 메타데이터가 있는 웹 공간은 소프트웨어 에이전트를 위한 사이버 공간이라고 할 수 있다. 이때 메타데이터는 HTML문서의 내용을 표현하고 있어야 하는데, 사람이 생각한 모든 내용을 표현한다는 것이 불가능하므로 사람의 생각 중에서 소프트웨어 에이전트에게 가장 필요한 부분만을 개념화하여 표현하는 방식을 취하게 된다. 이와 같이 시맨틱 웹에서는 두개의 공간을 준비하여 지능형 에이전트가 활동을 할 수 있는 구조를 제공하게 된다.
시 맨틱 웹 환경에서는 기존의 웹 공간과 이 웹에 있는 정보를 표현하는 메타 공간으로 구성된다. 이 메타 공간에 있는 정보는 웹 공간에 있는 HTML문서의 의미를 표현하고 있으며, 지능형 에이전트가 메타 공간에 있는 정보를 이해할 수 있도록 설계되어 있다. 메타 공간에 표현되는 언어는 한정된 vocabulary를 가지고 있고, 일정한 규칙을 가지도록 설계되었으므로 소프트웨어 에이전트는 메타 공간에 표현되는 언어를 이해할 수 있다. 따라서, 소프트웨어 에이전트는 사람이 지시한 작업을 수행하는데 웹에 있는 문서의 의미를 파악하면서 자동으로 문제를 해결할 수 있는 기반 구조를 가지게 되었다. 앞에서 들은 영화 예를 든다면, 영화의 자막이 없는 경우에 우리는 영화를 보면서 영화의 내용을 추측만하게 되지만 자막이 있는 경우는 영화의 내용을 알 수 있으므로 영화 스토리를 다른 사람에게 설명할 수 있게 된다. 기존의 웹에 있는 모든 소프트웨어 에이전트는 우리가 자막이 없는 프랑스 영화를 보는 것과 같이, 단순히 키워드 기반의 분석 작업에 의존하는 한계를 가지면서 사람이 지시한 작업을 수행하게 된다. 그러나 시맨틱 웹 환경에서의 지능형 에이전트는 웹에 있는 정보의 의미를 이해할 수 있는데, 이것은 우리가 영화의 자막을 보는 것과 유사하게 웹에 있는 HTML 문서의 메타데이터를 보면서 이해할 수 있게 된다.

나. 온톨로지 기술 개발

지능형 에이전트가 동작할 수 있도록 구조적인 공간을 만들어 주기 위해서는 사람이 만든 HTML문서의 내용을 메타데이터의 형식으로 표현하여야 한다. 이와 같은 메타데이터 공간을 지능형 에이전트가 활용하게 되므로 메타데이터를 표현하는 것은 매우 중요하다. 사람은 머리 속에 있는 생각과 개념을 이용하여 HTML문서를 표현하였으므로, 이와 같은 HTML 문서가 가지는 의미를 소프트웨어 에이전트에게 표현하기 위해서는 사람이 가지고 있었던 개념이나 생각을 표현하는 것이 필요하다. 이와 같은 시맨틱 개념을 이용하여 소프트웨어 에이전트를 위한 메타데이터를 표현하게 되는데, 사람이 생각한 시맨틱 개념을 소프트웨어 에이전트가 이해할 수 있도록 표현한 것을 온톨로지(Ontology)라고 한다[3].
온톨로지는 두가지 중요한 의미를 가지게 된다. 첫번째는, 메타데이터를 표현하기 위한 해당 분야의 개념이라고 할 수 있다. 예를 들면, 앞에서 예를 든 영화가 공상과학 영화라면 자막에는 과학 용어가 나오게 될 것이고 우리는 그 과학 용어에 대한 개념을 알고 있을 때 그 자막을 이해할 수 있다. 반대로 그 영화가 프랑스 역사물이라면 자막에 나오는 역사 사건에 대한 개념이 있을 때 그 영화의 내용을 이해할 수 있다. 이와 같이 자막과 같은 메타데이터가 있더라도, 그 자막에 있는 vocabulary에 대한 개념을 가지고 있을 때 그 내용을 알 수 있게 된다. 그러므로 온톨로지는 메타데이터에 있는 vocabulary에 대한 개념이라고 할 수 있다.
두번째는, 이와 같은 온톨로지는 공유(shared)된다는 점이다[3]. 즉, 온톨로지를 설계하는 것은 시맨틱 웹 프로그래머이고 이 프로그래머는 온톨로지를 설계할 때 일정한 규칙과 잘 정의된 방식에 따라서 설계하여 이 내용을 소프트웨어 에이전트에게 알려주게 된다. 이와 같은 과정을 거쳐서 온톨로지는 사람도 이해하고 소프트웨어 에이전트도 이해하는 특징을 가지게 된다. 여기서 소프트웨어 에이전트가 이해한다는 것은 에이전트가 온톨로지를 파싱할 수 있고, 그 구조를 이해한다는 것을 의미하는 것으로 지능을 가지고 이해한다는 것은 아니다. 따라서, 온톨로지 개발과정은 지능형 시스템의 지식베이스를 개발하는 과정과 매우 유사하다. 또한 온톨로지는 지능형 에이전트가 제공할 서비스에 따라서 다른 내용과 형식을 가지게 된다.

다. 메타데이터 표현 기술

시맨틱 웹이 성공하기 위해서는 메타데이타가 필수적이다. 사실 시맨틱 웹이 출현한 이유는 지능형 에이전트가 웹의 정보를 이해할 수 있는 공간을 구축하기 위함인데, 이와 같은 공간이 결국은 메타데이터이다. 즉, 웹에서는 HTML 문서와 같은 한 종류의 정보만 존재하지만, 시맨틱웹에서는 HTML문서가 가지고 있는 의미있는 정보를 지능형 에이전트가 이해할 수 있도록 메타데이터를 필요로 한다. 메타데이터 웹상의 문서나 정보의 의미를 표현하는데, 예를 들면, 그 문서의 주요 개념, 저자, 저작 년월일 등이다. 어찌보면 매우 간단한 정보같지만 기존의 웹 상에서는 이와 같은 정보가 명시적으로 표현되지 않았기 때문에 소프트웨어가 처리하는데 매우 불편했고 따라서 소프트웨어의 지능이 떨어지게 되었다[1],[3].

3. 시맨틱 웹 피라미드

(그림 1)은 시맨틱 웹의 계층적 구조를 잘 나타내주는 대표적인 그림이다. (그림 1)을 보면 시맨틱 웹의 하부는 Unicode, URI, XML이라는 것을 쉽게 알 수 있다. 시맨틱 웹에서는 XML 구조를 사용하지만 단순히 markup 용으로 XML을 활용하고 있다. 시맨틱 웹에서는 모든 자료를 그래프의 구조로 보고 있다. 그래프의 한 조각은 subject, object과 이 둘을 연결하는 링크 정보로 표현된다. 이와 같은 그래프 구조를 표현하기 위해서 시맨틱 웹에서는 RDF(Resource Description Framework) 방식을 이용하고 있다. 앞의 구조에서 Ontology vocabulary, Logic, Proof는 시맨틱 웹의 중요한 구성요소이다. 이 부분이 가지는 의미는 앞에서도 설명하였듯이 소프트웨어 에이전트가 이해할 수 있는 온톨로지 vocabulary로 메타데이터를 표현하면, 이 내용을 소프트웨어 에이전트가 이해하고 인공지능의 로직 기반의 추론 과정을 이용하여 사람이 지시한 작업을 지능적으로 수행한다는 것을 의미한다. 이와 같이 로직 기반의 지능형 추론과정을 거치게 되면, 소프트웨어 에이전트는 자신이 수행한 작업을 설명할 수 있게 된다. 따라서, 에이전트가 수행한 작업의 내용을 Proof 하는 기능을 가지게 된다. (그림 1)은 시맨틱 웹이 지능형 소프트웨어 에이전트가 작업을 수행할 수 있는 인프라를 구축하고 있다는 점을 중시하고 있다고 볼 수 있다. 마지막으로, 소프트웨어 에이전트는 나쁜 의도를 가지고 있으면 안되고 서로 믿을 수 있도록 설계되어야 한다는 내용을 표현하고 있다.

시맨틱 웹은 HTML, XML을 기반으로 하고 있으며, 인공지능, 소프트웨어 공학, 프로세스 모델링 등의 학제간 연구를 통해서 발전하고 있다. (그림 2)에서 보여 주는 바와 같이 미국과 EU에서는 RDF, DAML(DARPA Agent Markup Language), OIL(Ontology Inference Layer)와 같은 온톨로지 모델링 언어를 개발하였다. 그 후에 미국과 EU는 공동으로 DAML+OIL를 제안하게 되었고 W3C는 2004년에 OWL(Web Ontology Language)을 제안하게 된다[1]-[3]. OWL은 현재 시맨틱 웹에서 온톨로지 표현 언어의 표준으로 채택되고 있다. <표 1>은 온토롤지 모델 언어의 종류와 특징을 보여주고 있다.

4. 시맨틱 웹 시장

가. 시맨틱 웹 시장

시맨틱 웹 시장은 현재 매우 초기 단계에 있다. 현재의 실정은 1980년대 말의 웹 시장과 유사한 상황이다. 온톨로지 표현 언어는 표준화가 된 상태이지만, 메타데이터를 자유롭게 저작할 수 있는 저작 도구가 보편화되어 있지 않아서 일반대중이 활용하는 데는 많은 어려움이 있다. 그러나 초기의 웹 시장이 그러하듯이 전문적인 분야에서는 매우 활발하게 시장을 형성하고 있다. 미국과 EU에서는 W3C[7], DARPA[6], DERI[5] 등의 연구소와 IBM, HP, Google 등의 회사에서 LEAD(Live Early Adoption and development)와 같은 방식으로 컬러 애플리케이션(killer application)을 발굴하는 작업을 수행하고 있다. 현재 시맨틱 웹 시장은 온톨로지나 메타데이터를 적극 활용하는 분야나 활용할 수 밖에 없는 분야
에서 시장을 형상하고 있다. 전자의 경우는 블로그 시장을 예로 들 수 있는데, 블로그 시장에 있는 모든 사용자들은 자신의 정보를 광고하는 데 열심인 그룹들이다. 이 그룹에서는 자신의 정보를 널리 알리기 위해서 온톨로지 기반의 메타데이터를 적극 활용하는 분위기이므로 시맨틱 웹의 컬러 애플리케이션이 생길 수 있는 가능성이 매우 크다. 이런 이유 때문에 MIT 인공지능 연구실과 IBM T.J. Watson연구소, 그리고 EU의 SWAD-E(Semantic Web Advanced Development-Europe) 등에서 연구가 진행되어 왔다. 두번째 시장은 엔터프라이즈 소프트웨어 시장이다. 이 시장은 지식 공유를 매우 중시하는 분야이므로 온톨로지를 이용한 메타데이터를 저작한다는 불편함을 극복하고서라도 지능형 소프트웨어를 생산하려는 의지가 매우 높은 시장이다. 따라서, 향후의 ERP, CRM, SCM시장 등도 점진적으로 온톨로지 기반의 지능형 소프트웨어로 발전할 것이라는 예측이 많이 나오고 있다. 이와 같은 시장이 충분히 성숙되면, 일반 사용자가 시맨틱 웹 시장을 창출할 것으로 예상하고 있다. Google과 같은 회사에서는 2010년경에 일반 사용자들이 시맨틱 웹을 활발히 활용할 것이라는 전망하에 수년전부터 시맨틱 웹에 대한 연구를 진행하고 있다.

나. 시맨틱 웹 서비스 시장

현재 웹 시장은 서비스를 포함하는 시장으로 확산되고 있다. 즉, 과거의 웹에는 문서나 정지영상과 같은 정적인 콘텐츠를 포함하고 있었지만, 근래에 들어 웹에서 서비스를 처리할 수 있는 응용 소프트웨어가 출현하고 있다. 현재, WSDL, UDDI, SOAP과 같은 방식을 이용하여 웹에 있는 서비스를 표현하고, 저장하고, 검색하는 웹 서비스 시장이 활발히 만들어지고 있다. 시맨틱 웹 서비스 분야에서는 이와 같은 웹 서비스에 온톨로지를 이용하여 서비스에 대한 메타데이터를 표현하므로써 지능형 에이전트가 사람을 대신하여 작업을 수행할 수 있는 시장을 개척하는 시도가 이루어지고 있다. 궁극적으로 미래 사회에서는 현재 사람이 하는 모든 웹상의 트랜잭션을 지능형 에이전트가 대신하게 될 것이므로 시맨틱 웹 서비스 기술은 매우 필수적일 것으로 예측을 하고 있다. 시맨틱 웹 서비스 분야는 DARPA에서 주도적으로 연구와 개발이 진행되고 있다. 현재는 웹에 있는 서비스를 표현하기 위한 온톨로지 표현 언어에 대한 연구가 활발히 진행되고 있다. 또한 OWL-S를 이용한 웹서비스 온톨로지 구축을 하고 있다. 한가지 문제점은 기존의 대형 회사들이 이미 WSDL(Web Service description Language), BPEL4WS(Business Process Execution Language for Web Service)과 같은 표준안을 사용하고 있기 떄문에 기업과 연구소간의 공동 작업이 매우 어려운 실정이다. 또한, EU에서는 DERI(Digital Enterprise Research Institute)를 중심으로 시맨틱 웹 기술을 이용한 웹 서비스에 대한 연구가 활발히 진행되고 있다[5]. DERI에서는 WSMO conceptual 모델을 이용한 웹 서비스에 대한 연구를 진행 중에 있다. DARPA의 OWL-S나 DERI의 WSMO는 시맨틱 웹 서비스를 지향하는 점에서는 같은 방향이지만, OWL-S가 logic 중심이고 procedural 실행과정을 중시한 반면에 DERI의 WSMO는 보다 declarative한 방식을 취하고 있는 차이점이 있다.
시맨틱 웹 서비스 연구의 성공은 기업에서 사용할 수 있는 시맨틱 웹 서비스 온톨로지를 개발하는 것이 무엇보다 중요하다. 아무리 좋은 온톨로지라 하더라도 기업에서 현재 사용하고 있는 WSDL, BPEL과 연동할 수 없다면 매우 늦은 속도로 시맨틱 웹 서비스 기술과 시장이 형성될 것이다. 또한, 시맨틱 웹 서비스 온톨로지를 구축할 수 있는 인프라와 인력 양성이 중요한 문제점으로 대두되고 있다. 현재 OWL-S 온톨로지를 구축할 수 있는 저작 도구가 매우 빈약한 상태이고, OWL-S나 WSMO의 개념을 이해하여 온톨로지를 활용할 수 있는 전문 인력이 적다는 점도 문제점으로 대두되고 있다.

5. 시맨틱 웹 시장 동향

시맨틱 웹과 관련한 시장의 출현은 필수적인 것으로 간주되고 있다. 지금의 시맨틱 웹 시장은 1980년 말의 웹 시장과 매우 유사한 상황이다. 1980년대 말의 웹 시장이 형성되기 시작해서 지난 10여년간 매우 비약적으로 새로운 시장과 산업이 형성되었듯이 시맨틱 웹 시장도 향후 10년 동안 매우 빠른 속도로 확장될 것으로 예상하고 있다. Gartner 그룹의 분석에 의하면 온톨로지 기반의 기술이 향후 핵심 기술로 자리 잡을 것이며 2007년경에는 간단한 온톨로지가 응용 프로젝트의 75%를 차지할 것으로 예상하고 있다[9]. 2010년경에는 인공지능의 지식표현 기술[4]을 활용하는 강력한 온톨로지가 응용 프로젝트의 80% 이상을 차지할 것으로 예상하고 있다. 또한 Forrester의 연구 결과에 의하면 온톨로지를 이용하는 소프트웨어가 궁극적으로 성공할 것으로 예측하고 있다. 온톨로지 기반의 소프트웨어 에이전트를 활용하는 소프트웨어가 성공할 것이며 그렇지 못하는 응용 소프트웨어는 결과 쇠퇴할 것으로 예상하고 있고, 더 나아가서 점진적으로 자동화 할 수 있는 온톨로지 기술이 필요할 것으로 예상하고 있다[10]. TopQuadrant 사는 IT 분석가와 벤더들의 연구 결과를 바탕으로 2010년의 시맨틱 웹 시장을 (그림 3)과 같이 예상하고 있다. 이와 같은 분석은 향후에 활용될 Knowledge Management Infrastructure, 온톨로지 기반의 응용 소프트웨어, ERP/CRM 등과 같은 enterprise-class 응용 소프트웨어, 포털, 웹 서비스, 그리드 컴퓨팅, 유비쿼터스 컴퓨팅 등의 모든 소프트웨어를 대상으로 2010년의 시맨틱 웹 시장의 규모를 예측하고 있다.

TopQuadrant사의 분석을 보면, 시맨틱 웹을 위한 새로운 인프라를 구축하는 시장을 예상하고 있다. 현재 시맨틱 웹 시장을 활성화하기 위해서는 시맨틱 웹의 구축, 저장, 유통을 위한 제반 인프라가 필요한 실정이다. 이와 같은 현상은 1990년대 초반에 웹 시장이 시작할 때와 유사한 상황이라고 볼 수 있다. 근래에 들어 웹 시장에서는 많은 저작 도구와 관련된 유틸리티들의 출현으로 웹 산업의 활성화를 지원하고 있다. 시맨틱 웹 시장에서도 이와 같은 시도가 필요하다. 또한, 기존의 웹 소프트웨어를 시맨틱 웹 소프트웨어로 통합하기 위한 시장이 매우 활성화 될 것으로 예상한다. 이와 같은 시장은 시맨틱 웹 소프트웨어들의 통합 시장도 포함한다. 한가지 주목할 점은 시맨틱 웹 시장에서는 사람의 생각을 한정된 범위에서 이해하고 사람의 작업을 대신 수행할 수 있는 지능형 에이전트가 동작할 수 있는 인프라를 제안하고 있으므로, 이러한 기능을 활용할 수 있는 Discover & Access, Reasoning 시장이 활성화 된다는 점이다. 이와 같은 시장에서는 단순한 키워드 기반의 검색이 아니라 시맨틱 웹 상의 문서나 서비스의 의미를 파악한 지능형 에이전트가 자동으로 검색하고 계획할 수 있는 시장이 새롭게 형성되는 것을 의미한다. 또한, 통신(Communicate) 시장은 지능형 에이전트들 사이에 시맨틱 수준의 통신을 통하여 사용자가 요구한 작업에 대해서 전문가 수준의 결과를 얻기 위한 협업 시장을 의미한다.
IDC, Gartner, Meta Group, VSS, McKinset, TopQuadrant 사의 분석에 의하면 현재 2003년의 시맨틱 기술 시장이 20억 달러 규모에서 2010년에는 630억 달러 규모로 다른 소프트웨어, 하드웨어, 서비스 시장보다 현저하게 시장 규모가 확장될 것으로 예상하고 있다.

6. 결론

시맨틱 웹 분야는 새로 출현하는 시장이고 필연적으로 발전할 수 밖에 없는 분야로 간주되고 있다. 1980년 말에 웹 시장이 현재와 같이 큰 시장이 될 것이라고 예측한 사람들은 별로 많지 않았을 것이다. 2005년 현재 시맨틱 웹 시장에 대한 예측은 우리가 1980년 말에 웹 시장을 예측한 것과 같은 관점에서 접근할 필요가 있다고 본다. 그리고 시맨틱 웹은 지능형 에이전트가 동작하기 위한 인프라를 제공하고, 이를 기반으로 자동화되고 지능적인 소프트웨어가 동작하는 새로운 지능형 소프트웨어 시장을 가능하게 해준다. 이와 같은 시장은 매우 깊이 있는 원천 기술과 핵심 기술을 바탕으로 할 때, 국산 소프트웨어가 세계 시장을 선도할 수 있을 것이라고 예측할 수 있다. 단순한 프로그램 작업 만으로는 새롭게 도래하는 시맨틱 웹 시장을 선도할 수 없고, 보다 깊이 있는 원천 핵심 기술을 바탕으로 할 때 국산화율이 높은 시맨틱 웹 기반의 소프트웨어를 창출 할 수 있을 것이다.

<참 고 문 헌>

[1] Dieter Fensel, James Hendler, Henry Lieberman, and Wolfgang Wahlster, Spinning the Semantic
Web, MIT Press, 2003.
[2] Grigoris Antoniou and Frank van Harmelen, A Semantic Web Primer, MIT Press, 2004.
[3] Dieter Fensel, Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce,

2nd edition, Springer, 2004.
[4] Ronald J. Brachman and Hector J. Levesque, Knowledge Representation and Reasoning, Morgan
Kaufmann Publisher, 2004.
[5] DERI, http://www.deri.org
[6] DARPA, http://www.darpa.org
[7] W3C, http://www.w3c.org
[8] Tim Berners-Lee, James Hendler, Ora Lassila, “The Semantic Web,” Scientific American, May

2001.
[9] Gartner, Semantic Web Rechnologies Take Middleware to the Next Level, 2002.
[10] Forrester Research, How Things Will Communicate, 2001.

제공 : DB포탈사이트 DBguide.net

'UP! > Web Service' 카테고리의 다른 글

An Introduction to TCP/IP (0)	2008.08.21
Internet Core Protocols (0)	2008.08.21
RPC:Remote Procedure Call (0)	2008.08.21
RMI (0)	2008.08.21
UDP(User Datagram Protocol) (0)	2008.08.21

Posted by 으랏차

RPC:Remote Procedure Call

UP!/Web Service 2008. 8. 21. 14:13

* 원격 프로시저 호출(RPC:Remote Procedure Call)

원 격 프로시저 호출(RPC:Remote Procedure Call)은 이종 플랫폼에 상의 분산 어플리케이션의 상호처리작용, 이동 성, 유연성을 증가시키고 클라이언트/서버 인프라스트럭처의 하나이다. RPC는 어플리케이션 개발자를 다양한 운영체제와 네트워크 인터 페이스로부터 고립시킴으로써(RPC를 사용할 때는 함수 호출이 프로그래머의 인터페이스이다) 다중 운영체제와 네트워크 프로토콜 사이 의 어플리케이션 개발의 복잡도를 감소시킨다.

원격 프로시저 호출은 흐름제어를 caller와 callee로 대체하는 클 라이언트/서버(예를 들어 쿼리-응답) 상호작용에 잘 맞는다. 개념적으로, 클라이언트와 서버는 양쪽 모두 동시에 실행되지 않는 다. 대신에, 실행 쓰레드는 caller에서 callee로 옮겨 지고 나중에 다시 본래대로 돌아간다.

원격 프로시저를 호출하는 동안 다음의 절차를 거친다.

1. 클라이언트는 client stub 프로시저를 호출하고, 보통 방법처럼 파라미터를 전달한다. client stub은 클라이언트 자신의 주소공간에 들어있다.

2. client stub은 파라미터를 메시지로 마샬링(marshalling)한다. 마샬링(marshalling)은 파라미터의 표현을 표준 포맷으로 변환하는 것을 포함하며, 각 파라미터를 메시지로 복사한다.

3. client stub은 메시지를 전송 계층으로 전달하며, 전송계층은 그것을 원격 서버에 보낸다.

4. 서버상에서, 전송 계층은 메시지를 server stub으로 넘겨주고, server stub은 파라미터를 언마샬링(unmarshalling)하고 원하는 서버 루틴을 일반적인 프로시저 호출 메커니즘을 통해 호출한다.

5. 서버 프로시저가 끝나면 server stub(일반 프로시저 호출 복귀를 통해)으로 복귀하고 server stub은 리턴 값을 메시지로 마샬링(marshalling)한다. server stub은 메시지를 전송계층으로 넘긴다.

6. 전송 계층은 결과 메시지를 클라이언트의 전송 계층으로 전송하고 클라이언트의 전송 계층은 다시 client stub으로 보낸다.

7. client stub은 결과 파라미터를 언마샬링(unmarshalling)하고 실행결과를 caller에게 넘겨준다.

** 참조 사이트

1. RPC에 관한 소개
http://www.sei.cmu.edu/str/descriptions/rpc_body.html

2. RFC1050-RPC:Remote Procedure Call Protocol Specification Version 2
http://www.cse.ohio-state.edu/cgi-bin/rfc/rfc1057.html

'UP! > Web Service' 카테고리의 다른 글

Internet Core Protocols (0)	2008.08.21
[토론4] 웹서비스와 시맨틱 웹 (0)	2008.08.21
RMI (0)	2008.08.21
UDP(User Datagram Protocol) (0)	2008.08.21
SCTP (0)	2008.08.21

Posted by 으랏차

이전 1 2 3 다음

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`