Chapter 9: Flow Control and Quality of Service

Page 3
Page 3 of 3

Ethernet QoS

Ethernet supports QoS via the Priority field in the header tag defined by the IEEE 802.1Q-2003 specification. Whereas the 802.1Q-2003 specification defines the header tag format, the IEEE 802.1D-2004 specification defines the procedures for setting the priority bits. Because the Priority field is 3 bits long, eight priority levels are supported. Currently, only seven traffic classes are considered necessary to provide adequate QoS. The seven traffic classes defined in the 802.1D-2004 specification include the following:

• Network control information

• Voice applications

• Video applications

• Controlled load applications

• Excellent effort applications

• Best effort applications

• Background applications

The 802.1D-2004 specification defines a recommended set of default mappings between the seven traffic classes and the eight Ethernet priority values. In Ethernet switches that support seven or more queues per port, each traffic class can be mapped into its own queue. However, many Ethernet switches support fewer than seven queues per port. So, the 802.1D-2004 specification also defines recommendations for traffic class groupings when traffic classes must share queues. These mappings and groupings are not mandated, but they promote interoperability between Ethernet devices so that end-to-end QoS can be imple-mented successfully in a plug-and-play manner (even in multi-vendor environments).

Currently, no functionality is defined in the Ethernet specifications for the Pause Opcode \to interact with the Priority field. So, the Pause Opcode affects all traffic classes simultaneously. In other words, an Ethernet switch that supports the Pause Opcode and multiple receive queues on a given port must send a Pause Opcode via that port (affecting all traffic classes) if any one of the queues fills. Otherwise, tail-drop occurs for the queue that filled. However, tail-drop can interact with the Priority field. Many Ethernet switches produced by Cisco Systems support advanced tail-drop, in which queuing thresholds can be set for each Ethernet priority level. Tail-drop then affects each traffic class independently. When a particular traffic class exceeds its queue threshold, frames matching that traffic class are dropped while the queue remains above the threshold defined for that traffic class. Other traffic classes are unaffected unless they also exceed their respective thresholds. The Pause Opcode needs to be sent only if all queues filled simultaneously. Alternately, the Pause Opcode may be sent only when one or more of the high-priority queues fill, thus avoiding tail-drop for high-priority traffic while permitting tail-drop to occur for lower-priority traffic. In this manner, the Pause Opcode can interact with the Priority field, but this functionality is proprietary and is not supported by all Ethernet switches. For more information about Ethernet QoS, readers are encouraged to consult the IEEE 802.1Q-2003 and 802.1D-2004 specifications.

IP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by IP.

IP Flow Control

IP employs several flow-control mechanisms. Some are explicit, and others are implicit. All are reactive. The supported mechanisms include the following:

• Tail-drop

• Internet Control Message Protocol (ICMP) Source-Quench

• Active Queue Management (AQM)

• Explicit Congestion Notification (ECN)

Tail-drop is the historical mechanism for routers to control the rate of flows between end nodes. It often is implemented with a FIFO algorithm. When packets are dropped from the tail of a full queue, the end nodes detect the dropped frames via TCP mechanisms. TCP then reduces its window size, which precipitates a reduction in the rate of transmission. Thus, tail-drop constitutes implicit, reactive flow control.

ICMP Source-Quench messages can be used to explicitly convey a request to reduce the rate of transmission at the source. ICMP Source-Quench messages may be sent by any IP device in the end-to-end path. Conceptually, the ICMP Source-Quench mechanism operates in a manner similar to the Ethernet Pause Opcode. A router may choose to send an ICMP Source-Quench packet to a source node in response to a queue overrun. Alternately, a router may send an ICMP Source-Quench packet to a source node before a queue overruns, but this is not common. Despite the fact that ICMP Source-Quench packets can be sent before a queue overrun occurs, ICMP Source-Quench is considered a reactive mechanism because some indication of congestion or potential congestion must trigger the transmission of an ICMP Source-Quench message. Thus, additional packets can be transmitted by the source nodes while the ICMP Source-Quench packets are in transit, and tail-drop can occur even after "proactive" ICMP Source-Quench packets are sent. Upon receipt of an ICMP Source-Quench packet, the IP process within the source node must notify the appropriate Network Layer protocol or ULP. The notified Network Layer protocol or ULP is then responsible for slowing its rate of transmission. ICMP Source-Quench is a rudimentary mechanism, so few modern routers depend on ICMP Source-Quench messages as the primary means of avoiding tail-drop.

RFC 2309 defines the concept of AQM. Rather than merely dropping packets from the tail of a full queue, AQM employs algorithms that attempt to proactively avoid queue overruns by selectively dropping packets prior to queue overrun. The first such algorithm is called Random Early Detection (RED). More advanced versions of RED have since been developed. The most well known are Weighted RED (WRED) and DiffServ Compliant WRED. All RED-based algorithms attempt to predict when congestion will occur and abate based on rising and falling queue level averages. As a queue level rises, so does the probability of packets being dropped by the AQM algorithm. The packets to be dropped are selected at random when using RED. WRED and DiffServ Compliant WRED consider the traffic class when deciding which packets to drop, which results in administrative control of the probability of packet drop. All RED-based algorithms constitute implicit flow control because the dropped packets must be detected via TCP mechanisms. Additionally, all RED-based algorithms constitute reactive flow control because some indication of potential congestion must trigger the packet drop. The proactive nature of packet drop as implemented by AQM algorithms should not be confused with proactive flow-control mechanisms that exchange buffer resource information before data transfer occurs, to completely avoid frame/packet drops. Note that in the most generic sense, sending an ICMP Source-Quench message before queue overrun ocurs based on threshold settings could be considered a form of AQM. However, the most widely accepted definition of AQM does not include ICMP Source-Quench.

ECN is another method of implementing AQM. ECN enables routers to convey congestion information to end nodes explicitly by marking packets with a congestion indicator rather than by dropping packets. When congestion is experienced by a packet in transit, the congested router sets the two ECN bits to 11. The destination node then notifies the source node (see the TCP Flow Control section of this chapter). When the source node receives notification, the rate of transmission is slowed. However, ECN works only if the Transport Layer protocol supports ECN. TCP supports ECN, but many TCP implementations do not yet implement ECN. For more information about IP flow control, readers are encouraged to consult IETF RFCs 791, 792, 896, 1122, 1180, 1812, 2309, 2914, and 3168.

IP QoS

IP QoS is a robust topic that defies precise summarization. That said, we can categorize all IP QoS models into one of two very general categories: stateful and stateless. Currently, the dominant stateful model is the Integrated Services Architecture (IntServ), and the dominant stateless model is the Differentiated Services Architecture (DiffServ).

The IntServ model is characterized by application-based signaling that conveys a request for flow admission to the network. The signaling is typically accomplished via the Resource Reservation Protocol (RSVP). The network either accepts the request and admits the new flow or rejects the request. If the flow is admitted, the network guarantees the requested service level end-to-end for the duration of the flow. This requires state to be maintained for each flow at each router in the end-to-end path. If the flow is rejected, the application may transmit data, but the network does not provide any service guarantees. This is known as best-effort service. It is currently the default service offered by the Internet. With best-effort service, the level of service rendered varies as the cumulative load on the network varies.

The DiffServ model does not require any signaling from the application prior to data transmission. Instead, the application "marks" each packet via the Differentiated Services Codepoint (DSCP) field to indicate the desired service level. The first router to receive each packet (typically the end node's default gateway) conditions the flow to comply with the traffic profile associated with the requested DSCP value. Such routers are called conditioners. Each router (also called a hop) in the end-to-end path then forwards each packet according to Per Hop Behavior (PHB) rules associated with each DSCP value. The conditioners decouple the applications from the mechanism that controls the cumulative load placed on the network, so the cumulative load can exceed the network's cumulative capacity. When this happens, packets may be dropped in accordance with PHB rules, and the affected end nodes must detect such drops (usually via TCP but sometimes via ICMP Source-Quench). In other words, the DiffServ model devolves into best-effort service for some flows when the network capacity is exceeded along a given path.

Both of these QoS models have strengths and weaknesses. At first glance, the two models would seem to be incompatible. However, the two models can interwork, and various RFCs have been published detailing how such interworking may be accomplished. For more information about IP QoS, readers are encouraged to consult IETF RFCs 791, 1122, 1633, 1812, 2205, 2430, 2474, 2475, 2815, 2873, 2963, 2990, 2998, 3086, 3140, 3260, 3644, and 4094.

TCP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by TCP.

TCP Flow Control

TCP flow control is a robust topic that defies precise summarization. TCP implements many flow-control algorithms, and many augmentations have been made to those algorithms over the years. That said, the primary TCP flow-control algorithms include slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms control the behavior of TCP following initial connection establishment in an effort to avoid congestion and packet loss, and during periods of congestion and packet loss in an effort to reduce further congestion and packet loss.

As previously discussed, the TCP sliding window is the ever-changing receive buffer size that is advertised to a peer TCP node. The most recently advertised value is called the receiver window (RWND). The RWND is complemented by the Congestion Window (CWND), which is a state variable within each TCP node that controls the amount of data that may be transmitted. When congestion is detected in the network, TCP reacts by reducing its rate of transmission. Specifically, the transmitting node reduces its CWND. At any point in time, a TCP node may transmit data up to the Sequence Number that is equal to the lesser of the peer's RWND plus the highest acknowledged Sequence Number or the CWND plus the highest acknowledged Sequence Number. If no congestion is experienced, the RWND value is used. If congestion is experienced, the CWND value is used. Congestion can be detected implicitly via TCP's acknowledgement mechanisms or timeout mechanisms (as applies to dropped packets) or explicitly via ICMP Source-Quench messages or the ECE bit in the TCP header.

When ECN is implemented, TCP nodes convey their support for ECN by setting the two ECN bits in the IP header to 10 or 01. A router may then change these bits to 11 when congestion occurs. Upon receipt, the destination node recognizes that congestion was experienced. The destination node then notifies the source node by setting to 1 the ECE bit in the TCP header of the next transmitted packet. Upon receipt, the source node reduces its CWND and sets the CWR bit to 1 in the TCP header of the next transmitted packet. Thus, the destination TCP node is explicitly notified that the rate of transmission has been reduced. For more information about TCP flow control, readers are encouraged to consult IETF RFCs 792, 793, 896, 1122, 1180, 1323, 1812, 2309, 2525, 2581, 2914, 3042, 3155, 3168, 3390, 3448, 3782, and 4015.

TCP QoS

TCP interacts with the QoS mechanisms implemented by IP. Additionally, TCP provides two explicit QoS mechanisms of its own: the Urgent and Push flags in the TCP header. The Urgent flag indicates whether the Urgent Pointer field is valid. When valid, the Urgent Pointer field indicates the location of the last byte of urgent data in the packet's Data field. The Urgent Pointer field is expressed as an offset from the Sequence Number in the TCP header. No indication is provided for the location of the first byte of urgent data. Likewise, no guidance is provided regarding what constitutes urgent data. An ULP or application decides when to mark data as urgent. The receiving TCP node is not required to take any particular action upon receipt of urgent data, but the general expectation is that some effort will be made to process the urgent data sooner than otherwise would occur if the data were not marked urgent.

As previously discussed, TCP decides when to transmit data received from a ULP. However, a ULP occasionally needs to be sure that data submitted to the source node's TCP byte stream has actually be sent to the destination. This can be accomplished via the push function. A ULP informs TCP that all data previously submitted needs to be "pushed" to the destination ULP by requesting (via the TCP service provider interface) the push function. This causes TCP in the source node to immediately transmit all data in the byte stream and to set the Push flag to one in the final packet. Upon receiving a packet with the Push flag set to 1, TCP in the destination node immediately forwards all data in the byte stream to the required ULPs (subject to the rules for in-order delivery based on the Sequence Number field). For more information about TCP QoS, readers are encouraged to consult IETF RFCs 793 and 1122.

iSCSI Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by iSCSI.

iSCSI Flow Control

The primary flow-control mechanism employed by iSCSI is the Ready To Transfer (R2T) Protocol Data Unit (PDU). iSCSI targets use the R2T PDU to control the flow of SCSI data during write commands. The Desired Data Transfer Length field in the R2T PDU header controls how much data may be transferred per Data-Out PDU sequence. The R2T PDU is complemented by several other mechanisms. The MaxOutstandingR2T text key controls how many R2T PDUs may be outstanding simultaneously. The use of implicit R2T PDUs (unsolicited data) is negotiated via the InitialR2T and ImmediateData text keys. When unsolicited data is supported, the FirstBurstLength text key controls how much data may be transferred in or with the SCSI Command PDU, thus performing an equivalent function to the Desired Data Transfer Length field. The MaxRecvDataSegmentLength text key controls how much data may be transferred in a single Data-Out or Data-In PDU. The MaxBurstLength text key controls how much data may be transferred in a single PDU sequence (solicited or unsolicited). Thus, the FirstBurstLength value must be equal to or less than the MaxBurstLength value. The MaxConnections text key controls how many TCP connections may be aggregated into a single iSCSI session, thus controlling the aggregate TCP window size available to a session. The MaxCmdSN field in the Login Response BHS and SCSI Response BHS controls how many SCSI commands may be outstanding simultaneously. For more information about iSCSI flow control, readers are encouraged to consult IETF RFC 3720.

iSCSI QoS

iSCSI depends primarily on lower-layer protocols to provide QoS. However, iSCSI provides support for expedited command processing via the I bit in the BHS of the Login Request PDU, the SCSI Command PDU, and the TMF Request PDU. For more information about iSCSI QoS, readers are encouraged to consult IETF RFC 3720.

FC Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FC.

FC Flow Control

FC switches produced by Cisco Systems also support a proprietary flow control feature called FC Congestion Control (FCC). Conceptually, FCC mimics the behavior of ICMP Source-Quench. When a port becomes congested, FCC signals the switch to which the source node is connected. The source switch then artificially slows the rate at which BB_Credits are transmitted to the source N_Port. Cisco Systems might submit FCC to ANSI for inclusion in a future FC standard.

FC QoS

FC supports several QoS mechanisms via fields in the FC header. The DSCP sub-field in the CS_CTL/Priority field can be used to implement differentiated services similar to the IP DiffServ model. However, the FC-FS-2 specification currently reserves all values other than zero, which is assigned to best-effort service. The Preference subfield in the CS_CTL/Priority field can be used to implement a simple two-level priority system. The FC-FS-2 specification requires all Class 3 devices to support the Preference subfield. No requirement exists for every frame within a sequence or Exchange to have the same preference value. So, it is theoretically possible for frames to be delivered out of order based on inconsistent values in the Preference fields of frames within a sequence or Exchange. However, this scenario is not likely to occur because all FC Host Bus Adapter (HBA) vendors recognize the danger in such behavior. The Priority subfield in the CS_CTL/Priority field can be used to implement a multi-level priority system. Again, no requirement exists for every frame within a sequence or Exchange to have the same priority value, so out-of-order frame delivery is theoretically possible (though improbable). The Preemption subfield in the CS_CTL/Priority field can be used to preempt a Class 1 or Class 6 connec-tion to allow Class 3 frames to be forwarded. No modern FC switches support Class 1 or Class 6 traffic, so the Preemption field is never used. For more information about FC QoS, readers are encouraged to consult the ANSI T11 FC-FS-2 specification.

FCP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FCP.

FCP Flow Control

The primary flow-control mechanism employed by FCP is the FCP_XFER_RDY IU. FCP targets use the FCP_XFER_RDY IU to control the flow of SCSI data during write commands. The FCP_BURST_LEN field in the FCP_XFER_RDY IU header controls how much data may be transferred per FCP_DATA IU. The FCP_XFER_RDY IU is complemented by a variety of other mechanisms. The Class 3 Service Parameters field in the PLOGI ELS header determines how many FCP_XFER_RDY IUs may be outstanding simultaneously. This is negotiated indirectly via the maximum number of concurrent sequences within each Exchange. The use of implicit FCP_XFER_RDY IUs (unsolicited data) is negotiated via the WRITE FCP_XFER_RDY DISABLED field in the PRLI Service Parameter Page.

When unsolicited data is supported, the First Burst Size parameter in the SCSI Disconnect-Reconnect mode page controls how much data may be transferred in the unsolicited FCP_DATA IU, thus performing an equivalent function to the FCP_BURST_LEN field. The Maximum Burst Size parameter in the SCSI Disconnect-Reconnect mode page controls how much data may be transferred in a single FCP_DATA IU (solicited or unsolicited). Thus, the First Burst Size value must be equal to or less than the Maximum Burst Size value. FCP does not support negotiation of the maximum number of SCSI commands that may be outstanding simultaneously because the architectural limit imposed by the size of the CRN field in the FCP_CMND IU header is 255 (versus 4,294,967,296 for iSCSI). For more information about FCP flow control, readers are encouraged to consult the ANSI T10 FCP-3 and ANSI T11 FC-LS specifications.

FCP QoS

FCP depends primarily on lower-layer protocols to provide QoS. However, FCP provides support for expedited command processing via the Priority field in the FCP_CMND IU header. For more information about FCP QoS, readers are encouraged to consult the ANSI T10 FCP-3 specification.

FCIP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FCIP.

FCIP Flow Control

FCIP does not provide any flow-control mechanisms of its own. The only FCIP flow-control functionality of note is the mapping function between FC and TCP/IP flow-control mechanisms. FCIP vendors have implemented various proprietary features to augment FCIP performance. Most notable are the FCP_XFER_RDY IU spoofing techniques. In some cases, even the FCP_RSP IU is spoofed. For more information about FCIP flow control, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB-3 specification.

FCIP QoS

FCIP does not provide any QoS mechanisms of its own. However, RFC 3821 requires the FC Entity to specify the IP QoS characteristics of each new TCP connection to the FCIP Entity at the time that the TCP connection is requested. In doing so, no requirement exists for the FC Entity to map FC QoS mechanisms to IP QoS mechanisms. This may be optionally accomplished by mapping the value of the Preference subfield or the Priority subfield in the CS_CTL/Priority field of the FC header to an IntServ/RSVP request or a DiffServ DSCP value. FCIP links are not established dynamically in response to received FC frames, so the FC Entity needs to anticipate the required service levels prior to FC frame reception. One method to accommodate all possible FC QoS values is to establish one TCP connection for each of the seven traffic classes identified by the IEEE 802.1D-2004 specification. The TCP connections can be aggregated into one or more FCIP links, or each TCP connection can be associated with an individual FCIP link. The subsequent mapping of FC QoS values onto the seven TCP connections could then be undertaken in a proprietary manner. Many other techniques exist, and all are proprietary. For more information about FCIP QoS, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB-3 specification.

Summary

The chapter reviews the flow-control and QoS mechanisms supported by Ethernet, IP, TCP, iSCSI, FC, FCP, and FCIP. As such, this chapter provides insight to network performance optimization. Application performance optimization requires attention to the flow-control and QoS mechanisms at each OSI Layer within each protocol stack.

Review Questions

1. What is the primary function of all flow-control mechanisms?

2. What are the two categories of QoS algorithms?

3. What is the name of the queue management algorithm historically associated with tail-drop?

4. Which specification defines traffic classes, class groupings, and class-priority mappings for Ethernet?

5. What is the name of the first algorithm used for AQM in IP networks?

6. What are the names of the two dominant QoS models used in IP networks today?

7. What is the name of the TCP state variable that controls the amount of data that may be transmitted?

8. What is the primary flow-control mechanism employed by iSCSI?

9. What are the names of the two QoS subfields currently available for use in FC-SANs?

10. What is the primary flow-control mechanism employed by FCP?

11. Are FCIP devices required to map FC QoS mechanisms to IP QoS mechanisms?