Americas

  • United States
by James Long

Chapter 9: Flow Control and Quality of Service

Analysis
Sep 04, 200732 mins
Networking

Cisco Press

More Cisco Press book chapters from new and classic Cisco Press books.

Rate your favorite Cisco Press books.

Upon completing this chapter, you will be able to:

  • List all of the flow control and QoS mechanisms related to modern storage networks

  • Describe the general characteristics of each of the flow control and QoS mechanisms related to modern storage networks

The chapters in Part I, “The Storage Networking Landscape,” and Part II, “The OSI Layers,” introduce the flow control and QoS mechanisms used in modern storage networks. Building upon those chapters, this chapter provides a comprehensive inventory of the flow control and QoS mechanisms used by Ethernet, IP, TCP, Internet SCSI (iSCSI), Fibre Channel (FC), Fibre Channel Protocol (FCP), and Fibre Channel over TCP/IP (FCIP). Readers are encouraged to review the flow control and QoS discussion at the beginning of Chapter 5, “The OSI Physical and Data Link Layers,” before reading this chapter. Additionally, readers are encouraged to review the frame/packet format descriptions and delivery mechanism discussions in the chapters in Part II, “The OSI Layers,” before reading this chapter. Finally, readers are encouraged to review the data transfer optimization discussions in Chapter 8, “The OSI Session, Presentation, and Application Layers,” before reading this chapter.

Conceptual Underpinnings of Flow Control and Quality of Service

To fully understand the purpose and operation of flow-control and QoS mechanisms, first readers need to understand several related concepts. These include the following:

  • The principle of operation for half-duplex upper layer protocols (ULPs) over full-duplex network protocols

  • The difference between half-duplex timing mechanisms and flow-control mechanisms

  • The difference between flow control and Quality of Service (QoS)

  • The difference between the two types of QoS algorithms

  • The relationship of delivery acknowledgement to flow control

  • The relationship of processing delay to flow control

  • The relationship of network latency to flow control

  • The relationship of retransmission to flow control

  • The factors that contribute to end-to-end latency

As previously mentioned, SCSI is a half-duplex command/response protocol. For any given I/O operation, either the initiator or the target may transmit at a given point in time. The SCSI communication model does not permit simultaneous transmission by both initiator and target within the context of a single I/O operation. However, SCSI supports full-duplex communication across multiple I/O operations. For example, an initiator may have multiple I/O operations outstanding simultaneously with a given target and may be transmitting in some of those I/O operations while receiving in others. This has the affect of increasing the aggregate throughput between each initiator/target pair. For this to occur, the end-to-end network path between each initiator/target pair must support full-duplex communication at all layers of the OSI model.

Readers should be careful not to confuse half-duplex signaling mechanisms with flow-control mechanisms. Communicating FCP devices use the Sequence Initiative bit in the FC Header to signal which device may transmit at any given point in time. Similarly, iSCSI devices use the F bit in the iSCSI Basic Header Segment (BHS) to signal which device may transmit during bidirectional commands. (iSCSI does not explicitly signal which device may transmit during unidirectional commands.) These mechanisms do not restrict the flow of data. They merely control the timing of data transmissions relative to one another.

Flow control and QoS are closely related mechanisms that complement each other to improve the efficiency of networks and the performance of applications. Flow control is concerned with pacing the rate at which frames or packets are transmitted. The ultimate goal of all flow-control mechanisms is to avoid receive buffer overruns, which improves the reliability of the delivery subsystem. By contrast, QoS is concerned with the treatment of frames or packets after they are received by a network device or end node. When congestion occurs on an egress port in a network device, frames or packets that need to be transmitted on that port must be queued until bandwidth is available. While those frames or packets are waiting in queue, other frames or packets may enter the network device and be queued on the same egress port. QoS policies enable the use of multiple queues per port and determine the order in which the queues are serviced when bandwidth becomes available. Without QoS policies, frames or packets within a queue must be transmitted according to a simple algorithm such as First In First Out (FIFO) or Last In First Out (LIFO). QoS mechanisms enable network administrators to define advanced policies for the transmission order of frames or packets. QoS policies affect both the latency and the throughput experienced by a frame or packet. The QoS concept also applies to frames or packets queued within an end node. Within an end node, QoS policies determine the order in which queued frames or packets are processed when CPU cycles and other processing resources become available.

All QoS algorithms fall into one of two categories: queue management and queue scheduling. Queue management algorithms are responsible for managing the number of frames or packets in a queue. Generally speaking, a frame or packet is not subject to being dropped after being admitted to a queue. Thus, queue management algorithms primarily deal with queue admission policies. By contrast, queue scheduling algorithms are responsible for selecting the next frame or packet to be transmitted from a queue. Thus, queue scheduling algorithms primarily deal with bandwidth allocation.

End-to-end flow control is closely related to delivery acknowledgement. To understand this, consider the following scenario. Device A advertises 10 available buffers to device B. Device B then transmits 10 packets to device A, but all 10 packets are transparently dropped in the network. Device B cannot transmit any more packets until device A advertises that it has free buffers. However, device A does not know it needs to send another buffer advertisement to device B. The result is a deadlock condition preventing device B from transmitting additional frames or packets to device A. If the network notifies device B of the drops, device B can increment its transmit buffers for device A. However, notification of the drops constitutes negative acknowledgement. Device A could send a data packet to device B containing in the header an indication that 10 buffers are available in device A. Although this does not constitute an acknowledgement that the 10 packets transmitted by device B were received and processed by device A, it does provide an indication that device B may transmit additional packets to device A. If device B assumes that the first 10 packets were delivered to device A, the result is an unreliable delivery subsystem (similar to UDP/IP and FC Class 3). If device B does not assume anything, the deadlock condition persists. Other contingencies exist, and in all cases, either a deadlock condition or an unreliable delivery subsystem is the result. Because the goal of flow control is to avoid packet drops due to buffer overrun, little motivation exists for implementing end-to-end flow control on unreliable delivery subsystems. So, end-to-end flow control is usually implemented only on reliable delivery subsystems. Additionally, end-to-end flow-control signaling is often integrated with the delivery acknowledgement mechanism.

End-to-end flow control is also closely tied to frame/packet processing within the receiving node. When a node receives a frame or packet, the frame or packet consumes a receive buffer until the node processes the frame or packet or copies it to another buffer for subsequent processing. The receiving node cannot acknowledge receipt of the frame or packet until the frame or packet has been processed or copied to a different buffer because acknowledgement increases the transmitting node’s transmit window (TCP) or EE_Credit counter (FC). In other words, frame/packet acknowledgement implies that the frame or packet being acknowledged has been processed. Thus, processing delays within the receiving node negatively affect throughput in the same manner as network latency. For the effect on throughput to be negated, receive buffer resources must increase within the receiving node as processing delay increases. Another potential impact is the unnecessary retransmission of frames or packets if the transmitter’s retransmission timer expires before acknowledgement occurs.

Both reactive and proactive flow-control mechanisms are sensitive to network latency. An increase in network latency potentially yields an increase in dropped frames when using reactive flow control. This is because congestion must occur before the receiver signals the transmitter to stop transmitting. While the pause signal is in flight, any frames or packets already in flight, and any additional frames or packets transmitted before reception of the pause signal, are at risk of overrunning the receiver’s buffers. As network latency increases, the number of frames or packets at risk also increases. Proactive flow control precludes this scenario, but latency is still an issue. An increase in network latency yields an increase in buffer requirements or a decrease in throughput. Because all devices have finite memory resources, degraded throughput is inevitable if network latency continues to increase over time. Few devices support dynamic reallocation of memory to or from the receive buffer pool based on real-time fluctuations in network latency (called jitter), so the maximum expected RTT, including jitter, must be used to calculate the buffer requirements to sustain optimal throughput. More buffers increase equipment cost. So, more network latency and more jitter results in higher equipment cost if optimal throughput is to be sustained.

Support for retransmission also increases equipment cost. Aside from the research and development (R&D) cost associated with the more advanced software, devices that support retransmission must buffer transmitted frames or packets until they are acknowledged by the receiving device. This is advantageous because it avoids reliance on ULPs to detect and retransmit dropped frames or packets. However, the transmit buffer either consumes memory resources that would otherwise be available to the receive buffer (thus affecting flow control and degrading throughput) or increases the total memory requirement of a device. The latter is often the design choice made by device vendors, which increases equipment cost.

The factors that contribute to end-to-end latency include transmission delay, serialization delay, propagation delay, and processing delay. Transmission delay is the amount of time that a frame or packet must wait in a queue before being serialized onto a wire. QoS policies affect transmission delay. Serialization delay is the amount of time required to transmit a signal onto a wire. Frames or packets must be transmitted one bit at a time when using serial communication technologies. Thus, bandwidth determines serialization delay. Propagation delay is the time required for a bit to propagate from the transmitting port to the receiving port. The speed of light through an optical fiber is 5 microseconds per kilometer. Processing delay includes, but is not limited to, the time required to:

  • Classify a frame or a packet according to QoS policies

  • Copy a frame or a packet into the correct queue

  • Match the configured policies for security and routing against a frame or a packet and take the necessary actions

  • Encrypt or decrypt a frame or a packet

  • Compress or decompress a frame or a packet

  • Perform accounting functions such as updating port statistics

  • Verify that a frame or a packet has a valid CRC/checksum

  • Make a forwarding decision

  • Forward a frame or a packet from the ingress port to the egress port

The order of processing steps depends on the architecture of the network device and its configuration. Processing delay varies depending on the architecture of the network device and which steps are taken.

Ethernet Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by Ethernet.

Ethernet Flow Control

As discussed in Chapter 5, “The OSI Physical and Data Link Layers,” Ethernet supports reactive flow control via the Pause Operation Code (Pause Opcode). All 10-Gbps Ethernet implementations inherently support flow control and do not need to negotiate its use. 1000BASE-X negotiates flow control using the Pause bits in the Configuration ordered sets. Twisted-pair-based Ethernet implementations use the Technology Ability field to negotiate flow control. Except for 10-Gbps Ethernet implementations, three options may be negotiated: symmetric, asymmetric, or none. Symmetric indicates that the device is capable of both transmitting and receiving the Pause Opcode. Asymmetric indicates that the device is capable of either receiving or transmitting the Pause Opcode. None indicates that the Pause Opcode is not supported. All 10-Gbps Ethernet implementations support symmetric operation. A Pause Opcode may be sent before a queue overrun occurs, but many Ethernet switches do not behave in this manner.

Ethernet switches often employ “tail-drop” to manage flows. Tail-drop is not a mechanism per se, but rather a behavior. Tail-drop is the name given to the process of dropping packets that need to be queued in a queue that is already full. In other words, when a receive queue fills, additional frames received while the queue is full must be dropped from the “tail” of the queue. ULPs are expected to detect the dropped frames, reduce the rate of transmission, and retransmit the dropped frames. Tail-drop and the Pause Opcode often are used in concert. For example, when a receive queue fills, a Pause Opcode may be sent to stem the flow of new frames. If additional frames are received after the Pause Opcode is sent and while the receive queue is still full, those frames are dropped. For more information about Ethernet flow control, readers are encouraged to consult the IEEE 802.3-2002 specification.

Ethernet QoS

Ethernet supports QoS via the Priority field in the header tag defined by the IEEE 802.1Q-2003 specification. Whereas the 802.1Q-2003 specification defines the header tag format, the IEEE 802.1D-2004 specification defines the procedures for setting the priority bits. Because the Priority field is 3 bits long, eight priority levels are supported. Currently, only seven traffic classes are considered necessary to provide adequate QoS. The seven traffic classes defined in the 802.1D-2004 specification include the following:

  • Network control information

  • Voice applications

  • Video applications

  • Controlled load applications

  • Excellent effort applications

  • Best effort applications

  • Background applications

The 802.1D-2004 specification defines a recommended set of default mappings between the seven traffic classes and the eight Ethernet priority values. In Ethernet switches that support seven or more queues per port, each traffic class can be mapped into its own queue. However, many Ethernet switches support fewer than seven queues per port. So, the 802.1D-2004 specification also defines recommendations for traffic class groupings when traffic classes must share queues. These mappings and groupings are not mandated, but they promote interoperability between Ethernet devices so that end-to-end QoS can be imple-mented successfully in a plug-and-play manner (even in multi-vendor environments).

Currently, no functionality is defined in the Ethernet specifications for the Pause Opcode to interact with the Priority field. So, the Pause Opcode affects all traffic classes simultaneously. In other words, an Ethernet switch that supports the Pause Opcode and multiple receive queues on a given port must send a Pause Opcode via that port (affecting all traffic classes) if any one of the queues fills. Otherwise, tail-drop occurs for the queue that filled. However, tail-drop can interact with the Priority field. Many Ethernet switches produced by Cisco Systems support advanced tail-drop, in which queuing thresholds can be set for each Ethernet priority level. Tail-drop then affects each traffic class independently. When a particular traffic class exceeds its queue threshold, frames matching that traffic class are dropped while the queue remains above the threshold defined for that traffic class. Other traffic classes are unaffected unless they also exceed their respective thresholds. The Pause Opcode needs to be sent only if all queues filled simultaneously. Alternately, the Pause Opcode may be sent only when one or more of the high-priority queues fill, thus avoiding tail-drop for high-priority traffic while permitting tail-drop to occur for lower-priority traffic. In this manner, the Pause Opcode can interact with the Priority field, but this functionality is proprietary and is not supported by all Ethernet switches. For more information about Ethernet QoS, readers are encouraged to consult the IEEE 802.1Q-2003 and 802.1D-2004 specifications.

IP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by IP.

IP Flow Control

IP employs several flow-control mechanisms. Some are explicit, and others are implicit. All are reactive. The supported mechanisms include the following:

  • Tail-drop

  • Internet Control Message Protocol (ICMP) Source-Quench

  • Active Queue Management (AQM)

  • Explicit Congestion Notification (ECN)

Tail-drop is the historical mechanism for routers to control the rate of flows between end nodes. It often is implemented with a FIFO algorithm. When packets are dropped from the tail of a full queue, the end nodes detect the dropped frames via TCP mechanisms. TCP then reduces its window size, which precipitates a reduction in the rate of transmission. Thus, tail-drop constitutes implicit, reactive flow control.

ICMP Source-Quench messages can be used to explicitly convey a request to reduce the rate of transmission at the source. ICMP Source-Quench messages may be sent by any IP device in the end-to-end path. Conceptually, the ICMP Source-Quench mechanism operates in a manner similar to the Ethernet Pause Opcode. A router may choose to send an ICMP Source-Quench packet to a source node in response to a queue overrun. Alternately, a router may send an ICMP Source-Quench packet to a source node before a queue overruns, but this is not common. Despite the fact that ICMP Source-Quench packets can be sent before a queue overrun occurs, ICMP Source-Quench is considered a reactive mechanism because some indication of congestion or potential congestion must trigger the transmission of an ICMP Source-Quench message. Thus, additional packets can be transmitted by the source nodes while the ICMP Source-Quench packets are in transit, and tail-drop can occur even after “proactive” ICMP Source-Quench packets are sent. Upon receipt of an ICMP Source-Quench packet, the IP process within the source node must notify the appropriate Network Layer protocol or ULP. The notified Network Layer protocol or ULP is then responsible for slowing its rate of transmission. ICMP Source-Quench is a rudimentary mechanism, so few modern routers depend on ICMP Source-Quench messages as the primary means of avoiding tail-drop.

RFC 2309 defines the concept of AQM. Rather than merely dropping packets from the tail of a full queue, AQM employs algorithms that attempt to proactively avoid queue overruns by selectively dropping packets prior to queue overrun. The first such algorithm is called Random Early Detection (RED). More advanced versions of RED have since been developed. The most well known are Weighted RED (WRED) and DiffServ Compliant WRED. All RED-based algorithms attempt to predict when congestion will occur and abate based on rising and falling queue level averages. As a queue level rises, so does the probability of packets being dropped by the AQM algorithm. The packets to be dropped are selected at random when using RED. WRED and DiffServ Compliant WRED consider the traffic class when deciding which packets to drop, which results in administrative control of the probability of packet drop. All RED-based algorithms constitute implicit flow control because the dropped packets must be detected via TCP mechanisms. Additionally, all RED-based algorithms constitute reactive flow control because some indication of potential congestion must trigger the packet drop. The proactive nature of packet drop as implemented by AQM algorithms should not be confused with proactive flow-control mechanisms that exchange buffer resource information before data transfer occurs, to completely avoid frame/packet drops. Note that in the most generic sense, sending an ICMP Source-Quench message before queue overrun ocurs based on threshold settings could be considered a form of AQM. However, the most widely accepted definition of AQM does not include ICMP Source-Quench.

ECN is another method of implementing AQM. ECN enables routers to convey congestion information to end nodes explicitly by marking packets with a congestion indicator rather than by dropping packets. When congestion is experienced by a packet in transit, the congested router sets the two ECN bits to 11. The destination node then notifies the source node (see the TCP Flow Control section of this chapter). When the source node receives notification, the rate of transmission is slowed. However, ECN works only if the Transport Layer protocol supports ECN. TCP supports ECN, but many TCP implementations do not yet implement ECN. For more information about IP flow control, readers are encouraged to consult IETF RFCs 791, 792, 896, 1122, 1180, 1812, 2309, 2914, and 3168.

IP QoS

IP QoS is a robust topic that defies precise summarization. That said, we can categorize all IP QoS models into one of two very general categories: stateful and stateless. Currently, the dominant stateful model is the Integrated Services Architecture (IntServ), and the dominant stateless model is the Differentiated Services Architecture (DiffServ).

The IntServ model is characterized by application-based signaling that conveys a request for flow admission to the network. The signaling is typically accomplished via the Resource Reservation Protocol (RSVP). The network either accepts the request and admits the new flow or rejects the request. If the flow is admitted, the network guarantees the requested service level end-to-end for the duration of the flow. This requires state to be maintained for each flow at each router in the end-to-end path. If the flow is rejected, the application may transmit data, but the network does not provide any service guarantees. This is known as best-effort service. It is currently the default service offered by the Internet. With best-effort service, the level of service rendered varies as the cumulative load on the network varies.

The DiffServ model does not require any signaling from the application prior to data transmission. Instead, the application “marks” each packet via the Differentiated Services Codepoint (DSCP) field to indicate the desired service level. The first router to receive each packet (typically the end node’s default gateway) conditions the flow to comply with the traffic profile associated with the requested DSCP value. Such routers are called conditioners. Each router (also called a hop) in the end-to-end path then forwards each packet according to Per Hop Behavior (PHB) rules associated with each DSCP value. The conditioners decouple the applications from the mechanism that controls the cumulative load placed on the network, so the cumulative load can exceed the network’s cumulative capacity. When this happens, packets may be dropped in accordance with PHB rules, and the affected end nodes must detect such drops (usually via TCP but sometimes via ICMP Source-Quench). In other words, the DiffServ model devolves into best-effort service for some flows when the network capacity is exceeded along a given path.

Both of these QoS models have strengths and weaknesses. At first glance, the two models would seem to be incompatible. However, the two models can interwork, and various RFCs have been published detailing how such interworking may be accomplished. For more information about IP QoS, readers are encouraged to consult IETF RFCs 791, 1122, 1633, 1812, 2205, 2430, 2474, 2475, 2815, 2873, 2963, 2990, 2998, 3086, 3140, 3260, 3644, and 4094.

TCP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by TCP.

TCP Flow Control

TCP flow control is a robust topic that defies precise summarization. TCP implements many flow-control algorithms, and many augmentations have been made to those algorithms over the years. That said, the primary TCP flow-control algorithms include slow start, congestion avoidance, fast retransmit, and fast recovery. These algorithms control the behavior of TCP following initial connection establishment in an effort to avoid congestion and packet loss, and during periods of congestion and packet loss in an effort to reduce further congestion and packet loss.

As previously discussed, the TCP sliding window is the ever-changing receive buffer size that is advertised to a peer TCP node. The most recently advertised value is called the receiver window (RWND). The RWND is complemented by the Congestion Window (CWND), which is a state variable within each TCP node that controls the amount of data that may be transmitted. When congestion is detected in the network, TCP reacts by reducing its rate of transmission. Specifically, the transmitting node reduces its CWND. At any point in time, a TCP node may transmit data up to the Sequence Number that is equal to the lesser of the peer’s RWND plus the highest acknowledged Sequence Number or the CWND plus the highest acknowledged Sequence Number. If no congestion is experienced, the RWND value is used. If congestion is experienced, the CWND value is used. Congestion can be detected implicitly via TCP’s acknowledgement mechanisms or timeout mechanisms (as applies to dropped packets) or explicitly via ICMP Source-Quench messages or the ECE bit in the TCP header.

When ECN is implemented, TCP nodes convey their support for ECN by setting the two ECN bits in the IP header to 10 or 01. A router may then change these bits to 11 when congestion occurs. Upon receipt, the destination node recognizes that congestion was experienced. The destination node then notifies the source node by setting to 1 the ECE bit in the TCP header of the next transmitted packet. Upon receipt, the source node reduces its CWND and sets the CWR bit to 1 in the TCP header of the next transmitted packet. Thus, the destination TCP node is explicitly notified that the rate of transmission has been reduced. For more information about TCP flow control, readers are encouraged to consult IETF RFCs 792, 793, 896, 1122, 1180, 1323, 1812, 2309, 2525, 2581, 2914, 3042, 3155, 3168, 3390, 3448, 3782, and 4015.

TCP QoS

TCP interacts with the QoS mechanisms implemented by IP. Additionally, TCP provides two explicit QoS mechanisms of its own: the Urgent and Push flags in the TCP header. The Urgent flag indicates whether the Urgent Pointer field is valid. When valid, the Urgent Pointer field indicates the location of the last byte of urgent data in the packet’s Data field. The Urgent Pointer field is expressed as an offset from the Sequence Number in the TCP header. No indication is provided for the location of the first byte of urgent data. Likewise, no guidance is provided regarding what constitutes urgent data. An ULP or application decides when to mark data as urgent. The receiving TCP node is not required to take any particular action upon receipt of urgent data, but the general expectation is that some effort will be made to process the urgent data sooner than otherwise would occur if the data were not marked urgent.

As previously discussed, TCP decides when to transmit data received from a ULP. However, a ULP occasionally needs to be sure that data submitted to the source node’s TCP byte stream has actually be sent to the destination. This can be accomplished via the push function. A ULP informs TCP that all data previously submitted needs to be “pushed” to the destination ULP by requesting (via the TCP service provider interface) the push function. This causes TCP in the source node to immediately transmit all data in the byte stream and to set the Push flag to one in the final packet. Upon receiving a packet with the Push flag set to 1, TCP in the destination node immediately forwards all data in the byte stream to the required ULPs (subject to the rules for in-order delivery based on the Sequence Number field). For more information about TCP QoS, readers are encouraged to consult IETF RFCs 793 and 1122.

iSCSI Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by iSCSI.

iSCSI Flow Control

The primary flow-control mechanism employed by iSCSI is the Ready To Transfer (R2T) Protocol Data Unit (PDU). iSCSI targets use the R2T PDU to control the flow of SCSI data during write commands. The Desired Data Transfer Length field in the R2T PDU header controls how much data may be transferred per Data-Out PDU sequence. The R2T PDU is complemented by several other mechanisms. The MaxOutstandingR2T text key controls how many R2T PDUs may be outstanding simultaneously. The use of implicit R2T PDUs (unsolicited data) is negotiated via the InitialR2T and ImmediateData text keys. When unsolicited data is supported, the FirstBurstLength text key controls how much data may be transferred in or with the SCSI Command PDU, thus performing an equivalent function to the Desired Data Transfer Length field. The MaxRecvDataSegmentLength text key controls how much data may be transferred in a single Data-Out or Data-In PDU. The MaxBurstLength text key controls how much data may be transferred in a single PDU sequence (solicited or unsolicited). Thus, the FirstBurstLength value must be equal to or less than the MaxBurstLength value. The MaxConnections text key controls how many TCP connections may be aggregated into a single iSCSI session, thus controlling the aggregate TCP window size available to a session. The MaxCmdSN field in the Login Response BHS and SCSI Response BHS controls how many SCSI commands may be outstanding simultaneously. For more information about iSCSI flow control, readers are encouraged to consult IETF RFC 3720.

iSCSI QoS

iSCSI depends primarily on lower-layer protocols to provide QoS. However, iSCSI provides support for expedited command processing via the I bit in the BHS of the Login Request PDU, the SCSI Command PDU, and the TMF Request PDU. For more information about iSCSI QoS, readers are encouraged to consult IETF RFC 3720.

FC Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FC.

FC Flow Control

The primary flow-control mechanism used in modern FC-SANs (Class 3 fabrics) is the Buffer-to-Buffer_Credit (BB_Credit) mechanism. The BB_Credit mechanism provides link-level flow control. The FLOGI procedure informs the peer port of the number of BB_Credits each N_Port and F_Port has available for frame reception. Likewise, the Exchange Link Parameters (ELP) procedure informs the peer port of the number of BB_Credits each E_Port has available for frame reception. Each time a port transmits a frame, the port decrements the BB_Credit counter associated with the peer port. If the BB_Credit counter reaches zero, no more frames may be transmitted until a Receiver_Ready (R_RDY) primitive signal is received. Each time an R_RDY is received, the receiving port increments the BB_Credit counter associated with the peer port. Each time a port processes a received frame, the port transmits an R_RDY to the peer port. The explicit, proactive nature of the BB_Credit mechanism ensures that no frames are ever dropped in FC-SANs because of link-level buffer overrun. However, line-rate throughput can be very difficult to achieve over long distances because of the high BB_Credit count requirement. Some of the line cards available for FC switches produced by Cisco Systems support thousands of BB_Credits on each port, thus enabling long-distance SAN interconnectivity over optical networks without compromising throughput. When FC-SANs are connected over long-distance optical networks, R_RDY signals are sometimes lost. When this occurs, throughput drops slowly over a long period. This phenomenon can be conceptualized as temporal droop. This phenomenon also can occur on native FC inter-switch links (ISLs), but the probability of occurrence is much lower with local connectivity. The FC-FS-2 specification defines a procedure called BB_Credit Recovery for detecting and recovering from temporal droop. For more information about FC flow control, readers are encouraged to consult the ANSI T11 FC-FS-2 and FC-BB-3 specifications.

FC switches produced by Cisco Systems also support a proprietary flow control feature called FC Congestion Control (FCC). Conceptually, FCC mimics the behavior of ICMP Source-Quench. When a port becomes congested, FCC signals the switch to which the source node is connected. The source switch then artificially slows the rate at which BB_Credits are transmitted to the source N_Port. Cisco Systems might submit FCC to ANSI for inclusion in a future FC standard.

FC QoS

FC supports several QoS mechanisms via fields in the FC header. The DSCP sub-field in the CS_CTL/Priority field can be used to implement differentiated services similar to the IP DiffServ model. However, the FC-FS-2 specification currently reserves all values other than zero, which is assigned to best-effort service. The Preference subfield in the CS_CTL/Priority field can be used to implement a simple two-level priority system. The FC-FS-2 specification requires all Class 3 devices to support the Preference subfield. No requirement exists for every frame within a sequence or Exchange to have the same preference value. So, it is theoretically possible for frames to be delivered out of order based on inconsistent values in the Preference fields of frames within a sequence or Exchange. However, this scenario is not likely to occur because all FC Host Bus Adapter (HBA) vendors recognize the danger in such behavior. The Priority subfield in the CS_CTL/Priority field can be used to implement a multi-level priority system. Again, no requirement exists for every frame within a sequence or Exchange to have the same priority value, so out-of-order frame delivery is theoretically possible (though improbable). The Preemption subfield in the CS_CTL/Priority field can be used to preempt a Class 1 or Class 6 connec-tion to allow Class 3 frames to be forwarded. No modern FC switches support Class 1 or Class 6 traffic, so the Preemption field is never used. For more information about FC QoS, readers are encouraged to consult the ANSI T11 FC-FS-2 specification.

FCP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FCP.

FCP Flow Control

The primary flow-control mechanism employed by FCP is the FCP_XFER_RDY IU. FCP targets use the FCP_XFER_RDY IU to control the flow of SCSI data during write commands. The FCP_BURST_LEN field in the FCP_XFER_RDY IU header controls how much data may be transferred per FCP_DATA IU. The FCP_XFER_RDY IU is complemented by a variety of other mechanisms. The Class 3 Service Parameters field in the PLOGI ELS header determines how many FCP_XFER_RDY IUs may be outstanding simultaneously. This is negotiated indirectly via the maximum number of concurrent sequences within each Exchange. The use of implicit FCP_XFER_RDY IUs (unsolicited data) is negotiated via the WRITE FCP_XFER_RDY DISABLED field in the PRLI Service Parameter Page.

When unsolicited data is supported, the First Burst Size parameter in the SCSI Disconnect-Reconnect mode page controls how much data may be transferred in the unsolicited FCP_DATA IU, thus performing an equivalent function to the FCP_BURST_LEN field. The Maximum Burst Size parameter in the SCSI Disconnect-Reconnect mode page controls how much data may be transferred in a single FCP_DATA IU (solicited or unsolicited). Thus, the First Burst Size value must be equal to or less than the Maximum Burst Size value. FCP does not support negotiation of the maximum number of SCSI commands that may be outstanding simultaneously because the architectural limit imposed by the size of the CRN field in the FCP_CMND IU header is 255 (versus 4,294,967,296 for iSCSI). For more information about FCP flow control, readers are encouraged to consult the ANSI T10 FCP-3 and ANSI T11 FC-LS specifications.

FCP QoS

FCP depends primarily on lower-layer protocols to provide QoS. However, FCP provides support for expedited command processing via the Priority field in the FCP_CMND IU header. For more information about FCP QoS, readers are encouraged to consult the ANSI T10 FCP-3 specification.

FCIP Flow Control and QoS

This section summarizes the flow-control and QoS mechanisms supported by FCIP.

FCIP Flow Control

FCIP does not provide any flow-control mechanisms of its own. The only FCIP flow-control functionality of note is the mapping function between FC and TCP/IP flow-control mechanisms. FCIP vendors have implemented various proprietary features to augment FCIP performance. Most notable are the FCP_XFER_RDY IU spoofing techniques. In some cases, even the FCP_RSP IU is spoofed. For more information about FCIP flow control, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB-3 specification.

FCIP QoS

FCIP does not provide any QoS mechanisms of its own. However, RFC 3821 requires the FC Entity to specify the IP QoS characteristics of each new TCP connection to the FCIP Entity at the time that the TCP connection is requested. In doing so, no requirement exists for the FC Entity to map FC QoS mechanisms to IP QoS mechanisms. This may be optionally accomplished by mapping the value of the Preference subfield or the Priority subfield in the CS_CTL/Priority field of the FC header to an IntServ/RSVP request or a DiffServ DSCP value. FCIP links are not established dynamically in response to received FC frames, so the FC Entity needs to anticipate the required service levels prior to FC frame reception. One method to accommodate all possible FC QoS values is to establish one TCP connection for each of the seven traffic classes identified by the IEEE 802.1D-2004 specification. The TCP connections can be aggregated into one or more FCIP links, or each TCP connection can be associated with an individual FCIP link. The subsequent mapping of FC QoS values onto the seven TCP connections could then be undertaken in a proprietary manner. Many other techniques exist, and all are proprietary. For more information about FCIP QoS, readers are encouraged to consult IETF RFC 3821 and the ANSI T11 FC-BB-3 specification.

Summary

The chapter reviews the flow-control and QoS mechanisms supported by Ethernet, IP, TCP, iSCSI, FC, FCP, and FCIP. As such, this chapter provides insight to network performance optimization. Application performance optimization requires attention to the flow-control and QoS mechanisms at each OSI Layer within each protocol stack.

Review Questions

  1. What is the primary function of all flow-control mechanisms?

  2. What are the two categories of QoS algorithms?

  3. What is the name of the queue management algorithm historically associated with tail-drop?

  4. Which specification defines traffic classes, class groupings, and class-priority mappings for Ethernet?

  5. What is the name of the first algorithm used for AQM in IP networks?

  6. What are the names of the two dominant QoS models used in IP networks today?

  7. What is the name of the TCP state variable that controls the amount of data that may be transmitted?

  8. What is the primary flow-control mechanism employed by iSCSI?

  9. What are the names of the two QoS subfields currently available for use in FC-SANs?

  10. What is the primary flow-control mechanism employed by FCP?

  11. Are FCIP devices required to map FC QoS mechanisms to IP QoS mechanisms?

Copyright © 2007 Pearson Education. All rights reserved.