Vendors on flow control
The purpose of this document is to detail Cabletron's implementation of 802.3x Annex 31B on the SmartSwitch 6000 and SmartSwitch Router 8600 as it applies to MAC control PAUSE operation. Cabletron products do not transmit pause frames in all situations, but do respond to PAUSE Frames as outlined in the specification. The Spec states (in section 31B.3.2) "it is not required that an implementation be able to transmit PAUSE Frames." Cabletron complies completely with all parts of the specification.
There are instances where sending pause frames can cause an External Head of Line Blocking Situation. An example of this is when three stations oversubscribe the path to a fourth station (See diagram).
In the above diagram stations A, B and C are transmitting to D. Station B and C are transmitting to each other. Station A is Transmitting 100% of line Rate to Station D. Station B and C are transmitting only 10% of Line rate to Station D. Stations B and C are Transmitting 90% of line rate to each other.
120% Line Rate traffic is now Destined for Station D (See diagram). Some switches or routers would send pause frames to stations A, B and C, because station D's path is overloaded. These pause frames will slow down or even stop the unrelated conversation between stations B and C. This constitutes what is called External Head of Line Blocking.
Another reason that it would be undesirable to send PAUSE frames, when a port is over-subscribed is because slowing down an end-station does not allow for Quality of Service. Quality of Service can not operate properly if a switch sends PAUSE frames, because this slows all of that ports traffic, including any traffic which may have high priority. The switch needs to see all the traffic, and make the decision to forward only the high priority traffic.
Cabletron's SmartSwitch family and the SmartSwitch Router traffic will send PAUSE frames in special cases. One special case is when the switch or router is overwhelmed. Because the switch or router is overwhelmed it is not going to be able to make decisions based on quality of service, or even simple forwarding situations. The switch will slow all traffic by sending PAUSE frames out all ports, which will allow time for the switch or router to catch up.
Cabletron's SmartSwitch family also has a feature that enables it to send pause frames when an operator set frame rate limit is approached. With this feature enabled on a port, PAUSE Frames are sent as the frame rate approaches the pre assigned limit. This keeps the rate inside the set limit without frame loss. This option is selectable on a per-port basis.
The IEEE 802.3x Task Force defined Ethernet flow control, aka "Pause Frames", in 1997 as an optional clause of the specification for full duplex Ethernet. The IEEE 802.3 Ethernet standard does not require implementation of Ethernet flow control to be completely standards compliant.
The problem Ethernet flow control is intended to solve is input buffer congestion on oversubscribed full duplex links which cannot handle wire-rate input. The Cisco products tested, the Catalyst 2948G and Catalyst 8510 are both non-blocking, shared-memory devices, and therefore would never issue Pause frames since:
(1) They can sustain wirespeed throughput on all ports, so none of the links can become oversubscribed from input traffic, which is the problem Ethernet flow control is intended to solve.
(2) There are no "input buffers" to become congested. The test used tries to force a port to send Pause frames by putting it into a head-of-line blocking configuration. By definition, there is no "head-of-line blocking" in these products. This is one of the fundamental advantages of shared memory architectures.
Ethernet flow control is not intended to solve the problem of steady-state overloaded networks or links. Unfortunately, perhaps the only feasible way any network-based test can "force" a device under test to issue a Pause frame is to put it into exactly this situation. The port being tested is usually put it into a "head-of-line blocking" situation to simulate congestion. Typically this is achieved by oversubscribing a separate output port to which the port under test needs to send traffic. It can be argued that what the test primarily demonstrates is that the device under test is subject to head-of-line blocking. This test doesn't apply to non-blocking shared memory devices which are, by definition, not subject to head-of-line blocking.
If an output port is oversubscribed, having separate input ports issue Pause frames, and potentially cause blocking on other network devices by backward propagation of Pause frames, is not an appropriate long-term solution. For this reason, it can be argued that Pause Frames should not be used in a network core where they could potentially cause the delay of traffic unrelated to the oversubscription of a link. The right solution is to redesign the network with additional capacity, reduce the load, or provide appropriate end-to-end Quality of Service to ensure critical traffic can get through.
Ethernet flow control is also not intended to provide end-to-end flow control. End-to-end mechanisms, typically at the Transport Layer are intended to address such issues. The most common example is TCP Windows, which provide end-to-end flow control between source and destination for individual L3/L4 flows.
An example of where Ethernet flow control might be used appropriately is at the edge of a network where Gigabit Ethernet attached servers are operating at less than wirespeed, and the link only needs to be paused for a short time, typically measured in microseconds. The use of Pause frames to manage this situation may be appropriate under such circumstances.
Unfortunately, Ethernet flow control is commonly misunderstood. It is not intended to address lack of network capacity, or end-to-end network issues. Properly used, Ethernet flow control can be a useful tool to address short term overloads on a single link.
Product Line Manager, Gigabit Ethernet Switching
The interoperability test highlighted both aspects of flow control mechanics; responding to (throttling) and transmitting flow control frames. Only a few test ports were actually used in this portion of the test. All Foundry Networks products are 802.3x compliant and implement flow control but will only generate flow control messages when total system resources rather than an individual buffer for a given port are almost depleted. This protects the rest of the device (and the network) from a problem that could be caused by a single port. Due to the BigIron's switched shared memory architecture, consuming the system resources requires more test equipment than was actually available for this particular phase of the Interoperability Test. Had there been enough traffic generation ports to consume the system resources, the BigIron would have generated flow control messages.
Flow Control was originally invented to prevent packet drops by switches that were running at less than media-speed. At that time the method of control was usually back-pressure. This was not a standardized method, but was effective in preventing any data loss. It also could substantially lower overall throughput through the segments being flow controlled. And core switches and routers did not use it.
Now there is a standardized means of flow control in IEEE 802.3x. It doesn't change the debate, however, as to whether it should be implemented in a core switch. It is actually more detrimental to flow control in the core than helpful. Flow control in the core can cause congestion in sections of the network that otherwise would not be congested. So how do you decide which segments to hold off when one of the segments gets congested? Even a network manager familiar with the traffic patterns of his/her network would, in many cases, be hard pressed to answer this question due to the dynamic nature of network traffic. And if particular links are constantly in a congested state, there is most likely a problem with the current implementation of the network.
The best way to handle any potential congestion in the backbone can much better be answered through CoS/QoS controls that many core switches, such as the HP ProCurve Routing Switch 9304M, have implemented. Prioritizing packets through multiple queues (the 9304M has 4 queues) provides far more sophisticated traffic control (such as targeting specific application packet types) than an all-or-nothing, or even a throttled form of flow control. CoS/QoS can provide policy-based traffic shaping and guarantee that the proper traffic gets through in cases of temporarily limited bandwidth. This sophistication is becoming more important as different emphasis is placed on differing types of traffic.
The one area where flow control can add value is in an edge switch, such as the HP ProCurve Switch 4000M. At this point in the network, the singular clients can be held off without potentially affecting large areas of the network. Flow control can be useful, for example, if the uplink is being swamped by individual clients. Even here, though, CoS/QoS will become more important over time. The 4000M supports tagging and can convert IP/TOS priorities to 802.1Q priorities and can set these parameters for multiple switches at a time through TopTools, our network management application that ships with each managed product. HP will continue to develop these capabilities over time, making 802.3x largely irrelevant in the future.
HP ProCurve Networking
The scenario Tolly Group tested, using local congestion on an egress port to force flow control on a trunk to an upstream switch, while technically feasible can have adverse effects in day-to-day network operation. In the test Tolly is expecting the downstream switch, though fully capable of handling far more traffic than the small load offered, to generate a pause message to prevent further frames being sent to congested port. However, 802.3X flow control is not implemented on a flow basis, but on a link basis. Therefore, the pause message Tolly was trying to generate would block not only traffic for the congested port, but all other traffic on the trunk as well. In essence, this operation would create head-of-line blocking, and thereby define throughput for many other devices on the downstream switch based upon the bandwidth of the congested device. Such behavior is definitely not what network managers are expecting. We considered the foregoing when designing the Accelar 1200. We decided the 1200 should send flow control frames in only two situations; when the ingress buffer on a particular port is full it will send a pause message on that port and when the switch fabric is congested it will send pause messages on all ports. In both these situations, additional frames received would be dropped and therefore it is better to keep them buffered upstream until the congestion has eased.
Another aspect of this discussion I'd like to draw your attention to is that use of 802.3x flow control may actually interfere with QoS and CoS mechanisms such as Diff-Serv and 802.3p. We believe strongly that increased switch bandwidth and the advent of higher level traffic management facilities make 802.3x unadvisable in the scenario tested.
Director, Accelar HW Product Management
More on the topic. Network World, 9/13/99.
Switch vendors pass interoperability tests
The main article on switch interoperability. Network World, 9/13/99.
Latest data from our cost/performance testing of LAN switches.
Tell us your thoughts on this article or the issues it raises.