The typical service provider SLA defines the loss latency and jitter that the provider's network will deliver between PE points of presence (POPs) in its network. In almost all cases, this is an average figure, so POPs near each other compensate for the more remote POPs in terms of latency contribution. Some providers also offer different loss/latency/jitter for different CoSs. Again, this is normally for traffic between provider POPs. What is of interest to enterprise applications, and hence to enterprise network managers, is the service's end-to-end performance, not just the bit in the middle. Specifically, the majority of latency and jitter (most commonly loss, too) is introduced on the access circuits because of the circuits' constrained bandwidth and slower serialization times.
To solve this problem, you need SLAs that reflect the service required by the applications. By this, I mean that latency and jitter can be controlled by implementing a priority queuing (PQ) mechanism. For a PQ system, a loss of this kind is a function of the amount of traffic a user places in the queue, which the provider cannot control. For classes using something like the Cisco class-based weighted fair queuing (CBWFQ), the latency and jitter are a function of the load offered to the queuing mechanism. This is not surprising, because this mechanism is designed to allocate bandwidth to specific classes of traffic, not necessarily to deliver latency or jitter guarantees.
Some providers have signed up to deliver the Cisco Powered Network (CPN) IP Multiservice SLA, which provides 60-ms edge-to-edge latency, 20-ms jitter, and 0.5 percent loss between PE devices. With this strict delivery assured, designing the edge connectivity to meet end-to-end requirements is simplified.
With advances to the Cisco IP SLA, it will be possible to link the measuring of latency and jitter to class load. It is then reasonable for a provider to offer delay guarantees for CBWFQ classes, provided that the offered load is less than 100 percent of the class bandwidth. This then puts the CBWFQ's latency and jitter performance under the enterprise's control. If the enterprise does not overload the class, good latency and jitter should be experienced; however, if the class is overloaded, that will not be the case.
There should be more to an SLA than loss, latency, and jitter characteristics. The SLA should define the metrics for each service delivered, the process each side should follow to deliver the service, and what remedies and penalties are available. Here is a suggested table of contents to consider when crafting an SLA with a provider:
Performance characteristics Loss/latency/jitter for PQ traffic Loss/latency/jitter for business data traffic Loss/latency/jitter for best-effort traffic Availability Mean time to repair (MTTR) Installation and upgrade performance |
It is worth discussing each element in more detail. It is important to base performance characteristics on the requirements of the application being supported and to consider them from the point of view of end-to-end performance. Starting with PQ service, which will be used for voice, see Figure 10-1, which shows the results of ITU G.114 testing for voice quality performance. The E-model rating is simply a term given to a set of tests used to assess user satisfaction with the quality of a telephone call.
SLA Metrics: One-Way Delay (VoIP)
If you select a mouth-to-ear delay budget of 150 ms, you may determine that the codec and LAN delay may account for 50 ms, for example (this varies from network to network), leaving you 100 ms for the VPN. If the provider is managing the service to the CE, this is the performance statistic. However, if the provider is managing the service only to the PE, perhaps only 30 ms is acceptable to stay within the end-to-end budget. This more stringent requirement comes from the serialization times of the access link speed (for maximum-sized fragments), the PQ's queue depth, and the size of the first in, first out (FIFO) transmit ring on the routers in use as a PE, all taking up 35 ms for the ingress link and 35 ms for the egress link.
Whether the provider manages from CE to CE or PE to PE, targets must be set for the connection type, and reports need to be delivered against contracted performance. From the enterprise perspective, it's simplest to have the provider measure and report on performance from CE to CE; however, that does come with a drawback. To do so, the provider must be able to control the CE for the purposes of setting up IP SLA probes to measure the CE-to-CE performance and collect statistics. This is generally done by having the provider manage the CE device. However, not all enterprises want the IOS revision on the CE to be controlled by the provider, because the enterprise might want to upgrade its routers to take advantage of a new IOS feature. Clearly, this needs to be negotiated between the provider and enterprise to reach the optimum solution for the network in question.
For the data class, some research suggests that, for a user to retain his train of thought when using an application, the application needs to respond within one second (see Jakob Nielsen's Usability Engineering, published by Morgan Kaufmann, 1994). To reach this, it is reasonable to budget 700 ms for server-side processing and to require the end-to-end round-trip time to be less than 300 ms for the data classes.
Jitter, or delay variation, is a concern for real-time applications. With today's newest IP phones, adaptive jitter buffers compensate for jitter within the network and automatically optimize their settings. This is done by effectively turning a variable delay into a fixed delay by having the buffer delay all packets for a length of time that allows the buffer to smooth out any variations in packet delivery. This reduces the need for tight bounds on jitter to be specified, as long as the fixed delays plus the variable delays are less than the overall delay budget. However, for older jitter buffers, the effects of jitter above 30 or 35 ms can be catastrophic in terms of meeting user expectations for voice or other real-time applications. Clearly, knowledge of your network's ability to deal with jitter is required to define appropriate performance characteristics for the WAN.
The effects of loss are evident in both real-time and CBWFQ classes. For real time, it is possible for jitter buffers to use packet interpolation techniques to conceal the loss of 30 ms of voice samples. Given that a typical sample rate for voice is 20 ms, this tells you that a loss of two consecutive samples or more will cause a blip to be heard in the voice con-versation that packet interpolation techniques cannot conceal. Assuming a random-drop distribution within a single voice flow, a 0.25-percent packet drop rate within the real-time class results in a loss every 53 minutes that cannot be concealed. The enterprise must decide whether this is acceptable or whether tighter, or less tight, loss characteristics are required.
For the data classes, loss affects the attainable TCP throughput, as shown in Figure 10-2.
In Figure 10-2, you can see the maximum attainable TCP throughput for different packet-loss probabilities given different round-trip time characteristics. As long as the throughput per class, loss, and round-trip time fall within the performance envelopes illustrated, the network should perform as required. The primary reporting concern with the data classes is how well they perform for delay and throughput, which depends almost entirely on the load offered to them by the enterprise. Should the enterprise send more than what is con-tracted for and set up within a data class, the loss and delay grow exponentially, and the provider can't control this. Realistically, some sort of cooperative model between the provider and enterprise is required to ensure that data classes are not overloaded, or, if they are, that performance guarantees are expected only when the class is less than 100 percent utilized.
TCP Throughput
Other subjects listed in the SLA are more straightforward. Availability, MTTR, and installation and upgrade performance are mostly self-explanatory:
Availability—Defines the hours that the service should be available and the per-centage of time within that availability window that the service must be available without the provider's incurring penalties.
MTTR—Refers to how quickly the provider will repair faults within the network and restore service.
Installations and upgrade performance—Tells the provider how long it has to get a new site operational after the order has been delivered by the enterprise, or how long it has to upgrade facilities should the enterprise order that.
Network Operations Training
Clearly, with a new infrastructure to support, system administrators need appropriate training in the technology itself, the procedures to use to turn up or troubleshoot new sites, and the tools they will have to assist them in their responsibilities. The question of whether to train the enterprise operations staff in the operation of MPLS VPNs (with respect to the service operation within the provider's network) is open. Some enterprises may decide that, because no MPLS encapsulation or MPLS protocols will be seen by the enterprise network operators, no training is necessary for this technology. However, experience to date has shown that when you troubleshoot issues with service provider staff, knowledge of MPLS VPN operation is helpful.
The following high-level topics were taught to a large enterprise that successfully migrated network operations to a provider-delivered MPLS VPN service. These topics can be used as a template to evaluate training offerings to see if all necessary topics are covered:
Routing protocols (PE-to-CE and BGP)
MPLS
QoS
Multicast
These topics can be covered with course outlines that are similar to the following:
Course 1: Routing on MPLS VPN Networks |
---|
Course Description This course offers an integrated view of the PE-to-CE routing protocol and its interaction with the provider MPLS VPN, BGP, and basic MPLS/VPN operation. Both theory and hands-on practice are used to allow participants to configure, troubleshoot, and maintain networks using those protocols. |
Prerequisite Basic knowledge of TCP/IP, routing, and addressing schemes |
Content Routing (assuming EIGRP as the PE-to-CE protocol) EIGRP introduction EIGRP concepts and technology EIGRP scalability BGP route filtering and route selection Transit autonomous systems BGP route reflectors BGP confederations Local preference Multiexit discriminator AS-path prepending BGP communities Route flap dampening MBGP MPLS VPN technology Terminology MPLS VPN configuration on IOS platforms CE-PE relations BGP OSPF RIP Static Running EIGRP in an MPLS VPN environment |
Course 2: QoS in MPLS VPNs |
Course Description This course covers the QoS issues encountered when connecting campus networks to MPLS VPN WANs. |
Prerequisites A good understanding of generic QoS tools and their utility Basic knowledge of MPLS and IP |
Content Overview Modular QoS command-line interface (MQC) classification and marking CBWFQ Low-latency queuing (LLQ) (both fall into the broader category of congestion management) Scaling QoS QoS tunnel modes in MPLS VPN networks Monitoring QoS performance |
Course 3: Multicast |
Course Description This course describes basic multicast applications, the challenges and resolution of implementing multicast over an MPLS VPN, and basic troubleshooting of that environment. |
Prerequisites A good understanding of multicast use and configuration Basic understanding of MPLS/VPN networks |
Content Multicast operation PIM sparse mode SSM IPv6 Host-router interaction Multicast on MPLS/VPN Multicast Distribution Tree (MDT) Default MDT Data MDT Deployment considerations |
Implementation Planning
To ensure a smooth transition to the new network service, each site requires careful planning. The following is provided as an example of how to identify tasks, assign owners, and track the progress of actual versus planned activities. This is offered as a starting point for when you consider what activities are necessary to ensure a proper working installation at each site. This documentation exists for the following phases of the network transition:
Phase 1—Pre-cutover to ensure that all planning documents are complete and distributed
Phase 2—Connecting major sites to the new network
Phase 3—Cutover on a site-by-site basis
Phase 4—Post-cutover activities and sign-off
Phase 1
Phase 1 contains documentation that resembles Table 10-1.
Table 10-1 Typical Phase 1 Implementation Planning Tasks
Sequence | Due (by EoB) | Owner | Action |
---|---|---|---|
1 | 1/27/2006 | Adam | Team approval that all risks have been identified. |
2 | 1/27/2006 | Samantha | Create a plan for Tuesday the 31st to introduce core IDCs into production. |
3 | 1/31/2006 | Michael | Operations approves IDC connectivity. |
4 | 1/31/2006 | Adam | Routing team approves documentation. |
5 | 1/31/2006 | Michael | QoS team approves documentation. |
6 | 1/31/2006 | Samantha | Multicast team approves documentation. |
7 | 1/31/2006 | Mo | Documentation approved by providers. |
8 | 1/31/2006 | Samantha | Support document for engaging provider's support groups. |
Phase 2
Phase 2, which is the stage of planning a major site (such as an IDC) connection to the new production network, must be completed. This could be monitored via a document like that shown in Table 10-2.