COVID-19: Weekly health check of ISPs, cloud providers and conferencing services

ThousandEyes, which tracks internet and cloud traffic, is providing Network World with weekly updates on the performance of three categories of service provider: ISP, cloud provider, UCaaS

thousandeyes map
ThousandEyes

As COVID-19 continues to spread, forcing employees to work from home, the services of ISPs, cloud providers and conferencing services a.k.a. unified communications as a service (UCaaS) providers are experiencing increased traffic.

ThousandEyes is monitoring how these increases affect outages and the performance challenges these providers undergo. It will provide Network World a roundup of interesting events of the week in the delivery of these services, and Network World will provide a summary here. Stop back next week for another update, and see more details here.

Update Oct. 21

Globally, outages in all three categories fell 10%, from 261 the previous week to 236. In the US, outages fell 13%, from 128 to 111.

ISP outages worldwide declined 7%, from 199 to 185. The US saw a 15% drop, from 110 to 93.

Public cloud provider outages plunged 70% from 23 to 7 globally, while in the US they bottomed out, with zero reported outages.

Collaboration app network outages increased from two to three worldwide, with two of them occurring in the US, the same number as the week before.

A notable outage occurred about 3:30 a.m. EDT on Oct. 13, affecting the Zayo telecom network for more than 90 minutes and having an impact on other downstream provides. It started in Denver and spread to Zayo infrastructure in San Francisco, San Jose, Salt Lake City and parts of Australia and Europe. Click here for an interactive view of the outage.

Update Oct. 12

Globally, the number of outages observed in all three categories increased by 12% vs. the week before, from 233 to 261. In the US they increased 15%, from 111 to 128.

ISP outages worldwide increased 18%, from 168 to 199. In the US they rose 15%, from 96 to 110.

The number of outages in public cloud networks globally dropped from 28 to 23, a decrease of 18%. In the US they stayed steady at four.

In total there were two collaboration app outages, both in the US.

A notable outage for the week started Oct. 5 about 6 a.m. PDT and affected the collaboration app Slack. ThousandEyes tests returned 503 server errors, indicating the service was unavailable, as well as timeouts, suggesting that the application was running slower than normal. These problems were intermittent. No network issues were observed connecting to Slack’s edge servers, which are hosted within AWS. Slack confirmed issues within their backend systems and that they were resolved at 10 p.m. PDT. Click here for an interactive view of the outage.

Update Oct. 5

Globally, the total number of outages observed across all three categories increased by 21% from the week before, from 193 to 233. The increase was reflected in the US, where outages rose from 84 to 111, a 32% increase.

The number of ISP outages worldwide increased by 15%, rising from 146 to 168, accounting for 72% of all outages observed. In the US, the number rose from 72 to 96, a 33% increase.

Globally, cloud provider outages more than doubled from 11 to 28, a 155% increase. In the US, the number rose from one to four.

For the first time in three weeks there were no collaboration-app-network outages observed globally. The week before there were two.

A notable outage occurred about 3 a.m. EDT on Sept. 30 when Cogent, a US based multinational transit service provider, experienced a service disruption that affected users around the world attempting to access Microsoft, Amazon, Facebook, and Google services. The outage lasted 41 minutes spread over three hours and affected multiple parts of Cogent’s US network. The timing and pattern of the outage indicate traffic-engineering activity as the cause. The service was restored about 5:50 a.m. EDT. Click here for an interactive view of the outage.

Update Sept. 28

The number of outages observed worldwide in all three categories decreased by 16% from the week prior, from 230 to 193. In the U.S., the number of outages increased by 11, a 15% increase.

Global ISP outages dropped from 175 to 146, down 17%. But in the U.S., outages rose from 63 to 72, an increase of 14%.

Public cloud outages worldwide increased from eight to 11, up 38%. U.S. public cloud outages remained stable at one.

Collaboration app network outages jumped 300% globally from one to four. Most of that was due to the 300% increase in the U.S. from one to three.

Google suffered a notable disruption about 9 p.m. EDT Sept. 24 that prevented many users around the world from accessing services including Gmail, YouTube, Google Calendarand Google Meet. Front-end servers remained reachable during the outage, but requests to access services returned receive errors. Google confirmed that a pool of servers that handled application traffic on the backend had crashed. Service was restored about 9:30 p.m. EDT. Click here for an interactive view of the outage

Update Sept. 21

The number of outages reported globally in all three categories was 230 for the week Sept. 14-20, up 50% from 153 the week before. In the U.S., the count was 73 for the latest week, up two from the week before.

ISP outages rose 62%, from 108 to 175 worldwide, and from 57 to 63 in the U.S., and increse of 11%.

Public-cloud provider outages were down a third, from 12 to eight, with the count in the U.S. dropping from five to just one.

Collaboration-app network providers suffered a single outage this week, with that one occurring in the U.S. The week before, there were none.

Instagram and Amazon suffered notable outages during the week.

About 11:10 a.m. PDT on Sept. 17 Instagram experienced a service disruption that prevented many users worldwide from using the application. With no network or reachability issues with its front-end servers, and users receiving HTTP 502 error notifications, the cause  appeared to be anapplication back-end issue. Service began to return about 11:15 a.m. PDT, wth full service restored by 11:45 a.m. PDT. Click here for an interactive view of the outage.

About 2:45 p.m. EDT Sept. 14 Amazon suffered a 29-minute outage centered on nodes in Columbus, Ohio, and affecting Amazon cloud-compute instances at its Hilliard, Ohio, data center. The outage affected 99 interfaces and was contained to the one location. The impact was that some users experiencing non-responsive or slow EC2 instances. The outage was cleared just past 3 p.m. EDT.

Update Sept. 14

Globally the number of outages observed between Sept. 7 and 13 in all three categories decreased by 40% from the week before, from 256 to 153, the lowest figure observed since early February. In the U.S., the number of outages dropped from 134 to 71, a 47% decrease.

The number of ISP outages worldwide dropped 50% from 216 to 108. In the decrease was 54% from 123 to 57, the lowest weekly number since early April.

Cloud-provider outages globaly decreased by 25% from 16 to 12. In the U.S. they remained at 5 for the second consecutive week.

Globally and in the U.S. for the first week since early August, no collaboration app network outages were recorded.

Cogent Communications suffered three outages about 11:45 a.m. EDT on Sept. 11, lasting 36 minutes. They three lasted 13 minutes, 4 minutes and 19 minutes, spread across just over an hour. All three centered on Cogent node in Newark, N.J., and affected customers across the U.S. and also in the U.K., Netherlands, Canada, Mexico and India. The customers were using the network to access services such as Visa online services, Microsoft office, and Shopify. The outages occurring during business hours and their focus indicate that some form of control-plane condition was the cause. The problem was cleared about 12:50 p.m. EDT.

Update Sept. 7

Globally the number of outages observed in all three categories decreased by 33% from the week prior, from 381 to 256. In the U.S., the number of outages dropped by 46, decreasing from 180 to 134, a 26% decrease from the week prior.

Worldwide, the number of ISP outages decreased by 103, dropping from 319 to 216, a 32% decrease and accounting for 84% of all outages observed this week. In the U.S., the number of ISP outages decreased by 27, dropping from 150 to 123, an 18% decrease.

Cloud provider outages globally increased by a third from 12 to 16. In the U.S., outages more than doubled, rising from two to five.

There was just one collaboration app provider outage worldwide -- not in the U.S. -- down from two.

PCCW Global suffered two outages starting about 12:40 a.m. EDT Sept. 3, one lasting 20 minutes, and the other lasted six. The first centered on PCCW nodes located in Atlanta, Ga., and affecting services using the Charlotte Colocation and Affiliated Computer Services networks. The second started about half an hour after the first cleared and centered on PCCW nodes in Ashburn, Va., and affected access to Oracle Cloud services. All outages were cleared by 1:45 a.m. EDT. The cause was likely the result of a traffic-engineering exercise.

About 6 p.m. PDT, Comcast suffered a four-minute outage that affected users in the western U.S. centered on Comcast core devices in Sunnyvale, Calif. and mainly affecting services across Comcast Xfinity networks (Comcast Cable Communications). The outage would likely have caused internet connectivity slowdowns and disruption for users.

Update Aug. 31

Globally the number of outages observed across all three categories increased by 29% from the week prior, rising from 296 to 381. This was the largest number of outages recorded in a single week this year. In the U.S. outages increased 70% compared to the week prior from 106 to 180.

The vast majority of outages were due to ISP problems. Worldwide the number jumped from 214 to 319, with the count in the U.S. growing from 80 to 150.

Public cloud outages declined worldwide fom 27 to 12 and from four to two in the U.S.

Collaboration apps networks stayed stead at two worldwide, with both of them occurring in the U.S. where the count rose from zero to two.

CenturyLink suffered a major outage just after 6 a.m. EDT Aug. 30 that hit a broadrange of providers and businesses including Twitter, Microsoft (Xbox Live), Discord, Reddit, Cloudflare, OpenDNS, and Hulu. Shortly after the outage began, providers started rerouting traffic from CenturyLink to alternate providers in an effort to alleviate the impact, however, given the size and distribution of CenturyLink’s network, many services were still unreachable, ThousandEyes said. At 8:13 a.m. EDT, CenturyLink announced it was investigating issues affecting some services within their Mississauga, Ontario, Canada data center. Having identified the cause as an incorrect flowspec announcement from the Mississauga data center, CenturyLink requested that its Tier 1 Internet provider partners de-peer and ignore any traffic coming from its network. (BGP flow specification (flowspec) is a feature that allows you to rapidly deploy and propagate filter policies among a large number of BGP peer routers.) In order to resolve the issue, CenturyLink reset all the equipment and start with clean BGP routing tables, a process that took almost five hours to complete. Just before 3:00 p.m. EDT, CenturyLink announced that the issue had been resolved and all services had been restored.

Update Aug. 24

Globally the total number of outages observed across all three categories during the week Aug. 17-23 increased by 21% compared to the week prior, rising from 245 to 296. This increase in the U.S. rose from 90 to 106 an increase of 18% from the week prior.

ISP outages worldwide rose from 166 to 214 and from 72 to 80 in the U.S.

Public cloud network outages dropped worldwide from 28 to 27, and stayed the same in the U.S. at four.

Collaboration app network outages rose from zero to two globally, but remained at zero in the U.S.

ThousandEyes flagged three notable outages during the week.

Just after 8 a.m. EDT on Aug. 18, Spotify suffered an outage that prevented users from streaming songs from the service. The outage lasted just over an hour and would play songs for a few seconds, then pause and return an error. The outage is believed to be assosicated with an expired TLS certificate. Click here for an explanation on the impact of certificate expiration.

About 11:30 p.m. EDT on Aug. 17, Equinix suffered a power outage to a colocation center in Docklands, London. About 2 a.m. the failure of an output static switch from a UPS system triggered a fire alarm, resulting in loss of power for multiple customers. At 3:50 a.m. services started to be restored and were fully restored by 4:50 p.m. EDT. Affected customers included BT, Sky, Virgin Media, Giganet, Epsilon, SiPalto, EX Networks, Fast2Host, ICUK.net, and Evoke Telecom.

About 10:50 p.m. PDT on Aug. 19 Cogent Networks suffered a 36-minute outage affecting U.S. users’ access to Microsoft networks and associated services, as well as CDN content for services such as TikTok and ESPN. The outage affected nodes across the U.S. and apparently resulted from a configuration adjustment. A second outage two hours later at 11:26 p.m. PDT lasted 24 minutes and likely was connected to the first outage’s configuration adjustment. It affected users in the U.S., Asia-Pacific and Europe, Mid-East and Africa. Click here for an interactive view of the outages.

Update Aug. 17

Global outages across all three categories fell between the weeks of Aug. 3-9 and Aug. 10-16 from 294 to 245 (-17%) and in the U.S. from 123 to 90 (-27%).

ISP outages dropped worldwide from 227 to 166 and from 109 to 72 in the U.S.

Related:
1 2 3 4 Page 1
Page 1 of 4
IT Salary Survey: The results are in