My humble beginnings
Back in the early 2000s, I was the sole network engineer at a startup. By morning, my role included managing four floors and 22 European locations packed with different vendors and servers between three companies. In the evenings, I administered the largest enterprise streaming networking in Europe with a group of highly skilled staff.
Since we were an early startup, combined roles were the norm. I’m sure that most of you who joined as young engineers in such situations could understand how I felt back then. However, it was a good experience, so I battled through it. To keep my evening’s stress-free and without any IT calls, I had to design in as much high-availability (HA) as I possibly could. After all, all the interesting technological learning was in the second part of my day working with content delivery mechanisms and complex routing.
All of which came back to me when I read a recent post on Cato network’s self-healing SD-WAN for global enterprises networks.
Cato is enriching the self-healing capabilities of Cato Cloud. Rather than the enterprise having the skill and knowledge to think about every type of failure in an HA design, the Cato Cloud now heals itself end-to-end, ensuring service continuity.
The importance of redundancy
HA is a necessity for application stability. It is usually misinterpreted as a value add-on, although it is a must-have component. Our digital transformation relies on network stability, we therefore require a stable and consistent networking experience.
Delivering an always-on highly available network is easier said than done. Local redundancy isn't enough, and you need to plan through multiple layers of failover across the entire network and security infrastructure. This includes layers at a device, site, regional and global level.
Every end-to-end component that could increase design complexity and the recurring costs for the additional equipment needs to be made redundant. This additional equipment may only be used for minimal periods along with spares in storage.
But the more equipment the more complex the HA interaction. Within the IT network that I ran, for example, a particular vendor offered local device redundancy with supervisor engine failovers that could perform nonstop forwarding upon a primary supervisor failure. So essentially, the brain of the device was redundant.
The configuration was designed as per validated designs and tested appropriately on the deployment phase. However, there was a limitation to it when there was a software bug or hardware failure, it never worked when I wanted it to.
I was often left to defend myself without a professional explanation to the chief executive officer (CEO). And most of the times, I ended up just saying “sometimes these things just don't work.” As a result, even today when I hear of self-healing and nonstop forwarding, I always stop to take a breath.
Site-to-site redundancy
So I knew in the back of my mind that designing high availability for a single device was not 100 percent foolproof but I still had to move to high availability design in the enterprise between multiple diverse locations. Each site had different edge equipment. Let’s just say it was a complicated project.
I broke down my high availability strategy in regions, generally based on latency. During intervals at night, I would replicate data between different regions. Although, this worked most of the time any interference with latency would cause the job to fail.
My career quickly moved to design high availability to both greenfield and brownfield MPLS networks. It involved extensive skills and effort to design under the framework of such network types. It is a challenge to get it done right. It requires knowledge and skill with a lot of testing, feedback, and documentation.
In the minds of the engineer
Today’s infrastructure is more diverse and interconnected than before. There are even more moving parts. All of which would make HA complicated but gets even more complicated when the HA design is solely in the mind of the engineer not in the central database where they can be modeled, updated and controlled.
It often depends on the individuals day-to-day working practice and previous technical knowledge as to how he or she is going to design. There are so many ways to design high availability and as many ways to shoot yourself in the foot. If it's in the mind of the engineer, it will consist of manual configurations causing issues with location failures.
Daisy chain of manual events
A location failure would result in a daisy chain of manual events. Therefore, engineers must manually update policies in the firewalls and other security or networking appliances. There have never been any ‘follow-the-network security rules’ where security rules could change dynamically with the network.
More importantly, when connectivity is finally restored, you need to make sure that the outdated security rules would not break the application service.
The rollback process was usually a document put together by someone who had already left the organization. It would consist of a variety of steps for example if node 1 fails in data center two change policy x on firewall A to policy y on firewall B. The list goes on. Most of the time we would just wait for the big bang.
Wait for the big bang
This is certainly not something you can test and pre-plan for. You just have to wait for the big bang to happen, which is usually around the three years mark and analyze what happens then.
I thought to myself, wouldn't it be great if we could push all this complexity to the cloud and let the cloud take care of it?
My take on Cato
Cato’s self-help capabilities minimize the chances that problems will crop up in HA design. Cato Cloud replaces the myriad of appliances, VNFs and standalone services that make up the network with a single processing software engine for routing, optimizing, and securing all WAN and Internet traffic. A simpler network is a more reliable.
Then Cato builds self-healing from the data center through the Cato network and to the remote office. Changes to the network automatically cause updates to security policies. It’s the kind HA integration I wish I had when I ran my network.
Now that Cato has fully converged self-healing into our security and networking cloud platform, it has revolutionized the way we look at SD-WAN. By remediating network failures, updating the security infrastructure, and adapting workflows according to business priority, we are now witnessing a new era of SD-WAN.