Embracing the AI Revolution: How AI Has Transformed Networks Forever

Thu, 17th Aug 2023

FYI, this story is more than a year old

I remember the moment I realized AI was going to change everything for us.

Five years ago, when the founders of Mist Systems (then a technology partner of ours) stepped into our offices so we could explore business opportunities, they introduced us to a groundbreaking idea – an AI-driven network platform capable of autonomously troubleshooting and resolving network issues before they even became apparent to users.

It all sounded great. But since AI had been a buzzword for decades, I remained skeptical about its potential – that is until I witnessed Mist's real AI in action. In fact, our own IT team had already been feeling the impact of Mist from a trial deployment they had been running. The platform could literally self-diagnose and remediate issues in real time, thanks to its unparalleled accuracy and speed.

In that moment, I realized just how big of a deal AI was about to be for both Juniper and the world.

As many of you know, our AIOps platform has since become a cornerstone of Juniper's strategy. While the rest of the industry continues to talk, we are now in our 7th generation of industry-leading AI that has revolutionized the network and paved the way for exceptional user experiences. A 90% reduction in worldwide trouble tickets at a global software company. 85% fewer store visits at a multinational retailer. The fastest branch network rollout in the history of a national mobile operator.

Clearly, our AI-driven Juniper Mist platform has been a game changer for thousands of organizations.

But that's really just the beginning.

Having witnessed Juniper Mist's success, we knew it would only be a matter of time before AI applications would explode more broadly - and so too would the scale of AI models and data centers.

This is where it gets even more interesting for Juniper.

The AI Data Center: Connecting the AI Revolution

A while ago, silicon companies discovered that the graphics processing units (GPUs) they made for gaming use cases are very well suited for the type of learning and inference workloads that AI executes.

But a single GPU can only do so much AI processing on its own. Modern AI/ML clusters comprise hundreds or sometimes thousands of GPUs that provide the massive, parallel computational power that is required to train today's AI models.

And, of course, it is the network that ties these GPUs together and enables them to operate as a single, extremely powerful AI processing system.

Previous technology revolutions such as the cloud, mobile or streaming services have pushed networks to new heights, but the traffic in data centers generated from distributed machine-learning workloads dwarfs that of most other applications. AI requirements to communicate large datasets and solve for billions – even trillions – of model parameters stress the network like never before.

To put it in perspective: a typical GPU cluster that we're seeing our customers looking to deploy at max performance has roughly as much network traffic traversing it every second as there is in all of the internet traffic across America. And to understand the economics of an AI data center, know that GPU servers can cost as much as $400,000 each. So, maximizing GPU utilization and minimizing GPU idle time is one of the most important drivers of AI data center design.

Distributing the workloads across the GPUs and then synching them to train the AI model requires a new type of network that can accelerate "job completion time" (JCT) and reduce the time that the system is waiting for that last GPU to finish its calculations ("tail latency").

Data center networks optimized for AI/ML, therefore, must have special capabilities around congestion management, load balancing, latency, and above all else, minimizing JCT. These are system attributes that Juniper has excelled at for years. And as model sizes and datasets continue to grow, ML practitioners must accommodate more GPUs into their clusters. The network fabric should support seamless scalability without compromising performance or introducing communication bottlenecks.

As an engineer by trade who started my career at Juniper building highly specialized ASICs that unlocked internet growth in the 90s, I've had a front row seat over the years to innovation cycles that have enabled our industry to push new levels of scale, performance, and speed.

AI networking represents a once-in-a-generation inflection point that will present us with complex technical challenges for years to come. And I believe we have the pieces at Juniper to enable this future. For us, it means sticking to what I'm calling the three commandments of AI data center networks:

1. High Performance
Maximizing GPU utilization, the overarching economic factor in AI model training, requires a network that optimizes for JCT and minimizes tail latency. Faster model training means faster time to results, but it also means a less expensive data center with better optimized compute resources.

From day one, Juniper has been silicon-agnostic, and this commitment gives our customers different options for spine, leaf and data center interconnect, optimizing for various factors such as power efficiency and scale. We offer a broad portfolio of systems based on third-party and in-house designed silicon that are powering the largest networks on the planet while also providing customers at varying stages of their AI journey the flexibility to meet their needs and constraints.

2. Open Infrastructure
Performance matters, which is why everyone invests in it. But then… economics takes over. And economics is driven by competition, and competition is driven by openness. We've seen this play out in our industry before. And if I am a betting man, I am betting that Ethernet wins. Again. An open platform maximizes innovation. It's not that proprietary technologies don't have their roles to play, but seldom does a single purveyor of technology out-innovate the rest of the market. And it simply never happens in environments where there is so much at stake. Juniper firmly supports the Ethernet standard and its powerful vendor ecosystem, including the new Ultra Ethernet Consortium, which drives down costs, spurs innovation and ultimately overtakes proprietary approaches like InfiniBand.

Along with the rest of the vast Ethernet ecosystem, Juniper continues to innovate networking technologies that speed data transfer, provide lossless transmission and enhance congestion control – critical aspects to powering the AI revolution.

3. Experience-first Operations
Data center networks are becoming increasingly complex, and new protocols must be added to the fabric to meet AI workload performance demands. While complexity will continue to go up, intent-based automation shields the network operator from that complexity. Juniper approaches the data center with a multivendor and operations-first mentality. We are adding extensions for AI clusters to Junos and our Apstra data center fabric management and automation solution. And by the way, Apstra is the industry's only multivendor platform of its kind. Because what good is open if you're locked in operationally after the first purchase?
AI is here, and there's no going back.

Juniper has already proven the impact that AI has in simplifying the management of wired, wireless and wide area networks to dramatically improve end-user experiences as well as the lives of network operators. But the pressure that machine learning and large-language models have put on networks will require us to keep innovating and solving new challenges.

And yes, these challenges are nothing short of extremely difficult. But solving the hardest problems around the globe is what has always driven us at Juniper. We are driven by a purpose to power connections and empower change in whatever form that may take. We're bringing to bear our legacy of high performance and our obsession with experience-first operations.

I'm confident Juniper's approach to data center networking will allow a new era of AI to flourish.

Share on: