C P Manoharan, Director, Business Development, APAC South, Spirent Communications

The rapid growth of artificial intelligence (AI) is impacting large data centres as high volumes of traffic and intensive processing requirements are pushing the limits of hyperscale data centres. To successfully support AI’s rapid growth, data centre architectures and the high-speed networks they rely on, must be re-evaluated. AI application model complexity and size dictate the level of compute, memory and network type needed to connect AI accelerators, like graphics processing unit (GPUs) used for training and inferencing. At the same time, AI workloads are driving an unprecedented demand for low latency and high bandwidth connectivity between servers, storage, and accelerators.

The scale required for support does not come from simply adding racks to a data centre. Handling large AI training and inference workloads requires a separate, scalable, routable backend network infrastructure to connect distributed GPU nodes. AI apps have less impact on the frontend Ethernet networks that use general purpose servers to provide AI data ingestion for the training process.

The requirements for new backend network differ considerably from traditional data centre frontend access networks. In addition to higher traffic and increased network bandwidth per accelerator, the backend network needs to support thousands of synchronised parallel jobs, as well as data and compute-intensive workloads. The network must be scalable, and provide low latency and high bandwidth connectivity between servers, storage, and the GPUs essential for AI training and inferencing.

Data centre networks must transform to support new AI workloads

The AI data centre journey is just beginning and will change dramatically as AI evolves, promising to be transformative and expensive. Data centre architectures should be evaluated sooner rather than later, as new strategies will be required for success. Generative AI (GenAI) applications are poised to accelerate a new era of high-speed Ethernet backend networks for data centres, as well as other emerging technologies.

Field deployments of 400G Ethernet have started, 800G chipsets are being manufactured, and standards specifications are in development for 1.6 Terabit Ethernet, with each iteration representing a doubling of bandwidth. Backend AI networks are projected to migrate quickly to nearly all port speeds being at 800 Gbps and above by 2027, with triple-digit compound annual growth rate (CAGR) for bandwidth growth.

New Ethernet technologies for better AI networking

For large-scale AI deployments, latency sensitivities can have a large impact on training performance, so a more deterministic flow control approach can be taken with InfiniBand. High-speed Ethernet and InfiniBand are expected to coexist in data centre backend networks for the foreseeable future.

Many organizations have begun deploying 400G and 800G with the RoCE v2 advanced protocol (RDMA over converged Ethernet, version 2) as the data centre switch fabric. This low-cost data transfer network increases efficiency and improves central processing unit (CPU) utilisation and network performance, while reducing network latency and increasing bandwidth availability.

A sound test and assurance strategy is the gateway to AI success

Test solutions help validate that the industry is leveraging the cost-intensive AI/machine learning (ML) infrastructure to its maximum capabilities and organisations can safely unlock AI’s full potential and transform marginal gains into monumental results by:

Quantifying use cases –

  • Identify AI-powered use cases that offer clear business outcomes where quality datasets are available.
  • Use digital twins to cost-efficiently and rapidly test use case efficacy and value, and provide feedback loops for continuous AI model learning.

Developing a data architecture and management strategy –

  • Data architecture, management, and hygiene should be addressed at an early stage to avoid cost shock, poor data quality, and inaccurate or biased AI models.
  • Validate that data centre interconnect architectures can cope with the volume of data and high-speed data transfers and access required by AI learning and inference clusters. Consider 400G/800G Ethernet supporting RoCE v2 to address the requirements of high performance, low latency, and a low-cost data transfer network.
  • Use real test data to accelerate AI model training with realistic scenarios and unique variations relevant to intended environments.
  • Use continuous test data from the live network to keep AI models current.

Pursuing automation –

  • Invest in an automation framework first, integrating AI to enhance and supercharge related processes.
  • Start with lower-risk internal processes and environments like labs and test beds to intelligently automate repetitive tasks, streamline complex processes and reduce human errors.

Ensuring efficacy –

  • AI and especially GenAI (which is in its infancy) can present inaccurate information as though it were correct. While bad or erroneous data can be blamed, so can misalignment with desired business outcomes.
  • Continuously verify AI recommendation efficacy against golden scenarios and desired outcomes while providing closed-loop feedback for learning.
  • Use digital twins to provide a safe and realistic offline validation environment.
  • Use active testing in the operational networks to rapidly verify implemented recommendations and provide feedback loops for reinforcement or to trigger resolutions.

Testing security –

  • Explore using modern security solutions that are evolving to utilize AI to enhance their effectiveness and to counter AI-generated attacks.
  • Continuously test the efficacy of those security solutions for threat detection, false positives, prevention, and remediation response using hyper-realistic attacks, hacker attack behavior, and evasion techniques.