Designing for Scale: Laying the foundation for responsible, sustainable AI infrastructure -

India’s artificial intelligence (AI) strategy is moving from conversation to capacity. The core elements of a national AI ecosystem are already visible – pools of compute, growing storage and a network fabric that ties them together. Affordable GPU-as-a-service and domestic access to competitive compute are changing the landscape for start-ups, research labs and enterprises that previously relied on distant clouds. Compute is necessary but not sufficient. The next layer is a data engine capable of feeding models with clean, contextual datasets, because classification, structuring and contextualisation matter as much as raw processing power. Paired with home-grown models and smaller language models, this combination becomes the backbone for inclusive, local intelligence.

That backbone must be distributed rather than concentrated. Instead of one gargantuan campus, the practical model favours many right-sized regional clusters that co-locate compute, storage and relevant datasets with universities, labs and industry hubs. These clusters reduce the local friction of land, power and clearances, and they align compute density with local grid realities. When linked by high-capacity fibre and optical switching, the clusters form a national fabric that supports both hyperscale training and low-latency inference at the edge.

With the cluster model in place, the engineering focus turns to the fabric itself. AI at scale is as much a networking challenge as a compute one. Networks must shift from a coverage mentality to one that prioritises capacity, latency and resilience. New fibre routes, lower-loss cables and diverse conduits reduce the risk of correlated failures. Software-defined and intent-based networking lets traffic be steered by policy, so model updates and synchronous training flows can be prioritised over latency-sensitive inference when necessary.

Data centre design must change in parallel with the network. Optical circuit switching, tight GPU clustering and disaggregated scale-out architectures aim to boost utilisation, cut interconnect latency and protect capex. GPUs anchor cloud training, CPUs manage aggregation and orchestration, and low-power system-on-chip and field-programmable gate array devices host inference at the periphery. Each layer is chosen for its balance of latency, energy and cost; in practice, there is no single blueprint. Siting decisions are shaped by sovereignty and affordability, with domestic chip initiatives, local cloud options and low-cost power locations all influencing where clusters go.

Sustainability cannot be an afterthought. New processor generations and platform consolidation materially reduce server counts and power use. Liquid cooling and improved power usage effectiveness are part of modern data centre planning. Lifecycle issues, such as embodied carbon, cooling at edge sites and eventual e-waste, are now integrated into procurement decisions. Practical steps such as energy-rating systems for hardware and predictive energy management are being built into planning so that operations and environmental goals move together.

From infrastructure to intelligence

Infrastructure and hardware set the stage, but the immediate priority is getting data ready and governed so that models can run reliably. Accessible, well-structured data is a first-order need. Raw signals such as logs, telemetry, sector key performance indicators (KPIs) and transaction records are only useful when catalogued, anonymised and made queryable. Data engineering, like schema standardisation, data catalogues and tools that make logs AI-ready, is, therefore, as strategic as compute. Where data is abundant and well-managed, the same intelligence can be reused across sectors – mobility insights tied to transport telemetry, retail signals combined with network KPIs, or rural health indicators feeding targeted interventions. Good data engineering becomes useful only when accompanied by governance. Clear privacy guardrails, anonymisation standards and accountable sharing frameworks enable legitimate collaboration. Regulatory sandboxes and privacy-preserving techniques, such as federated learning or differential privacy, allow multiple parties to collaborate without sharing raw data. When datasets are governed and contextualised, domain large language models and specialised agents become possible, models that reflect local vocabulary, sector rules and compliance needs.

Democratising access amplifies those gains. Shared capacity models, such as capacity leasing, AI-as-a-service and public test hubs, let start-ups and universities experiment without prohibitive upfront costs. Developer portals, subsidised GPU time and university-industry testbeds ensure that access is broad rather than exclusive. That practical access, combined with rigorous data engineering, is how domain models get trained on contextually relevant datasets.

Once data and access are in place, economic use cases follow quickly. Distributed compute creates optionality – idle edge resources during off-peak hours can be repurposed for inference or batch training, turning spare capacity into revenue. Network slices and dynamic allocation of physical resource blocks let operators offer differentiated connectivity for enterprises or mission-critical systems. Localised agents and domain application programming interfaces (APIs) turn base language models into productised workflows for retail, mobility, health and agriculture, creating packages customers will pay for. Realising those commercial models requires standards and integration. Open interfaces and modular APIs reduce vendor lock-in and speed deployment; without them, experimentation slows and costs rise. System integration, however, remains a bottleneck. Pairing radios with distributed units, validating bands and running end-to-end tests is time-consuming. Emerging automation tools such as AI-assisted testing and large-model-driven compliance checks show promise in shortening validation cycles, though these approaches remain exploratory. Ultimately, execution will matter as much as policy.

Operational readiness is the final link that turns prototypes into services. Responsible deployment needs operational guardrails such as explainability, auditing, encryption and robust policy frameworks. Models must be auditable, and datasets versioned and governed. Preparedness for threats, from cybersecurity incidents to future cryptographic risks, must be part of the operational checklist. Explainability and policy clarity are prerequisites for scaled public use and inter-organisational sharing. That operational discipline includes integration, testing and machine learning (ML) operations. System integration is where strategy meets the real world; many integration tasks can take months, and automation is the only realistic way to compress them. AI-assisted testing and compliance automation can generate test cases, compare configurations and flag non-compliant settings automatically. But automation only works with governance – model versioning, controlled roll-outs, rollback plans and continuous monitoring are non-negotiable. Explainability, audit trails and robust performance metrics make models auditable and trusted by operations teams. People convert capability into sustained practice. Large-scale skilling programmes, hackathons, university-industry testbeds and developer portals build grassroots momentum. Developer pipelines must be pragmatic, with sandboxed environments, reproducible datasets, continuous integration/continuous delivery for models and observability for both performance and fairness. Organisations that combine top-down mandates with bottom-up champions will move AI adoption from partial use to widespread practice.

Towards responsible scale

Scaling AI at the national scale means reconciling ambition with environmental and regulatory realities. High-density GPU clusters require a step change in power provisioning, tens to hundreds of megawatts per campus, and advanced cooling strategies such as liquid cooling. Aligning clusters with local renewable generation and grid realities reduces systemic risk and improves resilience. Procurement must account for embodied carbon, cooling at edge sites and end-of-life disposal so that sustainability becomes part of the procurement and deployment lifecycle. Monetisation strategies are feasible but they meet regulatory boundaries. Edge inference-as-a-service, time-sliced GPU leasing, verticalised model APIs and dynamic slicing are technically feasible ways to capture value. Yet, policy constraints, such as net-neutrality debates, data-sovereignty rules and state-level land and power allocations, shape what can actually be offered and where. Practical deployment, therefore, requires co-design with regulators and state authorities. Data centre parks with pre-provisioned power and connectivity, simplified environmental clearances and clear frameworks for data sharing are operational enablers. When execution barriers are addressed, monetisation models can move from pilots to scale.

Readiness comes down to practical, measurable items. Data must be catalogued, anonymised and queryable under clear governance. Compute must be right-sized into clusters, with tight GPU clustering where needed and edge system on chip and field programmable gate arrays for inference. Network fabric must be high capacity and low loss, with diverse conduits and intent-based routing. ML operations and testing capability must include versioned models, automated testing, rollback plans and observability. Skills and culture require developer access, training, hackathons and cross-functional teams. Policy and execution need regulatory sandboxes, data centre parks and incentives for energy-efficient design. Sustainability calls for embodied carbon accounting, energy-rating systems and predictive energy management.

When these boxes are ticked, the payoff will be twofold – local value creation and exportable domain intelligence. Clustered compute, governed data, accessible capacity and calibrated policy will enable India to build models and services that reflect its languages, sectors and regulatory realities. The result will not only be better local services but also contextualised solutions that can be adapted outside the country. This practical fabric of regional clusters, tied by a national backbone, governed data flows and workable commercial models, will be the pragmatic path forward. If done right, it will accelerate innovation, improve public services and create products that represent India’s scale and context on the global stage. s

Based on discussions during the session titled “Democratising Intelligence: Building India’s AI Infrastructure” at India Mobile Congress 2025.