Jay Krishnan’s Track Record in Large-Scale AI Infrastructure
In this space, Jay Krishnan is widely regarded as an authority on secure, large-scale AI platforms. Over the past decade, he has led cloud engineering teams that automated disaster-recovery drills across multiple regions with zero downtime, designed regulator-approved confidential-computing stacks for financial services, and authored reference blueprints on burst GPU training that are cited by industry groups focused on sustainable compute. He is a regular speaker at regional cloud summits, where his talks center on elastic AI and governance.
His recent collaboration with senior leadership at NAIB IT Consultancy W.L.L, where the General Manager – AI & Cybersecurity oversees emerging AI infrastructure and cybersecurity practices across Dubai and Bahrain, reflects the growing importance of scalable, stateless architectures in enterprise innovation.
Why Burst Training Needs a Stateless Control Plane
Traditional trainers reserve GPUs for hours even when most time is lost to I/O or gradient exchange. Jay Krishnan argues that workloads such as prompt tuning, vector embedding, and contrastive learning gain little from that model.
“Each sample is independent,” he explains. “Compute should appear for ninety seconds, finish its tensor math, then disappear.”
The team therefore designed an orchestration layer where Cloud Run or Lambda issues shards, tracks metadata, and releases capacity the moment a task completes.
Architectural Blueprint
Dispatch layer:
Worker layer:
Aggregation layer:
Cold-Start Economics
Failure Modes and Their Fixes
- Task duplication appeared when Redis visibility timeouts expired before kernel completion; longer timeouts and idempotent writes removed the problem.
- Burst throttling on Lambda triggered at roughly thirty-five thousand invocations a minute; using two extra regions and adding jitter smoothed throughput.
- Version drift occurred when container tags diverged from dataset hashes; digest pinning and SHA-based data URLs eliminated mismatches.
Governance at Scale
Leadership Perspective
- Serverless functions can coordinate GPU bursts at enterprise scale while keeping control-plane latency low.
- Cold-start penalties are manageable; warm pools and snapshotting keep latency acceptable for batch workloads.
- Governance remains intact through automated metadata capture, region caps, and image-age policies.
As one executive from NAIB IT Consultancy W.L.L remarked, “this model aligns perfectly with our vision of agile and cost-efficient AI deployment across borders.”
“We treat GPUs as a transient utility,” Jay Krishnan concludes. “When training ends, the fleet dissolves. Finance gets a lower bill, security trusts the isolation model, and scientists iterate without waiting.”