We enable AI ASICs to accelerate the full toolchain of agentic AI—not just inference—eliminating multi-hardware orchestration and delivering order-of-magnitude gains in serving efficiency.
The industry is shifting from single-model serving to multi-step agentic execution, but the underlying infrastructure is still built for AI-only computation.
Today's agentic systems split execution across CPUs, GPUs, and AI ASICs. Every device hop adds latency, bandwidth cost, and engineering complexity.
Multi-step agents call tools dozens of times per query. Each CPU round-trip adds milliseconds that compound into seconds of end-to-end latency.
AI ASICs have massive throughput—matrix engines, vector units, high-bandwidth memory—but only use them for model forward passes, leaving capacity idle during tool execution.
We build the compiler and runtime framework that maps non-AI workloads onto AI ASICs, turning them into a general-purpose agentic serving substrate.
We systematically profile the agentic tool loop to identify compute-intensive operations that can benefit from ASIC acceleration—cryptographic primitives, data transforms, structured parsing, and more.
Our compiler maps non-AI operations to ASIC hardware primitives—converting high-precision modular arithmetic to INT8 matrix multiplies, aligning memory layouts to eliminate fine-grained shuffles, and scheduling operations for maximum throughput.
A lightweight runtime orchestrates both AI inference and tool execution on the same accelerator, keeping intermediate data on-chip and removing the CPU from the critical path of multi-step agent workflows.
Drop-in integration with existing cloud AI infrastructure. Compatible with TPUs, Trainium, MAIA, and similar platforms. No hardware modifications required—pure software unlock of latent ASIC capability.
AI ASICs should not be defined by their original purpose, but by their computational capabilities—high-throughput matrix engines, vector units, and efficient coarse-grained data movement.
Converts high-precision modular operations (1000+ bit) into INT8 matrix multiplications that map directly to AI ASIC matrix engines. Includes lazy modular reduction and chunk-wise multiplication for maximum hardware utilization.
1024-bit modular arithmetic → INT8 MatMul on MXU
Eliminates fine-grained data shuffles by restructuring data layouts for ASIC memory hierarchies. Optimizes NTT and polynomial operations for coarse-grained data movement patterns.
Fine-grained shuffles → Coarse-grained ASIC-native moves
Five-layer framework: Packing, Mapping, Scheduling, Decomposing, and Binding—systematically transforming arbitrary compute workloads into ASIC-native operations.
Production-ready implementations of HE operations (CKKS): multiplication, rotation, rescaling, basis conversion, and NTT acceleration—all running natively on AI ASICs.
View on GitHub →Accelerates ZKP primitives on AI ASICs—enabling proof generation and verification workloads to run on the same unified substrate alongside AI inference and HE operations.
View on GitHub →Our first proof-of-concept: running the full CKKS homomorphic encryption stack natively on Google TPUs. Presented at ASPLOS 2026 and published at HPCA 2026.
Full CKKS scheme: encode, encrypt, compute, decrypt—all on TPU
Matrix engines (MXU) run cryptographic kernels via BAT transformation
Bit-exact match with OpenFHE, interoperable ciphertext format
Dramatically faster than CPU-based HE for privacy-preserving AI
Learn CKKS homomorphic encryption from scratch
Read guide →Number Theoretic Transform implementations and optimizations
Explore →From TPU setup to writing high-performance kernels
Get started →Research frontiers in cryptography acceleration
View challenges →Existing solutions improve the software layer around agents. We go one layer deeper—making the full agentic compute run natively on AI infrastructure.
| Company | Layer | What They Do |
|---|---|---|
| LangChain / LangGraph | Control Plane | Durable execution, memory, streaming, human-in-the-loop orchestration |
| LlamaIndex | Control Plane | Event-driven, async-first workflow engine for multi-step agents |
| CrewAI / Composio | Control Plane | Multi-agent collaboration, tool integrations, governance |
| OpenAI / Anthropic / Cloud | Control Plane | SDKs, managed services for tools, sessions, tracing, deployment |
| CROSS | Execution Substrate | Move the full agent loop onto AI ASICs—inference + tools + crypto on unified hardware |
TPU v4 / v5e / v6e
Trainium / Inferentia
MAIA 100
MTIA
Cloud vendors are the first to feel the pain of agentic serving—they must support increasingly complex workflows where the bottleneck is no longer just model inference, but the surrounding tools and system functions.
Our team brings together expertise in AI accelerator architecture, compiler design, cryptography, and cloud infrastructure from leading research institutions and industry.
We're working with cloud providers to bring unified agentic serving to production. Let's talk about how CROSS can transform your AI ASIC fleet.