CROSS — Accelerate Agentic AI on AI ASICs

The Problem

Agentic AI is Bottlenecked by
Infrastructure, Not Intelligence

The industry is shifting from single-model serving to multi-step agentic execution, but the underlying infrastructure is still built for AI-only computation.

Heterogeneous Stack Tax

Today's agentic systems split execution across CPUs, GPUs, and AI ASICs. Every device hop adds latency, bandwidth cost, and engineering complexity.

Latency Compounds

Multi-step agents call tools dozens of times per query. Each CPU round-trip adds milliseconds that compound into seconds of end-to-end latency.

Wasted Compute Potential

AI ASICs have massive throughput—matrix engines, vector units, high-bandwidth memory—but only use them for model forward passes, leaving capacity idle during tool execution.

Today's Agentic Serving

CPUOrchestration + Tools

↔

AI ASICInference only

↔

CPUCrypto / Data processing

Multiple device hops, CPU on critical path, data movement overhead

vs

CROSS
AI ASICInference + Tools + Crypto
Unified execution, no device hops, data stays on-chip

Our Solution

Run the Full Agent Loop
Where the Data Already Lives

We build the compiler and runtime framework that maps non-AI workloads onto AI ASICs, turning them into a general-purpose agentic serving substrate.

01

Workload Analysis

We systematically profile the agentic tool loop to identify compute-intensive operations that can benefit from ASIC acceleration—cryptographic primitives, data transforms, structured parsing, and more.

02

Compilation Framework

Our compiler maps non-AI operations to ASIC hardware primitives—converting high-precision modular arithmetic to INT8 matrix multiplies, aligning memory layouts to eliminate fine-grained shuffles, and scheduling operations for maximum throughput.

03

Unified Runtime

A lightweight runtime orchestrates both AI inference and tool execution on the same accelerator, keeping intermediate data on-chip and removing the CPU from the critical path of multi-step agent workflows.

04

Cloud Integration

Drop-in integration with existing cloud AI infrastructure. Compatible with TPUs, Trainium, MAIA, and similar platforms. No hardware modifications required—pure software unlock of latent ASIC capability.

Core Technology

Unlocking AI ASIC Capabilities
Beyond AI Inference

AI ASICs should not be defined by their original purpose, but by their computational capabilities—high-throughput matrix engines, vector units, and efficient coarse-grained data movement.

Core Innovation

Basis Aligned Transformation (BAT)

Converts high-precision modular operations (1000+ bit) into INT8 matrix multiplications that map directly to AI ASIC matrix engines. Includes lazy modular reduction and chunk-wise multiplication for maximum hardware utilization.

1024-bit modular arithmetic → INT8 MatMul on MXU

Core Innovation

Memory Aligned Transformation (MAT)

Eliminates fine-grained data shuffles by restructuring data layouts for ASIC memory hierarchies. Optimizes NTT and polynomial operations for coarse-grained data movement patterns.

Fine-grained shuffles → Coarse-grained ASIC-native moves

Multi-Layer Compilation Stack

Five-layer framework: Packing, Mapping, Scheduling, Decomposing, and Binding—systematically transforming arbitrary compute workloads into ASIC-native operations.

Homomorphic Encryption Kernel Library

Production-ready implementations of HE operations (CKKS): multiplication, rotation, rescaling, basis conversion, and NTT acceleration—all running natively on AI ASICs.

View on GitHub →

Zero Knowledge Proof Primitive Acceleration Library

Accelerates ZKP primitives on AI ASICs—enabling proof generation and verification workloads to run on the same unified substrate alongside AI inference and HE operations.

View on GitHub →

Live Demo

Cryptography on AI ASICs:
CROSS Framework

Our first proof-of-concept: running the full CKKS homomorphic encryption stack natively on Google TPUs. Presented at ASPLOS 2026 and published at HPCA 2026.

CROSS: CKKS Homomorphic Encryption on TPU

Homomorphic Encryption

Full CKKS scheme: encode, encrypt, compute, decrypt—all on TPU

TPU-Native Execution

Matrix engines (MXU) run cryptographic kernels via BAT transformation

Verified Correctness

Bit-exact match with OpenFHE, interoperable ciphertext format

Order-of-Magnitude Speedup

Dramatically faster than CPU-based HE for privacy-preserving AI

Explore Full Tutorial Technical Deep Dive

HE Background

Learn CKKS homomorphic encryption from scratch

Read guide →

NTT Algorithms

Number Theoretic Transform implementations and optimizations

Explore →

TPU Programming

From TPU setup to writing high-performance kernels

Get started →

Open Challenges

Research frontiers in cryptography acceleration

View challenges →

Publications & Presentations

HPCA 2026 CROSS: CKKS Homomorphic Encryption Acceleration on Google TPU ASPLOS 2026 Tutorial: Cryptography Primitives Acceleration (CPA)

DAC 2025 Demo: Privacy-Preserving Digit Detection on TPU (2nd Place)

Competitive Landscape

They Optimize Coordination.
We Optimize Execution.

Existing solutions improve the software layer around agents. We go one layer deeper—making the full agentic compute run natively on AI infrastructure.

Company	Layer	What They Do
LangChain / LangGraph	Control Plane	Durable execution, memory, streaming, human-in-the-loop orchestration
LlamaIndex	Control Plane	Event-driven, async-first workflow engine for multi-step agents
CrewAI / Composio	Control Plane	Multi-agent collaboration, tool integrations, governance
OpenAI / Anthropic / Cloud	Control Plane	SDKs, managed services for tools, sessions, tracing, deployment
CROSS	Execution Substrate	Move the full agent loop onto AI ASICs—inference + tools + crypto on unified hardware

Key Insight: Everyone else optimizes how agents are coordinated. We optimize where the agentic computation runs. These approaches are complementary—better orchestration on top of our execution substrate yields compounding gains.

Target Market

Built for Cloud Providers
With In-House AI Chips

Google Cloud

TPU v4 / v5e / v6e

AWS

Trainium / Inferentia

Microsoft

MAIA 100

Deep Expertise in
Hardware-Software Co-Design

Our team brings together expertise in AI accelerator architecture, compiler design, cryptography, and cloud infrastructure from leading research institutions and industry.

Jianming Tong

Co-Founder

Jingtian Dang

Co-Founder

Ready to Accelerate
Your Agentic Infrastructure?

We're working with cloud providers to bring unified agentic serving to production. Let's talk about how CROSS can transform your AI ASIC fleet.

Contact Us

Make AI ASICs the
Universal Substrate
for Agentic Workloads

Agentic AI is Bottlenecked by
Infrastructure, Not Intelligence

Heterogeneous Stack Tax

Latency Compounds

Wasted Compute Potential

Run the Full Agent Loop
Where the Data Already Lives

Workload Analysis

Compilation Framework

Unified Runtime

Cloud Integration

Unlocking AI ASIC Capabilities
Beyond AI Inference

Basis Aligned Transformation (BAT)

Memory Aligned Transformation (MAT)

Multi-Layer Compilation Stack

Homomorphic Encryption Kernel Library

Zero Knowledge Proof Primitive Acceleration Library

Cryptography on AI ASICs:
CROSS Framework

HE Background

NTT Algorithms

TPU Programming

Open Challenges

Publications & Presentations

They Optimize Coordination.
We Optimize Execution.

Built for Cloud Providers
With In-House AI Chips

Google Cloud

AWS

Microsoft

Meta

Deep Expertise in
Hardware-Software Co-Design

Jianming Tong

Jingtian Dang

Ready to Accelerate
Your Agentic Infrastructure?

Make AI ASICs theUniversal Substratefor Agentic Workloads

Agentic AI is Bottlenecked byInfrastructure, Not Intelligence

Heterogeneous Stack Tax

Latency Compounds

Wasted Compute Potential

Run the Full Agent LoopWhere the Data Already Lives

Workload Analysis

Compilation Framework

Unified Runtime

Cloud Integration

Unlocking AI ASIC CapabilitiesBeyond AI Inference

Basis Aligned Transformation (BAT)

Memory Aligned Transformation (MAT)

Multi-Layer Compilation Stack

Homomorphic Encryption Kernel Library

Zero Knowledge Proof Primitive Acceleration Library

Cryptography on AI ASICs:CROSS Framework

HE Background

NTT Algorithms

TPU Programming

Open Challenges

Publications & Presentations

They Optimize Coordination.We Optimize Execution.

Built for Cloud ProvidersWith In-House AI Chips

Google Cloud

AWS

Microsoft

Meta

Deep Expertise inHardware-Software Co-Design

Jianming Tong

Jingtian Dang

Ready to AccelerateYour Agentic Infrastructure?

Make AI ASICs the
Universal Substrate
for Agentic Workloads

Agentic AI is Bottlenecked by
Infrastructure, Not Intelligence

Run the Full Agent Loop
Where the Data Already Lives

Unlocking AI ASIC Capabilities
Beyond AI Inference

Cryptography on AI ASICs:
CROSS Framework

They Optimize Coordination.
We Optimize Execution.

Built for Cloud Providers
With In-House AI Chips

Deep Expertise in
Hardware-Software Co-Design

Ready to Accelerate
Your Agentic Infrastructure?