AIMax

High-throughput inference orchestration across models, devices, and modalities.

A production C++ inference library built for embedded servers, agents, desktop apps, and enterprise AI platforms that need speed, batching, model management, and broad hardware coverage.

One-line install

powershell -c "irm https://esacode.com/install/aimax.ps1 | iex"

What AIMax is built to do

AIMax focuses on model loading, resource orchestration, multi-device execution, serving, batching, speculative decoding, quantization, and operational visibility in one production-oriented runtime.

Model loading

Parses model metadata, architecture, dimensions, and hyperparameters.
Places tensors across embedding, per-layer, and output-head structures.
Allocates buffers on GPU VRAM and CPU RAM as needed.
Loads models into inference-ready handles for immediate use.

Resource orchestration

Detects CPU cores, GPU count, and VRAM availability.
Plans model placement across available devices.
Holds multiple resident models at once.
Auto-evicts or swaps under memory pressure.

Multi-device inference

Single-GPU inference.
Mixed GPU + CPU offload when models exceed VRAM.
Multi-GPU execution with layer split or tensor parallel approaches.

Modalities and hardware coverage

The requirements describe AIMax as a broad inference runtime rather than a text-only engine.

Supported modalities

Text generation, vision encoding, speech-to-text, text-to-speech, text-to-image, and text-to-video workflows are all part of the intended scope.

Hardware targets

CPU backends on x86-64 with AVX2 and AVX-512 plus ARM with NEON, NVIDIA GPUs with CUDA, Apple GPUs with Metal, and cross-vendor GPUs with Vulkan.

CPU SIMD path

Runtime CPU feature detection with AVX2/AVX-512 awareness and AVX2 dequant paths for Q8_0, Q4_K, and Q6_K, designed for byte-exact parity with scalar output.

Quantization and memory control

AIMax is designed to work with existing quantized models and to apply quantization policies during load time.

Pre-quantized model support

Supports formats such as Q8_0, Q6_K, Q5_K, Q4_K, Q4_0, IQ4_XS, and similar quantized encodings.

KV-cache quantization

Stores attention KV caches at reduced precision to manage runtime memory more efficiently.

Recipe-driven quantization

Can apply JSON quantization recipes at load time, mixing formats such as Q6_K for attention, Q4_K for FFN weights, and FP16 for KV cache.

Serving, APIs, and concurrency

The requirements file describes AIMax as both an embeddable library and a server-side serving stack.

Developer interfaces

C++ API for embedded use.
Header-only consumer support.
HTTP server with OpenAI-compatible endpoints.
Streaming SSE responses.

Scheduler and batching

Continuous batching scheduler for concurrent requests.
Prefix caching for shared-system-prompt reuse.
Speculative decoding with draft-model verification.
Paged draft KV support for better multi-request memory behavior.

Chat and templating

Jinja-subset chat template handling.
Streaming speculative generation path.
Deterministic validation workflow for greedy vs speculative output.

Multi-model operations

Beyond single-model serving, AIMax includes model-management features for resident fleets and active-model switching.

Resident model management

ModelManager-backed multi-model hosting with resident-model limits, LRU behavior, and VRAM-budget-aware residency control.

Model routing endpoints

Resident-model listing, model budget configuration, active-model switching, and explicit model eviction endpoints for server deployments.

Operational flexibility

Lets operators keep multiple models warm and switch without reload latency when using the resident-model workflow.

Observability and validation

The AIMax requirements place strong emphasis on correctness, diagnostics, and measurable runtime behavior.

Metrics and reporting

Per-request metrics including time-to-first-token and tokens per second.
VRAM and RAM usage reporting.
Structured logging for model load and inference events.

Diagnostics

Per-layer norm inspection.
NaN detection support.
Dequant parity tests and source-level audit tests.

Validation workflows

Golden-output hardware validation.
Greedy vs speculative equality verification.
Byte-exact CPU SIMD parity testing for key quant formats.

Explore tested models and deployment paths

AIMax is positioned as the high-throughput serving layer inside the broader Esacode platform.

Explore models