AIMax

High-throughput inference orchestration across models, devices, and modalities.

A production C++ inference library built for embedded servers, agents, desktop apps, and enterprise AI platforms that need speed, batching, model management, and broad hardware coverage.

Install AIMax
One-line install
powershell -c "irm https://esacode.com/install/aimax.ps1 | iex"

What AIMax is built to do

AIMax focuses on model loading, resource orchestration, multi-device execution, serving, batching, speculative decoding, quantization, and operational visibility in one production-oriented runtime.

Model loading

  • Parses model metadata, architecture, dimensions, and hyperparameters.
  • Places tensors across embedding, per-layer, and output-head structures.
  • Allocates buffers on GPU VRAM and CPU RAM as needed.
  • Loads models into inference-ready handles for immediate use.

Resource orchestration

  • Detects CPU cores, GPU count, and VRAM availability.
  • Plans model placement across available devices.
  • Holds multiple resident models at once.
  • Auto-evicts or swaps under memory pressure.

Multi-device inference

  • Single-GPU inference.
  • Mixed GPU + CPU offload when models exceed VRAM.
  • Multi-GPU execution with layer split or tensor parallel approaches.
AIMax benchmark dashboard
AIMax serving topology

Modalities and hardware coverage

The requirements describe AIMax as a broad inference runtime rather than a text-only engine.

Supported modalities

Text generation, vision encoding, speech-to-text, text-to-speech, text-to-image, and text-to-video workflows are all part of the intended scope.

Hardware targets

CPU backends on x86-64 with AVX2 and AVX-512 plus ARM with NEON, NVIDIA GPUs with CUDA, Apple GPUs with Metal, and cross-vendor GPUs with Vulkan.

CPU SIMD path

Runtime CPU feature detection with AVX2/AVX-512 awareness and AVX2 dequant paths for Q8_0, Q4_K, and Q6_K, designed for byte-exact parity with scalar output.

Quantization and memory control

AIMax is designed to work with existing quantized models and to apply quantization policies during load time.

Pre-quantized model support

Supports formats such as Q8_0, Q6_K, Q5_K, Q4_K, Q4_0, IQ4_XS, and similar quantized encodings.

KV-cache quantization

Stores attention KV caches at reduced precision to manage runtime memory more efficiently.

Recipe-driven quantization

Can apply JSON quantization recipes at load time, mixing formats such as Q6_K for attention, Q4_K for FFN weights, and FP16 for KV cache.

Serving, APIs, and concurrency

The requirements file describes AIMax as both an embeddable library and a server-side serving stack.

Developer interfaces

  • C++ API for embedded use.
  • Header-only consumer support.
  • HTTP server with OpenAI-compatible endpoints.
  • Streaming SSE responses.

Scheduler and batching

  • Continuous batching scheduler for concurrent requests.
  • Prefix caching for shared-system-prompt reuse.
  • Speculative decoding with draft-model verification.
  • Paged draft KV support for better multi-request memory behavior.

Chat and templating

  • Jinja-subset chat template handling.
  • Streaming speculative generation path.
  • Deterministic validation workflow for greedy vs speculative output.

Multi-model operations

Beyond single-model serving, AIMax includes model-management features for resident fleets and active-model switching.

Resident model management

ModelManager-backed multi-model hosting with resident-model limits, LRU behavior, and VRAM-budget-aware residency control.

Model routing endpoints

Resident-model listing, model budget configuration, active-model switching, and explicit model eviction endpoints for server deployments.

Operational flexibility

Lets operators keep multiple models warm and switch without reload latency when using the resident-model workflow.

Observability and validation

The AIMax requirements place strong emphasis on correctness, diagnostics, and measurable runtime behavior.

Metrics and reporting

  • Per-request metrics including time-to-first-token and tokens per second.
  • VRAM and RAM usage reporting.
  • Structured logging for model load and inference events.

Diagnostics

  • Per-layer norm inspection.
  • NaN detection support.
  • Dequant parity tests and source-level audit tests.

Validation workflows

  • Golden-output hardware validation.
  • Greedy vs speculative equality verification.
  • Byte-exact CPU SIMD parity testing for key quant formats.

Explore tested models and deployment paths

AIMax is positioned as the high-throughput serving layer inside the broader Esacode platform.

Explore models

Support open-source AI tooling

Esacode is independently built. If our tools earn a place in your stack, you can help keep development going.