Alloy · A Prysm project
The fastest inference engine for Apple Silicon.
Run any model on your Mac. Faster than llama.cpp and MLX.
Benchmarks
Model
Device
M4 Max
Qwen3 0.6B · MLX 4-bit
tg128 decode · pp4096 prefill
Generation tok/s
Alloy
709.6 ± 17.2
MLX
395.8 ± 3.1
Prompt Processing tok/s
Alloy
8,824.6 ± 5.8
MLX
7,541.3 ± 57.6
Built For
App developers
Ship private AI in your app. Alloy serves any model behind an OpenAI-compatible API.
ML researchers
Write, train and fine-tune models on your Mac with Alloy's PyTorch backend.
Performance engineers
Alloy provides a Triton-like DSL so you can write and run Metal kernels from Python.
Get Started
From install to a running model in under a minute.
01Install
uv add 'alloy-kit[serve]'macOS 13+, Python 3.10+
02Serve
alloy serve -m qwen3:0.6bCompiles any GGUF or MLX model.
03Run
http://127.0.0.1:11434Point any OpenAI client at localhost. Done.
Built-in Features
Speculative decoding
Constrained decoding
Tool calling
KV-cache quantization
Vision & audio input
Embeddings
The Prysm Stack
Edge deploymentPrysmCompile any ONNX. Ship to Hailo-8, Jetson Orin, or Kria K26.
On-device · Apple SiliconAlloyRun and train any model on Apple Silicon. You're here.
Get Started
Free and open source.