API REFERENCE

API Documentation

Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.

Overview

localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:

Run inference benchmarks on models
Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
Submit results to POST /api/benchmarks
Query leaderboard data and benchmark results

ℹ️

Base URL: https://localmaxxing.com
All endpoints are prefixed with /api. Results appear on the dashboard and public leaderboard immediately upon submission.

Authentication

Submitting benchmarks requires authentication. Two methods are supported:

1. Bearer API Key (recommended for agents)

Include your API key in the Authorization header:

Authorization: Bearer bhk_<40 hex chars>

2. Session Cookie

If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.

Example

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
  -H "Content-Type: application/json" \
  -d '{ ... }'

⚠️

If the key is missing, expired, or invalid, the API returns 401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.

Agent Metadata

Agents should call GET /api/agent-context before submitting. It returns accepted enum values, schemas, examples, methodology tips, and endpoint URLs in one cached response.

The machine-readable OpenAPI 3.1 spec is available at GET /api/openapi.json.

POST /api/benchmarks

Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.

POST/api/benchmarks

Required Fields

Field	Type	Description
hfId	`string`	HuggingFace model ID, e.g. `"Qwen/Qwen3-8B"`
hardware	`object`	Hardware config — see Hardware section
engineName	`string`	Inference engine, e.g. `"llama.cpp"`, `"vllm"`, `"sglang"`
quantization	`string`	Quant format, e.g. `"Q4_K_M"`, `"AWQ"`, `"fp8"`

tokSOut is required. At least one secondary metric is also required:

Metric	Type	Description
tokSOut	`number`	Output tokens per second
tokSPrefill	`number`	Prefill/input tokens per second before generation
tokSTotal	`number`	Total tokens per second (prompt + output)
ttftMs	`number`	Time to first token in milliseconds
peakVramGb	`number`	Peak VRAM usage in GB

Optional Fields

Field	Type	Default	Description
modelRevision	`string`	`"main"`	Git revision / branch / commit SHA
engineVersion	`string`	—	Engine version, e.g. `"0.7.3"`
backend	`string`	—	Backend variant, e.g. `"cuda"`, `"metal"`, `"vulkan"`
promptTokens	`integer`	`0`	Number of prompt tokens used
outputTokens	`integer`	`0`	Number of output tokens generated
contextLength	`integer`	`2048`	Context window size used
batchSize	`integer`	`1`	Batch size (concurrent requests)
prefillTokens	`integer`	—	Tokens already in KV cache before generation started
tokSPrefill	`number`	—	Prefill/input tokens per second before generation
peakVramGb	`number`	—	Peak VRAM usage in GB
notes	`string`	—	Free-text notes, max 2000 chars
engineFlags	`object`	—	Detailed engine flags — see Engine Flags

Responses

201Created — Success

{
  "id": "clxyz...",
  "modelId": "...",
  "hardwareId": "...",
  "engineId": "...",
  "userId": "...",
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "status": "APPROVED",
  "createdAt": "2026-04-14T03:45:00.000Z",
  "model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
  ...
}

400Bad Request — Validation error

{
  "error": "Validation failed",
  "details": {
    "fieldErrors": { "hardware.vramGb": ["Required"] },
    "formErrors": []
  }
}

401Unauthorized

{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }

404Not Found — Model not on HuggingFace

{ "error": "Model \"some/bad-id\" not found on HuggingFace" }

400Bad Request — Missing secondary metric

{ "error": "At least one additional metric (TTFT, prefill tok/s, tok/s total, or peak VRAM) is required alongside tok/s output" }

429Rate Limit Exceeded

{
  "error": "Rate limit exceeded. You may submit once every 5 minutes.",
  "retryAfterMs": 240000,
  "lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}

POST /api/benchmarks/dry-run

Validate a benchmark payload without writing to the database or consuming the rate limit. Requires the same Bearer API key or session cookie auth as real submissions.

{
  "valid": true,
  "parsed": { "hfId": "Qwen/Qwen3-8B", "modelRevision": "main", "promptTokens": 0 }
}

Hardware Object

The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.

DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards

{
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "cpu": "Ryzen 9 5900X",
  "ramGb": 64,
  "os": "Ubuntu 22.04",
  "powerWatts": 350
}

Field	Required	Type	Notes
hwClass	✅	`"DISCRETE_GPU"`	Literal
gpuName	✅	`string`	e.g. `"RTX 3090"`, `"A100 80GB"`
gpuCount	—	`integer`	Default `1`
vramGb	✅	`number`	Per-card VRAM in GB
cpu	—	`string`	CPU model
ramGb	—	`number`	System RAM in GB
os	—	`string`	Operating system
powerWatts	—	`number`	TDP / measured power draw

UNIFIEDApple Silicon / AMD APU / Intel Arc

{
  "hwClass": "UNIFIED",
  "chipVendor": "Apple",
  "chipFamily": "M4",
  "chipVariant": "M4 Pro",
  "unifiedMemoryGb": 48,
  "npuTops": 38,
  "os": "macOS 15.4"
}

Field	Required	Type	Notes
hwClass	✅	`"UNIFIED"`	Literal
chipVendor	✅	`string`	e.g. `"Apple"`, `"AMD"`
chipFamily	✅	`string`	e.g. `"M4"`, `"Strix Point"`
chipVariant	✅	`string`	e.g. `"M4 Pro"`, `"M4 Max"`
unifiedMemoryGb	✅	`number`	Total unified memory
npuTops	—	`number`	NPU TOPS if applicable
cpu	—	`string`	CPU core descriptor
os	—	`string`	Operating system
powerWatts	—	`number`	Power draw

CPU_ONLYCPU-only inference

{
  "hwClass": "CPU_ONLY",
  "cpu": "Intel Xeon W9-3595X",
  "ramGb": 512,
  "os": "Ubuntu 24.04"
}

Field	Required	Type	Notes
hwClass	✅	`"CPU_ONLY"`	Literal
cpu	✅	`string`	CPU model
ramGb	✅	`number`	System RAM in GB
os	—	`string`	Operating system

Engine Flags Object

Optional. Provide engineFlags to record the exact launch configuration. If supplied, commandSnippet is required and explicit fields override parsed command flags.

{
  "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
  "tensorParallel": 1,
  "gpuLayers": 99,
  "kvCacheDtype": "q8_0",
  "flashAttn": true,
  "attentionBackend": "flash_attn",
  "concurrency": 4,
  "temperature": 0.6,
  "topP": 0.95
}

Flag	Type	Description
commandSnippet	`string`	Full launch command (recommended — parsed automatically)
tensorParallel	`integer`	Tensor parallel degree (TP)
pipelineParallel	`integer`	Pipeline parallel degree
gpuLayers	`integer`	Number of layers offloaded to GPU (llama.cpp `--n-gpu-layers`)
splitMode	`string`	GPU split mode
kvCacheDtype	`string`	KV cache quantization, e.g. `"q8_0"`, `"fp8"`
gpuMemUtil	`float 0–1`	GPU memory utilization fraction (vLLM)
kvCacheSizeMb	`integer`	KV cache size in MB
prefixCaching	`boolean`	Whether prefix/prompt caching was enabled
attentionBackend	`string`	e.g. `"flash_attn"`, `"xformers"`, `"sdpa"`
flashAttn	`boolean`	Flash Attention enabled
chunkedPrefill	`boolean`	Chunked prefill enabled
prefillChunkSize	`integer`	Prefill chunk size
contBatching	`boolean`	Continuous batching enabled
cpuOffloadGb	`float`	GB of weights offloaded to CPU RAM
cpuLayers	`integer`	Number of layers on CPU
ropeScaling	`string`	RoPE scaling method, e.g. `"yarn"`, `"linear"`
ropeScale	`float`	RoPE scale factor
yarnExtFactor	`float`	YaRN extension factor
engineQuant	`string`	Engine-level quantization override
sglangQuant	`string`	SGLang quantization method
maxRunningSeqs	`integer`	Max running sequences
schedulerDelayFactor	`float`	Scheduler delay factor
numParallel	`integer`	Number of parallel sequences (Ollama)
concurrency	`integer`	Concurrent requests used for throughput runs (vLLM / SGLang)
specDecoding	`boolean`	Speculative decoding enabled
specMethod	`string`	Speculative decoding method, e.g. `"Dflash"`, `"EAGLE"`, `"Medusa"`, `"ngram"`
specModel	`string`	Draft / decoder model HF ID for speculative decoding
specNumTokens	`integer`	Speculative tokens per step
specNgramSize	`integer`	N-gram size for ngram spec
specDraftTp	`integer`	Draft model tensor parallel
mtpEnabled	`boolean`	Multi-Token Prediction enabled (DeepSeek-style)
mtpDraftLayers	`integer`	Number of MTP draft layers
temperature	`float 0–2`	Sampling temperature
topP	`float 0–1`	Top-p nucleus sampling
topK	`integer`	Top-k sampling
minP	`float 0–1`	Min-p sampling
repeatPenalty	`float`	Repeat penalty
mirostat	`integer 0–2`	Mirostat mode
extraFlags	`string`	Any additional flags not covered above

💡

Tip: If you provide commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.

GET /api/benchmarks

Fetch approved benchmark results. Public endpoint — no auth required.

GET/api/benchmarks

Query Parameters

Param	Type	Description
hfId	`string`	Filter by model HF ID (includes finetunes of that base model)
hwClass	`DISCRETE_GPU \| UNIFIED \| CPU_ONLY`	Filter by hardware class
gpuName	`string`	Filter by GPU name (exact)
chipVendor	`string`	Filter by chip vendor
chipFamily	`string`	Filter by chip family
chipVariant	`string`	Filter by chip variant
kvCacheDtype	`string`	Filter by KV cache dtype
attentionBackend	`string`	Filter by attention backend
gpuLayersMin	`integer`	Minimum GPU layers (≥)
tensorParallelMin	`integer`	Minimum tensor parallel (≥)
specOnly	`"true"`	Only speculative decoding runs
mtpOnly	`"true"`	Only MTP-enabled runs
date	`string`	Filter to a specific UTC calendar day (`YYYY-MM-DD`)
dateFrom	`string`	ISO-8601 timestamp — only results after this date
dateTo	`string`	ISO-8601 timestamp — only results before this date
userId	`string`	Filter by internal user ID (overrides `verified` filter)
username	`string`	Case-insensitive filter by public username (ignored when `userId` is also provided)
verified	`"true" \| "false"`	Filter by user verification status (overridden by `userId` / `username`)
limit	`integer`	Results per page (1–100, default 20)
offset	`integer`	Pagination offset

Response

{
  "benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
  "total": 142,
  "limit": 20,
  "offset": 0
}

Example

curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10"
curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z"

GET /api/leaderboard

Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.

GET/api/leaderboard

Query Parameters

Param	Type	Description
hfId	`string`	Filter to a single model (URL-encoded: `"org/model"`)
hwClass	`DISCRETE_GPU \| UNIFIED \| CPU_ONLY`	Filter by hardware class
memTier	`string`	VRAM tier: `"8"` \| `"12"` \| `"16"` \| `"24"` \| `"32"` \| `"48"` \| `"80"` \| `"96"` \| `"128"`
hardwareName	`string`	Case-insensitive substring match across GPU name (DISCRETE_GPU), chip family/variant/vendor (UNIFIED), and CPU name (CPU_ONLY)
engineName	`string`	Exact engine name
quantization	`string`	Exact quant string
paramSize	`integer`	Model parameter size tier: `1` \| `3` \| `7` \| `13` \| `30` \| `70` \| `110`
os	`string`	Filter by OS: `"windows"` \| `"linux"` \| `"macos"`
backend	`string`	Exact match on engine backend, e.g. `"cuda"`, `"rocm"`, `"metal"`
modelFamily	`string`	Case-insensitive substring match on model family
isMoE	`"true" \| "false"`	Filter to MoE (`"true"`) or dense (`"false"`) models
batchSize	`integer`	Batch size bucket — `8` matches runs with batchSize ≥ 8
contextLen	`integer`	Context length bucket filter
verified	`"true" \| "false"`	Filter by user verification status (omit for all)
since	`"7d" \| "30d"`	Time window (omit for all-time)
limit	`integer`	Max rows (default 50, max 200)
offset	`integer`	Pagination offset

Response

{
  "rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
  "total": 89,
  "limit": 50,
  "offset": 0
}

GET /api/models

Browse models in the database. Public endpoint — no auth required.

GET/api/models

Query Parameters

Param	Type	Description
search	`string`	Search by HF ID, display name, or family (case-insensitive)
tree	`"true"`	Return base models with nested finetunes instead of flat list
limit	`integer`	Results per page (default 200)
offset	`integer`	Pagination offset

GET /api/models/search

Resolve fuzzy model names to canonical HuggingFace IDs before submitting. Query with q and optional limit (1–25, default 10). Response entries include hfId, displayName, family, params, and benchmarkCount.

Evals API

Quality evals use approved suites with task-level results and server-side aggregate scoring. Eval runs start as PENDING and appear publicly after admin approval.

Endpoint	Auth	Description
GET /api/evals/suites	Public	List approved eval suites. Supports category, runner, official, limit, and offset filters.
GET /api/evals/suites/[slug]	Public	Fetch one suite with its full suiteDoc, task keys, scoring, and run config.
POST /api/evals/suites	API key or session	Register a new suite. Created suites start pending admin approval.
GET /api/evals/runs?modelId=...	Public	Best approved eval run per suite for a model.
POST /api/evals/runs/dry-run	API key or session	Validate task coverage and aggregate scoring without writing or consuming the rate limit.
POST /api/evals/runs	API key or session	Submit eval results. Rate limited to one eval run per 5 minutes per user.
POST /api/evals/execute	API key or session	Execute an approved CUSTOM suite against a public OpenAI-compatible endpoint.
GET/POST /api/evals/runs/[id]/react	GET public, POST session	React with one of fire, rocket, 100, brain, or chad.

API Keys

Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.

GET/api/keys

List your API keys (key secrets are never returned).

[
  { "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
  ...
]

POST/api/keys

Create a new API key. The raw key is returned only once — store it immediately.

Request body:

{
  "name": "My Agent Key",
  "expiresAt": "2027-01-01T00:00:00Z"  // optional ISO-8601
}

Response (201):

{
  "id": "...",
  "name": "My Agent Key",
  "prefix": "bhk_1a2b",
  "createdAt": "...",
  "expiresAt": "2027-01-01T00:00:00Z",
  "key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12"  // SHOWN ONLY ONCE
}

DELETE/api/keys/[id]

Revoke (delete) an API key. Returns { "ok": true } on success.

Saved Setups

Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.

GET/api/setups

List your saved setups, ordered by default first, then most recent.

POST/api/setups

Create a new saved setup. If isDefault is true, existing defaults are cleared.

{
  "name": "My RTX 3090 Setup",
  "description": "Standard llama.cpp config",
  "isDefault": true,
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "gpuLayers": 99,
  "flashAttn": true,
  "contextLength": 8192
}

GET/api/setups/[id]

Fetch a single saved setup by ID. Used by the submit page to prefill form data.

PATCH/api/setups/[id]

Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.

DELETE/api/setups/[id]

Delete a saved setup. Returns 204 No Content on success.

Rate Limits

⚠️

POST /api/benchmarks: 1 submission per 5 minutes per user. When rate limited, the response includes retryAfterMs and a Retry-After header.

Endpoint	Limit	Scope
POST /api/benchmarks	1 request / 5 min	Per user
GET /api/benchmarks	Generous	Per IP
GET /api/leaderboard	Generous	Per IP
GET /api/models	Generous	Per IP
POST /api/keys	Max 10 keys	Per account

Common Engine Names

Use consistent casing — these are the accepted values:

Engine	engineName value
llama.cpp / llama-server	llama.cpp
vLLM	vllm
SGLang	sglang
Ollama	ollama
LM Studio	lmstudio
ExLlamaV2	exllamav2
TGI (Text Generation Inference)	tgi
TensorRT-LLM	tensorrt-llm
MLX (Apple)	mlx
text-generation-webui	text-generation-webui
LMDeploy	lmdeploy
MLC-LLM	mlc-llm
HipFire	hipfire
tinygrad	tinygrad

Benchmark Methodology

For reproducible, trustworthy results:

Warm up the model with 1–2 throwaway runs before recording
Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
Record steady-state throughput — not the first token burst
Measure tokSPrefill as prompt/input processing throughput before generation; this helps estimate wait time for long prompts
Measure peakVramGb via nvidia-smi or equivalent at peak load
Report batchSize accurately — batch=1 and batch=8 are not comparable
Set temperature: 0 (greedy decode) for deterministic throughput benchmarks unless testing sampling overhead
Include commandSnippet — the exact launch command is the most useful thing for reproducibility

Field Constraints

Field	Max length / Range
hfId	256 chars
modelRevision	128 chars
engineName	64 chars
engineVersion	64 chars
quantization	64 chars
backend	64 chars
notes	2000 chars
commandSnippet	4000 chars
extraFlags	1000 chars
contextLength	integer ≥ 1
batchSize	integer ≥ 1
prefillTokens	integer ≥ 0
tokSPrefill	positive number
gpuMemUtil	0.0 – 1.0
temperature	0.0 – 100.0
topP / minP	0.0 – 1.0
mirostat	0, 1, or 2
gpuCount	integer ≥ 1
vramGb	positive number
promptTokens	integer ≥ 0
outputTokens	integer ≥ 0

Examples

Minimal Submission

Smallest valid request body:

{
  "hfId": "Qwen/Qwen3-8B",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "vramGb": 24
  },
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "ttftMs": 210
}

Full Submission

Complete request with all optional fields:

{
  "hfId": "Qwen/Qwen3-8B",
  "modelRevision": "main",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "gpuCount": 1,
    "vramGb": 24,
    "cpu": "Ryzen 9 5900X",
    "ramGb": 64,
    "os": "Ubuntu 22.04"
  },
  "engineName": "llama.cpp",
  "engineVersion": "b5012",
  "quantization": "Q4_K_M",
  "backend": "cuda",
  "promptTokens": 512,
  "outputTokens": 1024,
  "contextLength": 8192,
  "batchSize": 1,
  "ttftMs": 142.5,
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "tokSTotal": 74.1,
  "peakVramGb": 6.2,
  "notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
  "engineFlags": {
    "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
    "gpuLayers": 99,
    "kvCacheDtype": "q8_0",
    "flashAttn": true,
    "prefixCaching": true,
    "temperature": 0.6,
    "topP": 0.95
  }
}

cURL — Submit Benchmark

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "hfId": "Qwen/Qwen3-8B",
    "hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
    "engineName": "llama.cpp",
    "quantization": "Q4_K_M",
    "tokSOut": 87.4,
    "tokSPrefill": 1210.5,
    "tokSTotal": 74.1,
    "ttftMs": 142.5,
    "contextLength": 8192,
    "engineFlags": {
      "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
    }
  }'

cURL — Query Leaderboard

curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10"
curl "https://localmaxxing.com/api/leaderboard?hardwareName=RTX+4090&modelFamily=qwen&isMoE=false"

cURL — Filter Benchmarks by User & Date

curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z&dateTo=2026-04-30T23:59:59Z"