API Documentation
Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.
Overview
localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:
- Run inference benchmarks on models
- Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
- Submit results to
POST /api/benchmarks - Query leaderboard data and benchmark results
https://localmaxxing.comAll endpoints are prefixed with
/api. Results appear on the dashboard and public leaderboard immediately upon submission.Authentication
Submitting benchmarks requires authentication. Two methods are supported:
1. Bearer API Key (recommended for agents)
Include your API key in the Authorization header:
Authorization: Bearer bhk_<40 hex chars>
2. Session Cookie
If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.
Example
curl -X POST https://localmaxxing.com/api/benchmarks \
-H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
-H "Content-Type: application/json" \
-d '{ ... }'401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.Agent Metadata
Agents should call GET /api/agent-context before submitting. It returns accepted enum values, schemas, examples, methodology tips, and endpoint URLs in one cached response.
The machine-readable OpenAPI 3.1 spec is available at GET /api/openapi.json.
POST /api/benchmarks
Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.
/api/benchmarksRequired Fields
| Field | Type | Description |
|---|---|---|
| hfId | string | HuggingFace model ID, e.g. "Qwen/Qwen3-8B" |
| hardware | object | Hardware config — see Hardware section |
| engineName | string | Inference engine, e.g. "llama.cpp", "vllm", "sglang" |
| quantization | string | Quant format, e.g. "Q4_K_M", "AWQ", "fp8" |
tokSOut is required. At least one secondary metric is also required:
| Metric | Type | Description |
|---|---|---|
| tokSOut | number | Output tokens per second |
| tokSPrefill | number | Prefill/input tokens per second before generation |
| tokSTotal | number | Total tokens per second (prompt + output) |
| ttftMs | number | Time to first token in milliseconds |
| peakVramGb | number | Peak VRAM usage in GB |
Optional Fields
| Field | Type | Default | Description |
|---|---|---|---|
| modelRevision | string | "main" | Git revision / branch / commit SHA |
| engineVersion | string | — | Engine version, e.g. "0.7.3" |
| backend | string | — | Backend variant, e.g. "cuda", "metal", "vulkan" |
| promptTokens | integer | 0 | Number of prompt tokens used |
| outputTokens | integer | 0 | Number of output tokens generated |
| contextLength | integer | 2048 | Context window size used |
| batchSize | integer | 1 | Batch size (concurrent requests) |
| prefillTokens | integer | — | Tokens already in KV cache before generation started |
| tokSPrefill | number | — | Prefill/input tokens per second before generation |
| peakVramGb | number | — | Peak VRAM usage in GB |
| notes | string | — | Free-text notes, max 2000 chars |
| engineFlags | object | — | Detailed engine flags — see Engine Flags |
Responses
{
"id": "clxyz...",
"modelId": "...",
"hardwareId": "...",
"engineId": "...",
"userId": "...",
"tokSOut": 87.4,
"tokSPrefill": 1210.5,
"status": "APPROVED",
"createdAt": "2026-04-14T03:45:00.000Z",
"model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
...
}{
"error": "Validation failed",
"details": {
"fieldErrors": { "hardware.vramGb": ["Required"] },
"formErrors": []
}
}{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }{ "error": "Model \"some/bad-id\" not found on HuggingFace" }{ "error": "At least one additional metric (TTFT, prefill tok/s, tok/s total, or peak VRAM) is required alongside tok/s output" }{
"error": "Rate limit exceeded. You may submit once every 5 minutes.",
"retryAfterMs": 240000,
"lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}POST /api/benchmarks/dry-run
Validate a benchmark payload without writing to the database or consuming the rate limit. Requires the same Bearer API key or session cookie auth as real submissions.
{
"valid": true,
"parsed": { "hfId": "Qwen/Qwen3-8B", "modelRevision": "main", "promptTokens": 0 }
}Hardware Object
The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.
DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards
{
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"cpu": "Ryzen 9 5900X",
"ramGb": 64,
"os": "Ubuntu 22.04",
"powerWatts": 350
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "DISCRETE_GPU" | Literal |
| gpuName | ✅ | string | e.g. "RTX 3090", "A100 80GB" |
| gpuCount | — | integer | Default 1 |
| vramGb | ✅ | number | Per-card VRAM in GB |
| cpu | — | string | CPU model |
| ramGb | — | number | System RAM in GB |
| os | — | string | Operating system |
| powerWatts | — | number | TDP / measured power draw |
UNIFIEDApple Silicon / AMD APU / Intel Arc
{
"hwClass": "UNIFIED",
"chipVendor": "Apple",
"chipFamily": "M4",
"chipVariant": "M4 Pro",
"unifiedMemoryGb": 48,
"npuTops": 38,
"os": "macOS 15.4"
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "UNIFIED" | Literal |
| chipVendor | ✅ | string | e.g. "Apple", "AMD" |
| chipFamily | ✅ | string | e.g. "M4", "Strix Point" |
| chipVariant | ✅ | string | e.g. "M4 Pro", "M4 Max" |
| unifiedMemoryGb | ✅ | number | Total unified memory |
| npuTops | — | number | NPU TOPS if applicable |
| cpu | — | string | CPU core descriptor |
| os | — | string | Operating system |
| powerWatts | — | number | Power draw |
CPU_ONLYCPU-only inference
{
"hwClass": "CPU_ONLY",
"cpu": "Intel Xeon W9-3595X",
"ramGb": 512,
"os": "Ubuntu 24.04"
}| Field | Required | Type | Notes |
|---|---|---|---|
| hwClass | ✅ | "CPU_ONLY" | Literal |
| cpu | ✅ | string | CPU model |
| ramGb | ✅ | number | System RAM in GB |
| os | — | string | Operating system |
Engine Flags Object
Optional. Provide engineFlags to record the exact launch configuration. If supplied, commandSnippet is required and explicit fields override parsed command flags.
{
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
"tensorParallel": 1,
"gpuLayers": 99,
"kvCacheDtype": "q8_0",
"flashAttn": true,
"attentionBackend": "flash_attn",
"concurrency": 4,
"temperature": 0.6,
"topP": 0.95
}| Flag | Type | Description |
|---|---|---|
| commandSnippet | string | Full launch command (recommended — parsed automatically) |
| tensorParallel | integer | Tensor parallel degree (TP) |
| pipelineParallel | integer | Pipeline parallel degree |
| gpuLayers | integer | Number of layers offloaded to GPU (llama.cpp --n-gpu-layers) |
| splitMode | string | GPU split mode |
| kvCacheDtype | string | KV cache quantization, e.g. "q8_0", "fp8" |
| gpuMemUtil | float 0–1 | GPU memory utilization fraction (vLLM) |
| kvCacheSizeMb | integer | KV cache size in MB |
| prefixCaching | boolean | Whether prefix/prompt caching was enabled |
| attentionBackend | string | e.g. "flash_attn", "xformers", "sdpa" |
| flashAttn | boolean | Flash Attention enabled |
| chunkedPrefill | boolean | Chunked prefill enabled |
| prefillChunkSize | integer | Prefill chunk size |
| contBatching | boolean | Continuous batching enabled |
| cpuOffloadGb | float | GB of weights offloaded to CPU RAM |
| cpuLayers | integer | Number of layers on CPU |
| ropeScaling | string | RoPE scaling method, e.g. "yarn", "linear" |
| ropeScale | float | RoPE scale factor |
| yarnExtFactor | float | YaRN extension factor |
| engineQuant | string | Engine-level quantization override |
| sglangQuant | string | SGLang quantization method |
| maxRunningSeqs | integer | Max running sequences |
| schedulerDelayFactor | float | Scheduler delay factor |
| numParallel | integer | Number of parallel sequences (Ollama) |
| concurrency | integer | Concurrent requests used for throughput runs (vLLM / SGLang) |
| specDecoding | boolean | Speculative decoding enabled |
| specMethod | string | Speculative decoding method, e.g. "Dflash", "EAGLE", "Medusa", "ngram" |
| specModel | string | Draft / decoder model HF ID for speculative decoding |
| specNumTokens | integer | Speculative tokens per step |
| specNgramSize | integer | N-gram size for ngram spec |
| specDraftTp | integer | Draft model tensor parallel |
| mtpEnabled | boolean | Multi-Token Prediction enabled (DeepSeek-style) |
| mtpDraftLayers | integer | Number of MTP draft layers |
| temperature | float 0–2 | Sampling temperature |
| topP | float 0–1 | Top-p nucleus sampling |
| topK | integer | Top-k sampling |
| minP | float 0–1 | Min-p sampling |
| repeatPenalty | float | Repeat penalty |
| mirostat | integer 0–2 | Mirostat mode |
| extraFlags | string | Any additional flags not covered above |
commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.GET /api/benchmarks
Fetch approved benchmark results. Public endpoint — no auth required.
/api/benchmarksQuery Parameters
| Param | Type | Description |
|---|---|---|
| hfId | string | Filter by model HF ID (includes finetunes of that base model) |
| hwClass | DISCRETE_GPU | UNIFIED | CPU_ONLY | Filter by hardware class |
| gpuName | string | Filter by GPU name (exact) |
| chipVendor | string | Filter by chip vendor |
| chipFamily | string | Filter by chip family |
| chipVariant | string | Filter by chip variant |
| kvCacheDtype | string | Filter by KV cache dtype |
| attentionBackend | string | Filter by attention backend |
| gpuLayersMin | integer | Minimum GPU layers (≥) |
| tensorParallelMin | integer | Minimum tensor parallel (≥) |
| specOnly | "true" | Only speculative decoding runs |
| mtpOnly | "true" | Only MTP-enabled runs |
| date | string | Filter to a specific UTC calendar day (YYYY-MM-DD) |
| dateFrom | string | ISO-8601 timestamp — only results after this date |
| dateTo | string | ISO-8601 timestamp — only results before this date |
| userId | string | Filter by internal user ID (overrides verified filter) |
| username | string | Case-insensitive filter by public username (ignored when userId is also provided) |
| verified | "true" | "false" | Filter by user verification status (overridden by userId / username) |
| limit | integer | Results per page (1–100, default 20) |
| offset | integer | Pagination offset |
Response
{
"benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
"total": 142,
"limit": 20,
"offset": 0
}Example
curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10" curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z"
GET /api/leaderboard
Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.
/api/leaderboardQuery Parameters
| Param | Type | Description |
|---|---|---|
| hfId | string | Filter to a single model (URL-encoded: "org/model") |
| hwClass | DISCRETE_GPU | UNIFIED | CPU_ONLY | Filter by hardware class |
| memTier | string | VRAM tier: "8" | "12" | "16" | "24" | "32" | "48" | "80" | "96" | "128" |
| hardwareName | string | Case-insensitive substring match across GPU name (DISCRETE_GPU), chip family/variant/vendor (UNIFIED), and CPU name (CPU_ONLY) |
| engineName | string | Exact engine name |
| quantization | string | Exact quant string |
| paramSize | integer | Model parameter size tier: 1 | 3 | 7 | 13 | 30 | 70 | 110 |
| os | string | Filter by OS: "windows" | "linux" | "macos" |
| backend | string | Exact match on engine backend, e.g. "cuda", "rocm", "metal" |
| modelFamily | string | Case-insensitive substring match on model family |
| isMoE | "true" | "false" | Filter to MoE ("true") or dense ("false") models |
| batchSize | integer | Batch size bucket — 8 matches runs with batchSize ≥ 8 |
| contextLen | integer | Context length bucket filter |
| verified | "true" | "false" | Filter by user verification status (omit for all) |
| since | "7d" | "30d" | Time window (omit for all-time) |
| limit | integer | Max rows (default 50, max 200) |
| offset | integer | Pagination offset |
Response
{
"rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
"total": 89,
"limit": 50,
"offset": 0
}GET /api/models
Browse models in the database. Public endpoint — no auth required.
/api/modelsQuery Parameters
| Param | Type | Description |
|---|---|---|
| search | string | Search by HF ID, display name, or family (case-insensitive) |
| tree | "true" | Return base models with nested finetunes instead of flat list |
| limit | integer | Results per page (default 200) |
| offset | integer | Pagination offset |
GET /api/models/search
Resolve fuzzy model names to canonical HuggingFace IDs before submitting. Query with q and optional limit (1–25, default 10). Response entries include hfId, displayName, family, params, and benchmarkCount.
Evals API
Quality evals use approved suites with task-level results and server-side aggregate scoring. Eval runs start as PENDING and appear publicly after admin approval.
| Endpoint | Auth | Description |
|---|---|---|
| GET /api/evals/suites | Public | List approved eval suites. Supports category, runner, official, limit, and offset filters. |
| GET /api/evals/suites/[slug] | Public | Fetch one suite with its full suiteDoc, task keys, scoring, and run config. |
| POST /api/evals/suites | API key or session | Register a new suite. Created suites start pending admin approval. |
| GET /api/evals/runs?modelId=... | Public | Best approved eval run per suite for a model. |
| POST /api/evals/runs/dry-run | API key or session | Validate task coverage and aggregate scoring without writing or consuming the rate limit. |
| POST /api/evals/runs | API key or session | Submit eval results. Rate limited to one eval run per 5 minutes per user. |
| POST /api/evals/execute | API key or session | Execute an approved CUSTOM suite against a public OpenAI-compatible endpoint. |
| GET/POST /api/evals/runs/[id]/react | GET public, POST session | React with one of fire, rocket, 100, brain, or chad. |
API Keys
Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.
/api/keysList your API keys (key secrets are never returned).
[
{ "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
...
]/api/keysCreate a new API key. The raw key is returned only once — store it immediately.
Request body:
{
"name": "My Agent Key",
"expiresAt": "2027-01-01T00:00:00Z" // optional ISO-8601
}Response (201):
{
"id": "...",
"name": "My Agent Key",
"prefix": "bhk_1a2b",
"createdAt": "...",
"expiresAt": "2027-01-01T00:00:00Z",
"key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" // SHOWN ONLY ONCE
}/api/keys/[id]Revoke (delete) an API key. Returns { "ok": true } on success.
Saved Setups
Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.
/api/setupsList your saved setups, ordered by default first, then most recent.
/api/setupsCreate a new saved setup. If isDefault is true, existing defaults are cleared.
{
"name": "My RTX 3090 Setup",
"description": "Standard llama.cpp config",
"isDefault": true,
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"gpuLayers": 99,
"flashAttn": true,
"contextLength": 8192
}/api/setups/[id]Fetch a single saved setup by ID. Used by the submit page to prefill form data.
/api/setups/[id]Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.
/api/setups/[id]Delete a saved setup. Returns 204 No Content on success.
Rate Limits
retryAfterMs and a Retry-After header.| Endpoint | Limit | Scope |
|---|---|---|
| POST /api/benchmarks | 1 request / 5 min | Per user |
| GET /api/benchmarks | Generous | Per IP |
| GET /api/leaderboard | Generous | Per IP |
| GET /api/models | Generous | Per IP |
| POST /api/keys | Max 10 keys | Per account |
Common Engine Names
Use consistent casing — these are the accepted values:
| Engine | engineName value |
|---|---|
| llama.cpp / llama-server | llama.cpp |
| vLLM | vllm |
| SGLang | sglang |
| Ollama | ollama |
| LM Studio | lmstudio |
| ExLlamaV2 | exllamav2 |
| TGI (Text Generation Inference) | tgi |
| TensorRT-LLM | tensorrt-llm |
| MLX (Apple) | mlx |
| text-generation-webui | text-generation-webui |
| LMDeploy | lmdeploy |
| MLC-LLM | mlc-llm |
| HipFire | hipfire |
| tinygrad | tinygrad |
Benchmark Methodology
For reproducible, trustworthy results:
- Warm up the model with 1–2 throwaway runs before recording
- Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
- Record steady-state throughput — not the first token burst
- Measure
tokSPrefillas prompt/input processing throughput before generation; this helps estimate wait time for long prompts - Measure
peakVramGbvianvidia-smior equivalent at peak load - Report
batchSizeaccurately — batch=1 and batch=8 are not comparable - Set
temperature: 0(greedy decode) for deterministic throughput benchmarks unless testing sampling overhead - Include
commandSnippet— the exact launch command is the most useful thing for reproducibility
Field Constraints
| Field | Max length / Range |
|---|---|
| hfId | 256 chars |
| modelRevision | 128 chars |
| engineName | 64 chars |
| engineVersion | 64 chars |
| quantization | 64 chars |
| backend | 64 chars |
| notes | 2000 chars |
| commandSnippet | 4000 chars |
| extraFlags | 1000 chars |
| contextLength | integer ≥ 1 |
| batchSize | integer ≥ 1 |
| prefillTokens | integer ≥ 0 |
| tokSPrefill | positive number |
| gpuMemUtil | 0.0 – 1.0 |
| temperature | 0.0 – 100.0 |
| topP / minP | 0.0 – 1.0 |
| mirostat | 0, 1, or 2 |
| gpuCount | integer ≥ 1 |
| vramGb | positive number |
| promptTokens | integer ≥ 0 |
| outputTokens | integer ≥ 0 |
Examples
Minimal Submission
Smallest valid request body:
{
"hfId": "Qwen/Qwen3-8B",
"hardware": {
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"vramGb": 24
},
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"tokSOut": 87.4,
"tokSPrefill": 1210.5,
"ttftMs": 210
}Full Submission
Complete request with all optional fields:
{
"hfId": "Qwen/Qwen3-8B",
"modelRevision": "main",
"hardware": {
"hwClass": "DISCRETE_GPU",
"gpuName": "RTX 3090",
"gpuCount": 1,
"vramGb": 24,
"cpu": "Ryzen 9 5900X",
"ramGb": 64,
"os": "Ubuntu 22.04"
},
"engineName": "llama.cpp",
"engineVersion": "b5012",
"quantization": "Q4_K_M",
"backend": "cuda",
"promptTokens": 512,
"outputTokens": 1024,
"contextLength": 8192,
"batchSize": 1,
"ttftMs": 142.5,
"tokSOut": 87.4,
"tokSPrefill": 1210.5,
"tokSTotal": 74.1,
"peakVramGb": 6.2,
"notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
"engineFlags": {
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
"gpuLayers": 99,
"kvCacheDtype": "q8_0",
"flashAttn": true,
"prefixCaching": true,
"temperature": 0.6,
"topP": 0.95
}
}cURL — Submit Benchmark
curl -X POST https://localmaxxing.com/api/benchmarks \
-H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
-H "Content-Type: application/json" \
-d '{
"hfId": "Qwen/Qwen3-8B",
"hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
"engineName": "llama.cpp",
"quantization": "Q4_K_M",
"tokSOut": 87.4,
"tokSPrefill": 1210.5,
"tokSTotal": 74.1,
"ttftMs": 142.5,
"contextLength": 8192,
"engineFlags": {
"commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
}
}'cURL — Query Leaderboard
curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10" curl "https://localmaxxing.com/api/leaderboard?hardwareName=RTX+4090&modelFamily=qwen&isMoE=false"
cURL — Filter Benchmarks by User & Date
curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z&dateTo=2026-04-30T23:59:59Z"