ModelsLeaderboardEvalsTrainRentalsAPI Docs
API REFERENCE

API Documentation

Complete reference for submitting and querying local LLM benchmarks. Designed for agents and developers building on the localmaxxing platform.

Overview

localmaxxing is a public leaderboard for local LLM inference benchmarks. The API enables agents and developers to:

  1. Run inference benchmarks on models
  2. Collect performance metrics (tok/s, TTFT, peak VRAM, etc.)
  3. Submit results to POST /api/benchmarks
  4. Query leaderboard data and benchmark results
ℹ️
Base URL: https://localmaxxing.com
All endpoints are prefixed with /api. Results appear on the dashboard and public leaderboard immediately upon submission.

Authentication

Submitting benchmarks requires authentication. Two methods are supported:

1. Bearer API Key (recommended for agents)

Include your API key in the Authorization header:

Authorization: Bearer bhk_<40 hex chars>

2. Session Cookie

If you're calling the API from the browser (e.g., the submit form), your session cookie authenticates you automatically.

Example

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12" \
  -H "Content-Type: application/json" \
  -d '{ ... }'
⚠️
If the key is missing, expired, or invalid, the API returns 401 Unauthorized. API keys are created and managed in your dashboard. A maximum of 10 keys per account is allowed.

Agent Metadata

Agents should call GET /api/agent-context before submitting. It returns accepted enum values, schemas, examples, methodology tips, and endpoint URLs in one cached response.

The machine-readable OpenAPI 3.1 spec is available at GET /api/openapi.json.

POST /api/benchmarks

Submit a benchmark result. Requires authentication. This is the primary endpoint for agents.

POST/api/benchmarks

Required Fields

FieldTypeDescription
hfIdstringHuggingFace model ID, e.g. "Qwen/Qwen3-8B"
hardwareobjectHardware config — see Hardware section
engineNamestringInference engine, e.g. "llama.cpp", "vllm", "sglang"
quantizationstringQuant format, e.g. "Q4_K_M", "AWQ", "fp8"

tokSOut is required. At least one secondary metric is also required:

MetricTypeDescription
tokSOutnumberOutput tokens per second
tokSPrefillnumberPrefill/input tokens per second before generation
tokSTotalnumberTotal tokens per second (prompt + output)
ttftMsnumberTime to first token in milliseconds
peakVramGbnumberPeak VRAM usage in GB

Optional Fields

FieldTypeDefaultDescription
modelRevisionstring"main"Git revision / branch / commit SHA
engineVersionstringEngine version, e.g. "0.7.3"
backendstringBackend variant, e.g. "cuda", "metal", "vulkan"
promptTokensinteger0Number of prompt tokens used
outputTokensinteger0Number of output tokens generated
contextLengthinteger2048Context window size used
batchSizeinteger1Batch size (concurrent requests)
prefillTokensintegerTokens already in KV cache before generation started
tokSPrefillnumberPrefill/input tokens per second before generation
peakVramGbnumberPeak VRAM usage in GB
notesstringFree-text notes, max 2000 chars
engineFlagsobjectDetailed engine flags — see Engine Flags

Responses

201Created — Success
{
  "id": "clxyz...",
  "modelId": "...",
  "hardwareId": "...",
  "engineId": "...",
  "userId": "...",
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "status": "APPROVED",
  "createdAt": "2026-04-14T03:45:00.000Z",
  "model": { "hfId": "Qwen/Qwen3-8B", "displayName": "Qwen3-8B", ... },
  ...
}
400Bad Request — Validation error
{
  "error": "Validation failed",
  "details": {
    "fieldErrors": { "hardware.vramGb": ["Required"] },
    "formErrors": []
  }
}
401Unauthorized
{ "error": "Authentication required. Use a session cookie or Authorization: Bearer <api_key>" }
404Not Found — Model not on HuggingFace
{ "error": "Model \"some/bad-id\" not found on HuggingFace" }
400Bad Request — Missing secondary metric
{ "error": "At least one additional metric (TTFT, prefill tok/s, tok/s total, or peak VRAM) is required alongside tok/s output" }
429Rate Limit Exceeded
{
  "error": "Rate limit exceeded. You may submit once every 5 minutes.",
  "retryAfterMs": 240000,
  "lastSubmittedAt": "2026-04-14T03:40:00.000Z"
}

POST /api/benchmarks/dry-run

Validate a benchmark payload without writing to the database or consuming the rate limit. Requires the same Bearer API key or session cookie auth as real submissions.

{
  "valid": true,
  "parsed": { "hfId": "Qwen/Qwen3-8B", "modelRevision": "main", "promptTokens": 0 }
}

Hardware Object

The hardware field is a discriminated union on hwClass. Use the right shape for the hardware being tested.

DISCRETE_GPUNVIDIA / AMD / Intel discrete graphics cards

{
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "cpu": "Ryzen 9 5900X",
  "ramGb": 64,
  "os": "Ubuntu 22.04",
  "powerWatts": 350
}
FieldRequiredTypeNotes
hwClass"DISCRETE_GPU"Literal
gpuNamestringe.g. "RTX 3090", "A100 80GB"
gpuCountintegerDefault 1
vramGbnumberPer-card VRAM in GB
cpustringCPU model
ramGbnumberSystem RAM in GB
osstringOperating system
powerWattsnumberTDP / measured power draw

UNIFIEDApple Silicon / AMD APU / Intel Arc

{
  "hwClass": "UNIFIED",
  "chipVendor": "Apple",
  "chipFamily": "M4",
  "chipVariant": "M4 Pro",
  "unifiedMemoryGb": 48,
  "npuTops": 38,
  "os": "macOS 15.4"
}
FieldRequiredTypeNotes
hwClass"UNIFIED"Literal
chipVendorstringe.g. "Apple", "AMD"
chipFamilystringe.g. "M4", "Strix Point"
chipVariantstringe.g. "M4 Pro", "M4 Max"
unifiedMemoryGbnumberTotal unified memory
npuTopsnumberNPU TOPS if applicable
cpustringCPU core descriptor
osstringOperating system
powerWattsnumberPower draw

CPU_ONLYCPU-only inference

{
  "hwClass": "CPU_ONLY",
  "cpu": "Intel Xeon W9-3595X",
  "ramGb": 512,
  "os": "Ubuntu 24.04"
}
FieldRequiredTypeNotes
hwClass"CPU_ONLY"Literal
cpustringCPU model
ramGbnumberSystem RAM in GB
osstringOperating system

Engine Flags Object

Optional. Provide engineFlags to record the exact launch configuration. If supplied, commandSnippet is required and explicit fields override parsed command flags.

{
  "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa",
  "tensorParallel": 1,
  "gpuLayers": 99,
  "kvCacheDtype": "q8_0",
  "flashAttn": true,
  "attentionBackend": "flash_attn",
  "concurrency": 4,
  "temperature": 0.6,
  "topP": 0.95
}
FlagTypeDescription
commandSnippetstringFull launch command (recommended — parsed automatically)
tensorParallelintegerTensor parallel degree (TP)
pipelineParallelintegerPipeline parallel degree
gpuLayersintegerNumber of layers offloaded to GPU (llama.cpp --n-gpu-layers)
splitModestringGPU split mode
kvCacheDtypestringKV cache quantization, e.g. "q8_0", "fp8"
gpuMemUtilfloat 0–1GPU memory utilization fraction (vLLM)
kvCacheSizeMbintegerKV cache size in MB
prefixCachingbooleanWhether prefix/prompt caching was enabled
attentionBackendstringe.g. "flash_attn", "xformers", "sdpa"
flashAttnbooleanFlash Attention enabled
chunkedPrefillbooleanChunked prefill enabled
prefillChunkSizeintegerPrefill chunk size
contBatchingbooleanContinuous batching enabled
cpuOffloadGbfloatGB of weights offloaded to CPU RAM
cpuLayersintegerNumber of layers on CPU
ropeScalingstringRoPE scaling method, e.g. "yarn", "linear"
ropeScalefloatRoPE scale factor
yarnExtFactorfloatYaRN extension factor
engineQuantstringEngine-level quantization override
sglangQuantstringSGLang quantization method
maxRunningSeqsintegerMax running sequences
schedulerDelayFactorfloatScheduler delay factor
numParallelintegerNumber of parallel sequences (Ollama)
concurrencyintegerConcurrent requests used for throughput runs (vLLM / SGLang)
specDecodingbooleanSpeculative decoding enabled
specMethodstringSpeculative decoding method, e.g. "Dflash", "EAGLE", "Medusa", "ngram"
specModelstringDraft / decoder model HF ID for speculative decoding
specNumTokensintegerSpeculative tokens per step
specNgramSizeintegerN-gram size for ngram spec
specDraftTpintegerDraft model tensor parallel
mtpEnabledbooleanMulti-Token Prediction enabled (DeepSeek-style)
mtpDraftLayersintegerNumber of MTP draft layers
temperaturefloat 0–2Sampling temperature
topPfloat 0–1Top-p nucleus sampling
topKintegerTop-k sampling
minPfloat 0–1Min-p sampling
repeatPenaltyfloatRepeat penalty
mirostatinteger 0–2Mirostat mode
extraFlagsstringAny additional flags not covered above
💡
Tip: If you provide commandSnippet, localmaxxing will attempt to parse flags from it automatically. Explicit fields always override parsed values.

GET /api/benchmarks

Fetch approved benchmark results. Public endpoint — no auth required.

GET/api/benchmarks

Query Parameters

ParamTypeDescription
hfIdstringFilter by model HF ID (includes finetunes of that base model)
hwClassDISCRETE_GPU | UNIFIED | CPU_ONLYFilter by hardware class
gpuNamestringFilter by GPU name (exact)
chipVendorstringFilter by chip vendor
chipFamilystringFilter by chip family
chipVariantstringFilter by chip variant
kvCacheDtypestringFilter by KV cache dtype
attentionBackendstringFilter by attention backend
gpuLayersMinintegerMinimum GPU layers (≥)
tensorParallelMinintegerMinimum tensor parallel (≥)
specOnly"true"Only speculative decoding runs
mtpOnly"true"Only MTP-enabled runs
datestringFilter to a specific UTC calendar day (YYYY-MM-DD)
dateFromstringISO-8601 timestamp — only results after this date
dateTostringISO-8601 timestamp — only results before this date
userIdstringFilter by internal user ID (overrides verified filter)
usernamestringCase-insensitive filter by public username (ignored when userId is also provided)
verified"true" | "false"Filter by user verification status (overridden by userId / username)
limitintegerResults per page (1–100, default 20)
offsetintegerPagination offset

Response

{
  "benchmarks": [ { ...benchmarkRun, model: {...}, hardware: {...}, engine: {...}, engineFlags: {...}, user: {...} } ],
  "total": 142,
  "limit": 20,
  "offset": 0
}

Example

curl "https://localmaxxing.com/api/benchmarks?hfId=Qwen/Qwen3-8B&hwClass=DISCRETE_GPU&limit=10"
curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z"

GET /api/leaderboard

Fetch ranked leaderboard data. Public endpoint — no auth required. Results are sorted by tokSOut descending.

GET/api/leaderboard

Query Parameters

ParamTypeDescription
hfIdstringFilter to a single model (URL-encoded: "org/model")
hwClassDISCRETE_GPU | UNIFIED | CPU_ONLYFilter by hardware class
memTierstringVRAM tier: "8" | "12" | "16" | "24" | "32" | "48" | "80" | "96" | "128"
hardwareNamestringCase-insensitive substring match across GPU name (DISCRETE_GPU), chip family/variant/vendor (UNIFIED), and CPU name (CPU_ONLY)
engineNamestringExact engine name
quantizationstringExact quant string
paramSizeintegerModel parameter size tier: 1 | 3 | 7 | 13 | 30 | 70 | 110
osstringFilter by OS: "windows" | "linux" | "macos"
backendstringExact match on engine backend, e.g. "cuda", "rocm", "metal"
modelFamilystringCase-insensitive substring match on model family
isMoE"true" | "false"Filter to MoE ("true") or dense ("false") models
batchSizeintegerBatch size bucket — 8 matches runs with batchSize ≥ 8
contextLenintegerContext length bucket filter
verified"true" | "false"Filter by user verification status (omit for all)
since"7d" | "30d"Time window (omit for all-time)
limitintegerMax rows (default 50, max 200)
offsetintegerPagination offset

Response

{
  "rows": [ { rank, id, model, hardware, engine, engineFlags, user, tokSOut, ... } ],
  "total": 89,
  "limit": 50,
  "offset": 0
}

GET /api/models

Browse models in the database. Public endpoint — no auth required.

GET/api/models

Query Parameters

ParamTypeDescription
searchstringSearch by HF ID, display name, or family (case-insensitive)
tree"true"Return base models with nested finetunes instead of flat list
limitintegerResults per page (default 200)
offsetintegerPagination offset

GET /api/models/search

Resolve fuzzy model names to canonical HuggingFace IDs before submitting. Query with q and optional limit (1–25, default 10). Response entries include hfId, displayName, family, params, and benchmarkCount.

Evals API

Quality evals use approved suites with task-level results and server-side aggregate scoring. Eval runs start as PENDING and appear publicly after admin approval.

EndpointAuthDescription
GET /api/evals/suitesPublicList approved eval suites. Supports category, runner, official, limit, and offset filters.
GET /api/evals/suites/[slug]PublicFetch one suite with its full suiteDoc, task keys, scoring, and run config.
POST /api/evals/suitesAPI key or sessionRegister a new suite. Created suites start pending admin approval.
GET /api/evals/runs?modelId=...PublicBest approved eval run per suite for a model.
POST /api/evals/runs/dry-runAPI key or sessionValidate task coverage and aggregate scoring without writing or consuming the rate limit.
POST /api/evals/runsAPI key or sessionSubmit eval results. Rate limited to one eval run per 5 minutes per user.
POST /api/evals/executeAPI key or sessionExecute an approved CUSTOM suite against a public OpenAI-compatible endpoint.
GET/POST /api/evals/runs/[id]/reactGET public, POST sessionReact with one of fire, rocket, 100, brain, or chad.

API Keys

Manage API keys for programmatic access. All endpoints require session authentication (not API key auth). Maximum of 10 keys per account.

GET/api/keys

List your API keys (key secrets are never returned).

[
  { "id": "...", "name": "My Agent", "prefix": "bhk_1a2b", "createdAt": "...", "lastUsedAt": "...", "expiresAt": null },
  ...
]
POST/api/keys

Create a new API key. The raw key is returned only once — store it immediately.

Request body:

{
  "name": "My Agent Key",
  "expiresAt": "2027-01-01T00:00:00Z"  // optional ISO-8601
}

Response (201):

{
  "id": "...",
  "name": "My Agent Key",
  "prefix": "bhk_1a2b",
  "createdAt": "...",
  "expiresAt": "2027-01-01T00:00:00Z",
  "key": "bhk_1a2b3c4d5e6f7890abcdef1234567890abcdef12"  // SHOWN ONLY ONCE
}
DELETE/api/keys/[id]

Revoke (delete) an API key. Returns { "ok": true } on success.

Saved Setups

Manage saved hardware/engine configurations. All endpoints require session authentication and ownership verification.

GET/api/setups

List your saved setups, ordered by default first, then most recent.

POST/api/setups

Create a new saved setup. If isDefault is true, existing defaults are cleared.

{
  "name": "My RTX 3090 Setup",
  "description": "Standard llama.cpp config",
  "isDefault": true,
  "hwClass": "DISCRETE_GPU",
  "gpuName": "RTX 3090",
  "gpuCount": 1,
  "vramGb": 24,
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "gpuLayers": 99,
  "flashAttn": true,
  "contextLength": 8192
}
GET/api/setups/[id]

Fetch a single saved setup by ID. Used by the submit page to prefill form data.

PATCH/api/setups/[id]

Update a saved setup. All fields are optional — only provided fields are changed. Pass null to clear nullable fields.

DELETE/api/setups/[id]

Delete a saved setup. Returns 204 No Content on success.

Rate Limits

⚠️
POST /api/benchmarks: 1 submission per 5 minutes per user. When rate limited, the response includes retryAfterMs and a Retry-After header.
EndpointLimitScope
POST /api/benchmarks1 request / 5 minPer user
GET /api/benchmarksGenerousPer IP
GET /api/leaderboardGenerousPer IP
GET /api/modelsGenerousPer IP
POST /api/keysMax 10 keysPer account

Common Engine Names

Use consistent casing — these are the accepted values:

EngineengineName value
llama.cpp / llama-serverllama.cpp
vLLMvllm
SGLangsglang
Ollamaollama
LM Studiolmstudio
ExLlamaV2exllamav2
TGI (Text Generation Inference)tgi
TensorRT-LLMtensorrt-llm
MLX (Apple)mlx
text-generation-webuitext-generation-webui
LMDeploylmdeploy
MLC-LLMmlc-llm
HipFirehipfire
tinygradtinygrad

Benchmark Methodology

For reproducible, trustworthy results:

  • Warm up the model with 1–2 throwaway runs before recording
  • Use a fixed prompt for comparability — a 512-token system prompt + 32-token user message is a good baseline
  • Record steady-state throughput — not the first token burst
  • Measure tokSPrefill as prompt/input processing throughput before generation; this helps estimate wait time for long prompts
  • Measure peakVramGb via nvidia-smi or equivalent at peak load
  • Report batchSize accurately — batch=1 and batch=8 are not comparable
  • Set temperature: 0 (greedy decode) for deterministic throughput benchmarks unless testing sampling overhead
  • Include commandSnippet — the exact launch command is the most useful thing for reproducibility

Field Constraints

FieldMax length / Range
hfId256 chars
modelRevision128 chars
engineName64 chars
engineVersion64 chars
quantization64 chars
backend64 chars
notes2000 chars
commandSnippet4000 chars
extraFlags1000 chars
contextLengthinteger ≥ 1
batchSizeinteger ≥ 1
prefillTokensinteger ≥ 0
tokSPrefillpositive number
gpuMemUtil0.0 – 1.0
temperature0.0 – 100.0
topP / minP0.0 – 1.0
mirostat0, 1, or 2
gpuCountinteger ≥ 1
vramGbpositive number
promptTokensinteger ≥ 0
outputTokensinteger ≥ 0

Examples

Minimal Submission

Smallest valid request body:

{
  "hfId": "Qwen/Qwen3-8B",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "vramGb": 24
  },
  "engineName": "llama.cpp",
  "quantization": "Q4_K_M",
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "ttftMs": 210
}

Full Submission

Complete request with all optional fields:

{
  "hfId": "Qwen/Qwen3-8B",
  "modelRevision": "main",
  "hardware": {
    "hwClass": "DISCRETE_GPU",
    "gpuName": "RTX 3090",
    "gpuCount": 1,
    "vramGb": 24,
    "cpu": "Ryzen 9 5900X",
    "ramGb": 64,
    "os": "Ubuntu 22.04"
  },
  "engineName": "llama.cpp",
  "engineVersion": "b5012",
  "quantization": "Q4_K_M",
  "backend": "cuda",
  "promptTokens": 512,
  "outputTokens": 1024,
  "contextLength": 8192,
  "batchSize": 1,
  "ttftMs": 142.5,
  "tokSOut": 87.4,
  "tokSPrefill": 1210.5,
  "tokSTotal": 74.1,
  "peakVramGb": 6.2,
  "notes": "Automated benchmark via agent. Thermal throttling observed after 10 min.",
  "engineFlags": {
    "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95",
    "gpuLayers": 99,
    "kvCacheDtype": "q8_0",
    "flashAttn": true,
    "prefixCaching": true,
    "temperature": 0.6,
    "topP": 0.95
  }
}

cURL — Submit Benchmark

curl -X POST https://localmaxxing.com/api/benchmarks \
  -H "Authorization: Bearer bhk_YOUR_API_KEY_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "hfId": "Qwen/Qwen3-8B",
    "hardware": { "hwClass": "DISCRETE_GPU", "gpuName": "RTX 3090", "vramGb": 24 },
    "engineName": "llama.cpp",
    "quantization": "Q4_K_M",
    "tokSOut": 87.4,
    "tokSPrefill": 1210.5,
    "tokSTotal": 74.1,
    "ttftMs": 142.5,
    "contextLength": 8192,
    "engineFlags": {
      "commandSnippet": "llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa"
    }
  }'

cURL — Query Leaderboard

curl "https://localmaxxing.com/api/leaderboard?hwClass=DISCRETE_GPU&memTier=24&verified=true&limit=10"
curl "https://localmaxxing.com/api/leaderboard?hardwareName=RTX+4090&modelFamily=qwen&isMoE=false"

cURL — Filter Benchmarks by User & Date

curl "https://localmaxxing.com/api/benchmarks?username=mason&dateFrom=2026-04-01T00:00:00Z&dateTo=2026-04-30T23:59:59Z"