{"_meta":{"description":"Agent bootstrap context for localmaxxing benchmark and eval submission.","docsUrl":"https://localmaxxing.com/docs/agent-skill","submitEndpoint":"POST https://localmaxxing.com/api/benchmarks","dryRunEndpoint":"POST https://localmaxxing.com/api/benchmarks/dry-run","evalSubmitEndpoint":"POST https://localmaxxing.com/api/evals/runs","evalDryRunEndpoint":"POST https://localmaxxing.com/api/evals/runs/dry-run","evalExecuteEndpoint":"POST https://localmaxxing.com/api/evals/execute","evalSuitesEndpoint":"GET https://localmaxxing.com/api/evals/suites","modelSearchEndpoint":"GET https://localmaxxing.com/api/models/search?q=<query>","auth":"Authorization: Bearer bhk_<40 hex chars>","rateLimit":"1 submission per 5 minutes per API key (applies separately to benchmarks and evals)","rateLimitHeaders":{"description":"Returned on every successful 201 response so agents can self-throttle.","X-RateLimit-Remaining":"Submissions remaining in the current 5-minute window (0 or 1).","X-RateLimit-Reset":"Unix timestamp (seconds) when the window resets."}},"requiredFields":{"hfId":{"type":"string","description":"HuggingFace model ID","example":"Qwen/Qwen3-8B","maxLength":256,"tip":"Use GET /api/models/search?q=<fuzzy name> to resolve a human-readable name to the canonical hfId before submitting."},"hardware":{"type":"object","description":"Hardware config — discriminated union on hwClass. See hardwareSchemas."},"engineName":{"type":"string","description":"Inference engine used","maxLength":64,"acceptedValues":["llama.cpp","vllm","sglang","ollama","lmstudio","exllamav2","tgi","tensorrt-llm","mlx","text-generation-webui","lmdeploy","mlc-llm","hipfire","tinygrad"]},"quantization":{"type":"string","description":"Quantization format of the model weights","maxLength":64,"commonValues":["Q4_K_M","Q5_K_M","Q6_K","Q8_0","IQ4_NL","IQ3_M","AWQ","GPTQ","fp8","fp16","bf16","int4","int8","INT4","NVFP4","bnb-nf4"]},"atLeastOneMetricRequired":{"description":"tokSOut is required for real submissions. At least one secondary metric (ttftMs, tokSPrefill, tokSTotal, or peakVramGb) must also be provided.","fields":{"tokSOut":{"type":"number","required":true,"description":"Output tokens per second"},"tokSPrefill":{"type":"number","description":"Prefill/input tokens per second before generation"},"tokSTotal":{"type":"number","description":"Total tokens per second (prompt + output)"},"ttftMs":{"type":"number","description":"Time to first token in milliseconds"},"peakVramGb":{"type":"number","description":"Peak VRAM usage in GB"}}}},"optionalFields":{"modelRevision":{"type":"string","default":"main","description":"Git revision / branch / commit SHA","maxLength":128},"engineVersion":{"type":"string","description":"Engine version string","example":"b5012","maxLength":64},"backend":{"type":"string","description":"Backend / device variant","maxLength":64,"commonValues":["cuda","rocm","metal","vulkan","cpu","openvino"]},"promptTokens":{"type":"integer","default":0,"min":0},"outputTokens":{"type":"integer","default":0,"min":0},"contextLength":{"type":"integer","default":2048,"min":1},"batchSize":{"type":"integer","default":1,"min":1},"prefillTokens":{"type":"integer","default":0,"min":0,"description":"Tokens already in KV cache before generation started (0 = cold start)."},"tokSPrefill":{"type":"number","description":"Prefill/input tokens per second before generation"},"peakVramGb":{"type":"number","description":"Peak VRAM usage in GB"},"notes":{"type":"string","maxLength":2000},"engineFlags":{"type":"object","description":"Detailed engine launch configuration. See engineFlagsSchema."}},"hardwareSchemas":{"DISCRETE_GPU":{"description":"NVIDIA / AMD / Intel discrete GPU rig. For a single GPU model use gpuName + vramGb. For a mixed-GPU rig (e.g. 1× RTX 3090 + 1× RTX 4090) use the gpus array instead.","required":{"hwClass":"DISCRETE_GPU"},"requiredEither":{"description":"Provide EITHER the flat fields OR the gpus array (not both).","flatFields":{"gpuName":"string","vramGb":"number"},"gpusArray":{"type":"array","items":{"gpuName":"string","count":"integer (default 1)","vramGb":"number"},"maxItems":16,"note":"When gpus has >1 distinct GPU name the profile is flagged isHeterogeneousGpu=true. gpuName on the parent is set to \"Multi-GPU\", vramGb is the total across all slots, gpuCount is the total card count."}},"optional":{"gpuCount":{"type":"integer","default":1,"note":"Ignored when gpus array is supplied — derived automatically."},"cpu":"string","ramGb":"number","os":"string","powerWatts":"number"},"examples":{"singleGpu":{"hwClass":"DISCRETE_GPU","gpuName":"RTX 3090","gpuCount":1,"vramGb":24,"cpu":"Ryzen 9 5900X","ramGb":64,"os":"Ubuntu 22.04"},"heterogeneousMultiGpu":{"hwClass":"DISCRETE_GPU","gpus":[{"gpuName":"RTX 3090","count":1,"vramGb":24},{"gpuName":"RTX 4090","count":1,"vramGb":24}],"cpu":"Threadripper PRO 5965WX","ramGb":128,"os":"Ubuntu 22.04"}}},"UNIFIED":{"description":"Apple Silicon / AMD APU / Intel Arc (shared memory)","required":{"hwClass":"UNIFIED","chipVendor":"string","chipFamily":"string","chipVariant":"string","unifiedMemoryGb":"number"},"optional":{"npuTops":"number","cpu":"string","os":"string","powerWatts":"number"},"example":{"hwClass":"UNIFIED","chipVendor":"Apple","chipFamily":"M4","chipVariant":"M4 Pro","unifiedMemoryGb":48,"os":"macOS 15.4"}},"CPU_ONLY":{"description":"CPU-only inference","required":{"hwClass":"CPU_ONLY","cpu":"string","ramGb":"number"},"optional":{"os":"string"},"example":{"hwClass":"CPU_ONLY","cpu":"Intel Xeon W9-3595X","ramGb":512,"os":"Ubuntu 24.04"}}},"engineFlagsSchema":{"description":"If engineFlags is supplied, commandSnippet is required. localmaxxing parses flags automatically; explicit fields always override parsed values.","fields":{"commandSnippet":{"type":"string","maxLength":4000,"recommended":true,"description":"Full engine launch command — most important field for reproducibility."},"tensorParallel":{"type":"integer","min":1,"description":"Tensor parallelism degree (--tensor-parallel-size / -tp)."},"pipelineParallel":{"type":"integer","min":1,"description":"Pipeline parallelism degree."},"concurrency":{"type":"integer","min":1,"description":"Number of concurrent request slots / max concurrency (--concurrency, --num-parallel, etc.)."},"numParallel":{"type":"integer","min":1,"description":"Alias for concurrency used by some engines (llama.cpp --parallel)."},"gpuLayers":{"type":"integer","min":0,"description":"Number of layers offloaded to GPU (-ngl / --n-gpu-layers)."},"cpuLayers":{"type":"integer","min":0},"cpuOffloadGb":{"type":"float","min":0,"description":"GB of model weights offloaded to CPU RAM."},"splitMode":{"type":"string","maxLength":64,"description":"Multi-GPU split mode (e.g. row, layer, none)."},"gpuMemUtil":{"type":"float","min":0,"max":1,"description":"GPU memory utilisation fraction (vLLM --gpu-memory-utilization)."},"kvCacheDtype":{"type":"string","maxLength":64,"commonValues":["q8_0","q4_0","fp8","fp16","auto"]},"kvCacheSizeMb":{"type":"integer","min":1},"prefixCaching":{"type":"boolean"},"attentionBackend":{"type":"string","maxLength":64,"commonValues":["flash_attn","xformers","sdpa","triton"]},"flashAttn":{"type":"boolean"},"chunkedPrefill":{"type":"boolean"},"prefillChunkSize":{"type":"integer","min":1},"contBatching":{"type":"boolean"},"maxRunningSeqs":{"type":"integer","min":1,"description":"Max sequences processed simultaneously (SGLang --max-running-requests, vLLM --max-num-seqs)."},"schedulerDelayFactor":{"type":"float","min":0,"description":"SGLang scheduler delay factor."},"ropeScaling":{"type":"string","maxLength":64,"commonValues":["yarn","linear","dynamic"]},"ropeScale":{"type":"float","min":0},"yarnExtFactor":{"type":"float","description":"YaRN extrapolation factor (--yarn-ext-factor)."},"engineQuant":{"type":"string","maxLength":64,"description":"Engine-level quant override (e.g. llama.cpp --override-kv / exllamav2 quant string)."},"sglangQuant":{"type":"string","maxLength":64,"description":"SGLang --quantization value (e.g. fp8, awq, gptq)."},"specDecoding":{"type":"boolean","description":"Whether speculative decoding is enabled."},"specMethod":{"type":"string","maxLength":64,"description":"Speculative decoding method (e.g. ngram, draft_model, medusa)."},"specModel":{"type":"string","maxLength":256,"description":"Draft model HF ID for model-based speculative decoding."},"specDraftModel":{"type":"string","maxLength":256,"description":"Alias for specModel used by some engines (--speculative-model)."},"specNumTokens":{"type":"integer","min":1,"description":"Number of speculative tokens per step."},"specNgramSize":{"type":"integer","min":1,"description":"N-gram size for ngram-based speculative decoding."},"specDraftTp":{"type":"integer","min":1,"description":"Tensor parallelism degree for the draft model."},"mtpEnabled":{"type":"boolean","description":"Multi-Token Prediction enabled (DeepSeek-style MTP)."},"mtpDraftLayers":{"type":"integer","min":1,"description":"Number of MTP draft layers."},"temperature":{"type":"float","min":0,"max":100,"note":"Use 0 for greedy / deterministic throughput benchmarks."},"topP":{"type":"float","min":0,"max":1},"topK":{"type":"integer","min":0},"minP":{"type":"float","min":0,"max":1},"repeatPenalty":{"type":"float","min":0},"mirostat":{"type":"integer","acceptedValues":[0,1,2]},"extraFlags":{"type":"string","maxLength":1000,"description":"Any additional flags not captured above, as a raw string."}}},"commandParsing":{"description":"When commandSnippet is provided, localmaxxing auto-extracts engineFlags from it. Explicit engineFlags fields always override parsed values. Providing both commandSnippet + explicit fields is the most reliable approach.","flagToField":{"-ngl / --n-gpu-layers":"gpuLayers (integer)","-tp / --tensor-parallel-size":"tensorParallel (integer)","--pipeline-parallel-size":"pipelineParallel (integer)","--gpu-memory-utilization":"gpuMemUtil (float 0–1)","--quantization (sglang)":"sglangQuant (string)","--kv-cache-dtype":"kvCacheDtype (string)","--enable-prefix-caching / --pc":"prefixCaching (true)","-fa / --flash-attn":"flashAttn (true)","--enable-chunked-prefill":"chunkedPrefill (true)","--max-running-requests / --max-num-seqs":"maxRunningSeqs (integer)","--parallel / -np (llama.cpp)":"numParallel (integer)","--concurrency":"concurrency (integer)","--speculative-model / --draft-model":"specDraftModel (string)","--num-speculative-tokens":"specNumTokens (integer)","--speculative-draft-tensor-parallel-size":"specDraftTp (integer)","--rope-scaling":"ropeScaling (string)","--rope-scale":"ropeScale (float)","--yarn-ext-factor":"yarnExtFactor (float)","--cpu-offload-gb":"cpuOffloadGb (float)","--split-mode":"splitMode (string)","--temp / --temperature":"temperature (float)","--top-p":"topP (float)","--top-k":"topK (integer)","--min-p":"minP (float)","--repeat-penalty":"repeatPenalty (float)","--mirostat":"mirostat (0|1|2)","--scheduler-delay-factor":"schedulerDelayFactor (float)","--mtp-num-draft-tokens (vllm)":"mtpEnabled=true + mtpDraftLayers (integer)"},"tips":["Always include the full launch command as commandSnippet — it is the single most useful field for reproducibility.","For llama.cpp, -ngl 99 means all layers on GPU; set gpuLayers: 99 explicitly too.","For vLLM, --tensor-parallel-size maps to tensorParallel; --gpu-memory-utilization maps to gpuMemUtil.","For SGLang, --quantization maps to sglangQuant (not the top-level quantization field).","If using speculative decoding, set specDecoding: true and provide either specModel or specDraftModel.","MTP (DeepSeek-style) runs: set mtpEnabled: true and mtpDraftLayers to the number of draft layers."]},"fieldConstraints":{"hfId":{"maxLength":256},"engineName":{"maxLength":64},"engineVersion":{"maxLength":64},"quantization":{"maxLength":64},"backend":{"maxLength":64},"modelRevision":{"maxLength":128},"notes":{"maxLength":2000},"commandSnippet":{"maxLength":4000},"extraFlags":{"maxLength":1000},"contextLength":{"min":1},"batchSize":{"min":1},"gpuMemUtil":{"min":0,"max":1},"temperature":{"min":0,"max":100},"topP":{"min":0,"max":1},"minP":{"min":0,"max":1},"mirostat":{"acceptedValues":[0,1,2]},"concurrency":{"min":1},"tensorParallel":{"min":1},"specNumTokens":{"min":1},"specDraftTp":{"min":1},"maxRunningSeqs":{"min":1}},"dryRun":{"description":"POST /api/benchmarks/dry-run accepts the exact same body as POST /api/benchmarks. It runs the full Zod schema validation and returns either { valid: true, parsed: {...} } or a 400 with the same error shape as a real submission. No DB writes occur. Does not consume the rate limit. Use this to verify your payload before committing a real submission.","endpoint":"POST https://localmaxxing.com/api/benchmarks/dry-run","auth":"Authorization: Bearer bhk_... (required — same as real submission)","successResponse":{"valid":true,"parsed":"/* normalised payload that would be written */"},"errorResponse":{"error":"Validation failed","details":"/* zod flatten() output */"}},"modelSearch":{"description":"GET /api/models/search?q=<query> resolves fuzzy model names to canonical hfIds. Searches hfId, displayName, and family fields. Returns up to 10 matches. Use this when you have a human-readable model name and need to find the right hfId before submitting.","endpoint":"GET https://localmaxxing.com/api/models/search?q=<query>","auth":"None required","queryParams":{"q":"Search string (required). Matched case-insensitively against hfId, displayName, family.","limit":"Max results (default 10, max 25)."},"responseFields":{"hfId":"Canonical HuggingFace model ID — use this as the hfId field in submissions.","displayName":"Human-readable model name.","family":"Model family (e.g. Qwen, Llama, Mistral).","params":"Parameter count in billions (may be null).","benchmarkCount":"Number of existing benchmark runs for this model."},"example":"GET /api/models/search?q=qwen3+8b → [{ hfId: \"Qwen/Qwen3-8B\", ... }]"},"reactions":{"description":"Users can react to benchmark runs with a single emoji. Reaction counts are returned in leaderboard and benchmark list responses.","leaderboardResponseFields":{"reactionCounts":{"type":"object","description":"Map of emoji → count for this run.","example":{"👍":5}},"myEmoji":{"type":"string | null","description":"Authenticated user's current reaction emoji on this run, or null if none."}},"reactEndpoint":{"method":"POST","path":"/api/runs/:id/react","auth":"Session cookie or Bearer API key","body":{"emoji":{"type":"string | null","description":"Emoji to react with (e.g. \"👍\"), or null to remove your reaction."}},"behavior":"Upserts one reaction per user per run. Sending null removes the reaction."}},"minimalExample":{"hfId":"Qwen/Qwen3-8B","hardware":{"hwClass":"DISCRETE_GPU","gpuName":"RTX 3090","vramGb":24},"engineName":"llama.cpp","quantization":"Q4_K_M","tokSOut":87.4,"tokSPrefill":1210.5,"ttftMs":210},"fullExample":{"hfId":"Qwen/Qwen3-8B","modelRevision":"main","hardware":{"hwClass":"DISCRETE_GPU","gpuName":"RTX 3090","gpuCount":1,"vramGb":24,"cpu":"Ryzen 9 5900X","ramGb":64,"os":"Ubuntu 22.04"},"engineName":"llama.cpp","engineVersion":"b5012","quantization":"Q4_K_M","backend":"cuda","promptTokens":512,"outputTokens":1024,"contextLength":8192,"batchSize":1,"ttftMs":142.5,"tokSOut":87.4,"tokSPrefill":1210.5,"tokSTotal":74.1,"peakVramGb":6.2,"notes":"Automated benchmark via agent.","engineFlags":{"commandSnippet":"llama-server -m Qwen3-8B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 -fa --temp 0.6 --top-p 0.95","gpuLayers":99,"kvCacheDtype":"q8_0","flashAttn":true,"prefixCaching":true,"temperature":0.6,"topP":0.95}},"vllmSpecDecodingExample":{"hfId":"deepseek-ai/DeepSeek-V3","hardware":{"hwClass":"DISCRETE_GPU","gpus":[{"gpuName":"H100 SXM","count":8,"vramGb":80}],"cpu":"AMD EPYC 9654","ramGb":768,"os":"Ubuntu 22.04"},"engineName":"vllm","engineVersion":"0.8.4","quantization":"fp8","backend":"cuda","contextLength":32768,"batchSize":1,"tokSOut":312.7,"tokSPrefill":2850.3,"tokSTotal":287.4,"engineFlags":{"commandSnippet":"vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --speculative-model deepseek-ai/DeepSeek-V3-Draft --num-speculative-tokens 5 --gpu-memory-utilization 0.92","tensorParallel":8,"gpuMemUtil":0.92,"specDecoding":true,"specMethod":"draft_model","specDraftModel":"deepseek-ai/DeepSeek-V3-Draft","specNumTokens":5,"specDraftTp":8,"kvCacheDtype":"fp8","prefixCaching":true,"maxRunningSeqs":256}},"sglangExample":{"hfId":"meta-llama/Llama-3.1-8B-Instruct","hardware":{"hwClass":"DISCRETE_GPU","gpuName":"RTX 4090","gpuCount":1,"vramGb":24,"os":"Ubuntu 24.04"},"engineName":"sglang","engineVersion":"0.4.6","quantization":"fp8","backend":"cuda","contextLength":8192,"batchSize":1,"tokSOut":198.4,"engineFlags":{"commandSnippet":"python -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --tp 1 --quantization fp8 --max-running-requests 128 --enable-flashinfer-sampling","tensorParallel":1,"sglangQuant":"fp8","attentionBackend":"flash_attn","maxRunningSeqs":128,"concurrency":128}},"heterogeneousGpuExample":{"hfId":"meta-llama/Llama-3.1-70B","hardware":{"hwClass":"DISCRETE_GPU","gpus":[{"gpuName":"RTX 3090","count":1,"vramGb":24},{"gpuName":"RTX 4090","count":1,"vramGb":24}],"cpu":"Threadripper PRO 5965WX","ramGb":128,"os":"Ubuntu 22.04"},"engineName":"llama.cpp","quantization":"Q4_K_M","tokSOut":18.3,"engineFlags":{"commandSnippet":"llama-server -m Llama-3.1-70B-Q4_K_M.gguf -c 8192 --n-gpu-layers 99 --tensor-split 1,1","gpuLayers":99,"tensorParallel":2}},"appleSiliconExample":{"hfId":"meta-llama/Llama-3.2-3B-Instruct","hardware":{"hwClass":"UNIFIED","chipVendor":"Apple","chipFamily":"M4","chipVariant":"M4 Pro","unifiedMemoryGb":48,"os":"macOS 15.4"},"engineName":"mlx","quantization":"int4","tokSOut":143.2,"engineFlags":{"commandSnippet":"mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit"}},"evalSubmission":{"description":"POST /api/evals/runs submits evaluation results for an approved eval suite. Fetch the suite doc first via GET /api/evals/suites/:slug to see task keys and scoring rules. Submitted eval runs start PENDING and appear publicly after admin approval.","endpoint":"POST https://localmaxxing.com/api/evals/runs","auth":"Authorization: Bearer bhk_... (required)","requiredFields":{"suiteSlug":{"type":"string","description":"Slug of an APPROVED eval suite (e.g. \"hellaswag\", \"mmlu\")."},"hfId":{"type":"string","description":"HuggingFace model ID."},"hardware":{"type":"object","description":"Same hardwareSchemas as benchmark submission."},"results":{"type":"object","description":"Map of every suite taskKey -> score number or { score, nShots, nSamples }. Unknown keys and missing suite task keys are rejected."}},"optionalFields":{"modelRevision":{"type":"string","default":"main"},"quantization":{"type":"string","description":"Quantization format used for the eval run."},"quantFormat":{"type":"string","description":"Quant format string (e.g. GPTQ, AWQ)."},"runnerVersion":{"type":"string","description":"Version of the eval runner (e.g. lm-eval-harness v0.4.4)."},"runConfig":{"type":"object","description":"Arbitrary JSON with runner-specific config (temperature, seed, etc.)."},"notes":{"type":"string","maxLength":2000}},"example":{"suiteSlug":"hellaswag","hfId":"Qwen/Qwen3-8B","hardware":{"hwClass":"DISCRETE_GPU","gpuName":"RTX 3090","vramGb":24},"quantization":"Q4_K_M","results":{"hellaswag":0.72,"hellaswag_std":0.68}},"workflow":["1. GET /api/evals/suites to list approved suites.","2. GET /api/evals/suites/:slug to fetch the suite doc (tasks, scoringMethod, aggregation).","3. Run the evaluation on your rig using the suite doc as the spec.","4. POST /api/evals/runs/dry-run first to validate schema, approved suite existence, full task coverage, and aggregate scoring without consuming the rate limit.","5. POST /api/evals/runs with the results map.","6. The server validates task keys, computes the aggregate score, creates a PENDING run, and returns it."],"execute":{"endpoint":"POST https://localmaxxing.com/api/evals/execute","description":"Server-side execution for approved CUSTOM suites against a public OpenAI-compatible model server. baseUrl must be public http/https; localhost/private/internal addresses are rejected.","supportedScoringMethods":["exact_match","f1","llm_judge"],"autoSubmit":"When true, hardware is required and a PENDING eval run is created."}},"methodologyTips":["Warm up with 1-2 throwaway runs before recording metrics.","Use a fixed prompt for comparability (e.g. 512-token system + 32-token user).","Record steady-state throughput, not the first-token burst.","Measure tokSPrefill as prompt/input processing throughput before generation; it helps estimate wait time for long prompts.","Measure peakVramGb via nvidia-smi or equivalent at peak load.","Set temperature: 0 (greedy) for deterministic throughput benchmarks.","Always include commandSnippet — it is the most useful field for reproducibility.","For heterogeneous multi-GPU rigs, use the gpus array instead of flat gpuName/vramGb fields.","For speculative decoding runs, set specDecoding: true and include specModel or specDraftModel.","concurrency and numParallel refer to the same concept — use whichever matches your engine flag.","Use POST /api/benchmarks/dry-run first to validate your payload without consuming the rate limit.","Use GET /api/models/search?q=<name> to resolve fuzzy model names to canonical hfIds before submitting.","Check X-RateLimit-Remaining and X-RateLimit-Reset headers on 201 responses to self-throttle."]}