Test every AI model
against your real use case.
Define your test once — prompts, voices, audio files — and run against every model in parallel. Compare side by side with latency, accuracy, cost and an automatic ranking.
Benchmarks test generic data.
Your product isn't generic.
The model that tops MMLU or HumanEval may be the worst choice for classifying your support tickets, transcribing your audio, or reading your domain-specific documents.
- Generic datasets, generic prompts
- No cost data for your volume
- No latency for your region
- Can't test your system prompt
- No audio or voice evaluation
- Your prompts, your scenarios, your data
- Real cost with your API keys, no markup
- Latency measured on actual runs
- Test any system prompt before shipping
- LLM, STT and TTS — same platform
One setup.
Every model.
Three steps that keep your tests structured, repeatable, and shareable — for LLM, STT and TTS benchmarks.
Your test setup.
Once. Reuse everywhere.
Define prompts, test data, expected outputs, audio files or voice settings in one structured Scenario — then run it across any arena. Or describe what you need and let AI generate test items for you.
Customer support messages in English — mix frustrated, neutral and polite tones. Different issues: late delivery, wrong item, refund…
Battery lasts two days. Bluetooth drops in the kitchen.
expected: mixed · battery good · BT issuesGreat build quality. App sync is unreliable on Android.
expected: positive hardware · app frictionComfortable fit. Noise cancelling works well on flights.
expected: comfort + ANC praisedPick N models.
Run in parallel.
Connect providers from OpenAI, Anthropic, Google, Deepgram, ElevenLabs and more. Select multiple models per provider and run them all at once.
Decide with data.
Not gut feeling.
Every run gets a composite score from accuracy, speed and cost. Favorite outputs, copy everything, or export a report on paid plans.
Reviews are split. Strong consensus on durability…
The product has mixed reviews. Users praise…
Positive sentiment on hardware quality…
See who's winning.
In the real world.
Live rankings from anonymized arena runs across x-n.dev — LLM, STT and TTS. Not vendor benchmarks. Recomputed daily from real evaluations.
- Aggregated from real arena evaluations, not synthetic benchmarks
- Separate rankings for LLM, STT and TTS providers
- Updated daily — watch models move as more runs come in
Built for the moment
before you commit to a model.
From voice AI builders to procurement — whoever owns the model decision uses x-n.dev as their testing bench.
- Voice AI buildersTTS, STT and LLM are the core of your product. Compare every provider on your real calls — latency, accuracy and cost — and pick the best fit for each task.
- Product engineers & developersYou own the model decision. Benchmark cost, latency and quality across providers on your actual prompts — before it ships to production.
- AI agencies & consultanciesRun multi-client evaluations with each client’s own keys and scenarios. Deliver vendor-neutral recommendations backed by data and shareable reports.
- Procurement & AI opsEnterprise, compliance-driven model selection. Full run history, exports and an objective score — accuracy, latency, cost and ROI — to justify every choice.
Most tools compare LLMs.
What about when your product speaks or listens?
x-n.dev is one of the few platforms that evaluates voice AI providers with the same structure and rigor as language models.
Your model.
Your rules. Your data.
Connect any OpenAI-compatible endpoint — Ollama, vLLM, LM Studio, llama.cpp — and benchmark self-hosted models against cloud APIs in the same arena.
Base URL
Must be publicly reachable. Use ngrok or Cloudflare Tunnel for local models.
API Key (optional)
The right tool for
the right job.
Different tools solve different problems. Here's where x-n.dev fits.
~ = partially supported or requires significant setup
Pick a plan.
Pay for what you run.
Bring your own API keys — we never touch your provider bill. Pick the plan that fits your team and top up runs when you need more.
- Up to 3 active scenarios
- Up to 3 active arenas
- 1 simultaneous arena
- 1 user
- All providers — LLM, STT & TTS
- Bring your own API keys
- Audio retained for 7 days
- Latency, cost & token metrics
- Multi-org workspaces
- Up to 20 active scenarios
- Up to 15 active arenas
- Up to 3 simultaneous arenas
- Up to 3 users
- Parallel execution — all models run simultaneously
- Side-by-side comparison
- Audio retained for 30 days
- Results export
- Unlimited history
- All providers — LLM, STT & TTS
- Bring your own API keys
- Multi-org workspaces
- Up to 50 active scenarios
- Up to 30 active arenas
- Up to 5 simultaneous arenas
- Up to 10 users
- Parallel execution — all models run simultaneously
- Side-by-side comparison
- Audio retained for 90 days
- Results export
- Unlimited history
- All providers — LLM, STT & TTS
- Bring your own API keys
- Multi-org workspaces
For Enterprise without BYOK — Contact us
AI agencies / consultancies — Contact us
Common questions.
Free to start
Start comparing.
Stop guessing.
Bring your own API keys and run your first arena in minutes.