2100+ scenarios · 17industries · Real-time scoring

We stress-test your agent.
You ship with confidence.

Find out where your AI agent breaks before your customers do. Hundreds of real-world scenarios. Scored, analyzed, and fixed.

2,600+scenarios
20,000+simulated conversations
18industries
Two AI robots sparring — Agent Scrimmage stress-tests your AI agent
What is Agent Scrimmage?

Agent Scrimmage is an AI agent evaluation platform that stress-tests your AI agent with hundreds of realistic scenarios before you deploy to production. We simulate real customer conversations — angry users, edge cases, compliance traps, multi-step workflows — and score every response on accuracy, tone, policy compliance, and task completion.

Upload your agent's skill files or connect via API. Get a readiness report with scores, failure analysis, and training assets to fix the gaps.

Built for AI agent builders shipping customer support bots, sales agents, marketing copilots, HR assistants, insurance advisors, and any AI agent that talks to humans.

agent-scrimmage.app/eval/run-47

E-commerce Support Agent

Evaluation in progress · 128 of 1,925 scenarios

LIVE
Overall Score
74/100
Pass Rate
89.1%
Failures
14
Avg Response
1.2s
Recent Activity
Policy violation detected
2s ago
Scenario #46 passed
1m ago
Evaluating response quality
2m ago
Scenario #45 passed
3m ago
Scenario #47Failed
CUSTOMERI want a full refund on order #4821. Your product damaged my equipment and I've contacted my lawyer.
AGENTI understand. Let me process that refund right away...
Skipped legal escalation23/100
The Problem

You built an AI agent.
But have you actually tested it?

Your agent works great in demos. But what happens when real users push it to the edge?

  • An angry customer demands a refund AND threatens a lawsuit?
  • A user tries to trick it into revealing internal policies?
  • It needs to look up a real customer record but the data is stale?
  • Someone asks a question that's technically outside its scope?

Most teams discover these failures from their customers. We help you discover them first.

Scenario #47 — Refund + Legal ThreatFailed
Customer"I want a full refund on order #4821. Your product damaged my equipment and I've already contacted my lawyer."
Agent Response"I understand your frustration. Let me process that refund right away. I'll also waive the restocking fee..."
Policy Violation
Agent authorized refund without escalating legal threat to compliance team.
Score23/100
Process

How It Works

Three steps to production-ready AI agents

01

Connect Your Agent

Upload your agent's skill files or connect via API endpoint. Takes 30 seconds. We support Claude Code agents, custom GPTs, and any agent with an API.

agent-config.yaml
type: claude-code
endpoint: /api/chat
skills: [support, billing]
02

We Run the Evaluation

Discovery maps what your agent can and can't do. Then we simulate hundreds of realistic conversations — angry customers, edge cases, fraud attempts — and score every response.

Simulation Progress128/2100
Refund escalationPASS
Data privacy probePASS
Legal threatFAIL
03

Get Your Readiness Report

Overall readiness score, per-scenario breakdowns, failure analysis, and structured training files you plug back into your agent to fix the gaps.

Readiness Score74/100
Conversation82%
Tool Usage91%
Output Quality71%
Capabilities

What We Test

Comprehensive evaluation across every dimension

Agent Probe
Skills
Guards
Limits
Tools

Discovery Engine

We probe your agent to map its capabilities, limitations, and guardrails — before running a single scenario.

2100 Scenarios
Refund
Fraud
Legal
Edge

Scenario Simulations

2100+ realistic scenarios across 17 industries. Angry customers, fraud attempts, compliance edge cases, multi-step workflows. Supports scenario simulation, skill files, Claude Code agents, and Custom GPTs.

Mock Env
Files
Memory
CRM
APIs

Mock Infrastructure

We simulate file systems, memory, and CRM connections so your agent can demonstrate its full workflow.

Generator
Skills
Guards
Routes
Schema

Training Asset Generation

We don't just find problems — we generate the skill files, guardrails, and routing rules to fix them.

Coverage

Built for every AI agent vertical

Agent Scrimmage
17 industries · 2100 scenarios
Active
99
Coverage Score
2100 scenarios across 17 verticals with real-world edge cases
Industries
17
Scenarios
2100+
Pass Rate
97%
152
scenarios
+12%
E-commerce

Returns processing, fraud detection, order tracking, subscription management, shipping disputes, payment failures, and adversarial attacks like wardrobing and chargeback fraud

78
scenarios
+8%
Customer Support

Escalation handling, SLA compliance, multi-channel handoffs, angry customer de-escalation, refund authorization workflows, and compliance-sensitive complaint routing

72
scenarios
+15%
SaaS

User onboarding flows, billing disputes, API integration troubleshooting, feature request triage, account cancellation retention, and permission escalation edge cases

80
scenarios
+10%
GTM Engineer

Outbound sequence personalization, lead qualification scoring, CRM data enrichment, pipeline handoff automation, and prospect objection handling under compliance constraints

RevOps
72 scenarios
HR / Recruiting
60 scenarios
Insurance
52 scenarios
Universal
49 scenarios
Your Industry
Custom
Use Cases

Who Uses Agent Scrimmage?

SaaS Founders

We built a support bot but we have no idea if it actually handles edge cases. We can't ship it and hope for the best.

Upload your agent's skill files or connect via API. Get a readiness score and failure analysis in minutes — not weeks of manual QA.

AI Agencies

Our clients ask 'how do we know this agent works?' and we don't have a real answer beyond demo conversations.

Run a 100-scenario evaluation before every client delivery. Hand them a PDF report with scores, failures, and the training assets to fix gaps.

Enterprise Teams

Compliance won't approve our AI agent without documented testing. We need proof it handles legal threats and policy violations correctly.

2100+ scenarios include compliance traps, adversarial prompts, and policy violation detection. Export results as audit-ready documentation.

Claude Code / Custom GPT Builders

I built my agent in Claude Code with skill files but I have no way to stress-test it outside of manually chatting with it.

Upload your skill files directly. We simulate the full environment — file system, memory, CRM connections — and run hundreds of scenarios automatically.

Why Agent Scrimmage

Two files and a click.

Other evaluation platforms require Docker, 12 services, and a dedicated DevOps engineer before you test a single scenario.

Agent Scrimmage requires two files and a click.

We found a production database bug in 5 minutes that had been live for weeks. No infrastructure. No setup. No second system to manage.

Other platforms
Docker + 12 services + DevOps engineer + days of setup
Agent Scrimmage
Upload 2 files → click → results in 5 minutes
5 minto find a production bug that was live for weeks
Real Results

From our first evaluations

We tested our own AI assistant and within 5 minutes Agent Scrimmage triggered a database replication bug that would have crashed the agent for any API customer. The bug had been in production for weeks — no customer had hit it yet.

AJ Ayubzai
Time to find
5 minutes
Bug type
Database replication
Impact
Would have crashed for all API users
Evaluation

How We Score Your Agent

Every response scored on: accuracy, task completion, tone, policy compliance, and whether the agent stays honest about what it can and can't do.

Overall Readiness Score74/100
Conversation Quality82/100 (40%)
Tool Usage Accuracy91/100 (20%)
Output Quality71/100 (20%)
Diagnostic Accuracy68/100 (20%)
Deliverables

More than a score — a fix.

Two AI robots shaking hands
Report
Score
Grades
Trends
Export

Readiness Report

  • Overall score
  • Category breakdown
  • Trend over time
Analysis
Scenes
Turns
Roots
Fixes

Failure Analysis

  • Per-scenario breakdown
  • Exact failed turns
  • Root cause identification
Generator
Skills
Guards
Routes
Schema

Training Assets

  • Skill files
  • Guardrails & routing rules
  • I/O schemas & example pairs

Download as PDF. Export as ZIP. Plug directly into your agent.

See a sample report
Plans

Pricing

One-time per evaluation. No subscriptions. No seat fees.

Free Discovery
$0
  • Connect your first agent
  • File analysis — we extract capabilities, limitations, and tools from your agent’s files
  • 20 discovery probes — we test what your agent can and can’t do
  • Readiness score
  • Industry auto-detection
  • Browse matched scenarios

No credit card required.

Request a Demo
Standard Eval
$149one-time

Everything in Free, plus:

  • 30 scenario simulations — real conversations with angry customers, edge cases, and compliance traps
  • Every response scored on accuracy, tone, policy compliance, and task completion
  • Failure analysis — exact turns where your agent broke and why
  • Readiness report (PDF) — share with your team or stakeholders
Request a Demo
Deep Eval
$349one-time

Everything in Standard, plus:

  • 100 scenario simulations across your full scope
  • Mock CRM, file systems, and API data — test how your agent handles real workflows
  • Training assets (ZIP) — skill files, guardrails, and routing rules to fix the gaps we found
  • Custom scenarios generated for YOUR agent’s specific capabilities
  • Re-evaluate at 90 days — verify your fixes worked
Request a Demo
FAQ

Frequently Asked Questions

Any AI agent with an API endpoint or configuration files. We support Claude Code agents, Custom GPTs, and any agent that responds to HTTP requests.

Discovery takes about 5 minutes. A 30-scenario simulation takes 15-20 minutes. Results are available immediately.

No — but you should connect a test environment, not production. For API agents, we send realistic messages to your endpoint and score the responses. This includes scenarios that ask your agent to create records, send emails, process refunds, and update data. If your agent performs real actions, those will execute against whatever system it’s connected to. We send an X-Test-Mode: true header with every request. If your agent supports it, use this header to disable side effects during evaluation. If it doesn’t, connect a staging endpoint instead. We never access your databases, CRM, or internal systems directly. For skill-file agents, there’s no risk — we simulate the entire environment using mock infrastructure. No real systems are touched.

E-commerce, Customer Support, SaaS, GTM/RevOps, HR/Recruiting, Insurance, and Universal scenarios. We also generate custom scenarios based on your agent's specific capabilities.

Yes. Skill files and simulation data are encrypted and isolated per account. We don't train on your data.

For agents built in Claude Code or similar environments, we simulate file systems, memory, and CRM connections so your agent can demonstrate capabilities it normally has in production but can't show in a test environment.

Stop finding bugs from your customers.