Agentic benchmark built from real
case filings.
Real World Tasks
Include drafting briefs, finding precedents, spotting errors and predicting outcomes.
Designed by former lawyers from top firms like Latham Watkins and Greenberg Traurig.
Long Horizon
Each task can take hundreds steps of searching, reading and drafting.
Humans work in teams for days to solve tasks like these.
Dataset-Grounded
To perform well agents have to search across our dataset of all US case law and regulations.
14 million real cases, including not publicly available filings.
Examples
Prompt
Loading prompt…
Loading world files.
Fetching the attached materials for this example.
About Midpage
14M Cases
Plus 6M statutes and regulations, with a comprehensive legal dataset and a proprietary citator showing which cases are overruled.
300+ Firms
Midpage is used by over 300 law firms directly and reaches hundreds of thousands through partner organizations.
5 Partnerships
We are the data supplier to 5 multi-billion-dollar organizations. They use our data, search, and MCP.
200,000 Visitors
Every month, 200k+ visitors read cases directly on our website.
Benchmark Results
Score vs cost
Average estimated cost per task vs. average benchmark score. Baseline without MCP on the completed 600-task litigation run.
Claude Opus 4.7max
Avg cost: $7.66
Avg score: 52.6%
Avg latency: 23.8 min
MCP off / on
MCP includes tools for searching across our case law and regulations corpus. Values are measured on the completed 600-task FrontierLaw benchmark runs.
Claude Opus 4.7max
MCP off: 52.6% · $7.66 · 23.8 min
MCP on: 51.8% · $7.34 · 21.8 min
Method
Midpage is building the first benchmark and RL env for agentic litigation. In the US alone, millions of cases are filed each year. Using those filings as verifiable environments, we want to help teach LLMs how to become excellent lawyers.
This is a private benchmark. To avoid leaking the questions, we do not give collaborators access to the full set of sample tasks. Submissions are run with the candidate model and harness inside our environments.
Responses are graded from 0 to 1 through rubrics hand-crafted by our attorneys. Our tasks use the Harbor format created by Laude.org.
RL Envs
Like the benchmark, RL uses Harbor format. Rollouts use the provider's own agent harness, and evaluators score from 0 to 1. This scales to tens of thousands of tasks. To avoid cheating we use cases that are not in the model's training sets yet.
Midpage collects case data the same day courts publish it, long before it reaches training datasets. These matters include multiple motions from the parties and one or more decisions from the judge. The documents are often 10 to 50 pages long and represent days of human work. To solve them, agents have to work like human case teams: research arguments, counterarguments, and past precedent, then draft long final outputs that are consistent with the rest of the case file and compliant with court rules.
Contact Us
Talk to us about the benchmark, our MCP, our dataset, and our RL environments.
Request leaderboard
DM us on @ottozastrow
Or contact: benchmark@midpage.ai
