Agentic benchmark built from real
case filings.
Real World Tasks
Include drafting briefs, finding precedents, spotting errors and predicting outcomes.
Designed by former lawyers from top firms like Latham Watkins and Greenberg Traurig.
Long Horizon
Each task can take hundreds steps of searching, reading and drafting.
Humans work in teams for days to solve tasks like these.
Dataset-Grounded
To perform well agents have to search across our dataset of all US case law and regulations.
14 million real cases, including not publicly available filings.
Examples
Prompt
About to file Micron's MTD — can you run a thorough cite check on the memo before it goes out? Flag anything in a table.
About Midpage
14M Cases
Plus 6M statutes and regulations, with a comprehensive legal dataset and a proprietary citator showing which cases are overruled.
300+ Firms
Midpage is used by over 300 law firms directly and reaches hundreds of thousands through partner organizations.
5 Partnerships
We are the data supplier to 5 multi-billion-dollar organizations. They use our data, search, and MCP.
200.000 Visitors
Every month, 200k+ visitors read cases directly on our website.
Benchmark Results
Accuracy vs cost
X axis is average cost per task, Y axis is average accuracy.
GPT-5.4 · high
Cost: $13.8
Avg accuracy: 72%
MCP off / on
MCP includes tools for searching across our case law and regulations corpus.
Opus 4.6
MCP off: 49%
MCP on: 67%
Method
Midpage is building the first benchmark and RL env for agentic litigation. In the US alone, millions of cases are filed each year. Using those filings as verifiable environments, we want to help teach LLMs how to become excellent lawyers.
This is a private benchmark. To avoid leaking the questions, we do not give collaborators access to the full set of sample tasks. Submissions are run with the candidate model and harness inside our environments.
Responses are graded from 0 to 1 through rubrics hand-crafted by our attorneys. Our tasks use the Harbor format. Shoutout to Alex at Laude.org.
RL Envs
Like the benchmark, RL uses Harbor format. Rollouts use the provider's own agent harness, and evaluators score from 0 to 1. This scales to tens of thousands of tasks. To avoid cheating we use cases that are not in the model's training sets yet.
Midpage collects case data the same day courts publish it, long before it reaches training datasets. These matters include multiple motions from the parties and one or more decisions from the judge. The documents are often 10 to 50 pages long and represent days of human work. To solve them, agents have to work like human case teams: research arguments, counterarguments, and past precedent, then draft long final outputs that are consistent with the rest of the case file and compliant with court rules.
Contact Us
Talk to us about the benchmark, our MCP, our dataset, and our RL environments.
Request leaderboard
DM me on @ottozastrow
Or contact: benchmark@midpage.ai
