Evaluating A.I. Language Model Performance for Legal Reasoning

Julian Bryant, Esq.

San Diego, CA

800-1672

San Diego, CA

800-1672

My Work Evaluating A.I. Language Model Performance for Legal Reasoning

TLDR: Quick Summary

Language models power A.I. tools that can substantially boost attorney productivity such as ChatGPT, Anthropic’s Claude, and Google’s NotebookLM.[1]

The MMLU benchmark is a standard test for evaluating language models. It consists of thousands of multiple-choice questions. Language model scores are the percentage of questions answered correctly.

The published public MMLU scores are not very useful for evaluating which A.I. language models to use for legal reasoning tasks.

Here I privately test several language models on the ‘professional law’ subset of MMLU questions.

These private test scores show that the tested models perform between 5% - 21% worse on legal reasoning than the published public score.

OpenAI’s “o1” models performed the best and provided the correct answer 80% of the time.

Background

Recently A.I. systems have gained the ability to process natural language in a way that can substantially increase attorney productivity. At the core of these systems are language models.[1] Several new language models are released every week. Companies usually evaluate the quality of these models using standard benchmark tests. There are hundreds of these tests, but this brief focuses on just one of them – the MMLU (Massive Multitask Language Understanding) benchmark.

Developed in 2021, MMLU tests models with multiple-choice questions spanning 57 subjects—from elementary mathematics to professional fields like law and medicine. While newer benchmarks now better assess complex reasoning capabilities, MMLU remains valuable for the breadth subjects it tests and its specific legal subcategory which consists of over 1,700 questions. For legal professionals, MMLU’s professional law subset offers insights into how well these models understand legal concepts and reasoning.

Because I regularly use private versions of these language models to assist in some of my legal work, I needed a better understanding of how these models are evaluated. So, I began to write my own benchmarking scripts to test available models for legal reasoning ability.

[1] Often called large language models (LLMs), though this term is becoming outdated as some effective models are smaller and many now handle multiple types of data beyond text.

Private Benchmark Test Results

(MMLU Professional Law Question Subset)

Below are both the publicly published MMLU benchmark scores and the scores I got from my private tests. Because of time and budget constraints, I could not test each model against the entire 1,700 set of questions in the professional law dataset. My private results are just average results from testing models on 25 questions at a time.

The table below illustrates the significant discrepancy between published scores and actual performance on legal questions:

OpenAI

o1-Mini

o1-Preview

GPT-4 Turbo

Public Scores

Publicly-Released Scores

82.5%

86.9%

75.0%

Private Scores

My MMLU Benchmark Score

80.0%

64.0%

Gemini

Flash 2.0 Thinking

Flash Pro 1.5

Public Scores

Publicly-Released Scores

N/A

89.5%

Private Scores

My MMLU Benchmark Score

72.0%

68.0%

Anthropic

Sonnet 3.7

Sonnet 3.5

Sonnet 3.0

Public Scores

Publicly-Released Scores

N/A

89.5%

81.5%

Private Scores

My MMLU Benchmark Score

72.0%

60.0%

Note that private scores on the legal subset of questions consistently give results 5.2% to 21.5% lower than the public scores. In the worst case, Claude Sonnet 3.0 scored only 60% which means that with a given legal question it would only provide the correct answer 60% of the time. Below is an example of a question from the professional law dataset. (Feel free to skip onto the Conclusion if you wish.)

Sample MMLU Professional Law Test Question

Question: On December 30, a restaurant entered into a written contract with a bakery to supply the restaurant with all of its bread needs for the next calendar year. The contract contained a provision wherein the restaurant promised to purchase "a minimum of 100 loaves per month at $1 per loaf. " On a separate sheet, there was a note stating that any modifications must be in writing. The parties signed each sheet. Both sides performed fully under the contract for the first four months. On May 1, the president of the bakery telephoned the manager of the restaurant and told him that, because of an increase in the cost of wheat, the bakery would be forced to raise its prices to $1.20 per loaf. The manager said he understood and agreed to the price increase. The bakery then shipped 100 loaves (the amount ordered by the restaurant) to the restaurant, along with a bill for $120. The restaurant sent the bakery a check for$100 and refused to pay any more. Is the restaurant obligated to pay the additional $20?

Choose the correct answer from the following choices:

A) Yes, because the May 1 modification was enforceable even though it was not supported by new consideration.

B) Yes, because the bakery detrimentally relied on the modification by making the May shipment to the restaurant.

C) No, because there was no consideration to support the modification.

D) No, because the modifying contract was not in writing; it was, therefore, unenforceable under the UCC.

Conclusion

This testing reveals an important nuance: specialized benchmarks provide more accurate guidance than general performance scores. While the 5% – 21% gap in legal reasoning deserves attention, even 80% accuracy can be useful. Top models can analyze legal questions instantly and at scale—dramatically enhancing efficiency in document review, contract analysis, or preliminary research when properly supervised. Understanding these specific capabilities allows legal professionals to leverage A.I.’s strengths while implementing appropriate guardrails around its current limitations.

Next Steps

I plan to build up my testing system to efficiently and thoroughly test more models using a broader range of benchmark tests. I’m currently working on evaluations and explanations for the MMLU-Pro, GPQA, LongBench v2, and BIG Bench Hard
I’m getting more interested in developing more specialized benchmark tests. For example, because my practice focuses on estate planning and probate issues in California, evaluation questions specific to this legal domain and jurisdiction would be much more valuable than general legal questions. Domain-specific benchmarks could test an AI’s understanding of particular practice areas and jurisdictional knowledge, providing attorneys with much more relevant performance data for their specific needs. This approach would likely benefit lawyers across all specialties who are considering A.I. adoption.
While accuracy is one important metric, there are privacy and cost considerations as well. Next month I plan to release an evaluation of different A.I. tools in terms of ethical duties regarding client confidentiality.

Additional Information

Original MMLU Paper: https://arxiv.org/abs/2009.03300
MMLU Dataset: https://huggingface.co/datasets/cais/mmlu

My Work Evaluating A.I. Language Model Performance for Legal Reasoning

TLDR: Quick Summary

Background

Private Benchmark Test Results

(MMLU Professional Law Question Subset)

OpenAI

Public Scores

Private Scores

Gemini

Public Scores

Private Scores

Anthropic

Public Scores

Private Scores

Sample MMLU Professional Law Test Question

Conclusion

Next Steps

Additional Information

Estate Planning

Probate

Company

Estate Plan Audit Tool