How is GPT-4 at Contract Analysis?

Where Generative AI Is Good in Contract Data Extraction … And Where It Isn’t

TL;DR: GPT-4 is impressive overall, but-on contract review tasks-it’s inconsistent and makes mistakes; probably not yet ready as a standalone approach if predictable accuracy matters.

Back in 2011, when we started Kira Systems, other contract review software companies were using rules- or comparison-based approaches to finding data in contracts. We used supervised machine learning to find information in contracts, and did well against our competitors. Eventually, most of the world’s leading law and audit/consulting firms (and a bunch of corporates) became Kira customers. In recent months, Generative AI solutions have become all the rage. Are they going to supplant other machine learning approaches in contract analysis, just as machine learning approaches beat out rules?

We thought it would be useful to drill into this. Our thoughts on the subject come in three parts:

How do Large Language Models perform on contract analysis tasks. Given GPT-4’s easy accessibility, we’ll discuss this with examples from it.
What are advantages and disadvantages of LLMs versus other tech.
Where we think LLMs can be especially helpful in contract review, and where contracts AI tech is headed.

Throughout, we use “Large Language Models,” “LLMs,” and “Generative AI” interchangeably. We recognize that this isn’t a great equivalence. Generative AI uses Large Language Models but is not the only form of Large Language Model. Nonetheless, we think it works well enough here that we do it.

This piece will cover how LLMs do at contract analysis. Specifically, we tested GPT-4 on a number of contract analysis challenges.

This is a long piece, so here’s a table of contents in case you would like to skip parts:

Why We’re (Not?) Worth Reading On This Topic

How GPT-4 Does At Identifying Data In Contracts

Header Detection?
Differently-Phrased Clauses
Sensitivity To Prompts
Summary

Before we get to our evaluation, let’s cover why you might find our perspective on this helpful.

Why We’re (Not?) Worth Reading On This Topic

On the “worth reading” side, we have a lot of experience in contracts AI, and ⅔ of us have computer science PhDs.

Noah was a corporate lawyer, then co-founded leading machine learning contract analysis company Kira Systems in 2011, and was its CEO until its sale to Litera in 2021. He’s now Zuva’s CEO. He has been involved in contract analysis AI longer than most.
Adam joined Kira in 2016 as a Research Scientist, having finished his PhD in computer science at the University of Waterloo (where he worked with Professor Gordon Cormack, a longtime leader in eDiscovery). At Zuva, he leads our Research and Product Development teams.
Sam joined Kira in 2018, after finishing his computer science postdoctoral work. He is currently a Senior Research Scientist at Zuva. His work has included a focus on differential privacy.

On the “not worth reading” side, perhaps:

Our experience prevents us from understanding a completely different present and future. We would like to think we’re open minded, but this is possible.
We are biased because we have a horse in the race: Zuva sells contracts AI software, and maybe we are trying to defend how we do things, as opposed to looking at the space clearly.

At Zuva, our mission is to make it dead easy to use the world’s best contracts AI. Frankly, we don’t really care about the technical approach we take to get there. Supervised ML, LLMs, even rules if appropriate - whatever. We just would like to (1) build great contracts AI and (2) make it dead easy to use. We think doing those things have the potential to get us to a good outcome. Also, even if we are biased, bias can sometimes drive crisper thinking on an issue. This is basically how the adversarial legal system works.

How GPT-4 Does At Identifying Data In Contracts

Adam, Noah, and Dr. Alexander Hudek (Kira and Zuva’s co-founder) ran a number of contract review tests on ChatGPT’s performance on contract analysis tasks beginning with ChatGPTs release, but we tarried in writing up our findings. With GPT-4’s release, we thought the time was right to share what we have learned.

When we first started playing with ChatGPT, we were pretty wowed.

GPT-3.5 was very impressive, and GPT-4 (on limited testing) seems even better. We think these can be really helpful for the right use cases. However, we now have a more nuanced view of how GPT-4 performs on contract review tasks.

Due to (1) the amount of attention on GPT-4 and (2) how easy it is to try ChatGPT, we are going to discuss LLM performance with examples from GPT-4. We know there are other Large Language Models out there, including some that have been specifically trained on legal documents. It’s very likely they perform differently, but it’s hard to say whether that means they are better or worse.

Let’s get into some examples.

Header Detection?

One really tricky thing about doing accurate post-signature contract analysis is that wordings can be non-standard. Sometimes documents come in the form of poor quality scans. At other times, they are drafted in atypical ways.

To measure how GPT-4 performed on slightly altered wording, we first gave GPT-4 some contract clauses (copied from a contract filed on Edgar) and asked it to identify what clauses were there.

Differently-Phrased Clauses

Contract clauses can be worded a lot of different ways. Poor quality scans, and contracts written in different jurisdictions (sometimes by non-English-first-language, non-lawyer authors) can contribute to even more disparate wordings.

Change of control clauses are among the most important to identify in connection with M&A contract review (aka, due diligence). At the time we sold Kira, 18 of the top 25 global M&A law firms were Kira customers, so we have a fair amount of experience with this use case. 30–60% of a typical M&A legal bill goes to due diligence. Not all of this is contract review, and not all the contract review segment of due diligence is about finding change of control clauses, but (accurately!) finding change of control clauses tends to be a big part of this work. Companies who are buying other companies literally spend millions of dollars with Biglaw firms on a regular basis to get this done right. Not only are change of control clauses important to get right, but there are a lot of ways to word them. As with other areas of contract review, this problem can be exacerbated by atypical drafting and poor quality scans.

We decided to test how GPT-4 would do if we ran some typical change of control clauses through. We took it relatively easy here, only giving pretty standard change of control language and not, say, introducing the distortion of a poor quality scan or non-English-first-language drafted clauses.

All of 3, 6, 7, and 9 are change of control clauses. GPT-4 showed real room for improvement, though at least it got 6.

One objection to this test is that GPT-4 (or another Large Language Model) could be trained to find variations of change of control clauses, just as other contracts AI was heavily trained to find clauses. Zuva has a not-small machine learning research team, and we are always interested in how to get our tech to perform better. When we have tested using LLMs to find information in contracts, we have found we can get comparable accuracies. The catch is that they are orders of magnitude more expensive to train and use. For example, in one of our papers we compared a (non-generative) LLM to our current tech for finding named entities. The LLM was 4% more accurate, which is not nothing, but cost 10,000 times more than our baseline to achieve that¹. We’ll delve deeper into costs in part 2.

Sensitivity To Prompts

In the course of testing GPT-4 on summarizing text and question answering (which we’ll discuss in more detail in part 3), we noticed that it was sensitive to prompts. This is fairly unsurprising, given what we’ve seen with other LLMs. Be mindful of this if using an LLM in your work.

Summary

GPT-4 has generally really impressed us. It appears able to do a lot very well. That said, we are unconvinced that it yet offers predictable accuracy on contract analysis tasks. (And we have not yet really pushed it, for example on poor quality scans and less standard wordings, though perhaps it would do well on these.) If you use Generative AI to help you find contracts with a change of control clause, it will identify change of control clauses in contracts. If, however, you need it to find all (or nearly all) change of control clauses over a group of contracts, we wouldn’t yet count on it². Still, this technology is improving, and our view may change as we test further iterations. Also, as we’ll discuss in a further installment, we think LLMs offer significant benefits today when combined with other machine learning contract analysis technologies. Exciting!

¹ In our experience, Zuva’s ML is far more effective on clause detection than finding contextually-relevant entities. Accordingly, we didn’t expect to “win” this comparison. But we think it’s important to test and see where we can improve.

² In situations where accurate contract review really matters, we recommend using contract analysis AI in conjunction with a user interface that includes a document viewer. A document viewer increases the odds that a human working with the AI might catch mistakes the AI makes.

How is GPT-4 at Contract Analysis? | Zuva (2024)

Where Generative AI Is Good in Contract Data Extraction … And Where It Isn’t

Why We’re (Not?) Worth Reading On This Topic

How GPT-4 Does At Identifying Data In Contracts

Header Detection?

Differently-Phrased Clauses

Sensitivity To Prompts

Summary