top of page

Legal tech: Beyond the myths #2 - The focus on accuracy

By Arnoud Engelfriet

What’s the difference between a lawyer and a lawyerbot? In the twenty years I’ve worked as a lawyer, no one has ever asked me how accurate I worked. But every time we introduce our lawyerbot to a new audience, the first question we get is always “How accurate is it”. Which is fine, as we have a good answer: 95.1%. But what does that even mean in a legal context?

Rote language: food for robots

As I wrote last time, computers aren’t intelligent, and never will be. They are good at computation, and therefore also at routine tasks like comparing texts, looking for statistical patterns and so on. This is great for legal tools: lawyers produce huge numbers of comparable texts with clear patterns – the rote language and standard expressions that are needed to trigger certain regulations or avoid problematic jurisprudence.

The difference however also means that the way a robot analyzes texts and makes legal predictions or recommendations is fundamentally different. This in turn has grave consequences for the perceived quality of AI for legal work. It bears repeating: we consider lawyerbots to be simulated human lawyers, and their work the automated performance of human work. This is wrong, but very hard to root out.

Of course we are used to human lawyers making mistakes. A rookie lawyer can miss important case law. A senior partner may focus too much on his own hobby horse, or be out of touch with the latest developments. In a rush, certain clauses may be glossed over and implications missed. We understand and can work with such mistakes, as we can relate to them.

Machine learning: a deep dive

Robots make an entirely different class of mistakes. These have to do with the way that robots screen text. Let’s have a little dive into the technology for that.

Most robot lawyers work with so-called machine learning, a process where the computer learns to recognize patterns in data, usually based on statistical similarities derived from a given set of examples (the training dataset). Usually the process is focused on classification: assigning a label to a piece of text, e.g. “this is a liability clause” or “this is a verdict for the plaintiff”. Most contract review tools work this way. The labels help classify and value the contract, especially when a value judgment (“this liability clause is 2.5 million Euros”) can be used in the classification.

Another popular application is information extraction: “the first contract party is Royal Shell”, “this verdict cites the ECHR Sunday Times case” or “the defendant was not served with a notice of default”. This is the domain of so-called natural language processing, where human-programmed or statistics-derived rules of grammar are applied to identify information: “this clause has the supplier as the subject and uses a verb indicating obligations without an ancillary verb indicating trying, therefore this is a supplier warranty”.

In both cases however note that the computer has no actual grasp of the legal consequences, it is applying formulas and numbers to derive conclusions. “Must”, “shall” and “will” are all verbs indicating obligations, therefore “Supplier must” is a supplier warranty. And in particular with classification, the computer will classify the sentence as belonging to one of its categories. There is no “ignore if you’re not certain” or “unclear” category. (In fact, if there were the computer would probably classify all of the clauses as “unclear” since there is no downside to it for doing so.)

Confidence in robot review

The usual way of handling this limitation is to examine the prediction’s confidence. Most machine learning systems provide predictions with indications of their confidence or certainty. This outcome looks very much like the training set, therefore the system is highly confident. But this outcome is rather unusual, therefore the confidence is only 28%. During training, an engineer would search for a minimum confidence that gives the lowest numbers of false classifications or mistaken extractions. A prediction with only 28% confidence is likely to be ignored.

It is, however, a mistake to think that a prediction with a high confidence is likely to be accurate. This has to do with the fact that robots are not trained to look for the right answer. Instead, they are trained to provide an answer that best matches its training data.

Training, training, training

We first developed our NDA-reading robot NDA Lynn, we noticed that confidentiality agreements with a California choice of law were always getting rejected as being very onerous for the recipient. This despite the fact that the clauses dealing with security, notification and so on were as standard as they could be. Further digging revealed however that the training dataset contained only very strict, one-sided NDAs with California law. From this, Lynn had concluded that California law is a good predictor of a very strict NDA. It thus made perfect sense to first look at the choice of law, and if that is California then to give a quick answer.

It is thus imperative to have a training dataset that is as complete and diverse as possible. This is however enormously hard, even when the system is restricted to only one jurisdiction. First of all, there are no public datasets with contracts, so assembling a large corpus of data is very labor intensive. Companies working in this field may be able to get (anonymized) documents from their first customers, but that introduces a bias: a law firm with an IP focus for high-tech enterprises will create different contracts than a law firm focused on SME businesses.

This is essentially the same problem as the allegations of bias that pop up whenever a machine learning system makes predictions or analyses of human behavior, such as with spotting potential fraudsters or even simple face recognition. In legal it is a bit harder to spot, as it may require deep human review of the clause to see that something fishy is going on. And to add to that, two lawyers may reasonably disagree on interpretation of a legal clause or implications of a court verdict.

That said, of course there are measures to objectively evaluate quality. A simple approach is to set aside 20% of the dataset for an evaluation when the machine learning system has been trained. As this 20% was labeled prior, the system output can easily be compared against the human-chosen labels. In more advanced approaches, this split is done multiple times along different lines, generating multiple models with different test sets. If all comparisons reveal a good quality of the predictions, then the dataset (and the models) are suitable for practical use.

Still, this presumes that the dataset is representative. A very high accuracy only means that the test dataset was well recognized – in other words, that the predicted labels match the human-assigned labels.

Handling computer mistakes

All this goes into the central question of trust. Trust is derived from accuracy in past performance, but how do we measure accuracy if it is so different from how humans work?

The usual answer involves the difference between true positives, false positives, true negatives and false negatives. Very quickly: a true positive and a true negative mean correct identification or rejection, and false positives and negatives are both mistakes. However, these terms assume a yes/no, true/false or guilty/not guilty dichotomy. In a legal robot, we usually deal with multiple classes: a contract clause can be any of 30 to 50 types, verdicts contain a lot more information than “guilty / not guilty” and let’s not even go into the amount of options in a legal demand letter.

Merely looking at a classification being wrong is not enough. You can have small or big mistakes. Human lawyers easily can make small mistakes: overlooking a dependent clause, forgetting to change a minimum term after a statute has changed, and so on. Big mistakes – say, taking a liability clause as a contract term – rarely if ever happen, and then only to the most junior of rookies. However, for a computer these are all more or less the same: it’s not part of the model, it was labeled differently, this is what the dataset looked like. So getting your liability clause misinterpreted or a carve-out to a payment penalty scheme overlooked can happen just as easily.

This is a key reason why lawyerbots are harder to trust: they make strange mistakes, which just as often may be rookie mistakes. And rookie mistakes have huge consequences. This means that a human lawyer feels like having to double-check the lawyerbot’s work all the time, which in turn destroys any added value (e.g. time saving) the lawyerbot may claim to have had.

Going forward

Understanding the function of AI tools is tremendously important. Anyone who considers them mere automated versions of human lawyers is setting themselves up for a huge disappointment.

The key issue when it comes to accuracy is not to strive for 100% perfection. This is impossible, just like with human lawyers. Even when a dataset is comprised of thousands of documents – as with NDA Lynn: 14.000 NDA’s – there is still a chance a new contract has vastly different language, and thus the system will perform lower. Continuous re-training based on mistakes thus is key. This requires input from the human lawyers: which label did you expect here?

Similarly important is the realization that a lawyerbot does not distinguish between big and small mistakes. The only way to address this issue is to ensure there are multiple steps in the robot’s process. For example: when a liability clause is detected, double-check for certain expected wording and reject the detection if that’s missing (such as a number or reference to contract value). Check if another clause has been detected as the same, but with higher confidence. And so on. This is an iterative process that again requires human lawyers to get to know their robot counterpart.

Finally, it comes down to positioning. How can a lawyerbot be deployed to save time or money, whilst minimizing the impact of inaccurate analysis. But not only that: also position the human lawyer (who reviews the bot’s work) to easily provide feedback on the bot’s output. Nothing frustrates acceptance of a tool as much as being unable to change its workings. Back in the box it will go. Every lawyerbot tool therefore should come with a feedback button, and of course its designers should listen to the feedback. Let’s get that right!


About the Author

Arnoud Engelfriet is co-founder of the legal tech company JuriBlox, and creator of its AI contract review tool Lynn Legal. Arnoud has been working as an IT lawyer since 1993. After a career at Royal Philips as IP counsel, he became a partner at ICTRecht Legal Services, which has grown from a two-person firm in 2008 to an 80+ person legal consultancy firm.

bottom of page