Get "The Ultimate Guide to eDiscovery"

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Ultimate Guide to eDiscovery

Everything you wanted to know about eDiscovery, but were afraid to ask

Chapter 6

Predictive Coding

AI & Machine Learning in Discovery

With the growing volumes of evidence and the expense of manual review, many legal teams are considering computer assisted review to help weed through digital records faster and with less human intervention. Predictive coding is the common name for this process which is emerging as an accepted practice in some extremely large cases.

Predictive Coding 101

The typical professional spends about a third of their work day reading, organizing, and responding to email. Those emails quickly add up, to more than 100 billion business emails sent every day. These messages don’t just crowd our inboxes—they may become subject to litigation. When that happens, emails and other ESI will need to be collected, reviewed, and, if responsive and unprivileged, produced. Reviewing even a tiny fraction of the emails sent every day can be, to put it mildly, a daunting task.

Getting through the ever-growing volumes of emails and other ESI is one of the main challenges in litigation today. To make this process more efficient, some lawyers and technologists are turning to emerging technologies like predictive coding and computer-assisted review, through which they hope to reduce the time and expense associated with eyes-on review.

But while predictive coding solves some problems, it introduces new complications. And despite a spate of recent cases employing the technology, questions remain as to how effective it can be.

What Is Predictive Coding?

Predictive coding is the automation of document review. In other words, instead of manually reading every single document in a collection, document reviewers use computer-categorizing software that classifies documents according to how they match concepts in sample documents.

This typically works by taking information gained from manual coding and automating that logic to a larger group of documents. Reviewers use a set of documents to identify potentially responsive documents and then train the computer to identify similar ones. Technology is used to predict how certain documents would be coded, based on how they were coded manually. Hence, the name predictive coding.

But how do you know if your system’s predictions are accurate? To guide the process and measure effectiveness, these processes generally incorporate statistical models and/or sampling techniques alongside human review.

To avoid confusion, it is important to note that not all technology-assisted review, or TAR, involves predictive coding, although the terms are sometimes used interchangeably. Predictive coding is just one form of TAR, which is itself a broader category that encompasses many uses of technology in the document review process. It is also important to note that predictive coding does not replace culling or early case assessment in the review process. For example, some TAR methods, like culling irrelevant data from a document collection, are used before applying predictive coding technology.

Will Computers Replace Document Review Attorneys?

Since computers can analyze millions of documents in minutes, some imagine that they may displace many of the lawyers, paralegals, and other professionals that are traditionally needed for document review. Yet, even the most automated review workflows can’t completely replace human reviewers. They simply change their role in the process.

Even with TAR, the reviewer’s role is still critical and is at least as important as during a traditional linear review. Rather than reviewing each and every document, the reviewer is now responsible for coding documents within the sample set. Those documents, in turn, refine the computer’s understanding of either a specific issue or the concept of responsiveness. Reviewers are also responsible for validating the review result.

Cover of The Ultimate Guide to eDiscovery, downloaded by 10,000-plus legal professionals
arrow down
Download Now

Getting started is fast and free

Start a discovery project in minutes from anywhere.

How Does TAR Work?

After culling, most predictive coding workflows call for selecting and coding a sample from the collection. The purpose of the initial sample is to see the how prevalent responsive documents may be in the collection. The two types of samples are:

Control Set: A random sample of documents coded by human reviewers at the start of a search or review process that is separate from and independent of the training set. Control sets are used in to measure the effectiveness of the machine learning algorithm.

Training Set: A sample of documents coded by one or more subject matter experts as relevant or non-relevant, from which a machine learning algorithm then infers how to distinguish future documents.

In TAR that uses machine learning, the predictive algorithms are continuously updated, based on the judgement of the review team. This “continuous active learning” deemphasizes the role of control and training sets, allowing the predictive coding to change alongside human reviewers.

As with the manual review process, predictive coding workflows are iterative. The iterative process involves repeatedly updating the training set with additional examples of coded documents to improve results. It is critical to have a lawyer review all documents and processed data, including reviewing documents coded by the machine to verify quality. Attorneys should continually review the results and refine their search method in order to train the system.

Concerns About Predictive Coding

Because predictive coding relies on highly complex algorithms to determine responsiveness, rather than individual human judgement, some critics describe it as a “black box.” That is, information goes in and information comes out, but the internal processes are opaque.

How the technology work, and thus how coding decisions are made, is understood by only a few experts—and very few of them are lawyers. Whereas an associate in a traditional eyes-on review can explain his or her individual coding decisions, understanding why predictive coding treated a document a specific way takes highly specialized technological knowledge.

There is also debate about the level of scrutiny courts should apply to TAR claims. Magistrate Judge David J. Waxse, for example, has argued that courts must act as gatekeepers when evaluating the appropriateness of TAR. Evidence about TAR’s capabilities would have to meet the standards established in the Supreme Court’s Daubert v. Merrell Dow Pharmaceuticals decision and Federal Rule of Evidence 702. On the other hand, Magistrate Judge Andrew J. Peck has said that those standards are “not applicable to how documents are searched for and found in discovery.”

Concerns over TAR, along with the complexity and cost of predictive coding products, may have contributed to the legal industry’s slow adoption to predictive coding review.

How Do I Defend the Use of Predictive Coding?

Because predictive coding is still a nascent practice, it may be necessary to defend any use of the technology in litigation. A defensible process will look something like this:

  • Assembling a data sample.
  • Coding the sample in a linear fashion.
  • Using the sample to assemble a training set.
  • Coding the training set.
  • Running the algorithm against the training set and comparing the machine coding with the human coding.
  • Repeating the process until the human reviewer is satisfied with the machine’s understanding of the coding criteria.
  • Running the algorithm so that the coding is applied to the rest of the collection.
  • Assessing the machine-generated coding.
  • Manually reviewing the production set for a final quality check.

Predictive Coding Terminology and Case Law

A Glossary of TAR Terminology

TAR involves complex algorithms, based on sophisticated mathematical and linguistic models. As in the law, the expert lingo can seem impenetrable to outsiders. Here are some of the common concepts and terms you might need to know to navigate this world.

  • Algorithm: A specified set of computations used to accomplish a particular goal. The algorithms used in eDiscovery are implemented through computer software.
  • Artificial Intelligence: A general term for computer programs that are designed to simulate human judgement. Artificial intelligence includes machine learning, which allows computers to change when exposed to new data, without needing manual programing.
  • Boolean Search: A search methodology that uses keywords to pull results through using connecting words like “and” or “or” to find specific combinations. In more complex litigation, more sophisticated Boo­lean strings are often used with a fuzzy search tech­nique, designed to account for variations in spelling and word choice.
  • Concept and Categorization Tool Search Systems: These techniques rely on a thesaurus to capture documents that use different words to express the same thought.
  • Clustering: A grouping method in which documents are organized into categories so that those in one category are more similar to each other than to those in another category. Clustering is an automated process and the categories made may or may not be valuable for review.
  • Fuzzy Search Models: A method to refine a search beyond specific words, recognizing that words can have multiple forms. In fuzzy search models, even if search terms don’t use the exact words in a relevant document, the document might still be found.
  • Natural Language Search: A non-Boolean search method, whereby search commands are input as one would speak naturally. This is the type of search associated with search engines like Bing and Google.
  • Probabilistic Search: Search based on language models, including Bayesian belief networks, which make inferences about the relevance of documents based on how concepts are communicated in a collection.
  • Subjective Coding: The classification of documents based on subjective judgement about their responsiveness, privilege, or other categories.

Predictive Coding Case Law

The first court case to embrace predictive coding as an allowable review strategy was 2012 Da Silva Moore v. Publicis Groupe. “What the bar should take away from this opinion is that computer-assisted review is an available tool and should be seriously considered for use in large-data-volume cases where it may save the producing party (or both parties) significant amounts of legal fees in document review,” U.S. Magistrate Judge Andrew Peck of the Southern District of New York wrote at the time. “Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review.”

In the years since, a handful of subsequent opinions have continued to address the issue. Here are some of the most notable:

In re Actos (Pioglitazone) Products Liability Litigation, MDL No. 6:11-md-2299 (W.D. La. July 27, 2012).

This product liability action included a case management order allowing the parties to use a “search methodology proof of concept to evaluate the potential utility of advanced analytics as a document identification mechanism for the review and production” of ESI. The search protocol provided for the use of a TAR tool on the emails of four key custodians.

Global Aerospace, Inc. v. Landow Aviation, L.P., No. CL 61040 (Vir. Cir. Ct. Apr. 23, 2012).

Virginia Circuit Court Judge James H. Chamblin ruled that the defendants may use predictive coding for the purposes of processing and producing ESI.

Nat’l Day Laborer Org. Network v. U.S. Immigration & Customs Enforcement Agency, 2012 WL 2878130 (S.D.N.Y. July 13, 2012).

U.S. District Court Judge Shira Scheindlin held that the federal government’s searches for responsive documents, requested pursuant to FOIA, were inadequate because of the government’s failure to properly employ modern search technologies. Judge Scheindlin urged the government to “learn to use twenty-first century technologies,” including predictive coding as opposed to simple keyword search.

In re Biomet M2a Magnum Hip Implant Prods. Liab. Litig., 2013 U.S. Dist. LEXIS 172570 (N.D. Ind. Aug. 21, 2013).

U.S. District Court Judge Robert L. Miller Jr. ruled that defendants need not identify which of the documents, from among those they had already produced, were used in the training of the defendants’ TAR algorithm. However, the court said that it was troubled by the defendants’ lack of cooperation.

Hyles v. NYC, No. 110-cv-03119 (S.D.N.Y. Aug. 1, 2016)

In this case, Judge Peck again addressed TAR, ruling that plaintiffs could not force the producing party to use TAR in its review. “It certainly is fair to say that I am a judicial advocate for the use of TAR in appropriate cases,” Judge Peck wrote, but he found no reason to compel its use in this case. “Responding parties are best situated to evaluate the procedures, methodologies, and technologies appropriate for preserving and producing their own electronically stored information.”

Chapter 7
Discovery Software

The proliferation of PCs, laptops, smartphones, and even the Internet of Things have created exploding volumes of discoverable, electronic evidence. That growth in electronic documents threatens to break the discovery process. The only solution is to use equally disruptive technology to tame the challenges of modern discovery.

Image of statues at the U.S. Supreme Court Building

Getting started is fast and free

Start a discovery project in minutes from anywhere.