Everything you wanted to know about eDiscovery, but were afraid to ask
With the growing volumes of evidence and the expense of manual review, many legal teams are considering computer assisted review to help weed through digital records faster and with less human intervention. Predictive coding is the common name for this process which is emerging as an accepted practice in some extremely large cases.
The typical professional spends about a third of their work day reading, organizing, and responding to email. Those emails quickly add up, to more than 100 billion business emails sent every day. These messages don’t just crowd our inboxes—they may become subject to litigation. When that happens, emails and other ESI will need to be collected, reviewed, and, if responsive and unprivileged, produced. Reviewing even a tiny fraction of the emails sent every day can be, to put it mildly, a daunting task.
Getting through the ever-growing volumes of emails and other ESI is one of the main challenges in litigation today. To make this process more efficient, some lawyers and technologists are turning to emerging technologies like predictive coding and computer-assisted review, through which they hope to reduce the time and expense associated with eyes-on review.
But while predictive coding solves some problems, it introduces new complications. And despite a spate of recent cases employing the technology, questions remain as to how effective it can be.
Predictive coding is the automation of document review. In other words, instead of manually reading every single document in a collection, document reviewers use computer-categorizing software that classifies documents according to how they match concepts in sample documents.
This typically works by taking information gained from manual coding and automating that logic to a larger group of documents. Reviewers use a set of documents to identify potentially responsive documents and then train the computer to identify similar ones. Technology is used to predict how certain documents would be coded, based on how they were coded manually. Hence, the name predictive coding.
But how do you know if your system’s predictions are accurate? To guide the process and measure effectiveness, these processes generally incorporate statistical models and/or sampling techniques alongside human review.
To avoid confusion, it is important to note that not all technology-assisted review, or TAR, involves predictive coding, although the terms are sometimes used interchangeably. Predictive coding is just one form of TAR, which is itself a broader category that encompasses many uses of technology in the document review process. It is also important to note that predictive coding does not replace culling or early case assessment in the review process. For example, some TAR methods, like culling irrelevant data from a document collection, are used before applying predictive coding technology.
Since computers can analyze millions of documents in minutes, some imagine that they may displace many of the lawyers, paralegals, and other professionals that are traditionally needed for document review. Yet, even the most automated review workflows can’t completely replace human reviewers. They simply change their role in the process.
Even with TAR, the reviewer’s role is still critical and is at least as important as during a traditional linear review. Rather than reviewing each and every document, the reviewer is now responsible for coding documents within the sample set. Those documents, in turn, refine the computer’s understanding of either a specific issue or the concept of responsiveness. Reviewers are also responsible for validating the review result.
After culling, most predictive coding workflows call for selecting and coding a sample from the collection. The purpose of the initial sample is to see the how prevalent responsive documents may be in the collection. The two types of samples are:
Control Set: A random sample of documents coded by human reviewers at the start of a search or review process that is separate from and independent of the training set. Control sets are used in to measure the effectiveness of the machine learning algorithm.
Training Set: A sample of documents coded by one or more subject matter experts as relevant or non-relevant, from which a machine learning algorithm then infers how to distinguish future documents.
In TAR that uses machine learning, the predictive algorithms are continuously updated, based on the judgement of the review team. This “continuous active learning” deemphasizes the role of control and training sets, allowing the predictive coding to change alongside human reviewers.
As with the manual review process, predictive coding workflows are iterative. The iterative process involves repeatedly updating the training set with additional examples of coded documents to improve results. It is critical to have a lawyer review all documents and processed data, including reviewing documents coded by the machine to verify quality. Attorneys should continually review the results and refine their search method in order to train the system.
Because predictive coding relies on highly complex algorithms to determine responsiveness, rather than individual human judgement, some critics describe it as a “black box.” That is, information goes in and information comes out, but the internal processes are opaque.
How the technology work, and thus how coding decisions are made, is understood by only a few experts—and very few of them are lawyers. Whereas an associate in a traditional eyes-on review can explain his or her individual coding decisions, understanding why predictive coding treated a document a specific way takes highly specialized technological knowledge.
There is also debate about the level of scrutiny courts should apply to TAR claims. Magistrate Judge David J. Waxse, for example, has argued that courts must act as gatekeepers when evaluating the appropriateness of TAR. Evidence about TAR’s capabilities would have to meet the standards established in the Supreme Court’s Daubert v. Merrell Dow Pharmaceuticals decision and Federal Rule of Evidence 702. On the other hand, Magistrate Judge Andrew J. Peck has said that those standards are “not applicable to how documents are searched for and found in discovery.”
Concerns over TAR, along with the complexity and cost of predictive coding products, may have contributed to the legal industry’s slow adoption to predictive coding review.
Because predictive coding is still a nascent practice, it may be necessary to defend any use of the technology in litigation. A defensible process will look something like this: