Blair and Maron (Still) Must Die!

This is part two of Michael Simon's piece on why Blair and Maron's dated study needs to stop exerting so much influence in the eDiscovery industry. Part one can be read here.

Computer systems have gotten much better since 1985, when Blair and Maron were investigating attorneys’ efficacy at document search. Just consider that back then Radio Shack was still selling nearly 10 percent of all computers. The Sedona Commentary of course recognizes this improved state of affairs:

In the years since Blair and Maron, and with increasing attention focused on the e-discovery space, the IR community has been engaged in research and the development of methods, tools, and techniques that compensate for endemic ambiguity and variation in human language, thereby improving the recall and precision of searches. Sedona Conference Commentary on Search and Retrieval at 26.

Blair himself understood how much technology and process improved. In STAIRS Redux in 1996, Blair recognized just how much things had changed in just ten years, first as to the cost side of the equation:

In the late 1970s when the study was done, word processing systems were practically unknown in business . . . the cost of input alone was $26 X 40,000, or $1,040,000. . . To our minds, this, clearly, was too great a price to pay for the “advantage” of full-text retrieval. . .

Today, the “better than nothing” argument for simple full-text retrieval is more convincing than it was 10 or 15 years ago. Most documents now begin as machine readable documents, so there is no up front cost of typing the documents into an information system. STAIRS Redux at 18

But Blair also realized that the situation had improved as to the results, especially with the use of exemplary documents:

But our other caveats remain, namely, that for an information retrieval system to be used for a “mission critical” or, high recall, application, the capabilities of simple full-text retrieval alone are just not up to the task. In such systems, the simple full-text retrieval must be augmented with a carefully thought out logical or intellectual structure, usually based on the activity that those documents serve (Blair, 1990, 1995b). In the STAIRS study the obvious logical structure that could have been used was the written complaint on which the lawsuit was based. STAIRS Redux at 18-19

By 2001, Blair was writing, in his paper "Exemplary Documents: A Foundation for Information Retrieval Design" (along with Wharton School professor Stephen Kimbrough) about how "exemplary documents" could be particularly useful for retrieval in the legal space. Blair and Kimbrough believed that the complaint in "corporate and government litigation" could be very useful, along with deposition transcripts, by serving as exemplar documents to help refine and power search systems.

Anything Could Be “Iterative” if You Have Infinite Time and Patience

Modern search systems, including eDiscovery platforms, have also so greatly improved in speed that they can now be used for the kind of iterative searching recommended by industry experts. Some modern commentators, including the EDRM, claim that Blair and Maron study also made use of iterative searching, but that claim doesn’t stand up to the published description of the work in the study:

. . . They generated a total of 51 different information requests, which were translated into formal queries by either of two paralegals, . . . The paralegals searched on the database until they found a set of documents they believed would satisfy one of the initial requests. The original hard copies of these documents were retrieved from files, and Xerox copies were sent to the lawyer who originated the request. The lawyer then evaluated the documents, ranking them according to whether they were “vital,” “satisfactory,” “marginally relevant,” or “irrelevant” to the original request. The lawyer then made an overall judgment concerning the set of documents received, stating whether he or she wanted further refinement of the query and further searching. The reasons for any subsequent query revisions were made in writing and were fully recorded. An Evaluation at 291

The 1985 study does not tell us how long this process took, though it certainly describes an interminable amount of work for 51 search queries. We do have STAIRS Redux to give us an overall idea of how slow the process actually was by telling us that the relevancy determinations took six months. Think about that when you complain that your current eDiscovery review platform took more than 5 seconds to get you your next iteration of search results.

So, was the Blair and Maron 1985 process iterative? Technically, yes. But so was Victorian Era play-by-mail chess, if we want to stretch the definition of “iterative” that far. To say that six months for 51 search queries should not be distinguished from “current iterative approaches,” seems to me to be, frankly, ludicrous.

Comparing Apples to Oranges, Or Is That Apples to Fruit Bats?

Another critical point to note is that the 1985 test was not a discovery document review; it was an entirely different situation, with very different goals. The test was preparation for trial, using a database of 40,000 documents (roughly 350,000 pages) with the goal being to find specific, perhaps even unique documents with the exact language needed to support testimony and exhibits at trial. Even though the test subjects had previously selected all 40,000 of those documents themselves, they still found that this was very tough to do:

Stated succulently, it is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents, . . . An Evaluation at 295

In fact, the difficulty of the test didn’t stop there, as the test subjects didn’t have to just find unique words and phrases for the documents they sought, they had to also try to craft queries that would also exclude other documents:

In order for a simple full-text system to retrieve effectively, the user/searcher must be able to predict (and use as his query terms) those words, phrases and word combinations that occur in most of the relevant documents, and which do not occur in most of the nonrelevant documents. Blair and Maron, “Full-Text Information Retrieval: Further Analysis and Clarification,” Information Processing & Management, 1989 at 438

Not surprisingly, adding this additional twist into the test requirements just made it all the more difficult. Plus, the emphasis for the Blair and Maron test was on a different focus than is typical for eDiscovery reviews. The goal of any search is to get the best results, or as the experts would say, maximize both recall and precision. Recall and precision can be expressed by simple ratios:

Recall = Number of responsive documents retrieved / Number of responsive documents overall

Precision = Number of responsive documents retrieved / Number of documents retrieved

If a collection of documents contains, for example, 1,000 documents, 100 of which are relevant to a particular topic and 900 of which are not, then a system that returned only these 100 documents in response to a query would have a precision of 1.0, and a recall of 1.0.

If the system returned all 100 of these documents, but also returned 50 of the irrelevant documents, then it would have a precision 100/150 = .667, and still have a recall of 100/100 = 1.0.

If it returned only 90 of the relevant documents along with 50 irrelevant documents, then it would have a precision of 90/140 = 0.64, and a recall of 90/100 = 0.9. Sedona Commentary at 24

However, it is extremely difficult to get both high precision and high recall at the same time; recall and precision typically have something of an inverse relationship so that improving one tends to degrade the other, as the Sedona Commentary notes:

Importantly for the practitioner, there is typically a trade-off between precision and recall. One can often adjust a system to retrieve more documents – increasing recall – but the system achieves this result at the expense of retrieving more irrelevant documents – decreasing precision. Effectively, one can cast either a narrow net and retrieve fewer relevant documents, along with fewer irrelevant documents, or cast a broader net and retrieve more relevant documents, but at the expense of retrieving more irrelevant documents. Sedona Conference Commentary, page 24

For the Blair and Maron test, precision was paramount. The study was designed so that nothing else mattered. Remember that the searchers were tasked with finding specific smoking gun documents and only those documents. Considering the difficulty of that task, the team got an average of 79 percent precision, which is pretty impressive actually.

The goal of searching within eDiscovery is not to find a specific document but to find documents that are relevant, potentially relevant or even just likely to lead to relevant information so as to provide a certifiable production under FRCP 26(g) or some state equivalent. That’s a very different sort of task than what the Blair and Maron team faced. Yet, in a way the test team seemed to do pretty well at this kind of broad searching:

Retrieving all the documents in which Company A was mentioned was too broad a search; it retrieved over 5,000 documents. An Evaluation at 296

For an eDiscovery search, finding the small subset of critical documents needed for deposition or trial would be the start of carefully constructed process funnel, with all or many of those documents potentially going to eyes-on review (or some form of technology-assisted review) for further winnowing down. But in the Blair and Maron test, each time the team did not correctly jump through the many flaming hoops needed to find the exact “smoking gun” it counted as a fail. At the very same time, when the team failed to eliminate any non-smoking gun documents Blair and Maron counted it as a fail – for each and every document!

So, let us be clear: the Blair and Maron study team did not fail at an eDiscovery search test, but instead at the very different kind of test which they were given.

You Call That a Valid Sample Size?!

And speaking about that team . . . here is where it gets really “interesting.” Let me ask you a question: if you were going to cite a study to lawyers to tell those lawyers that they are not proficient at searching for documents, how many lawyers would you expect to have participated in that study? Imagine yourself standing in front of audiences or writing articles telling the entire legal profession that they suck at searching. Would you be comfortable with citing a study that found 1,000 lawyers failed at it? The people I ask—and I do ask, all the time—are usually just fine with that, as it seems like plenty of participants necessary for a scientifically credible study.

I then ask the same people: what if there were only 100 lawyers in the study? Most people start hemming and hawing a bit, as that’s kind of a small number. We might get some pushback from the legal profession at that point, people start to think.

How about if there were 10? Ummmm, just 10? Yeah, everybody I talk to gets really nervous at 10, as that seems like something likely get thrown right back in your face by a profession filled with people who argue for a living.

So how about two?

Yes.

Really.

Two.

Don’t believe me? Just read the study:

CONDUCT OF THE TEST

For the test, we attempted to have the retrieval system used in the same way it would have been during actual litigation. Two lawyers, the principal defense attorneys in the suit, participated in the experiment. An Evaluation at 291

The two lawyers had two paralegals helping to translate their requests into system queries, but again that is just two lawyers creating those queries. That’s a sample set of two lawyers in a study that we are still to this day using to beat up on an entire industry of over 1.2 million lawyers.

Not being a statistician myself, I cannot say how many lawyers Blair and Maron should have used for this study to provide statistically valid results. But this survey software company seems to know a lot about statistics, and a quick run through on their handy online sample size calculator (using a 5 percent confidence interval for the default 95 percent confidence level) gives us an idea for a proper sample size within a population of 1.2 million: 383.

Conclusion: It’s Long Past Time for Us to Bury Blair and Maron

Does this all mean that attorneys don’t suck at searching for documents? Maybe. There are plenty of other statistically-valid, sufficiently recent studies such as the TREC studies that show that keyword searching is not necessarily the best way to search for documents. But at the same time, studies such as the most recent one by the Electronic Discovery Institute show that PC/TAR use by attorneys is in the single digits. The remaining 90+ percent of attorneys must be using something to select their relevant documents and we can most likely rule out Ouija boards or Magic Eight-Balls. Thus, it is clear that attorneys will likely stick to using keyword searches for some time.

Perhaps they should change. But if they do, it shouldn’t be because of a 32-year-old study, using 48-year-old technology, focusing on a task very different from eDiscovery and completed by all of two attorneys. Maybe if we just let Blair and Maron finally die, we can then come up with something better to convince them to actually change.

This post was authored by Michael Simon, an attorney and consultant with over 15 years of experience in the eDiscovery industry. Principal at Seventh Samurai and Adjunct Professor at Michigan State University College of Law, he regularly writes and presents on pressing eDiscovery issues. He can be reached at michael.simon@seventhsamurai.com.