Skip to main content

[Video] How to Search a FOIA Document Dump—And Hold the Government Accountable at the Same Time

June 5, 2019  |  5 min read

Document Search

Last Sunday, The New York Times published an in-depth report looking into potential conflicts between Secretary of Transportation Elaine Chao, her family’s international shipping business, American shipping interests, and China.

The geopolitical implications are interesting enough, but for those of us in the discovery realm, the Times’s report is also another compelling illustration of how journalists are increasingly using data, and the open records laws which provide access to that data, to make headlines.

Public records requests, after all, are not too different from discovery projects, requiring the collection, review, and production of massive document collections. And, for the receiving party, those productions are often similar to the data dumps that pervade litigation. Except they’re worse.

So, what does this report tell us about FOIA, data-journalism, and government accountability? And how can the right technology help those looking for the next big lede in a pile of government emails—or opening those records to the public in the first place?

Follow along in the video to find out, or read the approximate transcription that follows. And if you want to jump into the FOIA files yourself, you can do so here.

Hello everyone, I’m Casey Sullivan of Logikcull.com and today I want to talk about The New York Times. You might have seen the Times’s report on Elaine Chao this weekend. The front-page feature, running nearly 7,000 words long, details the potential conflicts between Elaine Chao, Secretary of the Department of Transportation, her families’ international shipping business, and the American shipping interests she is, ostensibly, supposed to represent.

It’s a compelling read, for anyone who likes big headlines or big ships. But, what stands out to us here at Logikcull is the role that data and public records requests played in breaking this story.

Many of the new revelations came to light through email correspondence the Times obtained under the Freedom of Information Act, or FOIA. Stories like these highlight the importance of data, and discovery, to our democracy, and they’re why we give journalists like these free Logikcull accounts—so they can have the tools they need to find what matters in data.

Now, FOIA requests aren’t too different from discovery requests, but they can present even more challenges. For producing parties, FOIA caseloads can be massive, leading to years-long backlogs—and often litigation. The tools FOIA teams have to review and produce documents, too, can make the process even slower. The State Department, for example, recently estimated that it would take 66 years to review 100,000 emails, partly because of its frustratingly manual review process. But that’s a separate issue.

For receiving parties—nonprofits, journalists, businesses—FOIA productions are often very similar to litigation data dumps, with the important information hidden under a mountain of junk. Except, unlike in legal discovery, FOIA doesn’t allow requesting parties to specify the form of production. Which means that FOIA documents often come in a single, thousand-page long PDF, possibly delivered on a CD-Rom, with no file metadata. Often, they can’t even be searched without further processing.

The government emails made available by the Times are a good example of this. One giant PDF of emails—some of which look like they were printed, scanned, then redacted and produced.

What’s a public advocate or public records officer to do?

Honestly, getting through these productions isn’t too tricky if you have the right tools, which we know the Times does.

Let’s take a look at how Logikcull would treat these emails.

Once you’ve waited your months or years to obtain your FOIA production, you can just drag and drop it into Logikcull and get into your data in a matter of minutes. While it’s uploading, Logikcull is performing more than 3,000 automated processing steps, like making all those imaged files text searchable.

Alright, so we’re into the data. But first, you’ll notice that we’ve got a problem here. Since the data was produced in a single PDF, we’ve got a single PDF. How do you call out the most critical data when dealing with a giant PDF? How do you tag docs, comment to colleagues, etc.?

Easy, you split it up. We’re going to take the quickest route and just do page by page, here, but you can also split docs by bookmarks, every page, on specified pages, etc.

Now that we’ve broken up our production we can get into searching. I can do a traditional linear review, and here we don’t have many pages so that’s not a problem. I can tag pages and comment to collaborators, who will be instantly notified. If I’m a FOIA officer, performing a review of docs before they’re released to the public, I can apply redactions and either tag or comment with the relevant FOIA exemption—I’ll tag this with my favorite FOIA exemption, B9, which excludes disclosure of information relating to wells.

To get through a real data dump, though, you’ll want to cull out data you don’t need and focus in on the stuff you do. Those will let you narrow your docs by custodian, date range, sender, etc.

Because of the format these documents came in, we’re going to be lacking some metadata that would appear in discovery contexts: things like custodians, create date, to: and from: fields. But we can work around that, for example, by keyword searching for something like “‘From: Browne, Noah’ AND ‘To: Leiby, Thomas.’” Not ideal, but it works.

We can also do straight keyword search for topics that we want to dive right in to. Here, we’ll do “ethics” and voila, we can see all the internal correspondence over potential ethics concerns. 

Suddenly, a FOIA document dump isn’t that hard to get through. If you’d like to play around in this very project, you can access it here. And if you’re a journalist who would like access to Logikcull, the same tool Pulitzer Prizing winning Times’ journalists like Eric Lipton use, shoot an email to hi@logikcull.com and we’d be happy to give you an account—for free.