At Logikcull, we do an unhealthy amount of thinking about files. After all, most of our clients send us ingestions and download document productions in ZIP format and ZIP files have been synonymous with data transfer since before the Web era. We use them anytime we need to transfer data from one place to another, whether we’re moving business records, email collections, or video game saves.
We use them without realizing it—Microsoft Office documents and spreadsheets, for instance, are ZIP files under another name. In law, too, ZIPs are everywhere, from the humble enclosure of a single draft memorandum to the mammoth hundred-gigabyte document production encapsulating thousands of billable hours of attorney work product.
It’s a testament to their versatility, durability, and utility that we use ZIPs so often without thinking too hard about where the format came from or how it works. Look at it intuitively: ZIPs collect a bunch of files into one big file, which we engineers call archiving, and they (usually) make the files smaller than they would be otherwise, which we call compression. An archive tends to be easier to move between computers than a lot of files, because single files are simpler and a smaller file tends to be easier to move than a bigger file, as computer networks are much slower than computers themselves.
One of the reasons ZIPs are a pretty good format is because we’ve had a lot of time to master the principles involved.
A long time ago, when electronic computers were a very new technology, there wasn’t any standard or convention for how to separate individual data entries or records from each other. Computers rose to prominence as tools of government agencies and businesses in the post-World War 2 era, which were being overwhelmed by a massive accumulation of newly-acquired scientific, economic, political, and legal data.
The tools of the preceding era for storing, searching, and transferring data had been manila folders, paperclips, and filing cabinets. In 1960, a file system wasn’t a piece of computer software that kept the photos your aunt just emailed you from appearing in your latest surreply to that pesky motion to dismiss, it was a set of rules for how people stored and searched for paper documents in a metal file cabinet or on a shelf. Most people knew at least the file system used by their local library to keep track of books.
So, to ease the cognitive burden of living and working in a suddenly digital environment, computer designers and engineers adopted some familiar patterns. They took the concept of a filing system and applied it to electronic records stored in machines.
Perhaps the most successful early case of a computer file system was ERMA (Electronic Recording Machine for Accounting), which Bank of America introduced in 1955 to process the exponentially-increasing volume of checks their prosperous customers were writing. Other companies and government agencies followed suit, including IBM and AT&T’s Bell Labs. By 1975 most mass-production computers offered some type of electronic file system for organizing records.
With the proliferation of file systems came many different ways of naming, organizing, and describing files, as well as a powerful need to transfer the data stored in the files between computers or onto tapes for storage.
Billion-dollar business acquisitions and government projects suddenly hinged on whether one organization’s millions of files worth of data could be transferred into the other organization’s computers. It seems quaint to worry about such things now, but in the mainframe era these were mammoth problems with no established solutions!
Early archive formats mapped the contents of files as well as some common metadata—their names and locations in the file system, the times at which they were created and last accessed—in a rational scheme that resulted in a single file. Anyone with knowledge of the scheme could decode the archive’s contents and add them to a different file system, throwing away whatever information didn’t make sense in the files’ new home. The name of the oldest such scheme still in widespread use, TAR, conceals its origin story—it originally stood for Tape ARchive.
One persistent problem for these early archivers was the size of the single files they produced. If you create an archive of every file on a computer, obviously the size of that one file is at least equal to the sizes of all the files stored on the computer.
Data transfer and storage have almost always been the slowest and most expensive aspects of computer operation, so making data smaller has always been a very profitable business.
Before AI fever, there was data compression fever.
And like AI, data compression has a reputation even among software engineers as an arcane art. Engineers in the field regularly throw around phrases like “psychoacoustic modeling” and “differential phase-shift keying.” But the truth is that many of the underlying principles are very simple. For instance, it takes fewer characters to write the words “sixty quadrillion” than the number 60,000,000,000,000,000, so if that number shows up in a document a thousand times, you’ll save a lot of ink and hand cramping by writing the two words rather than the twenty-two digits and commas.
Data compression is the MacGuffin at the center of HBO’s hit series Silicon Valley, and for good reason: many millionaires in their 50s owe their fortunes to an understanding of the minutiae of arithmetic coding and Burrows-Wheeler transformation. By the early 1980s, data compression engineers had achieved a sort of rockstar status within the software industry for their arcane statistical wizardry and low-level programming incantations.
Among the luminaries of this generation was a man from Milwaukee named Phil Katz. If his name isn’t familiar, it may surprise you to learn that you have been unconsciously repeating his initials for more than 20 years.
In the mid-to-late 80s, PCs became a rapidly growing component of American business and education, producing ever-larger volumes of data. In the blink of an eye fixed disks changed from an expensive luxury into an absolute necessity, every form of software was sold in a box containing floppy disks, and office supply stores started selling floppies directly to consumers in bulk. In 1986 a high-capacity 5-1/4” floppy cost about $4 at retail, the equivalent of $10 today. Anyone who could reliably cut consumption of floppies had the ear of investors and computer enthusiasts.
That very year, after some disputes with an employer he felt didn’t adequately appreciate his talents, Katz founded a software compression startup called PKWare. He wrote a program that combined the roles of archiving and compression, producing single files that both aggregated multiple files and condensed them into a much smaller amount of storage space.
Katz spoke the language of machines better than most, and the compression program he wrote was both speedier and more efficient than its predecessor, a program named ARC. In fact, he originally planned to name it PKARC, but a trademark dispute forced him to find a new name for his brainchild that would fit in the 3 letters allowed for file extensions in those days. The legend goes that he heard someone praise the “zippy” operation of the software, and he settled on .zip.
A few years later PKZIP was the de facto compression and archive standard for PCs, and .ZIP took its place in the pantheon alongside .DOC, .PDF, and .TXT.
Two factors drove ZIP to the top of the pile over its numerous competitors.
First, the PKZIP program was distributed as shareware, meaning that there was a free version that people could distribute along with their compressed archives, so that anyone could decompress them without having to buy their own copy of the software.
Second, and perhaps most importantly, Katz distributed the ZIP standard—that is, the rulebook for how to make and read ZIP files—along with every copy of the software. This meant that any software engineer reasonably versed in the art could write their own program to work with the ZIP format, and many did.
The ZIP standard includes what is known as a “record separator”, that is to say, a pattern of data that marks the end of one compressed file and the start of the next for the computer’s benefit. The separator appears at least twice for every file in a ZIP. The content of the separator is arbitrary, it just needs to be consistent, and it so happens that Katz made it his initials, PK. Sure enough, if you open a ZIP with a program that can examine the raw contents of the file (sometimes called a “hex editor”), you’ll see “PK” all over the place.
Phil Katz died tragically in 2000, so we’ll never know what he would have thought about ZIP’s footprint in the information age. The format has been extended multiple times in his absence, new features added and old ones changed; but his initials remain, an indelible reminder of his legacy.