Computer software source code and e-discovery
Andrew Schulman
http://www.SoftwareLitigationConsulting.com
While electronic discovery (e-discovery) focuses largely on data stored in or generated by computers, there is an additional area whose handling is becoming an essential e-discovery skill: code, that is, the software which computers run in order to create and process data. This article quickly compares and contrasts source code (the most readable type of software evidence) with e-discovery generally, noting cases the reader may consult for more details (most are federal patent cases; some are cited for fact patterns rather than for the central holding in the case). For a discussion of source-code examination/review for litigation, see the FAQ and outline of forthcoming book on source-code examination/discovery.
One e-discovery resource defines software as “Any set of coded instructions (programs) stored on computer-readable media that tells a computer what to do.” Software is a form of text, written by humans. It is in some ways not much different from other text (such as Word documents or emails), except that software text conforms to certain rules (programming languages, such as C++, Java, and JavaScript), and that computers interpret this text as instructions to perform actions.
What is source code?
All software-based products, devices, and services are built from files containing source code, which is text directly written by computer programmers. This is often transformed into a different type of file, called object code, which is what directly instructs the computer in its operation. The relationship between source code and a software product or service (used by consumers or within an enterprise) is somewhat analogous to the relationship between a blueprint and a building.
However, there are two important differences between blueprints and source code. First, while humans consult blueprints in order to construct buildings, software products/services are created directly from source code. The source code kept inside a company thus has a direct relationship to the company’s publicly-visible products; contrast the more indirect, though still important, relationship of an organization’s other internal documents (such as blueprints, internal memos, or emails) to its public or external behavior.
Second, because so much modern business, government, and other activity is based on software, source code has wide applicability in litigation. Everything in the modern world which uses software was built in part from source code. Source code often embodies rules and de facto policies of organizations (see the book Code and Other Laws of Cyberspace by Lawrence Lessig).
While many software products and services are widely accessible (how many million copies of Windows or Facebook or the iPhone iOS operating system exist on the planet right now?), their underlying source code is tightly held (apart from the important exception of so-called “open source”) as a proprietary so-called “crown jewel.” Litigation in which software is at issue generally focuses on the closely-held source code, rather than on more widely-available object code, in part because the source code is more readily understood by a larger number of people (perhaps surprisingly, even many computer programmers have little idea of how to reverse engineering object code).
When is source code relevant and necessary in litigation?
Source code may be relevant in litigation to explain the electronic data which has been produced (see cases in which a “code book” or “data dictionary” must be produced); or as “discovery about discovery” (source code may embody a company’s de facto policy for document preservation, or for logging of otherwise-ephemeral data).
But source code also often has a more intrinsic relevance, as in a software patent case or other intellectual property litigation, involving software copyright or trade secrets. Source code’s direct relevance is not limited to IP cases. Source code may also be relevant in:
- antitrust (for example, the technical aspects of “browser integration” in Microsoft antitrust cases),
- products liability (“bad software”),
- criminal law (DUI arrestees demanding to review Breathalyzer source code),
- medical malpractice or products liability (software used in medical devices),
- securities law (see recent headlines regarding Bernie Madoff’s computer programmers),
- environmental law (models used by the EPA), and
- election law (voters demanding audits of the software used in voting machines).
Having noted source code as a location of relevant information in litigation, however, relevance is not the same as necessity. Source code requires an expert or consultant for its interpretation, and the consequent cost of using source code likely demands a greater showing of necessity beyond mere relevance. See e.g. Generac Power v. Kohler, 2012 WL 2049945 (ED Wisc.), emphasizing alternates to source code, such as deposing the programmer (courts will particularly favor this approach when the source code belongs to a third party). Also see OpenTV v. Liberate, 219 FRD 474 (ND Cal., 2003), applying the Zubulake cost/benefit factors to establish which party has the burden to extract relevant source code from a larger collection.
How does source code differ from other electronic discovery?
In many ways, source code shares the characteristics of other electronically stored information (ESI): volume, dispersal, searchability, and so on. But it differs in key ways, briefly described below.
Searchability: Source code generally appears in plain-text files, and so is searchable by a wider variety of tools than e.g. a Word document. However, knowing what to search for is a different matter: relevant keywords often appear in source code as part of a single “CamelCase” word (e.g. “GetEmployeePolicyLocation”, when looking for “employee policy”), or single punctuated unit (e.g., “employee_policies.get_location”). Much e-discovery software is currently ill-suited to quickly finding relevant terms inside other terms. If there were one area of source code in which e-discovery and IT personnel could become more familiar, it would be source-code searching.
Structured format: Source code conforms to rules dictated by programming languages, platforms (such as Android, Windows, or the Apple iOS and OSX operating systems), APIs (application programming interfaces), and other constraints (e.g. interoperability). Source code examination needs to carefully use this structured format, as opposed to merely performing a keyword search. Whereas a simple keyword search might end up treating source code as a disconnected collection of code fragments, a true source-code examination will rely, for example, on the relationship between caller and callee functions; on aliasing of names used for functions, parameters, variables, and data structures; on the flow of data usage; and on externally-generated events. Also in contrast to a simple keyword search, a source-code examination will be alert to the possibility of code which is unused in the finished product; determining code usage may require careful tracing (distinct from searching) through the source code, or may even require dynamic testing of the “live” product. Finally, the dependence of source code on external constraints means that the code can be correlated with external information, such as API documentation, SDK (software development kit) sample code, header files, and so on.
Importance of preserving directory paths and filenames: In source code, more than in other electronically-stored documents, the original path/filenames are often crucial. Source-code files are linked to each other by their filenames (e.g., one file will “include” another, by path and filename). Filenames often reflect the name of a class implemented in that file, so searching for the class implementation is aided by the filename. Directory path names and filenames will often reference specific technologies used by, or implemented in, the file; thus, the path/filenames themselves may be important evidence, apart from the contents of those files. Thus, if source-code directories or files are renamed during discovery production, this not only makes relevant information more difficult to find, but may destroy evidence. Some of these points are made in a useful one-hour webinar: Dan Raffle, Lois Thomas, and Craig Motta, Surviving Source Code Reviews (webinar, Oct. 26, 2012).
Review process: In part because source code appears less searchable by non-programmers, it is often produced during discovery in an unusual way: neither a straight production, nor an on-site inspection, but something in between, a source code “review,” similar to a “quick peek” procedure. Source code will be copied onto a secure computer, the other side’s expert or examiner will review that copy under a stringent protective order (PO), and will request specific extracts to be printed and Bates stamped (see e.g. early software patent case, Rates Tech. v. Elcotel, 118 FRD 133,135 (MD Fla., 1987)). The source-code computer will generally be detached from the internet, but must provide the reviewing party with reasonable searching and source-navigation tools (see Big Baboon v. Dell, 723 F.Supp.2d 1224 (CD Cal, 2010)). Such tools include SciTool’s “Understand” and Microsoft Visual Studio (see GeoTag v. Frontier Comm., ED Tex., 2013). There will often be fights over the quantity of source code to be printed (see EPL Holdings v. Apple, 2013 U.S. Dist. LEXIS 71301; Digital Reg of Texas v. Adobe, 2013 U.S. Dist. LEXIS 23447).
Centralization and dispersal: Source-code productions in discovery are frequently incomplete, not always with bad intent. Compared to other ESI, however, source code generally is more centralized, with less good reason for “missing” files (see spoliation below).
Verification & spoliation: When source code corresponds to an externally-available product, the source-code production can be compared with this external product. Few other types of ESI have such external verifiability. This serves not only authentication, but also detection of spoliation. Portions of the source code seem to go missing in many cases. While this is sometimes egregious spoliation (e.g. Keithley v. Homestore.com,2008 WL 5234270 (ND Cal)), it may also reflect the way that the crown jewels have been dispersed (e.g. Windows Media Player source code absent from central “tree” during Microsoft antitrust litigation).
Versions: While many types of ESI also exist in multiple versions (e.g. iterations of an inter-company Word document), versioning is particularly important to source code. Source code is generally kept in a “version control” repository (e.g., Perforce, Git, SVN). The other side will often vaguely ask for “the code” without a clear idea of what code they mean: which specific product/service, which version numbers, and for which platform. Overbroad requests are common (see e.g. Symantec v. McAfee, 1998 U.S. Dist. LEXIS 22591 (ND Cal.), seeking all source code for 3.5 years). When there is a publicly-accessible product or service, the requestor could often be quite specific. Sometimes source code for long-defunct, unreleased, or future products needs to be requested (see BigBand v. Imagine, 2010 WL 2898288).
Intermingling and third parties: Third-party source code may be intermingled in the producing party’s repository (see Robotic Parking v. Hoboken, 2010, where third party intervenor sought protective order). Conversely, some of the source code relevant to a claim or defense may be held by a third party, who must then be subpoenaed. Discovery rules will be enforced tightly for third parties, even if relaxed for the principal parties to the litigation (see Realtime Data v. MetroPCS, 2012 WL 1905080 (SD Cal.)).
Form & manner of production: In addition to preservation of all available datestamps, it is important that source code be preserved as ordinarily kept. So-called “comments” (descriptions written by programmers) should not be “redacted” (see trade-secrets case MSC v. Altair, ED Mich., 2012). Whether to produce via a source-control system such as Perforce, or as plain-text files, should be negotiated.
Burden of explanation: The source-code owner sometimes has a burden, besides producing its source code in discovery, to also provide some explanation, “roadmap” or hand-holding to the requesting party (see Leader Tech v. Facebook, 2010 WL 254960 (D. Del.); LaserDynamics v. Asus, 2009 WL 153161 (ED TX)). This issue also emerges when a party produces source code rather than directly answer interrogatories (see FRCP 33(d)). At the same time, it is a rare requestor who will rest on the producing party’s own explanations (see “trust but verify” in Fleming v. Escort, 2010 WL 3833995 (D. Idaho)).
To summarize, because software is a form of text, it can in some ways be treated as any other electronic discoverable documents such as emails or spreadsheets. It is another area for litigants to mine for relevant information. However, computer code has several unusual features, summarized above, which make its use in e-discovery a somewhat distinct skill.
See also chapter 9 on discovery in notes/outline for forthcoming book on source-code examination.
An earlier version of this paper was submitted to ASU-Arkfeld eDiscovery & Digital Evidence Conference, March 2014 (“Essential Skills and Cutting Edge Strategies”); updated Nov. 2016.