Software Litigation Consulting | Andrew Schulman

I’ve been looking into the possible use of the US National Software Reference Library (NSRL), http://www.nsrl.nist.gov, maintained by the National Institute for Standards and Technology (NIST), as a library of software prior art. Such a library would be useful both to the US Patent & Trademark Office (PTO) and to patent litigators.

The original purpose of the NSRL is largely as a set of hashes of known files, so that a criminal investigator examining a computer can know which files do NOT need to be examined.However, NSRL is moving beyond this to “digital curation,” for example, of a Stanford University Library collection of 15,000 software products from the early days of microcomputing. In contrast to their current storage in boxes and indexing only by product name (which is consistent with most library software archives), NSRL is performing file-level cataloging of the collection.

The next step would be to index the contents of the files themselves. Software binary/object code files often contain useful strings of text, relevant for example to patent prior-art searching. Such “deep indexing” or data mining of code file contents is a goal of the “CodeClaim” project (to be described in a forthcoming blog post).

Such deep indexing of binary code files has been done in some limited areas, such as the superb PDP-10 software archive at http://pdp-10.trailing-edge.com/ in which files have been extracted from tape images, each file given its own web page, and contents of executable files included on the page, enabling a Google search for strings. See also sites such as totalhash.com which, for a variety of reasons, dump strings from Windows executable files (EXEs, DLLs, etc.) onto web pages, which are then indexed by Google (see e.g. Google search for “CEventManagerHelper::UnregisterSubscriber() : m_piEventManager->UnregisterSubscriber()”).

The core NSRL product is a hashset of 36,108,465 file hashes, listing one example of every file in the NSRL. For example, ten copies of the exact same file contents will share a single MD5 hash, even if each of the files has a different filename or file date, or came from different sources. NSRL calls this the “minimal” hashset. It is a file named NSRLFile.txt, about 4 GB in size, contained in a 2.4 GB zip file (filename rds_243m.zip) from the NSRL downloads page.

Entries in NSRLFile.txt look like this:

“SHA-1″,”MD5″,”CRC32″,”FileName”,”FileSize”,”ProductCode”,”OpSystemCode”,”SpecialCode”
“00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7”, “7854257F”, “PROGMAN.EXE”, 182032, 10912, “358”,””

Note that file dates are not included. Of course, the same exact file contents could be associated with different file dates, just as the same file contents can be associated with different file names. Dates of various types (OS file system create and write dates, (c) notice dates within files, linker dates within files) are of course crucial for a prior-art library. A method of associating dates with files will be noted later.

The collection contains media files (*.gif, *.wav, *.jpg). Crucial for a collection of prior art software, it also contains binary/object code files, for example:

“0000046FD530A338D03422C7D0D16A9EE087ECD9”, “680CA0BCE1FC7BC4136ADF4E210869C5″,”277D6BD5”, “TokenTypes.class”,2075,20318,”358″,””
“00000DE72943102FBFF7BF1197A15BD0DD5910C5”, “AD6A8D47736CEE1C250DE420B26661B7″,”7854257F”, “PROGMAN.EXE”,182032,10912,”358″,””
“00000FF9D0ED9A6B53BC6A9364C07074DE1565F3”, “A5D49D6DA9D78FD1E7C32D58BC7A46FB”,”2D729A1E”, “cmnres.pdb.dll”,76800,10055,”358″,”

A test of file extensions (not a guaranteed method to determine file type, but close enough for current purposes) in NSRLFile.txt provides a sense of what’s currently in the NSRL:

Many of the 36 million files are images (3.9 million GIF, 1.3 million JPG, 0.95 million PNG)
Files are predominantly from Microsoft Windows
A little over 1.2% are marked as “Linux”
There are files marked as “MacOSX”, “Mac OS 9+”, etc., but these do not appear to include binary code files (e.g., FaceTime)
There appear to be few mobile application files, e.g. *.ipa, *.apk
Many of the files are archive files, e.g. *.gz, *.zip, *.cab
Many of the files are compressed installers, e.g. *.msi, *.dmg; note that NSRL has researched “smart unpacking” of files
Many of the files are still compressed using Microsoft KWAJ, e.g. *.dl_, *.ex_
The most-frequently-occurring binary code file extension is *.class (Java), with 1.9 million different files
There are 811,468 different files with the extension .DLL (dynamic link library files for Windows)
There are 295,870 different files with the extension .EXE (Windows executables, possibly with some older DOS EXEs)
There are many different versions of code files with the same name, e.g. 835 different files (different MD5 hashes) of files with “kernel32.dll” in the name
There are many text files which contain (or potentially contain) source code, including 3.8 million HTML files, and about 1.7 million C/C++ files.

The following describes tests performed with Windows dynamic link library (DLL) files.

Even without access to the underlying files at NSRL itself, the presence of MD5 hashes makes it possible for anyone with a sufficiently-extensive collection of files, and a utility such as md5sum, to do some testing of the files in the NSRL database.

For example, NSRL includes a file with the MD5 hash 2bcbe445d25271e95752e5fde8a69082, and its minimal set of hashes provides the filename “IMPTIFF.DLL”.

The CodeClaim collection of code files contains about 490,000 files which are also in NSRL. One of these 490,000 files has the MD5 hash 2bcbe445d25271e95752e5fde8a69082. In CodeClaim, this file is X:\CD0138\CORELWPA\PROGRAMS\IMPTIFF.DLL; the file-system date is March 23, 1995.

Of the 811,000 files with the extension DLL in NSRL, CodeClaim currently has about 27,000. I have begun testing a subset of these: about 9,900 uniquely-named DLL files, with a total size of 2.28 GB. “Uniquely-named” means for example that one file with the name “kernel32.dll” was used out of the 90 different versions in CodeClaim; this file was selected at random, and is unlikely to be the newest or largest.

A “strings” utility was run on 9,900 DLL files, resulting in about 278 MB of output, about 10% of the size of the underlying code files. This 10% is both an over-estimate and an under-estimate of the usable text to found at least in Windows-based code files. An over-estimate because it contains a large amount of junk which merely looked like readable text to the “strings” utility. An under-estimate because “strings” is only one of at least a dozen methods of extracting useful text from binary code files. For example, given GUIDs or UUIDs in the file, these can often be turned into the corresponding textual name of a protocol or service; there are several other types of numeric-to-string lookup.

How useful would strings contained in binary code files be, for a library of software prior art? A search for “->” quickly turned up many source-code fragments which had made their way into the binary code files, presumably as “asserts” or logging statements. For example:

!FFlag(lppcminfo->dwPcm, PCM_RECTEXCLUDE) && FFlag(lppcminfo->dwPcm, PCM_RECTBOUND)
!(mod & 0x0004) || (!lpbxi->fDBCSPrio && *lpchIns == ((BYTE)’\x20′)) || (lpbxi->fDBCSPrio && *lpchIns == 0x81 && *(lpchIns + 1) == 0x40)
!_pmsParent->IsShadow() && ((char *)(“Dirtying page in shadow multistream.”) != 0)
%s — g_PluginModuleInstance->DeInitializeContext() failed.
%s:pChannel->RespondToFastConnect returned 0x%08lx
( LSeekHf( qbthr->hf, ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ), 0 ))==( ( (LONG)( qcb->bk) * (LONG)( qbthr )->bth.cbBlock + (LONG)sizeof( BTH ) ) )
((sidTree != sidParent) || (pdeChild->GetColor() == DE_BLACK)) && ((char *)(“Dir tree corrupt – root child not black!”) != 0)
(FreeBlock >= ChangeLogDesc->FirstBlock) && (FreeBlock->BlockSize <= ChangeLogDesc->BufferSize) && ( ((LPBYTE)FreeBlock + FreeBlock->BlockSize) <= ChangeLogDesc->BufferEnd)

To emphasize, we know that these snippets of code are present in the underlying NSRL collection, because the files examined in this quick test all had MD5 hashes found in NSRLFile.txt.

But so what? What difference does it make that some strings of text which resemble source code are located in commercial products? How useful is this for constructing a searching library of software prior art?

The next step is to see how the types of terminology found in code files are also used in the claims of software patents. This will be discussed in the next blog post.

Reading the article, it may not seem to have anything to do with IP litigation, but this National Software Reference Library appears to potentially be an important basis for a prior-art software library (that is, not a collection of publications about software, but of text extracted from the software itself, for use as prior art). Modern software generally contains a large amount of useful text. This text would need to be extracted from binary/object files, and then indexed.

The National Software Reference Library
by Barbara Guttman

LinkedIn IP Litigation discussion

The list of products in the collection is available at http://www.nsrl.nist.gov/RDS/rds_2.43/NSRLProd.txt (3 MB text file). Of course, to be useful as searchable prior art, either to litigators or the PTO, more would be needed than this list of products or even the list of individual files comprising the products. I’m going to do some tests of text extraction against some of the files in their collection.

The fingerprints right now are file-level MD5, SHA1, etc. The original purpose, as I understand it, was so that criminal investigators would know what files they did NOT need to look at when examining a suspect’s computer. They do seem to be expanding the goals, so that now for example they’re working with Stanford to incorporate a large collection of software from 1975-1995 as part of a “digital curation” effort: http://www.nsrl.nist.gov/Documents/nsrl_curategear_2013%20bg%20dw.pdf .

Use of the collection as software prior art would require going down below their current file-level granularity, to do string extraction from binaries, extraction of class headers, etc.

I started a process like this, named CodeClaim, with Frank van Gilluwe and Clive Turvey. CodeClaim is a database of software prior art, generated from the software binary code itself, as opposed to using documents about the software, and in contrast to databases that exist today of open source, such as Black Duck and Palamida. Clive Turvey and me wrote a lot of back-end code, and it was used to process several hundred CDs and a few gigabytes of sample firmware code. The processing we did employed the first few of about 20 different information-extraction methods. Some proof-of-concept testing showed that strings of text in commercial software tends to contain information that would be responsive to queries based on the terminology appearing in patent claim limitations. I also did some preliminary work on weighting of terms (so that for e.g. boilerplate startup or RTL code appearing in every executable would play a reduced role in responding to queries).

Technical and legal aspects of CodeClaim are discussed, though not by name, in:

Open to Inspection: Using Reverse Engineering to Uncover Software Prior Art, Part 1 (New Matter [Calif. State Bar IP Section], Summer 2011)
Open to Inspection: Using Reverse Engineering to Uncover Software Prior Art, Part 2 (New Matter, Fall 2011)

Andrew Schulman

US National Software Reference Library (NSRL) and Prior Art, Part 1

US government’s National Software Reference Library (NSRL): recent article

Good article on using Wayback Machine (archive.org) in patent litigation

Search

Menu

Recent updates