copyright notice
link to published version: Communications of the ACM, June, 2008

accesses since March 12, 2008

BRAP Forensics: Boutique Computer Activity Mining vs. Personal Privacy Management

Hal Berghel


BRAP forensics is one of the latest additions to the digital forensics toolset.  One of the more subtle forms of computer activity mining, it has considerable potential for privacy abuse.

I use the acronym BRAP for BRowser and APplications.  Some practitioners distinguish browser forensics from applications footprinting, but the two investigative procedures are so closely related (browsers are, after all, applications) that subsuming them both under the same category of computer activity mining seems more reasonable.

Computer activity mining (CAM) involves the recovery of information about a computer user, or a computer's use, from the computer itself.  As such, it is one of the core areas of modern digital forensics along with log analysis, timeline analysis, keystroke capture and analysis, system imaging, etc.  Log analysis is perhaps the best known example as it has been a staple of network forensics for years, and is a primary tool for network administrators to reverse engineer hacks of their systems.  It is so common in fact that sophisticated hackers consider log cleansing the final stage of a successful hack. 

Another core area of digital forensics is media analysis (aka file system forensics) - the practice of recovering data from non-volatile storage devices.  Where CAM focuses on activity, media analysis focuses on data.  BRAP forensics bridges the gap as it reveals stored data as well as information about user behavior.  That's what makes it interesting - and threatening to those concerned with personal privacy management.

In addition, the courts have made computer activity mining an important area of electronic discovery (E_discovery) lately.  Law enforcement routinely look to CAM for evidence of wrongdoing. This holds particularly true in the prosecution of cases involving unacceptable computer use, sexual harassment, child pornography, EULA, computer fraud and identity theft, and intellectual property cases.   As with media analysis, BRAP forensics should be thought of as indiscriminate.  Once the warrant is served and the forensics completed, the personal privacy toothpaste is out of the tube.

BROWSER GUANO

While the browsing experience is familiar to most computer users, the nuances remain vaporous.  These nuances are the grist for the BRAP forensics mill.  Internet Explorer (IE) on Windows  is noteworthy in this regard because it leaves behind a surplus of browser guano.  We'll focus on IE, though examples may be derived from non-Windows operating systems and alternative browsers.

The browser is the navigation and rendering tool for the web.  When the user clicks on an icon or link, the browser sends an HTTP request to a remote resource.  That triggers a download of information.  There are many byproducts of this exchange - some well understood, some less so. 

Cookies are one such byproduct.  Since HTTP is "stateless," the web development community introduced these identifiers to store information about the client-server exchange for subsequent connections, either during the current browser session (session identifiers) or during subsequent browser sessions (persistent identifiers).  Persistent IE identifiers reside in Documents and Settings>(user)>Cookies under the name of the website that produced it.  For example, when I visited www.microsoft.com just now, seven cookies from webtrends.com, atdmt.com, indextools.com and dcstest.wtlive.com were deposited in this folder on my computer.  The Webtrends website reports that "Influential technology companies such as Microsoft have used WebTrends Marketing Lab 2 to get a real-time view into both online visitor activity and offline customer information," so I have some idea of why the cookie was left.  The two webtrends.com cookies look like this when parsed.

SITE: m.webtrends.com/       
VARIABLE: ACOOKIE
VALUE: C8ctADEzMS4yMTYuMTE5LjIxLTEwNTUwMjE5NjguMjk5MTU4OTIAAAAAAAABAAAAcAAAAOk5yEeaOchHAQAAABMAAADpOchHmjnIRwAAAAA-              
CREATION TIME: 02/29/2008 08:59:30           
EXPIRE TIME: 02/26/2018  08:59:21  
FLAG FIELD: 2147484672

SITE: statse.webtrendslive.com/        
VARIABLE: ACOOKIE
VALUE: C8ctADEzMS4yMTYuMTE5LjIxLTE4ODIyNTE5NjguMjk5MTU4OTIAAAAAAAABAAAA/WAAAO05yEftOchHAQAAAEooAADtOchH7TnIRwAAAAA-              
CREATION TIME: 02/29/2008 08:59:34           
EXPIRE TIME: 02/26/2018  08:59:25  
FLAG FIELD: 2147484672

The precise meaning of the "value" field is irrelevant to our present pursuit.  The two datapoints of interest are the timestamps - first because the timestamp records when my computer was touched by WebTrends, and second because that record won't expire for 10 years - neither of which leaves me with a particularly warm feeling about the experience!  As I wrote many years ago ("Caustic Cookies," CACM, April, 2001) that cookies are slowly transforming our private sanctuaries into electronic auditoriums. 
 
What is more,  these cookies collect like lint even if IE is stiffened!  The default browser privacy setting for the risk averse might involve putting the privacy setting on HIGH for the Internet zone (IE>Tools>privacy), because the BLOCK ALL COOKIES setting restricts functionality beyond tolerable levels.  The HIGH setting should block tracking cookies and cookies from sites without a compact privacy policy.  However, since IE doesn't  clear private data on closing (as, say, Firefox does), one must do it manually (IE>Tools>Delete Browsing History>Delete All).  Therein lies the rub: the private data is archived in Windows every time the system creates a restore point (XP, 2000) or an incremental shadow copy (Vista)!  So, if the information isn't manually deleted before that day's backup, it's easy pickings for BRAP 'forensicist.'  System restore points and shadow copies include personal data whether you know it or not.  In some cases you can shut them off, but then there's no recovery mode for the operating system.  In short, the computer most likely has a record of some or all websites visited, and this record is recoverable.  The operative question is: is this what you want?

The same applies to cache and URL history.  This data is organized  in a largely cryptic  INDEX.DAT file in Documents and Settings\<user>\Local Settings\Temporary Internet Files\Content IE5.  To illustrate, Figure 1a shows a hex editor's perspective of INDEX.DAT after a single IE visit to Google.com.  Note that the cache filenames are identified in the header of INDEX.DAT.  Figure 1b shows the parsed contents of the file.  As with cookies, if the user doesn't manually remove all of this data it accumulates in the backup files and is readily accessed.  Other tools exist to recover cached images. 


Figure 1A.  A hex editor perspective on the INDEX.DAT file and the four cache folders.

 


Figure 2:

Figure 1B.  The parsed contents of INDEX.DAT


 

LEARNING TO LIVE WITH APP RESIDUE

Unintended residue is also a byproduct of typical application use, especially with Microsoft productivity tools.  We'll illustrate the point with the now-classic example of how Word metadata embarrassed Tony Blair's government.

Users become familiar with the Word metadata through the properties box (e.g., WORD>File>properties>summary).   In 2003, Richard Smith extracted the revision log from a 2003 document sent by Tony Blair's government to Colin Powell that was used to justify the attack on Iraq.  As it turned out, parts of the document were copied from an article written by a postgraduate student.  The source document was easily identified because the copy preserved spelling, grammatical and typographical transgressions.  The metadata in the source document appears below.

 

--------------------
Statistics
--------------------
File    = blair.doc
Size    = 65024 bytes
Magic   = 0xa5ec (Word 8.0)
Version = 193
LangID  = English (US)

 

Document was created on Windows.

Magic Created : MS Word 97
Magic Revised : MS Word 97

--------------------
Last Author(s) Info
--------------------
1 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asd
2 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asd
3 : cic22 : C:\DOCUME~1\phamill\LOCALS~1\Temp\AutoRecovery save of Iraq - security.asd
4 : JPratt : C:\TEMP\Iraq - security.doc
5 : JPratt : A:\Iraq - security.doc
6 : ablackshaw : C:\ABlackshaw\Iraq - security.doc
7 : ablackshaw : C:\ABlackshaw\A;Iraq - security.doc
8 : ablackshaw : A:\Iraq - security.doc
9 : MKhan : C:\TEMP\Iraq - security.doc
10 : MKhan : C:\WINNT\Profiles\mkhan\Desktop\Iraq.doc

--------------------
Summary Information
--------------------
Title        : Iraq- ITS INFRASTRUCTURE OF CONCEALMENT, DECEPTION AND INTIMIDATI
ON
Subject      :
Authress     : default
LastAuth     : MKhan
RevNum       : 4
AppName      : Microsoft Word 8.0
Created      : 03.02.2003, 09:31:00
Last Saved   : 03.02.2003, 11:18:00
Last Printed : 30.01.2003, 21:33:00

The metadata of immediate interest are the four abbreviated names in the revision history: phamil, jpratt, ablackshaw, and MKhan  which were usernames for four people in the Blair government.  The log reveals three autorecovery backups to the LOCAL\temp directory for userid="cic22", a subsequent copy by jpratt onto a floppy (A drive); another copy made by ablackshaw onto a floppy, and the final editing on Mkhan's computer.  According to Smith, Parliamentary hearings revealed that Pratt passed on a floppy to Blackshaw who sent it to Colin Powell for his presentation to the United Nations.  The revelation of this information, together with the plagiarism, proved to be a credibility train wreck for the governments involved.

Think about the millions of email attachments in global circulation daily. How many people actually know about the volume of metadata that they're broadcasting?

RECYCLING THAT DOESN'T HELP THE ENVIRONMENT

We all like to think of the delete key as the quintessential digital cleansing experience. But as we know, modern operating systems do not overwrite deleted file data areas but rather just reassign the affected disk space to the operating system for further use.  The intermediate step in this process in Windows involves a recycle bin or recycler.  But, putting digital waste in the recycle bin doesn't destroy anything.  In fact it exposes the user to even more risk because the file information is compressed into a smaller part of the disk which makes recovery easier.

If you think about it, all of the data necessary to recover a deleted file must go in the recycle bin.  Otherwise the file couldn't be undeleted.  In Windows XP, for example, the information is stored in a file, INFO2.  The information retained includes path, file size, delete time/date, and unique recycle ID.  Of course, one could recover this information with a hex editor, but it's much easier just parse it:

INFO2 File: info2

INDEX           DELETED TIME   DRIVE NUMBER PATH     SIZE
17   03/07/2008 11:53:50       2             C:\dumpster\Firefox Downloads\AdbeRdr812_en_US.exe 0
0    12/31/1969 16:00:00       0             C             0

In this case, I had emptied the recycle bin, sanitized it with Evidence Eliminator, and then deleted an Adobe Reader installer so that alone is the only contained file.  Note that I can recover the location of the file, the time/date deleted, the placement of the file within the recycler, etc. from the data recovered in the recycle bin.  Until the recycle bin is emptied, this file is very much readable.  But, even if the Recycle Bin is emptied, only this metadata is lost.  The actual file data  remains recoverable with a hex editor (unless the clusters have been re-allocated to another file - which isn't all that likely on high capacity drives).  (cf. this column in the August 2006 CACM for additional detail).  Another interesting twist is that even if image files are deleted, and the recycle bin has been emptied, and the registry and disk have been sanitized, the thumbnails of any image files that remain might still be recoverable if they were ever indexed by Windows Explorer because the image index, THUMBS.DB, stays behind with the folder. 

CONCLUSION

It is important that the computer user understand BRAP forensics because of its potential for invasion of privacy.  It provides the uninvited an indirect portal into our personal lives.  We express ourselves in our use of software and the Web.  Far from innocuous, browsers and applications software may reveal of our behavior than we expect.  In terms of subtlety, BRAP forensics goes beyond the older, more traditional, areas of computer activity mining.  Where a computer log provides information that is relatively objective and impersonal, BRAP forensics provides information that is subjective and personal. Think of it this way: knowing that someone logged into a computer and used a word processor is far less invasive than knowing that someone created a document for a specific person, visited a sequence of websites, viewed certain image files, saved the document, and then copied it on a USB memory stick with a known unique ID.   BRAP forensics drills down to this level of granularity.  And the small size of today's removable storage media encourages the circulation of personal and private information.

What I find most objectionable is that the production of this data residue is counter-intuitive.  The bottom line is that this residue exists for the convenience of myopic software developers who believe that their vision of computer use is so incontrovertible that there is no need to entertain other points of view - i.e., those that put a premium on safeguarding personal privacy.   How hard would it be to offer the user complete control over the backup of non-system files and metadata?  Or to allow users the option of surfing the web without recording tracking cookies or URL histories?  Or to create a file system where "delete" actually means delete.   To the typical user, learning of these developer-excesses retroactively is akin to learning that all of the world's typewriters had been secretly producing invisible carbons for the government.  Who would have imagined that anyone ever thought this was a good idea.  While hardware-based encryption systems like BitLocker are an improvement, software use of personal information should follow the "need-to-know" paradigm.  Encrypting data residue is never as effective as not storing it in the first place.

URL PEARLS

Readers interested in more information on media analysis might consult this column in the August, 2006 and April, 2007 issues of Communications of the ACM.

The basic BRAP utilities discussed above were developed by Keith Jones and are an ideal starting point for both BRAP 'forensicist' and voyeur.  These tools are open source and available on the sourceforge.net website.  Galleta is indispensible in expedient cookie analysis because of the strange cookie data format used by Internet Explorer including, among other oddities, timestamps that are defined in terms of 100 nanosecond increments since midnight, January 1, 1601!  INDEX.DAT and INFO2 were parsed by a Jones' utilities  PASCO and RIFIUTI, respectively.  Mandiant (www.mandiant.com) has a streamlined utility, Web Historian, that saves parsed history data in an excel spreadsheet for easier analysis. SANS (sans.org) now offers a half-day course in browser forensics.  Based on my experience with SANS, I would expect this to be the most thorough treatment available.

The data clusters described above are indexed in the Windows Registry Hive.  The most important file in BRAP Forensics is NTUSER.DAT.  A good overview of the linkage between the registry hive and critical activity files like NTUSER.DAT is provided in AccessData's Registry Quick Find Chart at www.accessdata.com/media/en_US/print/papers/wp.Registry_Quick_Find_Chart.en_us.pdf.

Perhaps the easiest way to see how the registry hive organizes BRAP data is DeviceLock's Active Registry Monitor (devicelock.com). Registry Monitor has a "compare" feature that reveals differences between registry scans that were produced by applications.

Many of these capabilities are bundled into computer forensics tools such as Encase (www.guidancesoftware.com), Windows Forensics Toolchest (foolmoon.net/security/wft/index.html), The Forensics Toolkit (accessdata.com/Products/ftk2test.aspx), to name but a few.

The Tony Blair/Colin Powell case illustrates how effective BRAP forensics may be.  For an overview of the plagiarism side of the case, see www.casi.org.uk/discuss/2003/msg00457.html. For the BRAP forensics perspective, see Richard Smith's account at www.computerbytesman.com/privacy/blair.htm. The fragment of metadata listed above was reproduced from the source document www.computerbytesman.com/privacy/blair.doc. by Harlan Carvey's metadata extraction and parsing tool WMD.PL (cf. cfed-ttf.blogspot.com/2008/01/what-is-your-ms-office-metadata-telling.html. The British government admitted to the plagiarism, cf. www.sfgate.com/cgi-bin/article.cgi?file=/chronicle/archive/2003/02/08/MN200631.DTL.


Hal Berghel is an educator, administrator, inventor, author, columnist, lecturer and sometimes talk show guest. He is both an ACM and IEEE Fellow and has been recognized by both organizations for distinguished service. He is the Associate Dean of the Howard R. Hughes College of Engineering at UNLV, and his consultancy, Berghel.Net, provides security services for government and industry.