Dependence-Preserving Data Compaction for Scalable Forensic Analysis
Md Nahid Hossain, Junao Wang, R. Sekar, and Scott D. Stoller

Increasingly, large organizations are being targeted in long-running attack campaigns. When the break-in is eventually discovered, forensic analysis begins. System audit logs provide crucial information that underpins such analysis. Unfortunately, audit data collected over months or years can grow to enormous sizes. Large data size is not only a storage concern: forensic analysis tasks can become very slow when they must sift through billions of records. In this paper, we first present two powerful techniques that reduce the number of records by a factor of 4.5 to 17.5. An important benefit of our techniques is that they provably preserve the accuracy of forensic analysis tasks such as backtracking and forward analysis. We then present techniques for representing this data in a main-memory graph to speed up forensic analysis. Our representation is very compact, using less than 3 bytes per event on the largest data sets used in our experiments. This compactness is achieved without slowing down forensic analysis tasks: our algorithms are able to process a million records per second on a single core.