Full Text

Hash-based Classification of Data: Class-based Similarity Hashing, IFIP 2008


In this paper, we introduce the notion of class-aware similarity hashes, or classprints which is an outgrowth of recent work on similarity hashing. Specifically, we build on the notion of context-based hashing to design a framework both for identifying data type based on content, and for building characteristic similarity hashes for individual data items that can be used for correlation.

The most important feature of the presented work is that the process can be fully automated and no prior knowledge of the underlying data is necessary, beyond the selection of a training set of objects. The approach relies entirely on these representative sets to characterize a particular data type. We present an empirical study which demonstrates the practicality of this work on real data and sketch out a complete implementation.