Automatic Detection of Fake File Systems

Neil C. Rowe

Cebrowski Institute, U.S. Naval Postgraduate School

Code CS/Rp, 833 Dyer Road, NPGS

Monterey, CA, 93943, USA

ncrowe@nps.edu

Abstract

We develop methods for assessing the typicality of the file system of a computer. This is helpful in analyzing, for instance, captured terrorist machines to decide if their information is genuine and for testing whether a honeypot is convincing. We have implemented a program that computes 28 metrics on a file system including features such as the average number of files per directory, the average number of programs per directory, the length of an average filename, the size of the average file, and the average time the file was last modified. We also can infer analogous directories with different names or paths on two file systems. We show that comparing the metrics can reveal a reasonably convincing "fake" file system created using random selections from Web page names at our institution together with some other random choices. We conclude with some discussion of possible improvements incorporating more context.

This paper appeared in the Proceedings of the Intelligence Analysis Conference, McLean, Virginia, USA, May 2005.

1. Introduction

Consider an enemy's computer that is captured when an enemy operations center is overrun by an army unit. The magnetic disk of the computer could contain valuable intelligence about the enemy. It could indicate enemy operations, personnel, weapons, and planning, and in a convenient, easy-to-use package. Even if key secrets are encrypted, many important things cannot be: programs, program-process data, and system-resource data. And encryption can be broken if the encryption key can be found or extracted, or if a gimmicked encryption program is substituted for a correct one. So an enemy will undoubtedly realize the dangers of their computers falling into our hands or being secretly accessed by us. They may then attempt counterespionage, deliberate insertion of false information, on those computers as an active form of counterintelligence. Our attempted exploitation of such planted information could be damaging to us.

The challenge is to decide when information provided in a computer's files is fake. Fake information can be created by computer programs to look much like real information, more easily than fake physical documents (Rendell, 1994) which can be betrayed by handwriting, medium, and provenance, clues not nearly as available in cyberspace. Many of the contextual clues that intelligence analysts use to measure the authenticity of intelligence are lacking in digital information: source, timing, and manner of presentation. The potential spectrum of fake digital information is broad. Fake parameters like the targets in air plans are easy to design, while fake narrative like news reports is often too difficult to construct except by highly skilled experts. But an interesting class of intermediate documents are now becoming suitable for automatic construction of fakes: file systems of computer systems, the directories and their files. Fake file systems are essential to information-security tools called "honeypots" (The Honeynet Project, 2004), computer systems designed solely to attract attackers to provide decoy targets and enable study of their attack methods. Honeypots have become increasingly important in the last few years as part of a complete security strategy for organizations. It would be useful to have a software tool to assess the convincingness of such a file system. This paper will present such a tool and associated software, and describe the approach used to develop them.

2. A fake file system

Simulating a file system on a computer means providing realistic-looking files, directories, and associated data like sizes and modification dates. For a honeypot, the files must also look interesting. Just copying the file system of another computer will probably not do, because identical things on different computers (other than the operating system and software) are unusual and hence suspicious. However, a too-simple mechanical method for generating fake information may be easy to discern. So some care is necessary to design a fake file system.

As an example, we developed a simple prototype FDir, a Web site that provides directory and file information for a set of nonexistent directories and files (Rowe, 2004a). Figure 1 is an example view of the top level, a display that imitates Microsoft MS-DOS or 'Command Prompt' listings of directories. Users can click on the names of directories to see similar listings for subdirectories, and they can click on names files to see fake file contents of various kinds. Some of the files appear to be encrypted as in Figure 2. Some files give error messages when the user attempts to open them, including several kinds of authorization and access errors to suggest secrets are being stored; some files listed as large cause an 'out of space' error after a long wait. The openable files show images with captions taken from actual image-caption pairs on Web pages at our School. However, the directory paths to these are often are different from their originals because each directory includes a few randomly added subdirectories, and long paths are likely to have at least one; for instance, an image of a meteorological satellite system (Figure 3) is located in the directory for the Snort intrusion-detection system. The random connections encourage viewers will see connections between unrelated things, a useful strategy for counterespionage since intelligence personnel are especially looking for unexpected connections. With Figure 3 for instance, it appears that satellites are spying on intruders.

Text Box: Figure 2: Example "encrypted" file from FDir.

3. Background

Fake data raises classic issues in intelligence and counterintelligence. People are surprisingly poor at detecting deception (Eckman, 2001), and this has often been exploited in military operations both offensively and defensively (Dunnigan & Nofi, 2001). Deception is especially facilitated in information systems because of the frequent lack of visual and aural clues that might reveal it. So we are seeing increases in Internet scams and frauds (Mintz, 2002). Sophisticated scams can even be accomplished with honeypot technology (Rong & Yang, 2003) although this has not happened yet. Deliberate deception has been proposed as a defensive technique for information systems in general (Gerwehr et al, 2000; Rowe, 2004b), and fake music files are being used to attack music-sharing services by cluttering them with junk (Kushner, 2003).

Detecting deception is the key skill in counterintelligence. (Whaley & Busby, 2003) suggests that such counterintelligence is very difficult and requires a special kind of personality, someone adept at connecting pieces and seeing discrepancies. Discrepancies are the key (Heuer, 1982) because as deceptions try to accomplish major goals, they can become too elaborate to coordinate. But discrepancies can require patience to find (Johnson et al, 2001), and automated tools could be useful, especially with online data which is already digital. Simple clues are available with speech and text to suggest where to look, such as sentence length, complexity, specificity, and informality (Qin, Burgoon, and Nunamaker, 2004). Mathematical models can be built of the likelihood of various events and these can be tracked dynamically (Carofiglio, de Rosis, & Castelfranchi, 2001; Yu & Singh, 2003). Discrepancies can be either individual (specific unusual events) or aggregate (on statistics of groups of events). This is analogous to the distinction in automated 'intrusion detection systems' for defending computer systems from attack (Proctor, 2001) between misuse detection and anomaly detection, the two major approaches. However, defining unusual events is generally quite specific to the details of operating systems, whereas results of statistical analysis of computer systems are more robust over a wide range of systems. Thus we explore a anomaly-based statistical approach here.

4. Building a fake file system

To detect a fake file system, we must characterize the population of "normal" file systems that the fake attempts to imitate. For instance, if we want to fake an office-staff computer at a military base, we have the population of all office-staff computers there. Then a good approach is to make the fake some kind of average of the population. The averaging can be like a mean (for numeric parameters like file sizes and number of files in an average directory), or like a median (for the 'most representative' memo), or like a mode (for the software loaded on the most machines), or like a random selection from a subpopulation (for a network log file created by randomly choosing lines from many log files). For more realism, averages can be done for subparts of the file systems, so for instance we compute an average 'bin' directory from only the other 'bin' directories. Several situations can occur:

· The corresponding items are identical, as with executables in different installations of the same operating system. Then it is important to include this item in the fake.

· The corresponding items are similar but differ in quantitative metrics like size. Then the fake should be similar with some appropriate ?average? metric.

· The corresponding items are mostly different but follow patterns. Then a stochastic grammar could be inferred and used to generate new random items. Such a grammar must take into account contextual parameters like the use of a user's name in files.

· Some corresponding items are absent. Then the average item should be used with probability equivalent to the fraction of times it is present.

· Some corresponding items have considerably more elaborate structure than others, as when one system has an empty directory where another has a directory filled with files. Then a stochastic grammar can be used to generate the necessary tree and fill it with plausible random structure.

Appropriate choice of an averaging method is important to minimize discrepancies in a fake file system. It must also take into account the likelihood of the whole file system, not just the pieces. For instance, an earlier version of the fake-directory site used file and directory names like 'horcalp' and 'qmcb833' generated by a stochastic context-free grammar (a set of rules with associated probabilities where one symbol on the left side is replaced by a sequence of symbols on the right side). The grammar was mostly phonetic, so it would try to alternate vowels with consonants to make the result more pronounceable, a feature often seen with operating-system filenames. The tendency of computer programmers to overuse abbreviations (although eight-character filenames are rarely required anymore) means many operating-system filenames do look this way. However, whole directories of such filenames are rare and look suspicious; file and directory names drawn randomly from real online files and directories are much more convincing. So for our FDir prototype, we used names drawn from all the Web sites at our School.

Note that corresponding files may have different paths on different computers; for instance, application executables could be stored under 'c:/Program Files' under Windows and under '/app1' under Unix. Matching will need to find the correspondence.

5. A first-order probabilistic model

Let us apply these ideas to assess the realism of a file system. We can use statistics on real directories. Part of this is the likelihood of each individual file or directory listing, what we call 'first-order' information. For instance in Figure 1, 'Docs', 'ICON', 'Travel', and 'admin' are all common names of directories for online information about organizations. If we do not have enough statistics on a word to estimate its likelihood, we can use those of its superconcepts (the upward type pointers) and synonyms like those provided by the Wordnet thesaurus system (Miller et al, 1990) and assume an even apportionment of counts among sibling concepts. For instance, if we do not have any statistics on the filename 'herpetology', we can note it is one of 30 subdivisions of 'biology', and divide the count on biology by 30 to estimate a count on 'herpetology'. Reasoning about the counts of superconcepts and synonyms is a form of 'statistical inheritance'. For this analysis, we should split words into subwords when we can; announcement_april_01_2002_picture.html is plausible because each of its words is common. Acronyms like CMDC, GSOIS, and RSL are rare but plausible because short acronyms are common in organizations.

We can also assign likelihoods for other information about files. For our fake directory system, the extensions 'htm', and 'html" are familiar, "rzp" looks similar to the common 'zip', and "cry" suggests 'cryptographic', a plausible adjective for a file. We can also assess the likelihoods of the file sizes (the third item on every line of Figure 1) using the mean and standard deviation of the distribution of their logarithms, which tends to be normally distributed. As for dates and times, we can also compute a mean and standard deviation, but it is also helpful to obtain the mean within the day, within the week, and within the year to see periodic patterns. A vector average of unit vectors is appropriate for periodic values, where the direction of the vector corresponds to time modulo the period. We can also measure typicality of substructures of files, such as how often a document has an abstract or a graphical header.

Although there can be considerable variation in the features of individual files, statistics on directories are less variable and thus more useful for assessing the reasonableness of a file system. For instance, we can count the average, standard deviation, median, largest value, and smallest value for the size of a file in a directory, the number of characters in the filename, or the date it was last modified. Directories that differ significantly in any of these statistics from those of a population of typical directories are suspicious. We can define 'significantly' by the population standard deviation. For instance, if the largest filename in a sample of typical directories is 25 characters with a standard deviation of 20, then a filename of length 65 in a given test directory is two standard deviations from expected and thus has only a 4% chance of being due to chance, and is therefore suspicious. We can add the degrees of significance of individual metrics to get a cumulative metric of significance.

6. Experiments with a file-system metric tool

As a testbed for these ideas, we developed a software tool in Java to assess the characteristics of a file system. An advantage of Java is that the same class file can analyze a file system on any machine with the Java Run-Time Environment, including Windows, Linux, and Unix machines. The tool calculates 28 metrics on each directory of a file system. The metrics were chosen to reflect features most obvious in a quick inspection of directories: the form and types of filenames, the types of files, the sizes of files, the date distribution, and the shape of the directory tree. The simple metrics used were the number of files in the directory, the depth of the directory in the file hierarchy, the number of system files in the directory, the number of document files, the number of image files, the number of Web files, the number of program source files, and the number of filenames that were known English words. The aggregating metrics -- for each of which we calculated the mean, standard deviation, minimum, and maximum -- were the filename length, the natural logarithm of the size of the file (or directory data) in bytes, when the file was last modified, the time within the day that the file was last modified, and the number of pieces in the filename as separated by punctuation marks, typeface case changes, or digits. Our program calculated these 28 metrics for each directory and its subdirectories as well as standard errors on each metric. Table 1 shows some example metrics for five systems a our school: (1) a computer-science Unix file server at our school, (2) an operations-research Unix file server, (3) an oceanography Unix file server, (4) our office desktop Windows machine, (5) and the Linux machine of a colleague. There are differences, but some surprising similarities too.

Table 1: Ten representative metrics on five file systems.

Metric	Sys. 1	Sys. 2	Sys. 3	Sys. 4	Sys. 5
Total files	215814	50860	315678	81346	2,515,746
Av. dir. size	19.2	18.2	76.3	18.5	76.3
# document files per directory	0.3	0.9	0.3	1.4	0.04
# Web files per directory	1.2	0.1	0.7	0.3	0.1
# of English filenames per directory	2.5	5.5	1.5	1.4	4.3
Av. filename length	8.9	9.0	10.8	10.5	6.4
Av. log of file size	7.9	9.7	7.4	7.7	2.7
Av. day modified	11041	11459	11035	11706	12341
Av. minute in day	936	794	987	932	822
Av. # filename parts	2.1	2.7	3.1	2.3	2.0

We can then compare two file systems as a whole by comparing the metrics for their top-level directories, which include statistics on everything beneath them. Comparing whether two metrics are statistically significant is a classic problem in statistics, for which it is reasonable to assume a normal distribution and a standard error of the square root of the sum of the squares of the standard deviations of the metrics. Thus we use as our measure of significance for an individual metric. We can average this over the 28 statistics to get an overall level of significance.

As an example, we used our tool to compare metrics of our entire fake directory system (the original version before we made improvements suggested by this analysis) with those of five machines around our office: Two Windows machines, two Unix machines, and one Macintosh. Results are shown in Table 2. There were 453,974 fake file and directory names, and 126,930 real files in the two Windows systems. Signed standard errors were computed as in the table: Negative numbers mean the fake's value was larger than the Windows systems' value.

Table 2: Comparison of the 28 metrics between five real and one fake file system.

Directory size	-2.34	Depth	1.00	Number of system files	-1.87	Number of user files	0.09
Number of image files	0.22	Number of Web files	-2.35	Number of program files	0.50	Number of English words	-2.87
Average filename length	0.00	Standard deviation	-1.50	Minimum	2.00	Maximum	-1.11
Average file size	0.46	Standard deviation	-3.20	Minimum	0.30	Maximum	-2.22
Average date	3.31	Standard deviation	0.43	Minimum	8.35	Maximum	-1.22
Average time in day	0.50	Standard deviation	-3.42	Minimum	1.74	Maximum	-1.35
Average filename parts	0.12	Standard deviation	-1.00	Minimum	0.25	Maximum	-0.85

The average error over the 28 statistics was 1.59, as contrasted with 0.13 in comparing the two Windows systems. It appears that the standard deviations and the maxima for file-system metrics were generally too small; we were attempting to avoid seeming too unusual, but went too far. Dates in the fake system were too early, but some real computer systems not used recently are like this. The fake system had too many files per directory and too many English-word filenames (since they came from Web pages). But all these factors are easy to adjust. Our tool also compares corresponding subdirectories, which is useful because matching importance varies considerably: Similarities of software directories are much more important than similarities of user directories. Note that a better strategy might be to compare to one normal system that is the closest to what the fake imitates.

Equivalent directories on different computers may have different names and paths (since, for instance, software can be installed in many places). So we wrote an additional tool to automatically infer apparently equivalent but differently-named directories. Examples we found were:

· c:/Documents and Settings/Neil Rowe maps to c:/Documents and Settings/Williams

· c:/Program Files/Java/j2re1.4_01 maps to c:/jsdk1.4.2_01/jre

· /work/ssiripal/java/demo/jfc/Java2D/src maps to c:/j2sdk1.4.2_01/demo/plugin/jfc/Java2D /src/java2d

This tool searches for pairs of directories that (a) have more than a threshold number (we used 10 in experiments) of subdirectories in common, and (b) differ by no more than a threshold (we used 20) in the sum of the count deviations for the five types of files. Using it we found 379 such mappings between directories for 10,068 directories on the four representative file systems, and these mappings created 4636 additional aggregations of statistics for the directories. We then applied these mappings and recalculated the statistics on the merged directories. No significant difference was observed with our test directories, but we expect there will be improved accuracy with a more realistic fake file system.

7. Further modeling issues

Our prototype comparator tool can detect some obvious honeypots, but it cannot detect discrepancies due to overly similar files. For instance, we expect every system to have directory names including the names of the main users; if these names are the same on every machine of a local-area network, they are suspicious. This can be detected by calculating statistics on the mappings between systems found by our mapping tool described above. System pairs with a too-low number of possible mappings (that is, for which too many files and directories are identical) are suspicious.

We also can examine "second-order" statistics on relationships; relationship analysis is central in intelligence (Coffman, Greenblatt, & Markus, 2004). For instance, we can count how often two files occur together in a directory. Many relationship discrepancies are reduced in our fake-directory tool by choosing file names for a directory from the same real-world directory, but there are too many relationships to fake all of them. The main problem with using second-order probabilities is that their statistics are considerably sparser than first-order statistics. Thus we need to do statistical inheritance from larger populations of files and directories and generalized properties of them; for instance, we can look at the number of times a synonym for 'project' occurred as a subdirectory under a directory name that was a branch of engineering.

A different issue that arises in counterintelligence is the effectiveness of the supplied fake information, a factor which acts counter to plausibility. For instance, the most plausible of fake air targeting orders for spies to find would be those closest to real orders, since good decision-making is usually best represented in the real orders. But deceptions too close to reality are ineffective. So a tradeoff must be made.

8. Conclusion

We have developed a theory of plausibility of the file system of a computer using its typical characteristics. To go with this theory we have built two main tools, one that builds realistic-looking fake file systems, and one that calculates statistics on file systems and compares them to see if they are significantly different. The tools can be used for both intelligence and counterintelligence, to both build honeypots and to test the convincingness of honeypots. But clearly much further work is possible by incorporating mapping statistics, second-order effects, and effectiveness ratings into our tools.

Acknowledgements

This work was supported by the National Science Foundation under the Cyber Trust Program. Views expressed are those of the author and do not represent policy of the U.S. Government.

References

Carofiglio, V., de Rosis, F., & Castelfranchi, C., 2001. Ascribing and Weighting Beliefs in Deceptive Information Exchanges. Proc. User Modeling, 222-224.

Coffman, T., Greenblatt, S., & Markus, S., 2004. Graph-based Technologies for Intelligence Analysis. Communications of the ACM, 47 (3): 45-47.

Dunnigan, J. F., & Nofi, A. A., 2001. Victory and Deceit, second edition: Deception and Trickery in War. San Jose, CA: Writers Club Press.

Eckman, P., 2001. Telling lies: Clues to Deceit in the Marketplace, Politics, and Marriage. New York: Norton.

Gerwehr, S., Weissler, R., Medby, J. J., Anderson, R. H., & Rothenberg, J., 2000. Employing Deception in Information Systems to Thwart Adversary Reconnaissance-Phase Activities. Project Memorandum, National Defense Research Institute, Rand Corp., PM-1124-NSA.

Heuer, R. J., 1982. Cognitive Factors in Deception and Counterdeception. Daniel, D., and Herbig, K. (Eds.), Strategic Military Deception, New York: Pergamon, pp. 31-69.

The Honeynet Project, 2004. Know Your Enemy, 2nd Edition. Boston: Addison-Wesley.

Johnson, P., Grazioli, S., Jamal, K., & Berryman, R., 2001. Detecting Deception: Adversarial Problem Solving in a Low Base Rate World. Cognitive Science, 25 (3): 355-392.

Kushner, D., 2003. Digital Decoys. IEEE Spectrum, 40 (5), 27.

Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K., 1990. Five Papers on Wordnet. International Journal of Lexicography, 3 (4), Winter 1990.

Mintz, A. P. (ed.), 2002. Web of Deception: Misinformation on the Internet, CyberAge Books, New York.

Proctor, P. E., 2001. Practical Intrusion Detection Handbook. Upper Saddle River, NJ: Prentice-Hall PTR.

Qin, T., Burgoon, J., and Nunamaker, J., 2004. An Exploratory Study of Promising Cues in Deception Detection and Application of Decision Tree. Proc. 37^th Hawaii Intl. Conf. On Systems Sciences.

Rendell, K., 1994. Forging History: the Detection of Fake Letters and Documents. Norman, OK: University of Oklahoma Press.

Rong, C., & Yang, G., 2003. Honeypots in Blackhat Mode and its Implications. Proc. 4th Intl. Conf. on Parallel and Distributed Computing Applications and Technology, 185-188.

Rowe, N., 2004. A Model of Deception during Cyber-Attacks on Computer Systems. Symposium on Multi-Agent Security and Survivability, Philadelphia, PA. (Rowe, 2004a)

Rowe, N., 2004. Designing Good Deceptions in Defense of Information Systems. Computer Security Applications Conference, Tucson, AZ, 418-427. (Rowe, 2004b)

Whaley, B., & Busby, J., 2002. Detecting Deception: Practice, Practitioners, and Theory. Godson, R., & Wirtz, J. (Eds.), Strategic Denial and Deception (New Brunswick: Transaction Publishers), pp. 181-221.

Yu, B., & Singh, M., 2003. Detecting Deception in Reputation Management. Proc. Conf. Multi-Agent Systems (AAMAS), Melbourne, AU, 73-80.