To be replaced by your thesis processor using the data from your Python Thesis Dashboard


image


NAVAL POSTGRADUATE SCHOOL

MONTEREY, CALIFORNIA


THESIS


USING TEXTURE VECTOR ANALYSIS TO MEASURE COMPUTER AND DEVICE FILE SIMILARITY

by Bruce Allen

December 2019

Thesis Co-Advisors: Neil C. Rowe James Bret Michael

Approved for public release. Distribution is unlimited


THIS PAGE INTENTIONALLY LEFT BLANK

REPORT DOCUMENTATION PAGE

Form Approved OMB No. 0704–0188

Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202–4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503.

1. AGENCY USE ONLY (Leave Blank)

2. REPORT DATE

December 2019

3. REPORT TYPE AND DATES COVERED

Master’s Thesis

4. TITLE AND SUBTITLE

USING TEXTURE VECTOR ANALYSIS TO MEASURE COMPUTER AND DEVICE FILE SIMILARITY

5. FUNDING NUMBERS

6. AUTHOR(S)

Bruce Allen

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES)

Naval Postgraduate School Monterey, CA 93943

8. PERFORMING ORGANIZATION REPORT NUMBER

9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES)

Navy Director for Acquisition Career Management

10. SPONSORING / MONITORING AGENCY REPORT NUMBER

11. SUPPLEMENTARY NOTES

The views expressed in this document are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. IRB Protocol Number: N/A.

12a. DISTRIBUTION / AVAILABILITY STATEMENT

Approved for public release. Distribution is unlimited

12b. DISTRIBUTION CODE

13. ABSTRACT (maximum 200 words)


Executable programs run on computers and digital devices. These programs are stored as executable files in storage media such as disk drives or solid state storage drives within the device, and are opened and run. Some executable files are pre-installed by the device vendor. Other executable files may be installed by downloading them from the Internet or by copying them in from an external storage media such as a memory stick or CD. It is useful to study file similarity between executable files to verify valid updates, identify potential copyright infringement, identify malware, and detect other abuse of purchased software. An alternative to relying on simplistic methods of file comparison, such as comparing their hash codes to see if they are identical, is to identify the “texture” of files and then assess its similarity between files. To test this idea, we experimented with a sample of 23 Windows executable file families and 1386 files. We identify points of similarity between files by comparing sections of data in their standard deviations, means, modes, mode counts, and entropies. When vectors are sufficiently similar, we calculate the offsets (shifts) between the sections to get them to align. Using a histogram, we find the most-likely offsets for blocks of similar code. Results of the experiments indicate that this approach can measure file similarity efficiently. By plotting similarity vs. time, we track the progression of similarity between files.

14. SUBJECT TERMS

15. NUMBER OF

PAGES 85

16. PRICE CODE

17. SECURITY CLASSIFICATION OF REPORT

Unclassified

18. SECURITY CLASSIFICATION OF THIS PAGE

Unclassified

19. SECURITY CLASSIFICATION OF ABSTRACT

Unclassified

20. LIMITATION OF ABSTRACT

UU

To be replaced by your thesis processor using the data from your Python Thesis Dashboard

NSN 7540-01-280-5500 Standard Form 298 (Rev. 2–89)

Prescribed by ANSI Std. 239–18


THIS PAGE INTENTIONALLY LEFT BLANK

To be replaced by your thesis processor using the data from your Python Thesis Dashboard


Approved for public release. Distribution is unlimited


USING TEXTURE VECTOR ANALYSIS TO MEASURE COMPUTER AND DEVICE FILE SIMILARITY


Bruce Allen Civilian

B.S., CSU Sacramento, 1989


Submitted in partial fulfillment of the requirements for the degree of


MASTER OF SCIENCE IN COMPUTER SCIENCE

from the

NAVAL POSTGRADUATE SCHOOL

December 2019


Approved by: Neil C. Rowe Thesis Co-Advisor


James Bret Michael Thesis Co-Advisor


Peter J. Denning

Chair, Department of Computer Science


THIS PAGE INTENTIONALLY LEFT BLANK

To be replaced by your thesis processor using the data from your Python Thesis Dashboard


ABSTRACT


Executable programs run on computers and digital devices. These programs are stored as executable files in storage media such as disk drives or solid state storage drives within the device, and are opened and run. Some executable files are pre-installed by the device vendor. Other executable files may be installed by downloading them from the Internet or by copying them in from an external storage media such as a memory stick or CD. It is useful to study file similarity between executable files to verify valid updates, identify potential copyright infringement, identify malware, and detect other abuse of purchased software. An alternative to relying on simplistic methods of file comparison, such as comparing their hash codes to see if they are identical, is to identify the “texture” of files and then assess its similarity between files. To test this idea, we experimented with a sample of 23 Windows executable file families and 1386 files. We identify points of similarity between files by comparing sections of data in their standard deviations, means, modes, mode counts, and entropies. When vectors are sufficiently similar, we calculate the offsets (shifts) between the sections to get them to align. Using a histogram, we find the most-likely offsets for blocks of similar code. Results of the experiments indicate that this approach can measure file similarity efficiently. By plotting similarity vs. time, we track the progression of similarity between files.


THIS PAGE INTENTIONALLY LEFT BLANK


image


Table of Contents


  1. Introduction 1

  2. Background 3

    1. Contents of an Executable File . . . . . . . . . . . . . . . . . . . 3

    2. Identifying File Similarity . . . . . . . . . . . . . . . . . . . . . 3

  3. Calculating File Similarity 7

    1. Calculating Texture-Vector Data . . . . . . . . . . . . . . . . . . 7

    2. Calculating Similarity Offsets between Sections . . . . . . . . . . . . 9

    3. Calculating Similar-Section Offset Histograms. . . . . . . . . . . . . 10

    4. Calculating Similarity Measures Between Files . . . . . . . . . . . . 11

    5. Tracking Versions of Executable Code . . . . . . . . . . . . . . . . 13

  4. Preparing the Dataset 15

    1. Preparing the Dataset of Executable Files. . . . . . . . . . . . . . . 15

    2. Preparing the Texture-Vector Files . . . . . . . . . . . . . . . . . 17

    3. Tuning Rejection Thresholds. . . . . . . . . . . . . . . . . . . . 19

    4. Preparing the Similarity-graph Files . . . . . . . . . . . . . . . . . 19

  5. Results 21

    1. Evaluating Similarities by File Family . . . . . . . . . . . . . . . . 22

    2. Evaluating Similarities Across File Families. . . . . . . . . . . . . . 23

    3. Examining Similarity using the Texture-Vector Browser GUI Tool . . . . . 27

    4. Examining Similarity using Gephi . . . . . . . . . . . . . . . . . 34

  6. Conclusions and Future Work 35

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Appendix A The Texture-Vector Similarity Toolset 37

A.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.3 Texture-Vector Similarity Toolset Data . . . . . . . . . . . . . . . . 49


Appendix B

Texture-Vector Threshold Settings

51

Appendix C

Texture Vector Data Syntax

53

Appendix D

Similarity-Graph Data Syntax

55

Appendix E

Source Code

57

    1. Texture-Vector Similarity Source Code . . . . . . . . . . . . . . . . 57

    2. Source Code for Batch Processing. . . . . . . . . . . . . . . . . . 66

    3. Source Code for Statistical Analysis . . . . . . . . . . . . . . . . . 67

E.4 Source Code License . . . . . . . . . . . . . . . . . . . . . . . 67

List of References 69

Initial Distribution List 71

image

List of Figures


Figure 3.1

Inference process. . . . . . . . . . . . . . . . . . . . . . . . . .

7

Figure 3.2

Texture patterns of two very similar executable files and lines con- necting them indicating points of similarity. . . . . . . . . . . . .


10

Figure 3.3

An illustration showing an uncompensated histogram (triangular re- gion) and its equivalent compensated histogram (rectangular regin) used for the similarity calculation. . . . . . . . . . . . . . . . . .


11

Figure 3.4

Example of high value of high similarity in file family iexplore_exe.

12


Figure 4.1 Histogram of file sizes for our dataset. . . . . . . . . . . . . . . . 16

Figure 4.2 Histogram of file modification times for our dataset. . . . . . . . 17

Figure 5.1 Histogram of similarity matches across all files in our dataset. . . 21


Figure 5.2

Similarity using texture-vectors vs. similarity using Prof. Rowe’s byte analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Figure 5.3

False-positive similarity between two files caused by homogeneous compressed data. . . . . . . . . . . . . . . . . . . . . . . . . . .


25

Figure 5.4

Example of low value of high similarity in file family winprint_dll.

26

Figure 5.5

Sorted node listing with node 326 selected. . . . . . . . . . . . .

27

Figure 5.6

Files (nodes) and similarity measures (edges) associated with file node 326 showing modification times and similarity to node 326.


28

Figure 5.7

A detailed comparison of files 312 and 326 showing a high degree of similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . .


30

Figure 5.8

Files similar to the latest Microsoft Office file in file family

powerpnt_exe. . . . . . . . . . . . . . . . . . . . . . . . . . .


31

Figure 5.9

Comparison of files 326 and 310. . . . . . . . . . . . . . . . . .

32

Figure 5.10

Similarity increases as versions approach the latest version. . . .

33


Figure A.1

Figure A.2

Example Texture-Vector Similarity GUI settings dialog. . . . . .

The Texture-Vector Similarity GUI showing similarity between two

41


similar files. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42


Figure A.3 The TV file selection table. . . . . . . . . . . . . . . . . . . . . 45

Figure A.4 The similarity edge selection table for file node 10. . . . . . . . . 46

Figure A.5 The Texture-Vector Browser GUI showing file node 79 selected. . 48

image

List of Tables


Table 4.1

Files by file family. . . . . . . . . . . . . . . . . . . . . . . . . .

18

Table 4.2

Default texture-vector threshold settings. . . . . . . . . . . . . . .

19

Table 5.1

Mean similarity and number of comparisons made within file families.

22

Table 5.2

Mean file similarity between file families. . . . . . . . . . . . . .

24


THIS PAGE INTENTIONALLY LEFT BLANK


image


CHAPTER 1:

Introduction


Software of unknown pedigree abounds. This is partly due to software being distributed as executable code or a “binary”, and evaluating the contents of a binary is technically challenging.

Executable code consists of machine instructions, register references, memory addresses, hardcoded data, and text that is referenced by it. Machine instructions have operators and operands (arguments). When the source code changes with new versions and executable code is recompiled, most operands change. Small changes in source code can result in considerably different operands in the executables. Nonetheless, comparisons between versions of a binary can be made because most operators remain the same amongst the versions.

Machine instructions in executable code are interpreted by a processor. Programmers rarely write machine code directly. Instead, they write higher-level source code in a high-level language such as C++ and compile the source code into machine code.

Numerous updates to a binary can occur over the useful life of the executable to address new software requirements, fix software defects, or port the software to a different computing platform. Each of these requires recompilation and results in a new binary.

Executable code can be analyzed using reverse-engineering tools that recover information about the binary’s structure, function, and behavior. Some tools recognize data regions inside the code, while more advanced tools analyze the machine instructions to make inferences about the code’s function. Because of the differences in instruction set archi- tectures (ISAs), tools use models of ISAs. However, reverse engineering of a binary can be resource-intensive and can be stymied by deliberate anti-reversing techniques used to protect the binary file.

Executable code is vulnerable to malware. By replacing machine instructions with malicious ones, executable code can be transformed into malware. Malware can divert execution of code to perform one or more malicious tasks. Detection of malware contained in adversarial


malware binaries is technically challenging, even with the use of artificial-intellegence techniques such as deep learning [1].

We introduce here an approach based on texture vectors to allow executables to be compared against each other without requiring reverse engineering of the binaries. Our approach can be used as a first step to determine whether reverse engineering is needed. Chapter 2 covers related work. Chapter 3 describes the algorithms for creating texture vectors and processing them to draw conclusions about similarities between executable code files. Chapter 4 describes the dataset we used and how we prepared texture-vector and similarity-graph data. Texture-vector analysis using a dataset of executable files is presented in Chapter 5, followed by conclusions and recommendations for future work in Chapter 6. Details of the tools that implement the algorithms are presented in the appendices.


image


CHAPTER 2:

Background


    1. Contents of an Executable File

      A binary contains more than just executable code. It includes fixed data, reserved space, and links to executable code that is external to the file [2]. Similarities in fixed data and fixed links are easiest to find because they can be matched directly. Reserved space usually consists of bytes with zero values, and is found in many places in a typical executable file. It can complicate similarity measurements since there can be many false matches with zero bytes.

      The portion of an executable file that contains the actual executable code consists of machine instructions and their associated operands. When executable code is modified, many machine instructions remain the same but usually their locations shift. Then the memory addresses encoded in their operands may change to compensate for this shift unless the code uses addressing relative to a register. However, register arguments encoded in operands may also shift. For 32-bit processors, many machine instructions are spaced four bytes apart; for 64-bit processors, eight bytes apart. Hence it may be possible to detect code similarity of machine instructions by comparing bytes at 4-byte or 8-byte boundaries.


    2. Identifying File Similarity

      Numerous approaches exist for identifying similarities between files. They can be used on text files, binary files, images, video, and audio. A few apply to files containing executable code. Some of these executable-analysis tools visualize software evolution in source code using version-control information or source-code file analysis [3]. A three-dimensional graph can show where code accesses the operating system or other information about code flow, and graph how these numbers change over the evolution of a software product. The Code Time Machine tool [4] does this to show the evolution of code metrics for a given file. It shows values along a time-line for the number of lines of code, number of methods, and cyclomatic complexity (i.e., the number of paths the code can take given the possible


      conditions written into the code). A three-dimensional graph of files and file relations between versions relates files. Circles represent releases, squares represent files, and edges represent associations [5]. Other tools that graph code evolution are CVSScan [6] and EPOSee [7].

      There are many types of files. Three important ones are:


image


CHAPTER 3:

Calculating File Similarity


In this chapter we present our texture-vector approach. We perform three layers of calcu- lations to make inferences about similarity and how and where files are similar. Our steps are:

  1. Calculate texture-vector datasets from the two files to be compared.

  2. Compare texture-vector datasets to identify similarity offsets and produce a similarity offset histogram.

  3. Calculate statistics from the heights of the similarity offset histogram to produce a single similarity measure for the comparison of the two files.

This process is illustrated in Figure 3.1.


image

Figure 3.1. Inference process.


    1. Calculating Texture-Vector Data

      Texture vectors are calculated from the byte values of contiguous sections of binary data. Although many transform algorithms are possible, we are specifically interested in trans- forms that can both represent some unique characteristic of the data and possess a value that can be meaningfully compared to other values to measure similarity.

      Sections measured as similar by many transforms have stronger similarities than others. We tested the following transforms for calculating texture vectors on the integer values of the bytes:

      • Standard Deviation: The standard deviation of the byte values in a section of binary data. Two sections with a similar amount of deviation may be similar.


      • Mean: The average byte value in the section. When executable code changes, operators may remain the same and help maintain the same mean.

      • Mode: The most frequent byte value in the section. Often this value was zero in our data. This value is nonmetric and can only be used in computing similarity distances in the sense that it is identical or not.

      • Mode Count: The count of occurrence of the most frequent byte value in the section.

      • Entropy: The Shannon entropy of the byte values in the section. Two sections may be similar if the amount of randomness in each section is similar.

We considered a Fourier transform for texture values, but chose not to because it preserves information and returns output the same size as the input. We could have used lowpass or highpass filtering to reduce the number of values it found, but found that similar data was provided by the entropy measure.

We picked a section size of 500 for the texture vectors after experimenting with low values such as 50 and high values such as 50,000. A section size that was too small resulted in texture vectors with too much fluctuation, and a section size that was too large diluted the texture-vector characteristics. We also picked 500 rather than a size that is the power of two so as to not attempt to align with possible data structure sizes or boundaries intrinsic to specific data such as organizational boundaries of contents placed within executable code.


3.1.1 Calculating Texture-Vector Distance

Two texture vectors are defined as similar when the first texture-vector is within a threshold of closeness to the second texture vector by the weighted square of the L2 (Euclidean) distance metric [19]. The similarity can be thought of as 1/d2 where d is distance, calculated as: d = w1(dv1)2 + w2(dv2)2 + w3(dv3)2 + w4(dv4)2 + w5(dv5)2 where dv is the difference at a given vector element and w is the weight for a given vector element. For example if texture-vector 1 has values [100, 30, 220, 50, 80], texture-vector 2 has values [101, 32,225,

51, 80], and weights [w1, w2, w3, w4, w5] are [0.25, 0.25, 0.0, 0.25, 0.25], then the L2 distance

d2 is 0.25 12 + 0.25 22 + 0.0 52 + 0.25 12 + 0.25 02 = 0.25 + 1.0 + 0 + 0.25 + 0 = 1.5.

A threshold of similarity was used for our graphics; for instance, if the acceptance threshold is 1.0, these vectors are not similar because 1.5 i 1.0. We set weight values by experiment as explained in Chapter 4.3. A good threshold identifies numerous correct similarities


between the sections of data from which the texture vectors were calculated while excluding non-similarities.

In our experiments, we saw many byte ranges of very low entropy, for example where all but three byte values in a section were 0. If low-entropy occurrences are random, then they average out somewhat in the mean power histogram described in Chapter 3.3. If they are not random, they can still be useful in identifying similarity. We decided not remove any texture vectors such as those with extremely low entropy because we want our similarity algorithms to use all the data. However, future work should consider weighting bytes by their inverse document frequency, traditionally computed as the inverse of the logarithm of their count.


    1. Calculating Similarity Offsets between Sections

      We calculate similarity offsets by comparing all the texture vectors in one file against all the texture vectors in another file and counting the offsets between the files where the texture vector distance is within the threshold of closeness. When there are many offsets with the same value, this gives high confidence in those byte matches.

      We implemented a display to show consistently strong offsets between two files. The display draws lines connecting similar texture vectors. The pattern and quantity of similarity lines indicates the nature and degree of file similarity. Figure 3.2 shows an example of two very similar versions of executable code, where the texture vector pattern of each file is shown across the top and bottom, and the lines between them indicate points of similarity. The files are both roughly 220 KB in length.


      image

      Figure 3.2. Texture patterns of two very similar executable files and lines connecting them indicating points of similarity.


    2. Calculating Similar-Section Offset Histograms

      We calculate a similarity offset histogram from the set of offsets identified when searching for sufficiently similar texture vectors. There can be many thousands of offset values where similar-section matches can occur. To quantify this distribution of offsets, we create a similarity offset histogram and distribute calculated offset values across approximately 400 buckets, which sufficiently categorizes offsets in a viewable form. Consistent offset values are found as peaks on the histogram of offset values and represent likely meaningful similarities.

      We calculate the measure of similarity between two files from the heights in the similarity offset histogram to provide a numeric measure of similarity between files. A large spread in heights suggests similarity at specific offsets, indicating similarity, while minimal spread in heights suggests a random distribution of similarity offsets, likely a result of false positives.

      Because it is mathematically possible to have more similarity offsets near the middle of the


      histogram than at the sides, we must adjust histogram counts by offset value. We created a compensated histogram that has an even probability of heights across it, and calculated similarity from that. We calculated the compensated histogram by removing the right side of the histogram where the possibility for histogram counts is decreasing, and added it to the left side, where the possibility for histogram counts is increasing. This is shown in Figure 3.3, where the triangular region is the uncompensated histogram and the rectangular region is compensated. The horizontal axis plots the number of similarity offsets found for each bucket. The offset value along the horizontal axis is the difference between the byte location of the similar-section offset in one file and the byte location of the matching similar-section offset in the other. The horizontal axis spans from the negative of the size of the file on the left to the positive size of the file on the right. Although the ordering of the files are user-selected, the calculated histogram is identical; the calculation is symmetric. Because these histograms overlap on the graph, we draw them slightly transparent so they blend, allowing us to see all their parts.


      image

      Figure 3.3. An illustration showing an uncompensated histogram (triangular region) and its equivalent compensated histogram (rectangular regin) used for the similarity calculation.


    3. Calculating Similarity Measures Between Files

      We calculate the measure of similarity between two files from the magnitude of the standard deviation of the heights of the compensated histogram as described in Chapter 3.3. An example of calculated similarity measure, along with the texture vectors, similarity offsets, and similar-section offset histograms, is shown in Figure 3.4. The top part describes the files


      being compared, the weights used in calculating the texture-vector distance, and statistics about the view, including the calculated similarity measure of 334.3535. The middle part shows the two texture-vector patterns, which visually appear identical, along with the center region saturated black with similarity lines. The bottom part shows the similarity histograms, where the similar-section offset histograms have spikes and low points. We will conclude that these two files are nearly identical in Chapter 5.


      image

      Figure 3.4. Example of high value of high similarity in file family

      iexplore_exe.


    4. Tracking Versions of Executable Code

We can also graph a network of relationships between different versions of the same ex- ecutable. By using the file modification time for the horizontal axis and the calculated similarity measure described in Chapter 3.4 as the vertical axis, we can show the relation- ships between versions. Files that have a larger similarity measure to the selected file are plotted higher on the vertical axis. Files whose similarity measure is below a user-selectable measure are not plotted. By adjusting the similarity threshold using the SD slider described in Appendix A.2.4, we can remove files with minimal similarity to reveal clusters of files that match with greater similarity. Using this graph, we can make inferences; for example, releases with a similar modification time may be a result of bug fixes or security updates; releases with a smaller similarity measure may have more functional differences or may have added malware. An example of this graph is shown in Figure 5.6.


THIS PAGE INTENTIONALLY LEFT BLANK


image


CHAPTER 4:

Preparing the Dataset


The dataset we studied consisted of executable files, texture-vector files, and similarity-graph files.


4.1 Preparing the Dataset of Executable Files

The initial set of files was a sample of executable .exe and .dll files extracted from the Real Data Corpus [20]. The Real Data Corpus consists of “images” (copies) of used disk drives and other devices obtained from non-U.S. countries. The files were extracted using the icat extraction tool from The Sleuth Kit forensics tool, https://forensicswiki.org/wiki/ The_Sleuth_Kit. Prof. Rowe picked 23 representative families of executables defined by a file name for each. Since many of the files were faulty, he used a software wrapper that loaded files for each distinct file contents (as indicated by its hash code) until the wrapper found a non-faulty copy. Names were changed from the original ones to distinguish files with the same names and different contents. The initial set consisted of 1,386 files. Of these, 162 were excluded because their size was greater than 1 MB and 55 were excluded because their size was less than 1 KB. Of the remaining 1,169 files, 35 were excluded because they were identical based on their MD5 cryptographic hash, leaving 1,134 files in our dataset. Figure 4.1 shows the distribution of file sizes. Note that since all files are from various countries and no files are from the U.S., our collection may exclude important versions of software.


image

Number of files by file size


250


Number of files

200


150


100


50


0

104 105 106

File size in bytes


Figure 4.1. Histogram of file sizes for our dataset.


The file modification times were extracted by Prof. Rowe using a separate pro- gram find_mod_times.py that uses DFXML metadata for the files created using the fiwalk program, https://www.forensicswiki.org/wiki/Fiwalk. We wrote a program set_modtimes.py (seeAppendix E.2.3), to set the file timestamps of these files using the MD5 cryptographic hash and timestamp information. We set these timestamps so that the file timestamp information can be captured as metadata when creating texture-vector datasets. The earliest valid modification timestamp value was used for each hashcode. Timestamps before 1979 were considered invalid. The distribution of files by file modification time is shown in Figure 4.2.


image

Number of files by modification time


160


140


Number of files

120


100


80


60


40


20


0

1980 1985 1990 1995 2000 2005 2010 2015 2020

Modification time


Figure 4.2. Histogram of file modification times for our dataset.


Statistics on the 23 file families that we studied are shown in Table 4.1. This includes source- code family tabulate_drive_data_py, which allows us to compare some versioned source-code files too.

Filenames for executable files in our dataset were assigned by Prof. Rowe to have a country- of-origin prefix followed by a drive code, followed by the absolute path to the file within the drive, followed by the filename, and finally followed by the .tmp suffix. All slashes and spaces are replaced with underscores for convenient storage in a Linux file system. A .tmp suffix is appended so that the file manager does not display them as executable files.


    1. Preparing the Texture-Vector Files

      We created the texture-vector .tv files with the sbatch_calc_tv.bash program described in Appendix E.2. Due to the computational burden, we calculated texture vectors on the Naval Postgraduate School (NPS) Hamming supercomputer using sbatch parallel processing. Sbatch is a Slurm workload manager that schedules jobs across multiple


      Table 4.1. Files by file family.

      image


      File Family

      File

      count

      Min file

      size

      Max file

      size

      Mean file

      size

      Standard

      deviation of file size

      a0003775_dll

      14

      1591

      853504

      258271.6

      318135.5

      bthserv_dll

      37

      1067

      92160

      31455.4

      19509.8

      ccalert_dll

      23

      189560

      267880

      225524.2

      21199.8

      cdfview_dll

      244

      1178

      409600

      144513.2

      39662.1

      dunzip32_dll

      34

      11091

      149040

      114370.9

      26991.3

      hotfix_exe

      33

      53248

      112912

      94098.4

      13263.9

      iexplore_exe

      216

      3506

      903168

      461304.5

      277712.7

      mobsync_exe

      80

      8192

      970752

      156818.5

      141438.6

      msrdc_dll

      6

      159232

      194048

      174933.3

      15696.5

      nvrshu_dll

      32

      151552

      262144

      240128.0

      33724.4

      pacman_exe

      2

      165594

      241693

      203643.5

      53810.1

      policytool_exe

      104

      1224

      787508

      54764.8

      84605.1

      powerpnt_exe

      19

      2310

      676112

      366290.8

      236454.6

      rtinstaller32_exe

      4

      135168

      158312

      146740.0

      9843.3

      safrslv_dll

      29

      1582

      65536

      41681.3

      12648.2

      tabulate_drive_data_py

      23

      18647

      47544

      34090.3

      7213.7

      typeaheadfind_dll

      2

      35920

      39856

      37888.0

      2783.2

      udlaunch_exe

      4

      118784

      118784

      118784.0

      0.0

      vsplugin_dll

      8

      65606

      118801

      88180.2

      15049.3

      webclnt_dll

      80

      1261

      611328

      96930.6

      92513.1

      winprint_dll

      7

      12048

      44544

      29627.4

      13120.4

      wmplayer_exe

      120

      2864

      520192

      142871.3

      101072.6

      xrxwiadr_dll

      13

      8192

      311296

      123327.4

      75040.4


      processors (see https://slurm.schedmd.com/overview.html). This program runs one job per file. Jobs take varying times to complete because file sizes vary. To compute the texture vectors for the 1,134 jobs, with a job queue size of 500, took about two minutes.

      We then copied these .tv files to the Texture-Vector Similarity repository, renaming them to their MD5 cryptographic hash value, for access by the Texture-Vector Similarity GUI tool, by running md5copy_500.py, see Appendix E.2.3.


    2. Tuning Rejection Thresholds

Similarity is indicated when the square of the L2 distance measure is less than an acceptance threshold, as described in Chapter 3.1.1. We performed our tuning with two arbitrarily selected larger files in the ccalert_dll file family. We began with a default weight of

0.5 for the standard deviation, mean, mode count, and entropy transforms and, after some experimentation, we selected a distance rejection threshold of 5.0 because it resulted in reasonable similarity offsets without an oversaturation of matches. We selected a default weight of 0.0 for the mode because mode values do not quantifiably compare with each other, though an alternative could be to set distances between modes to 0 for identical values and 1 for nonidentical values.

We examined our tuning of weight values by setting all weight values to 0.0 and then, one weight at a time, examined the saturation of matched offsets as we adjusted the weight for each texture contribution from 0.0 to 1.0. For each weight adjustment, we observed that the quantity of similarity offsets identified would vary as we changed the weight and also that there was a visually understandable quantity of similarity at weight 0.5. Given this, we accepted our weight and rejection threshold values as our default values. These defaults are shown in Table 4.2.

Table 4.2. Default texture-vector threshold settings.


Setting

Type

Value

Standard Deviation

Weight

0.5

Mean

Weight

0.5

Mode

Weight

0.0

Mode Count

Weight

0.5

Entropy

Weight

0.5

Rejection threshold

Threshold

5.0


4.4 Preparing the Similarity-graph Files

We created the similarity-graph files by running the sbatch_ddiff_tv.bash program as described in Appendix E.2. We calculated the similarity metrics on the NPS Hamming supercomputer using sbatch parallel processing with a job queue size of 700, resulting in a graph of 1,134 nodes and 463,486 edges from which we can create a similarity matrix across all file families. We compared files across file families in order to measure similarity


between known dissimilar files. There are 642,411 possible edges, but we dropped 178,925 of them because they had less than two similarity matches. This processing took about fifteen hours. Runtime of each file pair varied because file sizes varied.

Node data consists of the node index, filename, file family, file size, file-modification time, and file MD5 hashcode, as described in Appendix A.2.4. Edge data consists of the edge’s source and target file node indexes along with the standard deviation, mean, maximum, and sum similarity metrics described in Chapter 3.4.


image


CHAPTER 5:

Results


To evaluate the ability of our tools to identify similarities between executable files, we examined the 642,411 texture-vector similarity measures calculated for each pair of files for the 1,134 files. Of the 642,411 possible comparisons, 463,486 of them produced nonzero similarity values. Similarity measure values varied from zero to about 300. The distribution of these 463,486 similarity values across all files in our dataset is shown in Figure 5.1. Due to the uneven distribution of these values, a similarity threshold cannot be calculated using a normal gausian distribution. Most similarity measure values were less than ten, which is where the curve becomes level. This suggests that actual similarity between two files may be indicated when their similarity measure is greater than ten.



17500


Number of similarity matches

15000


12500


10000


7500


5000


2500


0

File similarity across all files

image

100 101 102

Similarity measure

Figure 5.1. Histogram of similarity matches across all files in our dataset.


    1. Evaluating Similarities by File Family

      To establish a baseline of what the similarity measure values are for similar files, we calculated the mean similarity measures for files within file families, see Table 5.1. The number of comparisons made within each file family is also shown. These values establish similarity measures within individual file families, which establishes similarity values given ground truth.


      Table 5.1. Mean similarity and number of comparisons made within file families.


      File Family

      Mean

      similarity

      Number of

      compar- isons made for this file

      family

      a0003775_dll

      4.5

      72

      bthserv_dll

      3.7

      478

      ccalert_dll

      11.4

      253

      cdfview_dll

      10.0

      1487

      dunzip32_dll

      5.1

      554

      hotfix_exe

      8.5

      115

      iexplore_exe

      130.2

      22311

      mobsync_exe

      6.1

      1808

      msrdc_dll

      4.5

      15

      nvrshu_dll

      32.9

      496

      pacman_exe

      1.5

      1

      policytool_exe

      2.6

      223

      powerpnt_exe

      76.0

      125

      rtinstaller32_exe

      13.4

      6

      safrslv_dll

      3.3

      18

      tabulate_drive_data_py

      2.8

      253

      typeaheadfind_dll

      2.3

      1

      vsplugin_dll

      3.2

      24

      webclnt_dll

      3.8

      2247

      winprint_dll

      1.1

      21

      wmplayer_exe

      9.1

      6742

      xrxwiadr_dll

      15.9

      66


    2. Evaluating Similarities Across File Families

      We tested whether the similarity measure between files of the same file family was higher than the similarity measure between files in different file families. The confusion matrix for file similarity across all file families in our dataset is in Table 5.2. Rows and columns represent file families using the numbers in the second column. The mean similarity measures between files within file families is typically greater than the mean similarity between files in other file families, showing that our approach for identifying file similarity is useful. We also compare similarity using texture-vectors vs. similarity using Prof. Rowe’s byte analysis which identifies file similarity by comparing similarity between byte values at two, four, and eight byte intervals. This is shown in Figure 5.2. Here we see a trend upward and to the right, indicating that both approaches agree in measuring similarity.


      image


      Figure 5.2. Similarity using texture-vectors vs. similarity using Prof. Rowe’s byte analysis.


      Table 5.2. Mean file similarity between file families.

      image


      Family

      No.

      1

      2

      3

      4

      5

      6

      7

      8

      9

      10

      11

      12

      a0003775_dll

      1

      4.5

      1.2

      5.4

      2.1

      3.8

      3.0

      3.6

      2.8

      2.2

      3.3

      5.1

      2.3

      bthserv_dll

      2

      1.2

      3.7

      1.2

      0.7

      0.7

      0.7

      0.5

      0.7

      0.6

      0.3

      0.9

      0.6

      ccalert_dll

      3

      5.4

      1.2

      11.4

      2.5

      3.9

      3.2

      2.2

      2.9

      3.6

      4.3

      4.8

      2.2

      cdfview_dll

      4

      2.1

      0.7

      2.5

      10.0

      1.3

      1.1

      1.6

      2.2

      1.8

      0.9

      1.7

      0.8

      dunzip32_dll

      5

      3.8

      0.7

      3.9

      1.3

      5.1

      2.3

      4.0

      2.1

      2.0

      3.4

      3.6

      1.6

      hotfix_exe

      6

      3.0

      0.7

      3.2

      1.1

      2.3

      8.5

      1.3

      2.0

      1.4

      3.8

      3.1

      1.6

      iexplore_exe

      7

      3.6

      0.5

      2.2

      1.6

      4.0

      1.3

      130.2

      9.3

      1.6

      2.4

      1.5

      7.5

      mobsync_exe

      8

      2.8

      0.7

      2.9

      2.2

      2.1

      2.0

      9.3

      6.1

      1.5

      2.3

      2.7

      1.6

      msrdc_dll

      9

      2.2

      0.6

      3.6

      1.8

      2.0

      1.4

      1.6

      1.5

      4.5

      1.4

      2.0

      0.9

      nvrshu_dll

      10

      3.3

      0.3

      4.3

      0.9

      3.4

      3.8

      2.4

      2.3

      1.4

      32.9

      6.2

      2.1

      pacman_exe

      11

      5.1

      0.9

      4.8

      1.7

      3.6

      3.1

      1.5

      2.7

      2.0

      6.2

      1.5

      2.2

      policytool_exe

      12

      2.3

      0.6

      2.2

      0.8

      1.6

      1.6

      7.5

      1.6

      0.9

      2.1

      2.2

      2.6

      powerpnt_exe

      13

      3.5

      0.4

      2.2

      1.1

      3.1

      1.5

      41.2

      5.8

      1.4

      2.6

      2.2

      4.6

      rtinstaller32_exe

      14

      3.4

      0.9

      4.1

      2.0

      3.6

      2.0

      1.6

      2.3

      2.2

      2.3

      2.8

      1.2

      safrslv_dll

      15

      1.9

      0.9

      2.2

      1.1

      1.2

      1.6

      1.1

      1.1

      0.7

      2.0

      2.0

      1.0

      tabulate_drive_data_py

      16

      0.1

      0.1

      0.1

      0.1

      0.1

      0.1

      0.2

      0.1

      -

      0.1

      -

      0.3

      typeaheadfind_dll

      17

      0.9

      0.6

      1.3

      0.7

      0.4

      0.3

      0.4

      0.5

      0.7

      0.1

      0.7

      0.5

      udlaunch_exe

      18

      2.9

      0.4

      3.3

      1.1

      2.5

      -

      1.3

      1.7

      1.8

      3.3

      3.0

      -

      vsplugin_dll

      19

      3.0

      0.6

      3.4

      1.0

      2.0

      2.5

      4.0

      1.8

      1.2

      3.2

      3.0

      1.6

      webclnt_dll

      20

      3.3

      1.0

      3.6

      1.1

      2.3

      1.3

      1.8

      1.8

      1.5

      2.2

      2.8

      1.0

      winprint_dll

      21

      0.8

      0.5

      0.9

      0.4

      0.6

      0.5

      0.4

      0.5

      0.5

      0.4

      0.6

      0.6

      wmplayer_exe

      22

      3.1

      0.4

      3.1

      0.9

      2.4

      2.0

      21.7

      3.6

      1.3

      3.0

      2.8

      2.4

      xrxwiadr_dll

      23

      11.5

      0.8

      12.1

      2.5

      9.2

      4.1

      3.2

      4.6

      3.3

      12.9

      13.0

      3.8


      Family

      No.

      13

      14

      15

      16

      17

      18

      19

      20

      21

      22

      23

      a0003775_dll

      1

      3.5

      3.4

      1.9

      0.1

      0.9

      2.9

      3.0

      3.3

      0.8

      3.1

      11.5

      bthserv_dll

      2

      0.4

      0.9

      0.9

      0.1

      0.6

      0.4

      0.6

      1.0

      0.5

      0.4

      0.8

      ccalert_dll

      3

      2.2

      4.1

      2.2

      0.1

      1.3

      3.3

      3.4

      3.6

      0.9

      3.1

      12.1

      cdfview_dll

      4

      1.1

      2.0

      1.1

      0.1

      0.7

      1.1

      1.0

      1.1

      0.4

      0.9

      2.5

      dunzip32_dll

      5

      3.1

      3.6

      1.2

      0.1

      0.4

      2.5

      2.0

      2.3

      0.6

      2.4

      9.2

      hotfix_exe

      6

      1.5

      2.0

      1.6

      0.1

      0.3

      -

      2.5

      1.3

      0.5

      2.0

      4.1

      iexplore_exe

      7

      41.2

      1.6

      1.1

      0.2

      0.4

      1.3

      4.0

      1.8

      0.4

      21.7

      3.2

      mobsync_exe

      8

      5.8

      2.3

      1.1

      0.1

      0.5

      1.7

      1.8

      1.8

      0.5

      3.6

      4.6

      msrdc_dll

      9

      1.4

      2.2

      0.7

      -

      0.7

      1.8

      1.2

      1.5

      0.5

      1.3

      3.3

      nvrshu_dll

      10

      2.6

      2.3

      2.0

      0.1

      0.1

      3.3

      3.2

      2.2

      0.4

      3.0

      12.9

      pacman_exe

      11

      2.2

      2.8

      2.0

      -

      0.7

      3.0

      3.0

      2.8

      0.6

      2.8

      13.0

      policytool_exe

      12

      4.6

      1.2

      1.0

      0.3

      0.5

      -

      1.6

      1.0

      0.6

      2.4

      3.8

      powerpnt_exe

      13

      76.0

      1.5

      1.0

      0.2

      0.2

      1.5

      2.8

      1.7

      0.3

      12.6

      8.2

      rtinstaller32_exe

      14

      1.5

      13.4

      1.1

      0.1

      0.4

      3.1

      1.9

      2.0

      0.6

      1.7

      6.3

      safrslv_dll

      15

      1.0

      1.1

      3.3

      0.1

      0.8

      -

      1.4

      1.2

      0.6

      1.1

      2.6

      tabulate_drive_data_py

      16

      0.2

      0.1

      0.1

      2.8

      0.1

      -

      0.2

      0.2

      -

      0.1

      0.3

      typeaheadfind_dll

      17

      0.2

      0.4

      0.8

      0.1

      2.3

      0.2

      0.5

      0.8

      0.4

      0.2

      0.6

      udlaunch_exe

      18

      1.5

      3.1

      -

      -

      0.2

      -

      2.1

      0.9

      0.5

      2.2

      3.6

      vsplugin_dll

      19

      2.8

      1.9

      1.4

      0.2

      0.5

      2.1

      3.2

      1.7

      0.5

      2.7

      3.5

      webclnt_dll

      20

      1.7

      2.0

      1.2

      0.2

      0.8

      0.9

      1.7

      3.8

      0.7

      1.5

      5.0

      winprint_dll

      21

      0.3

      0.6

      0.6

      -

      0.4

      0.5

      0.5

      0.7

      1.1

      0.4

      0.6

      wmplayer_exe

      22

      12.6

      1.7

      1.1

      0.1

      0.2

      2.2

      2.7

      1.5

      0.4

      9.1

      5.7

      xrxwiadr_dll

      23

      8.2

      6.3

      2.6

      0.3

      0.6

      3.6

      3.5

      5.0

      0.6

      5.7

      15.9


      Although the greatest average similarity for a given file family is usually within that file family, there are exceptions as between file families a0003775_dll and xrxwiadr_dll. This inconsistency could be due to the differences in file size or to other attributes within the files in these two file groups. An example similarity analysis plot illustrating the problem is Figure 5.3. Ranges of homogeneous texture vectors contain similar low mode counts


      and moderately high entropy values, suggesting that our similarity measure is primarily attributed to regions of compressed data rather than similarity in code. The few similarity matches in other regions suggest that there is actually little similarity between these two files.


      image

      Figure 5.3. False-positive similarity between two files caused by homoge- neous compressed data.


      As seen in Table 5.1, the mean similarity between files within file family varies greatly based on file family. For example mean similarity within the iexplore_exe family is

      130.2. An example comparison of two very similar files was shown in Figure 3.4, where


      the histogram shows regions of low similarity and regions of high similarity, resulting in the high calculated similarity value. Mean similarity within the winprint_dll file family is 1.1. An example comparison of two files within this family is shown in Figure 5.4. The histogram shows a fairly even dispersion of similarity, with no offset in particular matching mor than other offsets.


      image

      Figure 5.4. Example of low value of high similarity in file family

      winprint_dll.


      Average similarity measures between files across file families also varies greatly, as shown in Table 5.2. Average similarity between files of different file families tend to be high when


      average similarity within file families is high, for example between iexplore_exe and

      powerpnt_exe, which measures 76.0.


    3. Examining Similarity using the Texture-Vector Browser GUI Tool

      Our Texture-Vector Browser GUI tool can examine trends in file similarity based on file

      creation times and file-similarity measures. Figure 5.6 shows an example. The horizontal axis is the file modification time. This can be the time the file was created if it was never modified, or the time it was modified by update or by contamination with a virus. The vertical axis is the measure of similarity between the file the user selectsand the other files in the view, which if the Stay in group mode is selected, will be files within its family. Files higher up on the vertical axis are more similar to the selected file than files lower down on the vertical axis, where the similarity measure, as described in Chapter 3.4, is the value on the vertical axis. By clicking on a node, the focus of the view changes to show the similarities between the file associated with the clicked node and other files. By clicking on an edge, the view shows the similarity graph involving the two files associated with the edge.

      Using the node listing capability described in Appendix A.2.4 and by sorting the list by file group and modification time, we find and select the file in the ccalert_dll file group with the latest timestamp, as shown in Figure 5.5.


      image

      Figure 5.5. Sorted node listing with node 326 selected.


      In our dataset, this file is named AE10-1158_Program_Files_Norton_AntiVirus_Engi


      ne_18.5.0.125_ccalert.dll.tmp, indicating that it is on drive AE10-1158 from United Arab Emirates. It is indexed in our similarity graph dataset as node 326 (in green). The file naming convention is explained in Chapter 4.1. This graph shows node 326 and its similar neighbors and similar edges, where the similarity measure, described in Chapter 3.4, is 1.0 or more. The horizontal axis is the file modification time and the vertical axis is the relative similarity between file (node) 326 and the other files, as described in Appendix A.2.4.


      image

      Figure 5.6. Files (nodes) and similarity measures (edges) associated with file node 326 showing modification times and similarity to node 326.


      There are two clusters of similarity. One cluster of size 20 spans from about year 2004 to 2010 with a similarity measure that increases in time from about five to ten. The other cluster of size three is dated near 2010 and has a similarity measure to the selected file of about 25.

      Files Program Files/Norton AntiVirus Engine 17.0.136 ccAlert.dll on drive

      AE10-1160 and Program Files/Norton AntiVirus Engine 17.8.0.5 on drive AE10

      -1147, which are the two yellow dots at the top of the figure, have significantly greater similarity of 25 than the other nodes that meet the similarity threshold. Two most recent


      of the less similar nodes, nodes 310 and 309, indicate AntiVirus Engine 16.8.0.41 and 16.0.0.125, so apparently version 16 was quite a bit different from version 17. The older files in this family are less similar and indicate a different versioning scheme or do not indicate a version number.

      Figure 5.7 shows the analysis of the edge that connects nodes 326 and 312, corresponding to Program Files/Norton AntiVirus Engine 18.5.0.125_ccalert.dll on drive AE10-1158 and Program Files Norton AntiVirus Engine 17.0.136_ccAlert.dll on drive AE10-1160. This display was obtained using the GUI by clicking on the edge shown in Figure 5.6 that connects these two files. The texture vector patterns appear very similar and the similarity histogram spikes with a similarity count of nearly 370 near file offset 0, a large number, indicating that these two files are similar. We can click on any of the yellow dots in the GUI to select the file corresponding to it to compare other files against it.


      image

      Figure 5.7. A detailed comparison of files 312 and 326 showing a high degree of similarity.


      Figure 5.8 shows similarity of files within the powerpnt_exe file family to the Powerpoint file with the most recent timestamp in the dataset, file (node) 295. Not all files in this file family have version numbers in their names. By hovering the cursor over yellow dots representing files similar to node 295, we see files with a similarity measure of over 100 after year 2005 correspond to Microsoft Office 12, while less similar file (node) 303 has a similarity measure of about one near year 2003, and is labeled Microsoft Office 10.


      image

      Figure 5.8. Files similar to the latest Microsoft Office file in file family

      powerpnt_exe.


      Comparing nodes 326 and 310 for versions, which correspond correspond to Norton An- tiVirus Engine 18.5.0.125 and 16.8.0.41, we get the texture-vector graph shown in Figure 5.9. Here, there is more variance in the file offset, but the similarity frequency spikes to about 72, indicating that there is significant similarity. We also see more variation in the texture vector pattern and that the newer version is slightly larger in size, about 220 KB instead of 210 KB. By inspecting general changes in the five texture patterns, it appears that the additional 10 KB is inserted within the first 150 KB of the file.


      image

      Figure 5.9. Comparison of files 326 and 310.


      1. Composition Analysis

        By looking at the five bands in the texture-vector diagram, we can make inferences about the regions of executable code files being compared, in particular the locations of header, code, and data sections. For Figure 5.7, for the first two textures, covering the first 1,000 bytes, the standard deviation, mean, mode, and entropy values are lower than the values in other regions, while the mode count is higher. We infer that this represents a header, and the transition in the texture represents a transition to another type of content. The region


        from approximately byte 1,000 to byte 160,000, contains relatively medium values of the standard deviation, mean, and entropy, mode values that are either very high or very low, and consistently low mode counts. We infer that this is the code section. The third region, from approximately byte 160,000 through to the end at byte 219,512, usually has a low mode value while values in the other four statistics vary but consistently witht the two files. We infer that this is a region of data mostly unchanged between version. We also infer that the additional 10 KB added in the newer version was new code.


      2. Progressive Time Similarity

        Software files tend to be most similar to the previous version. Figure 5.10 shows an example for the nvrshu_dlll file family. Here, the file with the latest timestamp, WINDOWS system32 nvrshu.dll from the MY01-023 drive from Malaysia, is selected. We see sporatic measures of similarity between 10 and 30 for files before year 2005, but for files after 2005, we see a gradual increase in similarity over time from about 40 to 61.


        image

        Figure 5.10. Similarity increases as versions approach the latest version.


      3. Version Analysis

        With these diagrams, we can study on the origin and evolution of versions of files. Although an original file should have the earliest file creation time, file cre- ation times can be modified inadvertently or maliciously. Another clue is that the original file often has the least amount of code. Node 326 in Figure 5.6, file Program Files/Common Files/Symantex/Shared ccAlert.dll.tmpfrom drive PA002- 049 from Panama is likely the original file in its group because its file modification time is earliest and its similarity to latest files decreases over time.

        A newer version of code that introduces new features is likely to contain more code than the version before it as in Figure 5.9. A newer version that is only a bug fix will be similar in size to the version before it and will have similar texture-vector patterns as in Figure 5.7.

        Files released at approximately the same time may be targeted for different operating system platforms or different feature sets. For example 13 files in the webclnt_dll file family were released over two days, 2006-01-03 and 2006-01-04. This is too clustered to be in response to new functionality or bug fixes. These files could be a response to a virus because some of their file sizes are the same and their texture-vector patterns appear identical. However, bear in mind our sample is incomplete and important versions of software may be missing.


    4. Examining Similarity using Gephi

Although the Texture-Vector Browser GUI tool was specifically designed for examining network graphs created from the dataset of similarity-graph files, graph analytics can also be done with popular open-source tools such as the Gephi graph-visualization tool. Steps for working with similarity-graph data using Gephi are presented in Appendix D.


image


CHAPTER 6:

Conclusions and Future Work


    1. Conclusions

      This thesis proposed applying a vector of transforms to executable code to create texture- vector data, and then using analytics to identify similarities between executable files. We tested a sample of executable code files with our methods. Our experiments showed files within file families had greater average similarity than files across file families. We found that the visual patterns in the texture vectors were effective in identifying similar regions in two files as well as sections that may be compressed.


    2. Future Work

      This work used texture vectors calculated from a section size of 500 bytes. A large section size might reveal similarity across a larger section of data, equivalent to applying a low-pass filter to texture-vector values. A section size that is a power of two or is aligned to the size of fixed-size data might naturally align better with the section boundaries from which texture-vectors are calculated.

      Texture vectors may be useful for classifying file types or detecting types of data embedded within a file. Further work in this direction might consist of defining data patterns that map to particular data types.

      The open-source tool Gephi offers many capabilities such as filtering and neighbor analytics that can be used to augment the similarity analytics provided by our tool. Future work might use it to obtain additional insight about file similarity.


      THIS PAGE INTENTIONALLY LEFT BLANK


      image


      APPENDIX A:

      The Texture-Vector Similarity Toolset


      The Texture-Vector Similarity toolset bundles the previously mentioned features to provide a texture-vector approach for identifying similarities between files. While created for analyzing similarity between executable files, it can identify similarities in other file types. The Texture-Vector Similarity distribution, which bundles the toolset with sample data and other analytics tools, provides the following:

      • The calc_tv.py tool for calculating texture-vector files

      • The tv.py tool for calculating similarity metrics between two texture-vector files

      • The tv_browser.py tool for examining the similarity-graph dataset

      • Miscellaneous programs for organizing the dataset and calculating statistics from it

      • The texture-vector and similarity-graph dataset

The distribution has of approximately 3,300 lines of code in 65 files. It is primarily written in Python and uses the Qt 5 GUI widget toolkit for its graphical interface. Usage for these tools is presented in Appendix A.2 and source code for these tools is presented in Appendix E. Texture-vector files are described in Appendix C, and similarity-graph files are described in Appendix D.

Users interested in examining similarity between files that are not included in our dataset are encouraged to do so by running the calc_tv.py and tv.py tools directly.

Users who wish to analyze texture-vector files with their own tools can use the Texture- Vector Generator tool to create files in JSON format describing the file metadata, the section size used, the texture-vector labels, and the texture vectors as described in Appendix C.


    1. Download

      The Texture-Vector Similarity toolset and requisite texture-vector datasets are publicly avail- able on the GitHub repository at https:// github.com/ NPS-DEEP/ tv_sim. Clone or download the Texture-Vector Similarity toolset from this site. For license information, please see the COPYING file in this repository or refer to Appendix E.4.


      The repository includes the following:

      • The Texture-Vector Similarity toolset.

      • .tv Texture Vector files calculated from Windows .exe and .dll executable code using default settings.

      • Node and Edge graph data.

      • Miscellaneous Python code used for generating .tv and graph data.

        The repository does not include any Windows .exe and .dll executable code from which the .tv files were generated.

        The following Linux example clones the Texture-Vector Similarity toolset into the gits/

        subdirectory under your home path:

        • mkdir ~/gits

        • cd ~/gits

        • git clone https://github.com/NPS-DEEP/tv_sim

          If you are a Windows user, you may prefer to download the ZIP file from https:// github. com/ NPS-DEEP/ tv_sim and extract it into a directory of your choosing.

          These tools require Python3, numpy, scipy, and PyQt5.

          • Windows users: To see if Python3 is present, open a command window and type python and look for Python3 in the response. Once Python is installed, open a command window and type:

            • python -m pip install PyQt5 numpy scipy

          • Mac/Linux users: To see if Python3 is present, open a command window and type python3 and look for Python3 in the response. Once Python is installed, open a command window and type:

    2. Usage

All tools in the Texture-Vector Similarity toolset are in the python subdirectory. For example if you installed the toolset under ~/gits, the tools will be at ~/gits/tv_sim/python.


You select the python subdirectory so that the tools may be run directly:

Program sbatch_ddiff_tv.bash is a program that runs on the Slurm workload manager and executes sbatch_ddiff_tv.py on multiple processors. Program sbatch_ddiff_tv.py calculates difference metrics for a given input file. Program sbatch_ddiff_tv.py uses


function similarity_math from file similarity_math.py to calculate difference met- rics between two files.

Output from each batch job is directed to a data file associated with that batch job. Each data file consists of similarity metric data formatted as CSV. Entries with zero similarity measure are skipped. The result of the run is one CSV file for nodes, and many CSV files for edges. When the run completes, we collate the edge files into one and make sure the edge titles are at the top of the file so that it is compatible for input to Gephi.

The syntax and composition of the node and edge CSV files are described in Appendix D.


      1. Data Preparation

        Tools for preparing large sets of files and for assisting in validating correctness are in the

        sbatch_prep/ directory.


    1. Source Code for Statistical Analysis

      Source code for various statistical analysis is available in the statistics/ directory.


    2. Source Code License

All code is provided with the following notice:

image

The software provided here is released by the Naval Postgraduate School, an agency of the

U.S. Department of Navy. The software bears no warranty, either expressed or implied. NPS does not assume legal liability nor responsibility for a User’s use of the software or the results of such use.


Please note that within the United States, copyright protection, under Section 105 of the United States Code, Title 17, is not available for any work of the United States Government and/or for any works created by United States Government employees. User acknowledges that this software contains work which was created by NPS government employees and is therefore in the public domain and not subject to copyright.


THIS PAGE INTENTIONALLY LEFT BLANK


image


List of References


  1. B. Kolosnjaji, A. Demontis, B. Biggio, D. Maiorca, G. Giacinto, C. Eckert, and

    F. Roli, “Adversarial malware binaries: Evading deep learning for malware detec- tion in executables,” in 26’th European Signal Processing Conference, Rome, Italy, December 2018, Proceedings.

  2. S. Josse, E. Bachaalany, A. Gazet, and B. Dang, Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation, 1st ed. Indi- anapolis, IN: Wiley, 2014.

  3. T. Arbuckle, “Visually summarising software change,” in 12th International Confer- ence Information Visualization, 2008.

  4. E. Aghajani, A. Mocci, G. Bavota, and M. Lanza, “The code time machine,” in 2017 IEEE 25th International Conference on Program Comprehension (ICPC).

  5. H. Koike and H.-C. Chu, “Vrcs: Integrating version control and module manage- ment using interactive three-dimensional graphics,” in Graduate School of Informa- tion Systems University of Elect ro-Communications Chofu, Tokyo 182, Japan, 1997.

  6. L. Voinea, A. Telea, and J. J. van Wijk, “Cvsscan: Visualization of code evolution,” in SoftVis ’05 Proceedings of the 2005 ACM symposium on software visualization, 2005, pp. 47–56.

  7. M. Burch, S. Diehl, and P. Weißgerber, “Visual data mining in software archives,” in SoftVis ’05 Proceedings of the 2005 ACM symposium on software visualization, 2005, pp. 37–46.

  8. T. R. L. Bergroth, H. Hakonen, “A survey of longest common subsequence algo- rithms,” in Proc. Int’l Symp. String Processing Information Retrieval (SPIRE ’00), 2000, pp. 39–48.

  9. Wikipedia. (Feb. 4, 2020). Knuth-Morris-Pratt algorithm. [Online]. Available: https:

    //en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm. Accessed Feb. 11, 2020.

  10. R. Cole, Tight Bounds on the Complexity of the Boyer-Moore Pattern Matching Al- gorithm, 1st ed. London: Forgotten Books, 2018.

  11. W. Shotts, The Linux Command Line, 2nd ed. San Francisco, CA: No Starch Press, 2019.


  12. D. Cabezas and B. Mooij, “Detecting source code re-use through a binary analy- sis hybrid approach,” accessed May 2, 2019. Available: https://www.forensicmag. com/article/2013/02/detecting-source-code-re-use-through-binary-analysis-hybrid- approach

  13. J. Jang and D. Brumley, “Bitshred: Fast, scalable code reuse detection in binary code,” in CMU-CyLab-10-006, 2009.

  14. N. C. Rowe, “Associating drives based on their artifact and metadata distributions,” in 10th International EAI Conference, ICDF2C 2018, New Orleans, LA, USA, September 10–12, 2018, Proceedings.

  15. A. Hadmi, W. Puech, B. A. E. Said, and A. A. Ouahman, “Perceptual image hash- ing,” in University of Montpellier II, CNRS UMR 5506-LIRMM, France, 2012.

  16. J. Seemann and J. W. von Gudenberg, “Visualization of differences between versions of object-oriented software,” in Proceedings of the Second Euromicro Conference on Software Maintenance and Reengineering, 11-11 March 1998, Florence, Italy, Italy.

  17. J. Rho and C. Wu, “An efficient version model of software diagrams,” in Proceed- ings 1998 Asia Pacific Software Engineering Conference (Cat. No.98EX240), 2-4 Dec. 1988, Taipei, Taiwan, Taiwan.

  18. W. Swierstra and A. Löh, “The semantics of version control,” in Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2014), 43-54, New York, NY, USA.

  19. A. Defant, Classical Summation in Commutative and Noncommutative Lp-Spaces, 1st ed. New York, NY: Springer, 2011.

  20. S. Garfinkel, P. Farrell, V. Roussev, and G. Dinolt, “Bringing science to digital forensics with standardized forensic corpora,” Digital Investigation, vol. 6, pp. S2– S11, August 2009.


image


Initial Distribution List


  1. Defense Technical Information Center Ft. Belvoir, Virginia


  2. Dudley Knox Library Naval Postgraduate School Monterey, California