Data Deception in Stream Processing

Andrew J. Adams, Neil C. Rowe, and Arijit Das

U.S. Naval Postgraduate School

This paper appeared in the 15th International Conference on Cyber Warfare and Security, Munich, Germany, July 2016.

1. Introduction

Adversaries in cyberwarfare may be able to eavesdrop on data of a military to get a tactical advantage. This work investigated the automated creation of deceptive data to fool eavesdropping adversaries so that they get faulty information. Such "fake data" (Rowe and Rrushi, 2016) could impede their planning as well as making them more cautious about eavesdropping in the future.

One good opportunity for eavesdropping is to intercept streams of data being reported by sensors. Apache Storm is a popular stream-processing package, so this work investigated harnessing its power to do deception. We investigated real-time deceptive modification of stream data.

2. Apache Storm

Apache Storm is an open-source, distributed computation system that processes unbounded streams of data in real-time (The Apache Software Foundation, 2014). The Apache Foundation promotes its system as scalable, fault-tolerant, reliable, easy to set up and operate, and very fast. Their website states that Storm can process over a million tuples per second per node. It functions by delegating task responsibility among several distributed components. Storm found initial prominence in assessing continuous, social Internet content and dynamically adjusting advertising approaches aimed at users in real-time (Chardonnens et al, 2013). Storm uses distributed resources so its employment is scalable to the diverse and fast-paced cloud environment.

The core of Storm operations is a set of programs that interact dynamically with streamed information. The base topology is built with "spouts" and "bolts." Spouts connect to raw data sources, convert data into key-value tuples, and emit them as an unbounded, continuous stream (Goetz and O'Neill, 2014). Bolts receive streams from spouts, process the data, and emit one or more streams as output. Each bolt and spout can be implemented with a Java class.

3. Experiments

In our experiments, Storm topologies were used to automatically create false data points for simulated naval data. We examined the four original Storm topologies to make small, random alterations to naval unit and location identified. Experimental topologies were built with Apache Incubator's Storm-Starter Project, an open-source tool. The Topology Main class WordCountTopology.java determines data flow through subordinate spout and bolt classes as shown below.

Each spout and bolt is created using individual Java classes and combined using the Topology's Main class. For the experiment, the file input for Coordinate-Replacer included string values of objects within the coordinate predicate. The files for Location-Replacer included string values of objects within the location predicate. The files for Unit-Replacer and Hull-Replacer both include string values of objects within the unit predicate. To simulate streamed data, eight sizes of each predicate-associated file were created. They varied in size from five to 1,000,000 strings. Larger files were populated by repeatedly copying smaller sets of strings. Experimental input details are shown in the table below.

Initial development was of standalone Java programs that replace coordinates, locations, units, and hull numbers. The table below shows the CoordinateReplacer.class maximum and minimum delta. This table represents the most extreme and minor alterations induced by this bolt. The maximum distance is created when replacement number 99 is inserted into the 3rd and 4th places of latitude and longitude. The minimum distance is created when 01 is chosen for the 6th and 7th places of either latitude or longitude.

The LocationReplacer class manipulates the port name in which a vessel is located. The intent for location manipulation was to an affect adversary's knowledge of order of battle rather than exact GPS locations. Randomly chosen alternative locations replace true location names passed within a tuple. I use the Random class to select a random value from a pre-populated array of alternative location strings. The UnitReplacer class has the exact same structure as LocationReplacer.class edited with minor hard coding. Rather than changing a location, it changes the vessel being identified. For a more subtle effect, the HullReplacer class only alters the hull number of a unit.

The Storm-Starter Project simplified the integration of deceptive content since, with the exception of the new replacer bolts, all project components only required minor edits to meet new objectives. We used a simulated Storm cluster built at our school (Pontius, 2014) for remote-mode-topology execution. This consisted of several virtual machines running Linux Mint OS; a master machine runs Nimbus, Zookeeper, and the Storm user interface, and three Worker virtual machines run Supervisor. One virtual machine acted as a test control node for topology submission and user-interface viewing.

The Storm-Starter Project and Maven simplified local mode topology execution. Once all classes were finished, standard Maven commands were used to compile and execute topology projects. A single Ubuntu machine simulated all dependencies and interactions of a topology's cluster traversal.

4. Results

We did 24 test runs on each deception topology using varying amounts of string input. Each topology ingested the eight input files three times. Upon completion of topology execution over the entire file, Maven displays build success, total time for completion to the millisecond, GMT time finished, and total memory used. The total time for completion of each run was used for speed comparisons. Increased string input did not greatly affected completion time. Regardless of whether the topology made five deceptive replacements or 1,000,000, all executions fell between 15 and 30 seconds. Execution times are displayed in the Figures below. The file size in strings is listed along the X axis. The first, second, and third executions of each file size are represented by a diamond, square, and triangle based on time to finish. The average time of all three executions for a file is represented by the graphed line.

Average run times for all sizes of input were very similar because Storm divides operations into individual tasks and runs them in parallel. If the amount of streamed input is increased, the Master simply assigns responsibility to more Worker nodes. Most of the time expended in execution is due to the overhead startup processes of Storm and is not significantly increased with more data on which to operate. A 20-second operation is not ideal when a sub-second effect on a single data point is needed. However, this is an efficient metric if over large data sets. Physical computer resources used for simulation or intricacies in the Maven execution process may explain the encountered outliers. Speed testing thus showed that Storm-based deception topologies can maintain fast operations independent of the data set used. It appears that a topology could execute deceptive alterations on real-time data at the same speed an adversary is stealing it.

After fully analyzing topologies in local mode, I packaged Coordinate-Replacer for use on Pontius' virtual Storm cluster by making a few code changes in the TopologyMain.java class. The Master machine distributed execution responsibility to one of the three workers. Associated distribution and timing diagnostics from the Storm user-interface are shown below, including all active topologies on the cluster and the machines responsible for their tasks. Although, the operations were not physically separated, this method of testing more closely mirrored an implementable Storm cluster.

Macintosh HD:Users:andrewadams:Dropbox:Storm:BIgData Screenshots:uilongerrun.JPG

The CoordinateReplacer.class bolt successfully conducted small random changes. Altering two numbers between the 3rd and 7th character of latitude and longitude changed the position enough to affect tactics but not enough to be unrealistic. Unpredictability meant there was no simple deception pattern to detect. The figure below shows positions generated near the Island of Midway in the Pacific Ocean. It is important to never replace all four altered points with the original values, and never replace an at-sea location with one on land, which would expose the presence of deception.

5. Conclusions

Modification of data qualify as "ruses" in military tactics. We have shown they are easy to do with stream data using Apache Storm. Stream data is increasingly important to militaries as sensor data continues to increase while storage capacity does not keep pace. Location data, a key subject of this work, is essential for targeting. Thus these techniques will be increasingly important in the future.

Disclaimer

The views expressed are those of the authors and do not represent the U.S. Government.

6. References

The Apache Software Foundation. (2014). Apache Storm. [Online]. Available: https://storm.apache.org.

T. Chardonnens, et al., "Big data analytics on high velocity streams: A case study," in 2013 IEEE Int. Conf. on Big Data, Silicon Valley, CA, 2013, pp. 784–787.

P. Goetz and B. O'Neill, Storm Blueprints: Patterns for Distributed Real-Time Computation, 1st Ed. Birmingham, UK: Packt Publishing, 2014.

B. Pontius, "Information security considerations for applications using Apache Accumulo," M.S. thesis, Dept. Comp. Sci., Naval Postgraduate School, Monterey, CA, 2014.

N. Rowe and J. Rrushi, Introduction to Cyberdeception. New York: Springer, 2016.

Prof. Rowe discussing this poster paper at the conference.