Automated Retrieval of Security Statistics

 from the World Wide Web


Michael McVicker, Paul Avellino, and Neil C. Rowe



Extended Abstract


Many statistics pertaining to information security are cited with little supporting evidence.  Consider the fraction of cyber-attacks due to insiders [1].  The U.S. Secret Service in 1996 estimated 60%, Network World in 2000 estimated 70% to 90% for “corporate networks” (and said only 1 in 50 attacks is detected), Deloitte and Touche gave it as 35% in 2004 in a study of the financial industry, a CERT briefing in 2005 gave it as 20%, a Carnegie-Mellon report in 2004 gave it as 39%, and the CSI/FBI annual survey for 2006 estimated that 26% of financial losses anyway came from insiders.  Which should we believe?  The figures are based on different data collection methods and some are more reliable than others.  Most do not adequately identify their sources.

To explore this, we have developed a Java data-mining program that collects statements of security-related statistics from the World Wide Web.  Besides providing a single source for scattered data, our program permits comparing the sources and influences of statistics [2].

A Web spider first retrieves links (5000 in our experiments) from a Google search of conjunctions of computer-security words such as “computer + virus + attack” and “firewall + intrusion”.  Our program then applies filtering not possible with Google.  We first delete duplicate pages on different sites and those where the keywords are not in the main body of text (as in links or image references).  We search automatically within the page text for specific patterns involving statistical expressions.  For instance, we look for numbers followed by “%”, or “$” followed by a number, or phrases like “one out of four” or “one fifth.”  Sentences matching on such expressions are then examined for regular expressions such as “<integer>%”, “<integer>.<integer>”, and “<integer> out of <integer>”.  Those sentences that pass all the tests are listed for the user.

We use our previously developed software [3] to destem words (remove suffixes) to increase matches. We also ignore sentences containing some “stop phrases” representing false positives on uninteresting statistics, like “the following review helpful” as used on  About 5% of the initial pages found by Google had relevant sentences.  Matched sentences are stored in a table with their address, date of last modification, and date of creation.  (For pages without a “date last modified” field, we used the last date in the page text.)

To trace the history of particular statistics, our program also searched for other occurrences of sentences found and similar sentences.  Similarity was measured, following ideas from [2], by the vector inner product of the frequency distributions of words in the sentences (which usually amounts to the fraction of words in common) times an adjustment factor to reward longer sentences.  A set of "stop words" like "and" and "the" were excluded from this calculation.  The adjustment factor was the sum of the sentence lengths divided by one plus the sum.  We found that requiring around 80% of the words be identical worked best in general.  But even then there were false positives on a number of sentences involving the financial news unrelated to information security, such as “One in five sustained financial losses due to an attack on mobile data platforms.” 

Inspection of our experimental results revealed some exact matches, indicating free use of quotation on the Web, but also there were paraphrases which we could only find by our similarity-matching methods. 


Table 1: Example search results for strings similar to "50% of the attacks were not preceded by a scan."




Web Address

more than 50 percent of attacks are not preceded by a scan of any kind

5 Dec 2005


more than 50 percent of attacks are not preceded by a scan of any kind

5 Dec 2005


in fact, more than half of all attacks aren't preceded by a scan of any kind, he said

7 Dec 2005


in fact, more than half of all attacks aren't preceded by a scan of any kind, cukier said

13 Dec 2005


in fact, more than half of all attacks aren't preceded by a scan of any kind, cukier said

15 Dec 2005


thirty five percent of the observed attacks were preceded by at least one scan

23 Mar 2006



james clark school of engineering shows port scans precede attacks only about 5 percent of the time, with more than half of all attacks not preceded by a scan of any kind, says michel cukier, center for risk and reliability professor at the engineering school

9 Dec 2005


who found that 50% of directed attacks were preceded by a vulnerability scan, while much of the remainder consisted of having a scan and the attack bundled as a single activity such as is observed in worms, for example

21 Apr 2006



The number of exact matches for any statistical description on the Web rarely exceeded ten, which suggests there is not too much quoting or citing of statements in the area of information security.  

Table 1 shows example results for similarity matching on “50% of the attacks were not preceded by a scan.”  The two highest matches were to a University of Maryland website that posted on December 5, 2005 the results of a study.  Over the next ten days, three articles quoted the study and several others paraphrased it.  Three more statistical descriptions were found in this cluster, each significantly different than the search string.  Citing the author of the original study, Michael Cukier, was a key connecting clue to these significantly different strings. 

Although most of the statistics we found quoted the source of an original survey or study, often the source was not on the Internet.  For example, Cukier’s study is no longer available.  But we did find the original source of the interesting statistic “a new virus mutates 1 of 4,000,000,000 different ways”  This refers to a virus created by a hacker known as Dark Avenger in the early 1990s, and he bragged in a Web blog about it.  Excerpts of the blog were quoted in a book [4], published in hardcover in 1992 and then online in 1993.  Our program found four sites that posted the book, as well as a Discover Magazine article [5] published in 1993 written by the same authors as the book.  We also found a sentence extracted from an article published in 1996 describing a timeline for PC viruses, one of which was Dark Avenger’s virus. Although this sentence was not a direct quote from the original book, it still scored highly.  Finally, we also found lower-rated matching sentences from an online tutorial that was republished many times between 1996 and 2005.

In future work, we hope to build a more comprehensive database of security statistics by running spiders across the Web periodically.  This will require broadening our initial Google search criteria as well as creating new phrase patterns for which to search by noting additional sentence patterns on pages containing known patterns of interest.




[1]     R. Bejtlich, The Tao of Network Security Monitoring: Beyond Intrusion Detection, Reading, MA: Addison-Wesley, 2004.

[2]     I. Mani, Automatic Summarization (Natural Language Processing), John Benjamin, 2001.

[3]     N. Rowe, MARIE-4: A high-recall, self-improving Web crawler that finds images using captions.  IEEE Intelligent Systems, Vol. 17, no. 4, July/August 2002, 8-14.

[4]     P. Mungo and B. Glough, Approaching Zero: The Extraordinary Underworld of Hackers, Phreakers, Virus Writers, and Keyboard Criminals, Random House, 1992.

[5]     P. Mungo and B. Glough, “The Bulgarian Connection – Many US Computer Viruses Traced to one Man in Bulgaria”,  Discover, 1993.


This paper appeared in the Proceedings of the 8th IEEE Workshop on Information Assurance, West Point, NY, June 2007.


Manuscript received March 7, 2007.  This work was supported in part by the National Science Foundation under the Cyber Trust program. 

The authors are with the Naval Postgraduate School, Code CS/Rp, 1411 Cunningham Road, NPGS, Monterey CA 93943 USA.