EXTENDED ABSTRACT

The usefulness of the World Wide Web as a digital library of precise and reliable information is reduced by the increasing presence of advertising on Web pages.� But no one is required to read or see advertising, and this cognitive censorship can be automated by software.� Such filters can be useful to the U.S. government which must permit its employees to use the Web but which is prohibited by law from endorsing commercial products.� While the task would seem at first simpler than filtering of pornography or general firewalls, subtleties in recognizing advertising make full success daunting.

Our work is evaluating the quality of methods and products for automatic ad censorship.� Commercial products include AdKiller (www.adkiller.com), Ad Subtract Pro (www.adsubtract.com), Advertising Killer (www.buypin.com), AdWiper (www.adwiper.com), FilterGate (www.adscience� .co.uk), and WebWasher (www.webwasher.com). Other products prevent the annoying popup windows that are usually ads.� Things these products do include removal of ad-like images embedded in the page, prevention of popup windows and Javascript alert boxes, prevention of blinking text, and prevention of playing of embedded audio.� Or so the vendors claim.� Not a single one provides any statistics on the accuracy of their product (e.g. recall and precision) to support their grand claims of removing ads.

We are experimenting to determine how effective various techniques are in ad censoring.� We constructed a censor of our own ("Big Head") with manipulable features.� We use Java servlet software to implement a page server that fetches HTML source text and edits it to create a modified page for display.� The modified page has blanks in the places of inferred ads, and substitutes local links for remote links to permit further censoring.� Blanks are made the same shape as the censored ads so that meaningful page layout can be preserved.

We initially examined a variety of Web sites to develop a set of clue properties for both image ads and their associated text, considering the text within a fixed-size window around the HTML image reference.� We defined ads as information intended to arouse a desire to purchase or patronize something.� It became clear that identification methods need to include both logical and probabilistic methods to achieve high recall (fraction

of ads removed from pages), although high precision (fraction of ads among the items eliminated) was easy by simply picking the popup windows and narrow banner-size images.� Certain image dimensions are strong clues for ads, especially 480 by 60 banners and 150 by 500, 120 by 600, and 160 by 600 images along the sides of the page.� Images stored on sites different from the page's site (i.e. with different first part of their URL) were also very likely to be ads, as were images whose file names contained long integers.� These criteria are sufficiently strong to give 95% precision in identifying image ads.

Additional weaker criteria used included the words of the image file name (the image URL), words of any directly associated text (�alt� string), and other words within a fixed-size window around the image reference.� Good clue words and phrases were obtained from a study of random commercial Web pages.� Examples are �ad�, "buy", "shop", "free", "join", "click", and "now".� The strength of each clue was estimated as the fraction of the time that the image was an ad when the word was associated with it.� In addition, image ads were usually larger than 2500 pixels, and �alt� text for ads was usually less than 100 characters long; both tendencies can be modeled by probability distributions derived from statistics from example pages.�� Evidence from these weaker clues was combined using a linear model (or weighted average), and items were eliminated if their weighted sum exceeded a fixed threshold.� In a quick test, our program examined representative pages, and correctly recognized 19 of 20 ads and 153 of 156 non-ads where ads were manually identified in advance.� Public access is from http://triton.cs.nps.navy.mil:8080/rowe/rowedemos.html.

Text ads also have exploitable syntax.� Analogously to what we developed for finding image captions in our MARIE-4 Web crawler, incitements to purchase typically use a limited range of grammatical expressions recognizable by a partial parser.� Good examples are expressions of the imperative form of verb indicating acquiring ("buy", "get", "join", "click", etc. followed by a noun indicating a purchasable quantity (a physical object or a service), with optionally a qualifying adjective on the noun or an adverb indicating a desirable property of the acquisition ("now", "free", "soon", etc.)� Such a parser can approach semantic understanding of advertising text and improve precision of ad identification.

Our research is ongoing.� Future work will obtain more reliable performance statistics on representative Web pages, and will investigate methods of identifying more difficult kinds of ads.�� Though we did not consider it yet, elimination of popup windows and Javascript applets is usually straightforward from analysis of the HTML source code.� We hope to publish performance comparisons of different methods and different vendor products soon.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Joint Conference on Digital Libraries �02, July 8-12, Portland, Oregon.

�