Automatic Removal of Advertising from Web-Page Display
Neil C.
Rowe, Jim Coffman, Yilmaz Degirmenci,
Scott
Hall, Shong Lee, and Clifton Williams
Code CS/Rp, Computer Science
Department, U.S. Naval Postgraduate School
833 Dyer Rd., Monterey, CA 93943
USA, (831) 656-2462, ncrowe@nps.navy.mil
The
usefulness of the World Wide Web as a digital library of precise and reliable
information is reduced by the increasing presence of advertising on Web
pages.� But no one is required to read
or see advertising, and this cognitive censorship can be automated by
software.� Such filters can be useful to
the U.S. government which must permit its employees to use the Web but which is
prohibited by law from endorsing commercial products.� While the task would seem at first simpler than filtering of
pornography or general firewalls, subtleties in recognizing advertising make
full success daunting.
Our
work is evaluating the quality of methods and products for automatic ad
censorship.� Commercial products include
AdKiller (www.adkiller.com), Ad Subtract Pro (www.adsubtract.com), Advertising
Killer (www.buypin.com), AdWiper (www.adwiper.com), FilterGate
(www.adscience� .co.uk), and WebWasher
(www.webwasher.com). Other products prevent the annoying popup windows that are
usually ads.� Things these products do
include removal of ad-like images embedded in the page, prevention of popup
windows and Javascript alert boxes, prevention of blinking text, and prevention
of playing of embedded audio.� Or so the
vendors claim.� Not a single one
provides any statistics on the accuracy of their product (e.g. recall and
precision) to support their grand claims of removing ads.
We
are experimenting to determine how effective various techniques are in ad
censoring.� We constructed a censor of
our own ("Big Head") with manipulable features.� We use Java servlet software to implement a
page server that fetches HTML source text and edits it to create a modified
page for display.� The modified page has
blanks in the places of inferred ads, and substitutes local links for remote
links to permit further censoring.�
Blanks are made the same shape as the censored ads so that meaningful
page layout can be preserved.
We initially examined a variety of Web sites to develop a set of clue properties for both image ads and their associated text, considering the text within a fixed-size window around the HTML image reference.� We defined ads as information intended to arouse a desire to purchase or patronize something.� It became clear that identification methods need to include both logical and probabilistic methods to achieve high recall (fraction
|
of
ads removed from pages), although high precision (fraction of ads among the
items eliminated) was easy by simply picking the popup windows and narrow
banner-size images.� Certain image
dimensions are strong clues for ads, especially 480 by 60 banners and 150 by
500, 120 by 600, and 160 by 600 images along the sides of the page.� Images stored on sites different from the
page's site (i.e. with different first part of their URL) were also very likely
to be ads, as were images whose file names contained long integers.� These criteria are sufficiently strong to
give 95% precision in identifying image ads.
Additional
weaker criteria used included the words of the image file name (the image URL),
words of any directly associated text (�alt� string), and other words within a
fixed-size window around the image reference.�
Good clue words and phrases were obtained from a study of random
commercial Web pages.� Examples are
�ad�, "buy", "shop", "free", "join",
"click", and "now".�
The strength of each clue was estimated as the fraction of the time that
the image was an ad when the word was associated with it.� In addition, image ads were usually larger
than 2500 pixels, and �alt� text for ads was usually less than 100 characters
long; both tendencies can be modeled by probability distributions derived from
statistics from example pages.��
Evidence from these weaker clues was combined using a linear model (or weighted
average), and items were eliminated if their weighted sum exceeded a fixed
threshold.� In a quick test, our program
examined representative pages, and correctly recognized 19 of 20 ads and 153 of
156 non-ads where ads were manually identified in advance.� Public access is from
http://triton.cs.nps.navy.mil:8080/rowe/rowedemos.html.
Text
ads also have exploitable syntax.�
Analogously to what we developed for finding image captions in our
MARIE-4 Web crawler, incitements to purchase typically use a limited range of
grammatical expressions recognizable by a partial parser.� Good examples are expressions of the
imperative form of verb indicating acquiring ("buy", "get",
"join", "click", etc. followed by a noun indicating a
purchasable quantity (a physical object or a service), with optionally a
qualifying adjective on the noun or an adverb indicating a desirable property
of the acquisition ("now", "free", "soon",
etc.)� Such a parser can approach
semantic understanding of advertising text and improve precision of ad
identification.
Our
research is ongoing.� Future work will
obtain more reliable performance statistics on representative Web pages, and
will investigate methods of identifying more difficult kinds of ads.�� Though we did not consider it yet, elimination
of popup windows and Javascript applets is usually straightforward from
analysis of the HTML source code.� We
hope to publish performance comparisons of different methods and different
vendor products soon.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Joint Conference on Digital Libraries �02, July 8-12, Portland,
Oregon. Copyright 2002 ACM 1-58113-000-0/00/0000�$5.00. JCDL�02,
July 13-17, 2002, Portland, Oregon, USA.Copyright 2002 ACM
1-58113-513-0/02/0007�$5.00. |
�