Using Context
to Disambiguate Web Captions
Neil C. Rowe
Code CS/Rp, 833 Dyer Road
U.S. Naval Postgraduate School
Monterey, CA 93943 USA
Abstract
The easiest way to index
multimedia from ordinary Web pages is to find their captions.� However, captions are not used consistently,
and retrieval effectiveness for caption-based multimedia browsers is
significantly poorer than that for text retrieval.� We show that statistical "context" information about
the Web pages at a site can help recognize image captions by quantifying their
"representativeness".�
Experiments were conducted on a random sample of 5010 image captions
from 3.2 million candidates from 5 million Web pages, and 1220 audio and video
captions from 720,000 candidates from those same Web pages. They showed that
while statistical context information was definitely a good clue, it usually
did not appear to add much beyond what good local clues in the candidate
caption-image pair itself provide, and provided no help for caption-audio and
caption-video pairs.
Keywords: captions, World
Wide Web, data mining, disambiguation, context, similarity
** This paper appeared in the
Internet Computing Conference, Las Vegas, NV, June 2004. **
1.
Introduction
Captions are the best tool to
index images and other multimedia objects in large unstructured document
collections such as the Web.� But
finding captions is not straightforward on the Web because page authors use
widely different ways of placing them and displaying them [1, 2].� Commercial "image search" software
typically addresses only the "easy" captions, obtaining high
precision (accuracy) but low recall (coverage).� Our previous work on the MARIE-4 system [3] showed that seven key
factors for captions were statistically significant on a randomly selected
training set.� The most valuable of
these factors was the occurrence of particular words in the proposed caption
(e.g. "caption", "figure", "photograph",
"shows", "above", "left", etc.) or the absence of
particular words (e.g. "click", "page", "button",
"bytes", "free", "now", etc.)� The other valuable factors were shown to be
the caption type (e.g. italics, centered text, item in table, alternative text
for the image, text of a clickable link, etc.); format of the associated image;
length of the caption; particular words in the image-file name, especially
common words between the image file name and the caption; digits in the
image-file name; and image size.�
Distance of the caption from the image was shown to be unhelpful within
1000 characters of the image.
However, this work did not
examine the clue of the consistency of the captions at a site or in a
directory.� This work examined the local
"link context" [4], but there is global context in the directory and
site for the page of the image.� Many
sites use designers, design templates, and style sheets to ensure a consistent
"look and feel" to their pages.�
This can mean using boldface for all captions, centering it below the
image, capitalizing the words of the caption, preceding the caption with a tag
word like "Figure", putting the caption in a link to an image,
etc.� Recognizing a consistent caption
style should make it easier to recognize atypical captions, such as abnormally
short or unusually placed ones, that are nonetheless like other captions on the
site in other ways.� The present work
attempts to look at the problem more closely.
2. Defining
caption context from statistical analysis
To experiment more carefully
with the effect of caption context, we first wrote a program to calculate
statistics on key features of the use of all nontrivial images on a Web site,
as found by our MARIE-4 crawler and caption-rater.� ("Nontrivial" meant we automatically excluded images
less than 2000 bytes in size and those occurring three or more times, which eliminated
most graphics icons.)� Since directories
at a Web site also can differ considerably in features, we subcategorized the
data by directories of 10 pages or more.�
Fifteen statistics were chosen to reflect factors we saw frequently in
style sheets:
1) vertical relationship of the caption to the image
(above, below, or to the side);
2) horizontal alignment (centered or not);
3) whether the caption begins with a tag
("Figure", "Table", etc.) or not;
4) whether the caption has an image-suggesting keyword
(e.g. "photo", "shows", and "above");
5) length of the caption, defined as short (less than 25
characters), medium, or long (more than 100);
6) whether the caption is at the top of the page (defined
as the first 1000 characters);
7) whether the caption is a single sentence;
8) whether the caption is capitalized;
9) size of the image, defined as small (image height plus
width is less than 400), medium, or large (image height plus width is greater
than 900);
10) format of the image (GIF, JPEG, or PNG);
11) whether the image file name is an English word or
appended pair of English words;
12) whether the image file name contains hyphens or
underscores;
13) whether the image file name contains digits;
14) one of eleven categories of the caption type (italics,
boldface, font, big heading, medium heading, small heading, paragraph or list
item, table item, caption-suggesting wording, alternative text, clickable link
or explicit caption);
15) the average confidence rating of a caption candidate
over the directory or site.
For the statistics based on
numeric ranges, we chose the ranges to give approximately an even distribution
on the training set.
Table 1 shows the average
fraction of caption-image pairs (weighting by their caption probability) having
certain features over some representative sites.� Clearly there are important differences in the statistics between
sites.
It might be objected that
aggregate statistics such as these are inferior to a set of prototypical
caption-image pairs to represent the tendencies of a directory or site since
many sites have several distinct kinds of common image-caption pairs.� However, even then statistics can help
because each of the common kinds will get good representation in aggregate
statistics, and no one kind will override the others.� Nonstatistical approaches to context like [5] only work well for
contexts centered on active agents, which is not the case here.
Table 1:
Statistical characteristics of captions on example Web sites.
Web site |
Nontrivial image-text pairs |
Caption below image? |
Tagged caption? |
Single- sentence caption? |
JPEG- format image? |
English file name? |
web.nps.navy.mil |
8,752 |
0.54 |
0.06 |
0.81 |
0.26 |
0.65 |
www.history.navy.mil |
301,945 |
0.79 |
0.53 |
0.57 |
1.00 |
0.99 |
www.nawcwpns.navy.mil |
4,972 |
0.58 |
0.04 |
0.82 |
0.16 |
0.50 |
www.apple.com |
82,827 |
0.41 |
0.01 |
0.38 |
0.76 |
0.57 |
www.amazon.com (first 50,000 pages) |
97,234 |
0.13 |
0.00 |
0.99 |
0.99 |
0.01 |
www.nationalgeographic.com |
19,287 |
0.49 |
0.00 |
0.75 |
0.89 |
0.61 |
www.kepnerfamily.com |
1,098 |
0.40 |
0.00 |
0.67 |
1.00 |
0.29 |
www.lacoast.gov |
4,261 |
0.47 |
0.17 |
0.82 |
0.66 |
0.66 |
www.hazegray.org |
6,285 |
0.66 |
0.00 |
0.88 |
1.00 |
0.47 |
www.dmoz.org (first
30,000) |
3,640 |
0.47 |
0.00 |
0.94 |
0.22 |
0.50 |
dizzy.library.arizona.edu |
31,702 |
0.44 |
0.00 |
0.75 |
0.81 |
0.40 |
www.ipfw.edu |
14,346 |
0.47 |
0.04 |
0.80 |
0.83 |
0.47 |
Other specialized forms of
context may also be useful with captions.�
For instance, PowerPoint presentations converted into image files are
common on the Web.� These can be
inferred for directories with files in numerical sequence like
"image01", "image02", etc.�
Images that fit these patterns are less likely to be captioned since
most slides contain their own captions in the image itself.� But their uncaptionability can also usually
be inferred by negative word clues like "slide" and "presentation"
in the name of the image file, so we did not implement any special mechanism
for them.
3. Baseline
probabilities for captions
For a model for the
likelihood of a caption candidate based on local clues, [3] used a neural
network.� But this has a number of
disadvantages, including overfitting of the training data and oversensitivity
to one positive factor.� So since both
these issues could seriously affect context effects, we went to a Naive Bayes
approach for the experiments reported here where such issues can be handled
well [6].� We used the odds form of Naive
Bayes:
o(C|(E1&E2&...&Em)) = o(C) *
(o(C|E1)/o(C)) * (o(C|E2)/o(C)) * ... * (o(C|Em)/o(C))
where C represents the
condition of the candidate being a caption, the Ex terms represent evidence
factors, and o(X) = prob(X)/(1-prob(X)).�
We used this formula for the nine major factors that were sufficiently
supported in newly conducted tests on our 5338-case training set (1716 captions
and 3622 noncaptions): caption words, image file-name words, fraction of the
candidate that was nonalphabetic, length of the caption, HTML tags used for the
caption candidate, number of common words between caption and image file name,
image size, image format, and number of digits in the image file name.� We also used the above formula for combining
subclues of the first two factors, the words of the text and the image file
name (approximating by 1 the odds ratio for a caption given the absence of any
particular word, since a useful word was rare).� Using the Naive Bayes approach, we improved precision over the
neural network approach from 73% to 84% for the top 10% of caption candidates
in the training set, and from 67% to 73% for the top 30%.� A mildly nonlinear function was applied to
these Naive-Bayes values to make them closer to the observed probabilities of
captions in the training set.
4. The effect
of context on caption identification
Given that we can identify
the statistical traits of Web sites and their directories, can this help
recognize less-obvious captions on those pages?� To study this, we interpreted context to mean the similarity of a
caption-image pair to other caption-image pairs in the same directory.� We use case-based reasoning to find
similarity [7].� Similarity can be established
by a "mean-caption" approach that compares the features of an
image-caption pair to the mean values for its directory.� The directory's context feature vector was
taken as the weighted means of the fifteen properties described in section 2
over the directory.� We weighted cases
by their estimated probability of being a caption-image pair, as computed by
the methods in section 3, so the values for more-likely captions were weighted
more.� We inherited context statistics
from superdirectories containing the Web page when the page's own directory had
fewer than 10 pages.
To measure representativeness
of a caption-image pair P for a Web directory, we took the inner product of the
mean feature vector for the directory with the feature vector of P (a vector of
1's and 0's indicating the features present, designed so there were always 15
1's in the vector).� As usual with inner
products, we multiply corresponding vector components and divide by the norm of
the directory feature vector times the square root of 15.� We then applied a mildly nonlinear function
to this result to better approximate the actual probability of captions with
that output; this function was obtained by fitting to the training set.� The final result is in the range 0 to 1.
Representativeness alone is
not a clue for a caption.� If a
candidate is highly representative of its site, but the site has poor caption
candidates (like www.amazon.com), the candidate should be unlikely.� But if the candidate is not representative
of its site, we cannot conclude much from its context.� As the simplest adequate modification of
odds approach, we used the formula
where X is the context
information, E is the other evidence for a caption, and R is the
representativeness metric; 0.5 was found by experiment.
We tested the effect of
context information on a new random sample "test4" of 5010 entries
drawn from near-exhaustive caption indexes created for 52 of the large Web
sites in the earlier work plus a few more "mil" sites from our
earlier experiments.� The 52 sites were
chosen to include a diverse set of sites; the runs were exhaustive, with the
exception of very large sites like www.stanford.edu, www.dmoz.org, and
www.amazon.com.� These Web
"crawls" in December 2003 and January 2004 also provided the data for
context statistics for our tests.�
Altogether, our crawler and subsequent filtering found 3,258,399 caption
candidates from examining around 5,000,000 pages in those two months.� We selected the 5010 candidates for test4 by
a random selection designed to pick with a probability roughly proportional to the
square root of the number of candidates from the site.� So sites with many images did not overly
bias the evaluation, but still had more representation than small sites.� To see if the choice of media had an effect,
we also similarly created a test set of 1220 audio and video captions drawn
randomly from the 201,661 audio and 518,834 video captions found on the same
Web sites during the same crawl.
Table 2 gives measured
precision (fraction of captions correctly identified in all captions
identified) as a function of measured recall (fraction of captions correctly
identified of all captions in the test set).
Table 2:
Experimental results for precision as a function of recall.
Test / Recall |
1.0 |
0.9 |
0.8 |
0.7 |
0.6 |
0.5 |
0.4 |
0.3 |
0.2 |
0.1 |
T1: test4 (5010 items), clues but no context |
.28 |
.30 |
.32 |
.33 |
.35 |
.39 |
.43 |
.53 |
.61 |
.64 |
T2: test4, context factor only,� exponent multiplier 0.5 |
.28 |
.31 |
.34 |
.35 |
.36 |
.37 |
.39 |
.40 |
.43 |
.46 |
T3: test4, average candidate weight in a directory alone |
.28 |
.31 |
.33 |
.35 |
.34 |
.34 |
.35 |
.38 |
.41 |
.36 |
T4: test4, with both local clues and context factor,
multiplier 0.5 |
.28 |
.30 |
.32 |
.34 |
.37 |
.41 |
.43 |
.54 |
.61 |
.60 |
T5: same as T4 except multiplier 1.0 |
.28 |
.30 |
.32 |
.34 |
.37 |
.40 |
.44 |
.53 |
.60 |
.57 |
T6: same as T4 ignoring title and filename captions,
multiplier 0.5 |
.37 |
.40 |
.41 |
.44 |
.47 |
.50 |
.58 |
.62 |
.60 |
.58 |
T7: same as T4 but no inheritance of context |
.28 |
.30 |
.32 |
.34 |
.36 |
.40 |
.43 |
.52 |
.61 |
.61 |
T8: like T1 but only the subset of test4 from
*.epa.gov� (207 items) |
.26 |
.30 |
.36 |
.37 |
.38 |
.46 |
.47 |
.53 |
.65 |
.58 |
T9: like T4 but only *.epa.gov |
.26 |
.29 |
.36 |
.38 |
.41 |
.50 |
.51 |
.62 |
.65 |
.58 |
T10: like T1 but only the subset*.stanford.edu (346
items) |
.22 |
.24 |
.25 |
.26 |
.28 |
.29 |
.29 |
.34 |
.46 |
.65 |
T11: like T4 but only *.stanford.edu |
.22 |
.23 |
.25 |
.26 |
.29 |
.31 |
.32 |
.38 |
.53 |
.63. |
T12: like T1 but only the subset *.history.navy.mil
(177 items) |
.39 |
.38 |
.38 |
.41 |
.46 |
.42 |
.48 |
.60 |
-- |
-- |
T13: like T4 but only *.history.navy.mil |
.39 |
.38 |
.38 |
.42 |
.44 |
.43 |
.45 |
.51 |
.46 |
.38 |
T14: audio and video captions (1220), clues but no
context |
.18 |
.26 |
.30 |
.31 |
.32 |
.34 |
.38 |
.46 |
.57 |
-- |
T15: audio and video captions with context |
.18 |
.26 |
.30 |
.31 |
.32 |
.34 |
.33 |
.35 |
.39 |
.42 |
To summarize the
results:�
5. Context
from image properties
An obvious question is what
kind of additional knowledge would further help in disambiguating
captions.� Our hunch is that it is
indeed possible to significantly improve performance because people can still
do the task better than our system, albeit more slowly.� So the obvious idea is to include information
from image content analysis despite its large processing-time
requirements.� A glance at the objects
in a photograph often makes it easy for people to connect a caption to an
image.� So we need to do at least some
simple image processing to determine the general characteristics of the image
and guess the major shapes within it.
General classification of
images (color photographs, black and white photographs, manipulated
photographs, line drawings, block diagrams, simple graphics, etc.) is not difficult
to do and can provide useful extra information for connecting images.� Other useful and not-difficult
classifications are indoors/outdoors, day/night, people/scenery, and
manipulated/unmanipulated.� More
detailed taxonomies [8] can help but are hard to implement in automatic
classifiers.� The lower-level image
primitives of [9] appear more promising for assigning feature vectors to
images, such things as average size of regions, general kind of division of
image (e.g. vertically into two halves), appearance of straight versus curved
edges, double edges, regularly curving edges, regularly shaped regions, edges
within regions, granulation, glossiness, and color.
Our previous work has shown
that it is often not difficult to distinguish foreground, background, and
subject of an image by relative location, size, and contrast of regions
[10].� Image similarity can also be
computed using feature vectors, and images similar to those known to be
captioned images will tend to be captioned images as well.
6. Conclusion
It appears that we have
reached the limit of what can be accomplished in distinguishing captions on Web
pages without content analysis of the accompanying media.� Since content analysis requires considerably
more processing time per instance than the methods described here, better
performance may be impractical in building large Web media indexes until
computers become significantly faster.��
Nonetheless, we have shown that methods without content analysis can
improve significantly on the current commercial state of the art if designed
carefully.
References
[1]
Sclaroff, S., La Cascia, M., Sethi, S., & Taycher, L., Unifying textual and
visual cues for content-based image retrieval on the World Wide Web.� Computer
Vision and Image Understanding, 75(1/2), 86-98, 1999.
[2] Srihari, R., Zhang, Z., & Rao, A.,
Intelligent indexing and semantic retrieval of multimodal documents. �Information Retrieval, 2(2), 245-275, 2000.
[3] Rowe, N., MARIE-4: A
high-recall, self-improving Web crawler that finds images using captions.� IEEE Intelligent Systems, 17(4),
July/August 2002, 8-14.
[4] Pant, G., Deriving
link-context from HTML tag tree.� Proc.
8th ACM SIGMOD Workshop on Research Issues in Data Mining and
Knowledge Discovery, San Diego, CA, pp. 49-55, 2003.
[5] Ranganathan, A., &
Campbell, R., An infrastructure for context-awareness based on first order
logic.� Personal and Ubiquitous
Computing, 7(6) (December), pp. 353-364, 2003.
[6] Korpipaa, P., Koskinen,
M., Peltola, J., Makela, S.-M., & Sappanen, T., Bayesian approach to
sensor-based context awareness.� Personal and Ubiquitous Computing, 7(2)
(July), pp. 113-124, 2003.
[7] Mitchell, T., Machine
Learning.� Boston, MA: WCB
McGraw-Hill, 1997.
[8] Burford, B., Briggs, P.,
and Eakins, J., A taxonomy of the image: On the classification of content for
image retrieval.� Visual Communications, 2(2), pp. 123-161, 2003.
[9] Saint-Martin, F., Semiotics of Visual Language.� Bloomington, IN: Indiana University Press,
1990.
[10] Rowe, N., Finding and
labeling the subject of a captioned depictive natural photograph.� IEEE
Transactions on Data and Knowledge Engineering,� 14(1), 202-207, January/February 2002.