Using Context to Disambiguate Web Captions

 

 

 

Neil C. Rowe

 

Code CS/Rp, 833 Dyer Road

U.S. Naval Postgraduate School

Monterey, CA 93943 USA

 

Abstract

 

The easiest way to index multimedia from ordinary Web pages is to find their captions.  However, captions are not used consistently, and retrieval effectiveness for caption-based multimedia browsers is significantly poorer than that for text retrieval.  We show that statistical "context" information about the Web pages at a site can help recognize image captions by quantifying their "representativeness".  Experiments were conducted on a random sample of 5010 image captions from 3.2 million candidates from 5 million Web pages, and 1220 audio and video captions from 720,000 candidates from those same Web pages. They showed that while statistical context information was definitely a good clue, it usually did not appear to add much beyond what good local clues in the candidate caption-image pair itself provide, and provided no help for caption-audio and caption-video pairs.

 

Keywords: captions, World Wide Web, data mining, disambiguation, context, similarity

 

** This paper appeared in the Internet Computing Conference, Las Vegas, NV, June 2004. **

 

1. Introduction

 

Captions are the best tool to index images and other multimedia objects in large unstructured document collections such as the Web.  But finding captions is not straightforward on the Web because page authors use widely different ways of placing them and displaying them [1, 2].  Commercial "image search" software typically addresses only the "easy" captions, obtaining high precision (accuracy) but low recall (coverage).  Our previous work on the MARIE-4 system [3] showed that seven key factors for captions were statistically significant on a randomly selected training set.  The most valuable of these factors was the occurrence of particular words in the proposed caption (e.g. "caption", "figure", "photograph", "shows", "above", "left", etc.) or the absence of particular words (e.g. "click", "page", "button", "bytes", "free", "now", etc.)  The other valuable factors were shown to be the caption type (e.g. italics, centered text, item in table, alternative text for the image, text of a clickable link, etc.); format of the associated image; length of the caption; particular words in the image-file name, especially common words between the image file name and the caption; digits in the image-file name; and image size.  Distance of the caption from the image was shown to be unhelpful within 1000 characters of the image.

 

However, this work did not examine the clue of the consistency of the captions at a site or in a directory.  This work examined the local "link context" [4], but there is global context in the directory and site for the page of the image.  Many sites use designers, design templates, and style sheets to ensure a consistent "look and feel" to their pages.  This can mean using boldface for all captions, centering it below the image, capitalizing the words of the caption, preceding the caption with a tag word like "Figure", putting the caption in a link to an image, etc.  Recognizing a consistent caption style should make it easier to recognize atypical captions, such as abnormally short or unusually placed ones, that are nonetheless like other captions on the site in other ways.  The present work attempts to look at the problem more closely.

 

2. Defining caption context from statistical analysis

 

To experiment more carefully with the effect of caption context, we first wrote a program to calculate statistics on key features of the use of all nontrivial images on a Web site, as found by our MARIE-4 crawler and caption-rater.  ("Nontrivial" meant we automatically excluded images less than 2000 bytes in size and those occurring three or more times, which eliminated most graphics icons.)  Since directories at a Web site also can differ considerably in features, we subcategorized the data by directories of 10 pages or more.  Fifteen statistics were chosen to reflect factors we saw frequently in style sheets:

1)      vertical relationship of the caption to the image (above, below, or to the side);

2)      horizontal alignment (centered or not);

3)      whether the caption begins with a tag ("Figure", "Table", etc.) or not;

4)      whether the caption has an image-suggesting keyword (e.g. "photo", "shows", and "above");

5)      length of the caption, defined as short (less than 25 characters), medium, or long (more than 100);

6)      whether the caption is at the top of the page (defined as the first 1000 characters);

7)      whether the caption is a single sentence;

8)      whether the caption is capitalized;

9)      size of the image, defined as small (image height plus width is less than 400), medium, or large (image height plus width is greater than 900);

10)   format of the image (GIF, JPEG, or PNG);

11)   whether the image file name is an English word or appended pair of English words;

12)   whether the image file name contains hyphens or underscores;

13)   whether the image file name contains digits;

14)   one of eleven categories of the caption type (italics, boldface, font, big heading, medium heading, small heading, paragraph or list item, table item, caption-suggesting wording, alternative text, clickable link or explicit caption);

15)   the average confidence rating of a caption candidate over the directory or site.

For the statistics based on numeric ranges, we chose the ranges to give approximately an even distribution on the training set.

 

Table 1 shows the average fraction of caption-image pairs (weighting by their caption probability) having certain features over some representative sites.  Clearly there are important differences in the statistics between sites.

 

It might be objected that aggregate statistics such as these are inferior to a set of prototypical caption-image pairs to represent the tendencies of a directory or site since many sites have several distinct kinds of common image-caption pairs.  However, even then statistics can help because each of the common kinds will get good representation in aggregate statistics, and no one kind will override the others.  Nonstatistical approaches to context like [5] only work well for contexts centered on active agents, which is not the case here.

 

Table 1: Statistical characteristics of captions on example Web sites.

Web site

Nontrivial image-text pairs

Caption

below

image?

Tagged

caption?

Single-

sentence

caption?

JPEG-

format image?

English

file name?

web.nps.navy.mil

8,752

0.54

0.06

0.81

0.26

0.65

www.history.navy.mil

301,945

0.79

0.53

0.57

1.00

0.99

www.nawcwpns.navy.mil

4,972

0.58

0.04

0.82

0.16

0.50

www.apple.com

82,827

0.41

0.01

0.38

0.76

0.57

www.amazon.com

(first 50,000 pages)

97,234

0.13

0.00

0.99

0.99

0.01

www.nationalgeographic.com

19,287

0.49

0.00

0.75

0.89

0.61

www.kepnerfamily.com

1,098

0.40

0.00

0.67

1.00

0.29

www.lacoast.gov

4,261

0.47

0.17

0.82

0.66

0.66

www.hazegray.org

6,285

0.66

0.00

0.88

1.00

0.47

www.dmoz.org (first 30,000)

3,640

0.47

0.00

0.94

0.22

0.50

dizzy.library.arizona.edu

31,702

0.44

0.00

0.75

0.81

0.40

www.ipfw.edu

14,346

0.47

0.04

0.80

0.83

0.47

 

 

Other specialized forms of context may also be useful with captions.  For instance, PowerPoint presentations converted into image files are common on the Web.  These can be inferred for directories with files in numerical sequence like "image01", "image02", etc.  Images that fit these patterns are less likely to be captioned since most slides contain their own captions in the image itself.  But their uncaptionability can also usually be inferred by negative word clues like "slide" and "presentation" in the name of the image file, so we did not implement any special mechanism for them.

 

3. Baseline probabilities for captions

 

For a model for the likelihood of a caption candidate based on local clues, [3] used a neural network.  But this has a number of disadvantages, including overfitting of the training data and oversensitivity to one positive factor.  So since both these issues could seriously affect context effects, we went to a Naive Bayes approach for the experiments reported here where such issues can be handled well [6].  We used the odds form of Naive Bayes:

 

o(C|(E1&E2&...&Em)) = o(C) * (o(C|E1)/o(C)) * (o(C|E2)/o(C)) * ... * (o(C|Em)/o(C))

 

where C represents the condition of the candidate being a caption, the Ex terms represent evidence factors, and o(X) = prob(X)/(1-prob(X)).  We used this formula for the nine major factors that were sufficiently supported in newly conducted tests on our 5338-case training set (1716 captions and 3622 noncaptions): caption words, image file-name words, fraction of the candidate that was nonalphabetic, length of the caption, HTML tags used for the caption candidate, number of common words between caption and image file name, image size, image format, and number of digits in the image file name.  We also used the above formula for combining subclues of the first two factors, the words of the text and the image file name (approximating by 1 the odds ratio for a caption given the absence of any particular word, since a useful word was rare).  Using the Naive Bayes approach, we improved precision over the neural network approach from 73% to 84% for the top 10% of caption candidates in the training set, and from 67% to 73% for the top 30%.  A mildly nonlinear function was applied to these Naive-Bayes values to make them closer to the observed probabilities of captions in the training set.

 

4. The effect of context on caption identification

 

Given that we can identify the statistical traits of Web sites and their directories, can this help recognize less-obvious captions on those pages?  To study this, we interpreted context to mean the similarity of a caption-image pair to other caption-image pairs in the same directory.  We use case-based reasoning to find similarity [7].  Similarity can be established by a "mean-caption" approach that compares the features of an image-caption pair to the mean values for its directory.  The directory's context feature vector was taken as the weighted means of the fifteen properties described in section 2 over the directory.  We weighted cases by their estimated probability of being a caption-image pair, as computed by the methods in section 3, so the values for more-likely captions were weighted more.  We inherited context statistics from superdirectories containing the Web page when the page's own directory had fewer than 10 pages.

 

To measure representativeness of a caption-image pair P for a Web directory, we took the inner product of the mean feature vector for the directory with the feature vector of P (a vector of 1's and 0's indicating the features present, designed so there were always 15 1's in the vector).  As usual with inner products, we multiply corresponding vector components and divide by the norm of the directory feature vector times the square root of 15.  We then applied a mildly nonlinear function to this result to better approximate the actual probability of captions with that output; this function was obtained by fitting to the training set.  The final result is in the range 0 to 1.

 

Representativeness alone is not a clue for a caption.  If a candidate is highly representative of its site, but the site has poor caption candidates (like www.amazon.com), the candidate should be unlikely.  But if the candidate is not representative of its site, we cannot conclude much from its context.  As the simplest adequate modification of odds approach, we used the formula

 

 

where X is the context information, E is the other evidence for a caption, and R is the representativeness metric; 0.5 was found by experiment.

 

We tested the effect of context information on a new random sample "test4" of 5010 entries drawn from near-exhaustive caption indexes created for 52 of the large Web sites in the earlier work plus a few more "mil" sites from our earlier experiments.  The 52 sites were chosen to include a diverse set of sites; the runs were exhaustive, with the exception of very large sites like www.stanford.edu, www.dmoz.org, and www.amazon.com.  These Web "crawls" in December 2003 and January 2004 also provided the data for context statistics for our tests.  Altogether, our crawler and subsequent filtering found 3,258,399 caption candidates from examining around 5,000,000 pages in those two months.  We selected the 5010 candidates for test4 by a random selection designed to pick with a probability roughly proportional to the square root of the number of candidates from the site.  So sites with many images did not overly bias the evaluation, but still had more representation than small sites.  To see if the choice of media had an effect, we also similarly created a test set of 1220 audio and video captions drawn randomly from the 201,661 audio and 518,834 video captions found on the same Web sites during the same crawl.

 

Table 2 gives measured precision (fraction of captions correctly identified in all captions identified) as a function of measured recall (fraction of captions correctly identified of all captions in the test set).

 

Table 2: Experimental results for precision as a function of recall.

Test / Recall

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

T1: test4 (5010 items), clues but no context

.28

.30

.32

.33

.35

.39

.43

.53

.61

.64

T2: test4, context factor only,  exponent multiplier 0.5

.28

.31

.34

.35

.36

.37

.39

.40

.43

.46

T3: test4, average candidate

weight in a directory alone

.28

.31

.33

.35

.34

.34

.35

.38

.41

.36

T4: test4, with both local clues and context factor, multiplier 0.5

.28

.30

.32

.34

.37

.41

.43

.54

.61

.60

T5: same as T4 except multiplier 1.0

.28

.30

.32

.34

.37

.40

.44

.53

.60

.57

T6: same as T4 ignoring title and filename captions, multiplier 0.5

.37

.40

.41

.44

.47

.50

.58

.62

.60

.58

T7: same as T4 but no inheritance of context

.28

.30

.32

.34

.36

.40

.43

.52

.61

.61

T8: like T1 but only the subset of test4 from *.epa.gov  (207 items)

.26

.30

.36

.37

.38

.46

.47

.53

.65

.58

T9: like T4 but only *.epa.gov

.26

.29

.36

.38

.41

.50

.51

.62

.65

.58

T10: like T1 but only the subset*.stanford.edu (346 items)

.22

.24

.25

.26

.28

.29

.29

.34

.46

.65

T11: like T4 but only *.stanford.edu

.22

.23

.25

.26

.29

.31

.32

.38

.53

.63.

T12: like T1 but only the subset *.history.navy.mil (177 items)

.39

.38

.38

.41

.46

.42

.48

.60

--

--

T13: like T4 but only *.history.navy.mil

.39

.38

.38

.42

.44

.43

.45

.51

.46

.38

T14: audio and video captions (1220), clues but no context

.18

.26

.30

.31

.32

.34

.38

.46

.57

--

T15: audio and video captions with context

.18

.26

.30

.31

.32

.34

.33

.35

.39

.42

 

 

To summarize the results: 

 

5. Context from image properties

 

An obvious question is what kind of additional knowledge would further help in disambiguating captions.  Our hunch is that it is indeed possible to significantly improve performance because people can still do the task better than our system, albeit more slowly.  So the obvious idea is to include information from image content analysis despite its large processing-time requirements.  A glance at the objects in a photograph often makes it easy for people to connect a caption to an image.  So we need to do at least some simple image processing to determine the general characteristics of the image and guess the major shapes within it.

 

General classification of images (color photographs, black and white photographs, manipulated photographs, line drawings, block diagrams, simple graphics, etc.) is not difficult to do and can provide useful extra information for connecting images.  Other useful and not-difficult classifications are indoors/outdoors, day/night, people/scenery, and manipulated/unmanipulated.  More detailed taxonomies [8] can help but are hard to implement in automatic classifiers.  The lower-level image primitives of [9] appear more promising for assigning feature vectors to images, such things as average size of regions, general kind of division of image (e.g. vertically into two halves), appearance of straight versus curved edges, double edges, regularly curving edges, regularly shaped regions, edges within regions, granulation, glossiness, and color.

 

Our previous work has shown that it is often not difficult to distinguish foreground, background, and subject of an image by relative location, size, and contrast of regions [10].  Image similarity can also be computed using feature vectors, and images similar to those known to be captioned images will tend to be captioned images as well.

 

6. Conclusion

 

It appears that we have reached the limit of what can be accomplished in distinguishing captions on Web pages without content analysis of the accompanying media.  Since content analysis requires considerably more processing time per instance than the methods described here, better performance may be impractical in building large Web media indexes until computers become significantly faster.   Nonetheless, we have shown that methods without content analysis can improve significantly on the current commercial state of the art if designed carefully.

 

References

 

[1] Sclaroff, S., La Cascia, M., Sethi, S., & Taycher, L., Unifying textual and visual cues for content-based image retrieval on the World Wide Web.  Computer Vision and Image Understanding, 75(1/2), 86-98, 1999.

[2] Srihari, R., Zhang, Z., & Rao, A., Intelligent indexing and semantic retrieval of multimodal documents.  Information Retrieval, 2(2), 245-275, 2000.

[3] Rowe, N., MARIE-4: A high-recall, self-improving Web crawler that finds images using captions.  IEEE Intelligent Systems, 17(4), July/August 2002, 8-14.

[4] Pant, G., Deriving link-context from HTML tag tree.  Proc. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, pp. 49-55, 2003.

[5] Ranganathan, A., & Campbell, R., An infrastructure for context-awareness based on first order logic.  Personal and Ubiquitous Computing, 7(6) (December), pp. 353-364, 2003.

[6] Korpipaa, P., Koskinen, M., Peltola, J., Makela, S.-M., & Sappanen, T., Bayesian approach to sensor-based context awareness.  Personal and Ubiquitous Computing, 7(2) (July), pp. 113-124, 2003.

[7] Mitchell, T., Machine Learning.  Boston, MA: WCB McGraw-Hill, 1997.

[8] Burford, B., Briggs, P., and Eakins, J., A taxonomy of the image: On the classification of content for image retrieval.  Visual Communications, 2(2), pp. 123-161, 2003.

[9] Saint-Martin, F., Semiotics of Visual Language.  Bloomington, IN: Indiana University Press, 1990.

[10] Rowe, N., Finding and labeling the subject of a captioned depictive natural photograph.  IEEE Transactions on Data and Knowledge Engineering,  14(1), 202-207, January/February 2002.