Retrieving Captioned Pictures Using Statistical Correlations and a Theory of Caption-Picture Co-reference

Neil C. Rowe

Computer Science, CS/Rp, U.S. Naval Postgraduate School, Monterey, CA 93943, USA

(This document derived from N. C. Rowe, Retrieving captioned pictures using statistical correlations and a theory of caption-picture co-reference. Fourth Annual Symposium on Document Analysis and Retrieval, Las Vegas, NV, April 1995.)

1. Introduction

The MARIE project is investigating new methods for efficient information retrieval of captioned multimedia from multimedia libraries. Captions are essential to understanding multimedia and to finding relevant examples quickly. Our approach, shared by (Srihari, 1994), is to analyze both the caption and the picture in advance, then match user English queries to these semantic networks in a way that maximizes retrieval speed. Our theory and algorithms are being tested on a particular example, 100,000 captions constituting the entire unclassified portion of the photographic library at NAWC-WD, the U.S. Navy test facility in China Lake, California, USA (Rowe and Guglielmo, 1993), plus some of the pictures. Our focus is thus on technical captions and technical pictures that required specialized domain-dependent knowledge to interpret. (Rowe and Guglielmo, 1993) shows experimental results that confirm that our approach to caption-only processing gets higher precision for similar recall than a keyphrasematching system for the same application.

An important problem with captioned pictures is the ambiguity of words and shapes. A sidewinder can be both a missile developed at NAWC-WD and a snake in the desert around NAWC-WD (there are many pictures of local flora and fauna in this database). Such ambiguity is a central problem in information retrieval (Krovetz and Croft, 1992). There is also ambiguity in the shapes in the pictures: If the user asks for a picture of an F-18 aircraft and the aircraft is not mentioned in the caption, it may still appear in the background, but how close need the shape be to that of an ideal F-18 to be identified as one? We will use the concept of "information filtering" (Belkin and Croft, 1992) to try to rule out the unreasonable possibilities first.

Our approach to these problems has four parts. First, since captions are much faster to analyze than pictures, we have a comprehensive approach to disambiguation of caption-word senses based on a syntactic statistical parser and using binary word-correlation statistics. Second, we have a set of constraints covering the reference from captions to pictures. Third, we have a set of generalized correlations from picture subjects to captions. Fourth, we can disambiguate picture shapes using statistics on co-occurrence of shape pairs in a picture in a particular relationship. The second is fully implemented, the first is close to fully implemented, the third is partially implemented, and the fourth is not yet implemented. This paper will describe the four parts of our approach in order, and will focus on the theory behind each.

2. Example captions

lgb skipper bomb on a-7c bu# 156739 aircraft (cl on tail). side view of aircraft on runway.

This is a typical caption (whose picture is shown in section 7). This follows the common pattern of two noun phrases ended by periods where the first noun phrase describes the subject of the picture and the second describes the picture itself. A variety of technical domain knowledge is necessary to interpret this caption, like that an A-7C is an aircraft, BU #156739 is an aircraft ID number, Skipper is the name of a bomb, and LGB means it is a laser-guided weapon. Interpretation also requires the syntactic knowledge that ID numbers come after aircraft types like participles instead of adjectives, and the semantic knowledge that "CL" on a tail means letters that are written on the aircraft. Both acronyms can be decoded automatically by examining the corpus of captions: CL is China Lake, the location of the base and a phrase that is used in unabbreviated form hundreds of times in the captions, and LGB can be figured out from a few captions that write out "Laser-Guided Bomb" explicitly in the same sentence as LGB.

tp 1356. a-7b/e srt-6 escape system test. synchro firing at 1075'' n x 30'' w. dummy just breaking canopy.

The corresponding picture of this caption is shown in Section 5. Here TP 1356 is the number of a test, and functions as a "supercaption" describing a set of pictures, since it is appears on the front of a number of captions. SRT-6 is the version of the escape system that is being tested; typically this occurs after the aircraft type because it is the true subject of the test. 1075'' n x 30'' w is one of a number of special formats used for precision in experimental setups, which we handle by special domain-dependent lexical rules. "Dummy" and "canopy" are props associated with this test, and which have specific meaning for the test.; we could guess their senses with statistics on how often particular senses correlate with the word "test" or the concept "ejection test". Note here we have four sentences of a different sort than in the first example: applicable supercaption, abstract subject of the picture, visible subject of the picture, and auxiliary subject that is especially interesting.

cartoon. two f-14's. one carrying lightweight cheetah, the other carrying heavy phoenix assisted by balloons.

This illustrates the value of domain-dependent statistics, since there are four words here that cannot be interpreted in their most common English senses. Cheetah and Phoenix are names of missiles (which could be inferred by noticing how often they are launched or tested in the other captions). Balloons are special devices hanging from the missile. A cartoon is just a line drawing, a technical term here.

3. Our statistical parsing

This application, like many technical applications, exhibits large numbers of synonyms. Furthermore, queries can introduce additional synonyms, supertypes, and subtypes of the terms in the captions. So it is important to have a comprehensive thesaurus of related terms for information retrieval. Because the existing keyword matching system at NAWC-WD cannot address these issues, it is considered unhelpful most of the time, and is mostly ignored by personnel. (Rowe and Guglielmo, 1993) reports on MARIE-1, a prototype implementation in Prolog that we developed for them, a system that appears much more in the direction of what their staff and users want.

But MARIE-1 took a man-year to construct and only handled 220 pictures (averaging 20 words per caption) from the database. It used the standard approach of intelligent natural-language processing for information retrieval (Rau, 1988; Sembok and van Rijsbergen, 1990) of hand-coding of lexical and semantic information for the words in a narrow domain. We used the DBG software from Language Systems, Inc. (in Woodland Hills, CA) to help construct the parser for MARIE-1. Nonetheless, considerable additional work was needed to adapt DBG to our domain. Even with only 220 captions, they averaged 50 words in length and required a lexicon and type hierarchy of 1000 additional words beyond the 1000 we could use from the prototype DBG application for cockpit speech. A large number of additional semantic rules had to be written for the many long and complicated noun-noun sequences that had no counterpart in cockpit speech. These required difficult debugging because DBG's multiple-pass semantic processing is tricky to figure out, and the inability of DBG to backtrack and find a second interpretation meant that we could only find one bug per run. DBG's syntactic features required a grammar with fixed assigned probabilities on each rule, which necessitated careful tradeoffs that considered the entire corpus, to choose what was often a highly sensitive number. The lack of context sensitivity meant that this number had to programmed artificially for each rule to obtain adequate performance (for which some researchers have claimed success), instead of being taken from applicable statistics on the corpus, which makes more sense. But this "programming" was more trial-and-error than anything.

MARIE-1's approach would be unworkable for the 29,538 distinct words in the full 100,000-caption NAWC database., as suggested by the problems of this approach discussed in (Smeaton, 1992). So for MARIE-2 we are using two new ideas: simple lexicon derivation in part from a standard thesaurus system, and statistical parsing to avoid needing to formulate subtle semantic distinctions. Statistical parsing assigns probabilities of co-occurrence to sets of words, and uses these probabilities to guess the most likely interpretation of a sentence. The probabilities can be derived from statistics on a corpus, a representative set of example sentences, and they can capture fine semantic distinctions that would otherwise require additional lexicon information. We discuss it more in the next section.

The NAWC-WD captions require a lexicon of about 42,000 words, including words closely related to those in the captions. The standard thesaurus system we used was Wordnet (Miller et al, 1990), which covers synonyms, superconcepts, subconcepts, superparts, and subparts, plus rough word frequencies and morphological processing. From the word frequencies, we identified standard aliases of all the words we needed, shortening the required Wordnet lexicon information by 85%. The full breakdown of the lexicon was:

Table 1:

Number of captions: 36,191
Number of word occurrences: 610,182
Number of distinct caption words: 29,082
Words recognized by Wordnet: 6,729
Explicitly defined: 2,626
Morphological variants of preceding words: 2,335
Numbers: 3,412
Person names: 2,791
Place names: 387
Product names: 264
Prefixed codes: 3,256
Other recognized special formats: 10,179
Misspellings and mispunctuations: 1,174
Abbreviations: 1,093
Remaining words (assumed equipment names): 1,876
Total number of words senses recognized: 69,447
The special-format rules do things like interpret "BU# 462945" as an aircraft identification number and "02/21/93" as a date. Misspellings and abbreviations were obtained mostly automatically, without human checking, from rule-based systems described in (Rowe and Laitinen, 1994). The effort for lexicon-building was only 0.4 of a man-year, which suggests good portability of this approach to other technical-caption domains.

4. Statistical parsing

We use a bottom-up chart parser with a stochastic grammar (Charniak, 1993). Entries are created for every word sense of the words in the captions, and then we attempt to combine chart entries according to a context-free grammar in Chomsky Normal Form (in which the rules must have only one or two replacement symbols each). We pick the best combination first at every step, or the most likely parse-rule application, so our parsing amounts to a branch-andbound search.; the approach of (Jones and Eisner, 1992) has similarities

The main innovation in our parsing approach is in the way weights are assigned to parses. For rules with only one replacement symbol, the weights are evenly distributed over all generalization symbols, of which there is usually only one. But for rules with two replacement symbols, like NP -> NP NP, we extract the headwords and look up their correlation statistic. The headword is defined for each nonterminal in the grammar as the most important word (with its sense) for its corresponding phrase; for instance, the headword of an NP is the principal noun, and the headword of a PP is the direct object of the preposition. For example, for the phrase "modified F-18 wing front", F-18 is the headword of "modified F-18" and "wong" is the headword of "wing front"; so we can rate this parse based on how often we have counted co-occurrences of "F-18" and "wing", which is actually quite often.

The use of headwords reduces the problem of arbitrarily complex correlations to the more manageable one of binary correlations. We argue that most semantic phenomena in language can be explained by binary correlations like subjectobject, subject-modifier, and object-object.

Obtaining the statistics on pairs of word senses is the main obstacle to this approach. Eventually we will use the statistics on headwords associated in correct parses of our library of statistics, which we obtained by counting pairs of successive words in the captions. This works well for noun-phrase constructions, so common in our corpus, because ideas that caption writers often associate tend to frequently occur in succession in the captions. So the frequent occurrence of "sidewinder missile" makes it easier to recognize "sidewinder aim-9r missile".

It is important to use indirect counts rather than direct counts, that is, counts obtained by upward inheritance in the type hierarchy. For example, "vehicle" occurs only 411 times in the corpus; but "aircraft" occurs 5901 times (and "F-18" 14 times, etc.), and we should include the occurrences of those words in the count on "vehicle". We also apportion the observed counts to each of the possible word-sense combinations, using heuristic weights to guess how often a word sense occurs. The heuristic weights are derived by taking a weighted sum of the counts on related words; so for instance, sidewinder sense 1 of a missile is rated much higher than sidewinder sense 2 of a snake, since there are more occurrences of missiles and armaments in this database than snakes and fauna.

Even using indirect counts instead of direct counts, correlation data will be sparse and unreliable for infrequent pairs of words. Thus it is important to use "statistical inheritance" (Rowe, 1985) to obtain adequate data from correlations on superconcepts of rare concepts. For instance, statistics on the correlation between "F-18" and "Sidewinder" can be obtained from statistics on the correlation of "aircraft" and "missile", scaling that latter number down by the product of the sampling ratios. The reliability of such estimates has been addressed in sampling theory (Cochran, 1977).

5. References from a caption to a picture

Because captions reference a picture object, they have additional semantics concerning what they denote or depict (Rowe, 1994). Based on our study of a variety of captioned pictures for a variety of domains, the main caption discourse postulate is that the linguistic focus of the caption is depicted in full in the picture, if it is possible to depict it. So for "sled on track" we should expect to see all of a sled, but not necessarily all of a track. Linguistic focus is generally the subject of the sentence. Focus can be extended with conjunctions and "with" phrases, as in "sycamores and live oaks with ground squirrels". Gerunds can also be the focus of a sentence, as in "exterior building painting", in which case the depiction is of the tools and props that are associated with performance of the verb. Plurals and numerically quantified subjects imply multiple depicted objects, as in "three hangars in view". If a caption contains multiple sentences, then usually the focus of each sentence has an independent depictability guarantee. NAWC-WD captions follow this last convention, although National Geographic often likes to use sentences after the first to refer to contextual information not directly visible in the picture.

Here below is the picture with caption "photovoltaic cell panels for generating power to ultimately operate a radar. left to right: nasa employee and richard fulmer with the batteries and power inverter." Panels, an employee, Richard Fulmer, batteries, and a power inverter are visible, but generating, power, an operate action, and radar are not.

The linguistic focus is important because usually its shape or shapes must be easily distinguishable in the picture, so as to reduce the amount of visual processing necessary to find it. Otherwise, the picture is not considered a "good" picture for that caption.

However, some linguistic foci are not a-priori depictable, like "test", "program", "view", and "welcome" (of visiting dignitaries). For example, consider the picture below with caption "tp 1356. a-7b/e srt-6 escape system test. synchro firing at 1075'' n x 30'' w. dummy just breaking canopy." The test number tp1356, the escape system, the test, and the firing are not depictable, but the dummy is. Such nondepictables can trigger frame or script invocations, and indirectly their associated objects. Also, when such nondepictables are in linguistic focus on the surface, often prepositional phrase constructions shift the focus to the object of a preposition, as in "side view of aircraft". This often occurs when the nondepictable represents meta-information, information about the picture as an artifact itself.

Prepositional phrases describing physical relationships in a picture are also common in captions. The discourse postulate for the objects of such prepositions is that usually they are depicted at least in part, like "track" in "sled on track" for which some but not necessarily all of the track is visible. The reason is that physical relationships involving the focus of the caption must be made clear for a "good picture", and the relationships cannot be clear unless some of the related object is visible.

Since caption sentences are generally noun phrases without true verbs, participles and gerunds are the predominant manifestation of actions. Actions are generally hard to depict in a still picture, though there are exceptions like "rocket firing" where the firing can be identified with areas of smoke and flames. Thus the caption discourse postulate is that the direct object of the verbal is usually depicted to make the verb clear. Examples are "aircraft" in "crew loading aircraft", "circuit boards" in "personnel soldering circuit boards", and "painting" in "building painting progress". Like physical prepositions, verbs and verbals and their objects are only guaranteed to be partially visible in a picture because they are secondary to sentence focus. For example, consider the picture below whose caption is "the soldering assembly area in michelson lab. richard maxwell soldering resistor on g-r simulator circuit board." Soldering is depicted in the orientation of the hands and the soldering iron, and the resistor is depicted (though it is small and hard to see in this reproduction).

Captions try to make it easier for a person to achieve visual understanding of a picture. To this end, adjectives and adverbs are usually used to restrict the subtypes of the objects depicted, to permit faster recognition. Examples are "seat" in "seat ejection", "computer" in "computer peripherals", "low" in "flying low", and "g-r simulator" in "g-r simulator board" of the above caption. As with nouns, not all adjectives and adverbs are depictable, as "cold" in "cold engine" and "west" in "heading west". Some unusual modifiers actually cancel depictability assumptions of their referent, such as "burning building" and "proposed entrance". All these details need to be kept in a lexicon.

6. Additional caption inferences

In an information-retrieval system using captions, depictability can be ascertained for objects mentioned in a query but not in a caption. Some key additional inference rules are expressed in predicate calculus in (Rowe, 1994). These include the notion that if you can see X in the picture, then you can also see any generalization of X. On the other hand, you may or may not be able to see a specialization of X, as when caption says "aircraft" and you want to know if it is a military aircraft. Such situations require a new kind of answer to a query, "possible, but look and see". As for part-whole relationships, depiction of a part implies partial depictability of its whole, so a caption that focuses on an airplane wing must partially depict the airplane. On the other hand, depictability of a whole does not necessarily imply depictability of a part, so when a picture depicts an airplane, you cannot necessarily see its cargo or fuel. What can be inferred visible are exterior features, and not just all of them, but those that can be seen from any orientation, like the wing or painting of an aircraft. Such information needs to be indicated in the type hierarchy associated with the lexicon.

Negative inferences can often be made concerning the presence of important objects. At NAWC-WD, these important objects are the equipment being tested, so if a sled track is described in a caption and no ejection seat is mentioned on the track, it can be assumed that no ejection seat is present because sled tracks are intended to test ejection seats. In the last picture above, we can infer from the caption alone that no missiles or airplanes are shown. Negative inferences are valuable in information retrieval because they allow certain elimination from consideration of certain pictures even when caption information is incomplete.

Photographic libraries like that of NAWC-WD often also include "supercaptions", captions describing groups of pictures. 36% of the NAWC-WD pictures have them, and the default discourse postulate is that their information is appended to the information for each caption. But a number of the NAWC-WD supercaptions have a sentence with differential semantics, enumerations of differences between the subcaptions. Then correspondences must be set up between the items of differential information and the subcaptions, and there can be ambiguity. For instance, in "pre and post test views", we are not sure which pictures are which; and in "overall, closeup of front, accelerometers in turn position, and tail view in room 123" we can only be sure of what the first and last picture depict if there are five pictures in a sequence. Thus, supercaption differential information can often only be assigned as possibly-true to subcaptions.

Another important class of inferences concern picture types. In studying a wide range of photographs from different sources, we observed certain common categories: (1) portraits of people; (2) illustration of single non-human subjects (like equipment at NAWC-WD, animals in National Geographic, or the first and third pictures in this paper); (3) terrain overviews (like aerial photos at NAWC-WD, landscapes in NG, or the picture above); (4) process documentation (like documentation of tests at NAWC-WD, cultural-ritual illustrations in NG, or the second and fourth pictures in this paper),; and (5) artificially staged scenes to illustrate ideas. These categories can be further subdivided in domain-dependent ways; for instance, category (4) at NAWCWD has clearly distinguished subcategories for sled tests, firing tests, fire-safety tests, assembly operations, and site visits. Of course, some photographs are multi-purpose, and others have ambiguous captions, so that they could belong to more than one category. Queries can also imply categories. Thus a good heuristic for quickly ruling out obviously poor matches to a query is that the query categories must intersect the caption categories.

The additional semantics provided by the inferences so far described can be used to improve the precision and recall of an information-retrieval system for pictures based on captions, so it has more than theoretical interest. Using the above ideas, (Rowe, 1994) reports 60% higher retrieval of appropriate matches to an English query, for queries generated by naive users, while at the same time retrieving 37% fewer data items. Thus knowledge-based retrieval improved both recall and precision.

7. References from a picture to a caption

Just as captions have linguistic foci, pictures that depict have visual foci, something not true of pictures in general. That is, if a picture is to be considered a "good" depiction of something, and worth storing in a multimedia library, the object(s) depicted usually can be inferable from the picture alone. However, photography is a less precise enterprise than writing captions because photographs sometimes must be taken in a hurry, and the best angle to the subject or best distance from the photographic subject is not always possible, and it is also much harder to "edit" the results. So visual focus can only be established by a set of factors that correlate with it.

We have identified six major factors that can be applied to the regions identified in a picture to rate how likely a region or set of contiguous regions is to be a visual focus. First, a visual focus tends to be a big region or set of regions (with exceptions for photographs illustrating the context of some subject). Second, a visual focus tends to be surrounded be a strong edge, or clear discontinuity in brightness, color, or texture. Third, a visual focus tends to be either a uniform color or color mix, although its brightness may vary considerably. Fourth, a visual focus tends not to touch the boundary of picture, though large objects can touch a little (with one major exception: People and some animals are generally considered depicted if their faces are depicted.) Fifth, a visual focus has its center of mass close to the center of the photograph. Sixth, there are few other regions or region clusters having the same properties as the visual focus (with exceptions for some natural pictures like those of flowers in a field).

For the example picture below, we must first identify the collection of regions of varying shapes and colors as an aircraft. There is an edge around it, but there are also strong horizontal edges on the ground. The aircraft can be inferred to be a visual focus, however, because it is big, near the center of the picture, mostly not touching the boundaries, and the only region having its mixture of colors. The caption on this picture is "lgb skipper bomb on a-7c bu# 156739 aircraft (cl on tail). side view of aircraft on runway." Unfortunately, the bomb is too small compared to the aircraft to permit easy recognition of it, the other linguistic focus.

So early visual processing should be adjusted, in thresholds and in the techniques used, to find such a region or regions, using parameters for textural discrimination between regions if necessary; (Seem and Rowe, 1994) describes the techniques we are exploring for this in one domain. The tendency of these six factors to correlate with visual focus naturally maps to a neural net with the factors as inputs. The neural net should be trainable, since there are no human experts to consult with on the proper weightings of the factors. The weights on the factors also need adjusting to the domain and picture type within the domain because they can obviously vary significantly. For example, for most NAWC-WD pictures, the fourth and fifth factors are very important, and the first factor is quite unimportant because there many occasions when the context in which a small object is embedded is more important than the object. But process documentation pictures, (type (4) of the last section) and some wildlife pictures like the one above, are often taken in a hurry at NAWC-WD, and for them the first, fourth, and fifth factors must all be weighted lightly.

8. Correlations between pictures shapes

Before matching picture to caption, some unreasonable interpretations of the objects in the picture can be ruled out by using statistics on pictures already certified as being correctly analyzed, analogously to what we do with binary word-correlation statistics. For instance, we can count how many times a missile appears below an aircraft wing in all the occurrences of that missile in pictures. Then if we see a missile in a new picture below some object we cannot identify, its likelihood of being an aircraft wing can be estimated as high, no matter how blurry or unclear its shape.

Consider this example, "photographic equipment. extended range tracking mount on trailer." We could use statistics that say trailers usually have four wheels and that each wheel touches the ground, and that trailers often have a hitch attached at one end. There are several dark regions above the trailer that appear to be "on" it, so perhaps these are the "mount" The background regions can be excluded from consideration as foci because of their weak edges and numerosity.

9. Putting it all together

A semantic network can be built of the relationships between shapes in the picture, and then matched to the semantic network of the caption. In general, the query graph will be a subgraph of the picture graph because captions are intended as summaries. So we have a subgraph isomorphism problem in trying to match the two. This is a different subgraph isomorphism problem from the linguistic query-caption match that comprises the last stage of linguistic processing; here we can use additional clues, like the known relative sizes of objects in the real world, to rule out inconsistent sets of matches. This particular instance of subgraph isomorphism is well suited for relaxation techniques, since many properties can be inferred for the regions in the picture if desired in order to rule out possibilities for match items in the caption. After relaxation, backtracking can be used to generate possible matches. Then the two semantic networks can be conjoined into one single network that can be used to answer queries better than either network alone.


Belkin, N. J. and Croft, W. B. Information filtering and information retrieval: two sides of the same coin? Communications of the ACM, 35, 12 (December 1992), 29-38.

Charniak, E. Statistical Language Learning. Cambridge, MA: MIT Press, 1993.

Cochran, W. G. Sampling Techniques, third edition. New York: Wiley, 1977.

Jones, M. and Eisner, J. A probabilistic parser applied to software testing documents. Proceedings of the Tenth National Conference on Artificial Intelligence, San Jose, CA, July 1992, 323328.

Krovez, R. and Croft, W. B. Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10, 2 (April 1992), 115141.

Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. Five papers on Wordnet. International Journal of Lexicography, 3, 4 (Winter 1990).

Rau, L. Knowledge organization and access in a conceptual information system. Information Processing and Management, 23, 4 (1988), 269284.

Rowe, N. Antisampling for estimation: an overview. IEEE Transactions on Software Engineering, SE-11, 10 (October 1985), 1081-1091.

Rowe, N. Inferring depictions in natural-language captions for efficient access to picture data. Information Processing and Management, 30, 3 (1994), 379-388.

Rowe, N. and Guglielmo, E. Exploiting captions in retrieval of multimedia data. Information Processing and Management, 29, 4 (1993), 453-461.

Rowe, N. and Laitinen, K. Semiautomatic deabbreviation of technical text. Technical report, Computer Science, Naval Postgraduate School, April 1994.

Seem, D. and Rowe, N. Shape correlation of low-frequency underwater sounds. Journal of the Acoustical Society of America, 90, 5 (April 1994).

Sembok, T. and van Rijsbergen, C. SILOL: A simple logical-linguistic document retrieval system. Information Processing and Management, 26, 1 (1990), 111-134.

Smeaton, A. F. Progress in the application of natural language processing to information retrieval tasks. The Computer Journal, 35, 3 (1992), 268-278.

Srihari, R. K. Use of collateral text in understanding photos. Artificial Intelligence Review, to appear 1994.


This work was sponsored by DARPA as part of the I3 Project under AO 8939.

Go up to paper index