Automatic Classification of Objects in Captioned Depictive Photographs for Retrieval

 

Neil C. Rowe and Brian Frew

U.S. Naval Postgraduate School
Code CS/Rp, Department of Computer Science
Monterey, California 93943
USA
(408) 656-2462
rowe@cs.nps.navy.mil

Abstract

We investigate the robust classification of objects within photographs in a large and varied picture library of natural photographs.  We assume the photographs have captions describing and locating only imprecisely some of the objects present in the picture, as is common in libraries.  Our approach does not match to shape templates nor do full image picture understanding, neither of which works well for natural photographs where appearance varies considerably with lighting and perspective.  Instead, we strike a robust compromise by statistically characterizing photograph regions with 17 key domain-independent parameters covering shape, color, texture, and contrast.  We explored two ways to use the parameters to classify picture regions, case-based reasoning and a neural network, both of which require training.  We found the neural network outperformed case-based reasoning, especially when we included caption information and a separate neuron inferring likelihood that a region was the "visual focus" of the picture.  Then 25-category shape classification succeeded 48.1% of the time on a set of pictures randomly selected from a large picture library currently in use.  Our work represents good progress on the difficult problem of retrieval by content from large real-world picture libraries.

 

1.  Introduction

Increasing attention is being paid to retrieval from picture libraries.  Picture captions are especially helpful in finding the right pictures for a user need [Srihari 1994].  Our previous research in the MARIE project [Rowe 1994; Rowe 1996] proposed and confirmed by testing a theory of how captions refer to pictures.  [Rowe 1994] in particular identified linguistic clues that indicated whether a noun or verb in a caption was fully depicted, partially depicted, or definitely undepicted, and also inferred presence of supertypes from the presence of types, partial presence of wholes from the presence of parts, and broad picture categories.

However, picture captions are rarely sufficient descriptions of a picture for all purposes because of space limitations and the unanticipated uses to which a picture can be put.  For instance, for the MARIE-2 testbed of the picture library of the U.S. Navy air test facility NAWC-WD in China Lake, California, many pictures show airplanes, but the captions never describe the airplane-image size, something of great importance to someone wanting a good picture of an airplane.  Other features important to users but rarely mentioned in the captions are orientations of objects, colors, the time of day, the illumination, and the identities of objects in the background; identities of objects are also sparingly provided in captions for complex scenes.  If these features are to be queried in picture retrieval, they must be computed from the images, preferably during library setup.  This requires analyzing what "regions" of the picture (clusters of pixels of similar characteristics) represent, using the region's properties and general principles (like that man-made objects have many right angles).  This is difficult, but there are clues from the caption and the intent of the photographer to clearly present a photographic subject.

Previous work that does attempt content analysis and matching for a wide variety of images generally makes important simplifications of the task, assumptions impossible for large real-world picture libraries.  [Barber et al 1994] analyzes a broad class of images and matches regions of the picture to ideals summarized by feature vectors, but requires user definition of the outlines of the shapes to be recognized.  Petkovic et al [this volume] and [Flickner et al 1995] simplify matching in a picture library to colors and simple shape properties; Smith and Chang [this volume] simplifies to colors and global spatial layout; [Ogle and Stonebraker 1995] simplifies to colors; and [Kato 1992] simplifies to a few image properties.  There is some good work on recognition of objects in artificial graphical images by Chuah et al [this volume], engineering drawings and floor plans by Rabitti and Savino [1992], and aerial photographs by Choo et al [1990].

Robust recognition of objects in scenes observed by a mobile robot is addressed in [Draper et al 1989] and [Strat 1992]; both researches used domain-dependent context-based inferences to infer identities of objects, using frame-like "schemas" in the first and frame-like "contexts" in the second.  Some image-understanding work requires detailed predefined models of what can be seen, e.g. [Lamdan and Wolfson 1988], but in most real-world picture libraries (including our testbed), far too many different objects appear to build models for all of them.  There is a variety of work on "shape matching" where just the outline of a picture region is matched to a template [Scassellati et al 1994; Rickman and Stonham 1993; Mehrotra and Gary 1995], but these methods do not work well for natural photographs where objects can appear in many orientations and lighting conditions: Region color and texture are more helpful then.  Similar problems of sensitivity to photographic conditions apply to two-dimensional strings for compact representation of spatial relationships in images in visual databases [Chang et al 1987].

So our work reported here differs from previous work in image-content analysis in that: (1) it is concerned with "natural" (real-world) images from photographs of three-dimensional objects, so pixel-by-pixel matching to templates will not work; (2) it is concerned with recognizing parts of a picture, not relating the parts; (3) it requires only minimal specialized domain knowledge or visual-object models, so it can handle a broad class of pictures; (4) it does not require user input beyond a natural-language query; (5) it is fully implemented.  We also focus on just image-content analysis, and our methods would be just one module in a multimedia-retrieval system like those described in this volume by Mani et al, Merialdo and Dubois (as an "agent"), and Griffioen, Yavatkar, and Adams (as "extraction of embedded semantic information").

 

2. Finding regions in the pictures

Our experiments used the MARIE-2 multimedia retrieval system we are developing [Rowe 1996].  MARIE-2 is written in Quintus Prolog, like the programs reported in this paper.  Our experiments used a random sample of 127 pictures from the China Lake picture library of about 100,000 pictures.  The library includes views of facilities, views of equipment, views of how equipment should be mounted on aircraft (since this a naval air test facility), views of tests, views of routine base activities, public relations photos, historical photos, and views of natural features of the area.

The originals were high-quality 8.5x11 inch color prints.  The 127 selected for testing were digitized and reduced to approximately 100 by 100 colored pixels each, where each pixel was represented by 8-bit red, green, and blue values.  We chose to work with this low level of resolution because it saves a great deal of space, enabling magnetic-disk storage of full libraries of such reduced images.  This image resolution is sufficient for most browsing, and many World Wide Web sites with image collections use something similar on overview pages; but such a resolution prevents detailed image analysis such as classification of aircraft.  Dithering was necessary to give best visual display for a fixed number of bits, although it complicates the subsequent image processing.

Fig. 1 shows an example picture we analyzed.  It is shown here in black and white, but is stored and analyzed in our system in color.  It has the caption: "Photovoltaic cell panels for generating power to ultimately operate a radar.  Left to right: NASA employee and Richard Fulmer with the batteries and power inverter."

Figure 1: Example input picture.

Our challenge was to process a wide variety of such pictures in a robust way.  We first used mostly-standard methods of pixel-level image processing from [Ballard and Brown 1982] to find regions of homogeneous characteristics in the picture, using implementations adapted from the program in [Seem and Rowe 1994].  These methods were, in order: (1) image averaging; (2) color gradient thresholding; (3) clumping of the results into pixel regions; (4) computation of basic region properties; (5) merging of single-cell regions into adjacent regions; and (6) iterative merging of the remaining adjacent regions with similar characteristics.  Fig. 2 shows results of this processing on the example of Fig. 1, with the regions numbered for future reference.  Because of its visual variety, this picture was of above-average difficulty for our software.

Figure 2: Regions found by our program for the picture in Fig. 1.

Image averaging of each square of four adjacent cells was done first to compensate for dithering.  Then gradient thresholding was done, with a tight color gradient threshold so as to separate many pixel boundaries and avoid splitting regions later.  Clumping was then done in a single pass, and the tight gradient meant about 500-2000 initial regions were created for the 10,000-pixel pictures.

Next we computed 26 statistical properties of each region of the image, properties providing a good summary of the basic visual properties of the image regions.  We chose these properties from study of the test library, observing that color and texture are often more important than shape in identifying many regions (like sky and terrain) in natural photographs.  The region properties we computed are listed in Fig. 3.  They include geometric properties, brightness and color properties, and shape properties for the regions.  Dimensions (statistics A-H and U) are measured in numbers of pixels; brightnesses (statistics K-Q and T) are measured with a 0-255 gray scale for each color; skews (statistics I-J) are proportional to the size of the bounding box; diagonality (statistic V) is a fraction of the boundary length; curviness (statistics W) is in radians (and minus twice pi for closed curves); correlations (statistics R-S) run -1 to 1; and counts (statistics X and Y) are unadjusted.  We did not compute statistics on linear features like [Draper et al 1989] did because they appeared rarely important in pictures intended for depiction.

 

Code

Explanation

Code

Explanation

A

region number

B

area in pixels

C

circumference in pixels

D

number of picture-boundary pixels

E

minimum x-coordinate

F

minimum y-coordinate

G

maximum x-coordinate

H

maximum y-coordinate

I

x-skew of center of mass in box

J

y-skew of center of mass in box

K

average red brightness

L

average green brightness

M

average blue brightness

N

standard deviation of red brightness

O

standard deviation of green brightness

P

standard deviation of blue brightness

Q

average brightness variation between adjacent cells

R

correlation of brightness with x

S

correlation of brightness with y

T

average strength of region boundary

U

smoothed-boundary length

V

smoothed-boundary diagonality

W

smoothed-boundary curviness

X

smoothed-boundary number of inflection points

Y

number of right angles in boundary

Z

whether boundary is open or closed

Figure 3: The 26 basic region statistics and their codes.

Next we merged single-cell regions into their neighbor regions in a best-first way.  Merges were ranked by the weighted sum of the deviations in red, green, green and blue values (statistics K, L, and M above) between the cell and the average of the merged-into region; the weights were the reciprocals of the corresponding standard deviations (statistics N, O, and P), a common method for distance metrics.  Region properties were updated with each merge.

Finally, we did best-first merges of the remaining regions.  After some experimentation, best performance was obtained when merges were ranked on the weighted sum of three factors: the color deviation between the regions (calculated and weighted as with single-cell merges), the deviation in the neighbor-brightness variation (statistic Q) over the regions (weighted similarly), and the weighted decrease in density (computed from statistics B, E, F, G, and H) of the bounding boxes (rectangles).  The weighting on the last was twenty times the square root of the area of the larger region, which we found prevented accidental merges of small regions until more pixels could provide more accurate properties of them.  The first factor modeled color similarity; the second, pixel-level textural similarity (to distinguish regions of similar color but markedly different uniformities); and the third, resulting-region compactness (to discriminate against creation of thin and curvy regions).  Again after each merge, region properties were updated, using in part the properties of the merged regions.

An overall-rank threshold defines the end of merging.  However, if more than 30 regions remain, we judge that too many initial regions were created, so the color-gradient and merge thresholds are increased and picture analysis is redone.  Similarly, if no region that touches the picture border, we judge that too few  initial regions were created, decrease the thresholds, and redo.

Fig. 4 shows the 26 basic statistics on the final set of regions computed for the picture in Fig. 1, the regions shown in Fig. 2.  To understand the numbers, it is helpful to compare them for region 1 (the sky, represented by the second column from the left) and region 7 (the batteries and power inverter, the black area in the lower left, represented by the eighth column).  Region 1 has 3012 pixels (statistic B) to 626 for region 7.  Region 1 has a 110 by 37 bounding rectangle (statistics E-H) while region 7 has a smaller 30 by 13 one, consistent with the size difference.  One significant difference is that 162 pixels of region 1 are on the edge of the picture (statistic D) while none of region 7.  Another is that region 1 brightness (statistics K-M) is considerably greater than region 7 brightness, especially in the blue as should be expected with sky.  Two interesting things about the brightness are that region 1 shows significant decreasing brightness with depth in the picture (statistic S), as is typical of sky, and region 7 has significant variation in its blue level (statistic P), apparently from the light of the sky reflecting off different colors.  Finally, the shape statistics (T-Z) are not too helpful because of the inaccuracies in the boundaries, but the greater contrast of region 7's edge (statistic T) is useful.  (Note that pixels on the edge of the picture are excluded from region 1's boundary since they do not characterize its shape.)

 

A

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

B

3012

1099

689

574

214

675

626

346

5

75

437

5

74

191

118

C

162

231

136

128

70

92

125

45

5

52

88

5

45

79

30

D

162

21

13

0

0

47

0

48

0

0

29

0

0

19

25

E

1

57

1

48

85

1

16

90

98

50

53

99

83

22

16

F

1

16

28

33

42

43

45

47

51

54

54

57

63

65

68

G

110

110

48

78

99

37

57

110

98

62

89

99

101

60

41

H

37

64

48

61

67

74

69

74

55

70

74

61

72

74

74

I

.04

.58

.55

.12

.69

.79

.32

.87

.77

.03

.31

.79

.64

.23

.53

J

.61

.01

.01

.23

.48

.52

.53

.65

.41

.68

.78

0.58

0.82

0.86

0.92

K

580

434

445

540

229

655

252

618

408

455

327

370

424

585

398

L

600

295

399

521

36

666

107

597

278

343

170

255

311

520

312

M

808

358

559

677

80

834

200

774

391

416

246

357

408

605

408

N

45

96

48

51

31

45

70

28

34

69

63

37

95

59

125

O

55

111

65

76

44

61

92

42

33

79

85

50

131

75

173

P

61

120

79

115

51

66

111

46

18

81

101

68

180

76

177

Q

75

168

138

136

90

101

170

83

46

153

164

67

244

134

244

R

-.26

-.28

-.06

-.40

.10

-.50

-.26

-.07

0

.15

-.10

0

-.44

.18

.34

S

-.62

-.09

-.26

-.17

.12

.22

.13

.42

-.85

.27

-.09

-.96

.25

.40

.51

T

186

197

175

216

157

234

287

346

343

206

241

321

200

301

273

U

163

234

150

133

76

92

130

45

7.9

55

90

7.9

48

80

32

V

.24

.28

.15

.30

.24

.27

.31

.27

.33

.37

.37

.33

.35

.29

.34

W

3.0

3.6

2.8

2.7

2.1

3.0

2.4

2.9

-.04

3.5

4.5

-.04

2.6

3.2

2.6

X

14

24

11

10

6

9

15

6

0

7

13

0

6

10

3

Y

3

5

0

2

1

0

0

0

0

0

1

0

2

1

0

Z

o

o

o

c

c

o

c

o

c

c

o

c

c

o

o

Figure 4: Basic statistics computed for the 15 regions of Fig. 2 (see Fig. 3 for explanation of the statistics codes).  Rows represent statistics and columns represent regions.

 


3. Classifying regions

 

3.1. Case-based methods of identifying regions

One approach to region identification is to use human judgment to classify (label) some example regions and store these as cases for case-based retrieval.  That then means for an unknown region, we find the case region with closest metric distance to the unknown's statistics, and assign the case's classification to the unknown.  Our closeness metric was the square root of the weighted sum of the squares of the differences in 17 corresponding derived statistics between unknown region and case region.  Weights were initially the reciprocals of the standard deviations of the statistic values over all cases; they were adjusted manually to improve performance.  The 17 derived statistics are shown in Fig. 5.  They were chosen by us by examining pairs of regions in the library, meaningfully combining the 26 basic region statistics until we had a set of derived properties we thought sufficient to distinguish any two regions in the pictures.  They define a feature space analogous to those in the chapters by Blum et al (for sounds) and Manmatha et al (for handwritten words), but of necessarily greater dimensionality given the greater complexity of images.


 

Code

Name

Definition

Case Weight

a

circularity

area / (circumference * circumference)

1.22

b

narrowness

height / width of bounding rectangle

0.31

c

marginality

1 / (1 + (circumference / number of border cells))

5.06

d

redness

average red brightness

0.005

e

greenness

average green brightness

0.004

f

blueness

average blue brightness

0.004

g

pixel texture

average brightness variation between adjacent cells

0.017

h

brightness trend

root means square brightness correlation with x and y

4.47

i

symmetry

fractional skew of center of mass from  box center

3.24

j

contrast

average strength of the region edge

0.010

k

diagonality

smoothed-boundary non-verticality and non-horizontality

9.07

l

curviness

smoothed-boundary curviness

0.694

m

segments

smoothed-boundary number of inflection points

0.130

n

rectangularity

smoothed-boundary number of right angles

0.650

o

size

area in pixels

0.001

p

density

density of pixels in bounding rectangle

0.050

q

height in picture

y-skew (unsigned) of center of mass within  box

3.45

Figure 5: The 17 derived statistics used in classifying regions, with their final weights used in case-based reasoning.

We tested this approach with a case library of all the recognizable regions in a random subset of 64 of our 127 test pictures (containing 935 of the 1685 picture regions).  We ran tests involving both 5 and 25 broad object classes that we developed from our experience (in both captions and pictures) with the database.  The 5 classes used were equipment, landscape, structure, “being” (living creature), and gas (clouds, smoke, etc.); the 25 classes were airplane, airplane part, bomb, bomb part, building, building part, equipment, fire, flower, helicopter, helicopter part, horse, missile, missile part, mountain, pavement, person, person body part, rock, ship part, sky, tank, terrain, wall, and water.  The intent of these classes was to provide a start of a visual taxonomy of the objects in the pictures, a start that could then be refined by caption information.  Some of the 25 are domain-specific, but they could be generalized.

To ensure a high-quality case library, we excluded from both the cases and test regions the numerous unidentifiable small regions, a few unidentifiable larger regions, and some regions created by merging errors during pixel-level visual processing; these amounted to 605 of 1685 regions.  For Fig. 2 for instance, we judged regions 3, 4, 5, 7, 10, and 11 as equipment; region 1 as sky; and region 5 as a part of a person.  Region 2 cannot be classified since the people were incorrectly merged into the fence in pixel-level processing; regions 9 and 12 cannot be classified since they are too small; and regions 6 and 8 cannot be classified because they could be either terrain or pavement, and the caption provides no clue.

With 5 classes, the percentage of correct classifications for the test regions was 57.4%; with 25 categories, 29.8%.  For Fig. 2 and 25 classes, only one region was properly identified, region 1 as sky; 3 and 5 were identified as terrain, 4 as an airplane part, 7 as a helicopter part, 10 as a person part, and 11 as a flower.  Of course, accuracy could be improved by increasing the number of cases.

Finding the best weights (the last column of Fig. 5) even for this level of performance required significant trial-and-error: 78 test runs on all 64 case pictures at 13 minutes of CPU time per run (1.39 seconds per region analysis), for a total of 17.1 hours, plus an additional ten hours spent analyzing the output.  (These times are for Quintus Prolog in the default semi-compiled mode.)  This tedious adjustment of weights is an important disadvantage of this approach.

 

3.2 Neural-network methods of identifying regions

Neural networks have been successful at shape recognition [Rickman and Stonham 1993], and have outperformed case-based reasoning for some text-retrieval tasks [Schutz, Hull, and Pedersen 1995], so they should be considered for visual-data retrieval too.  A simple neural-network approach is to have a neuron for each region classification t, taking as inputs the 17 derived region statistics for a particular region r, and providing as output the degree to which r has classification t.  Then the classification associated with the largest output can be taken as r's classification.  As usual [Wasserman 1989, chapter 3], inputs have weights to control their importance to each output, a linear sum is taken of the weighted inputs, and a nonlinear gain function (here a logarithm) is applied to the result; so our neuron number t computes log(Wta*sa + Wtb*sb + ... + Wtq*sq), where the Wtx is the weight on the derived statistic x for neuron t, and sx is the value of  derived statistic x for the region r.

The weights were initially 1.  Adjustment of weights (feedback) was done only when the neuron for the correct classification did not have the highest output among the neurons, and was computed by adding to each weight the product of these factors:

·        the corresponding input value;

·        the reciprocal of  the standard deviation of that input value over all training cases, to normalize the input value;

·        the ratio of the number of nonexamples of the neuron's class to the number of examples, to balance positive and negative feedback over the training examples;

·        for neurons of erroneous classes, the output of the correct-classification neuron minus the output of this neuron, the negative error;

·        for neurons of correct classes, the output of the largest-output incorrect-classification neuron minus the output of this neuron, the positive error;

·        a “learning speed”  constant.

This means weights can range greater than 1 or less than 0, but they rarely went negative.  We ran the training examples, as a set, through the neural net several hundred times.

We first tried such a single-level network.  We got 67.1% accuracy (389/580) for 5 output classes, and 33.4% (194/580) for 25 output classes; only the first was significantly better than with case-based reasoning.  In both cases it helped to turn off weight adjustment during testing.  For Fig. 2 and 25 classes, the neural net got three regions correct: 1 (sky), 4 (equipment), and 10 (equipment); but it misidentified 2 and 5 as equipment, and 11 as terrain.  Fig. 6 lists the final weights for the 5 output classes found by the neural net after training.  This performance was somewhat disappointing, so we looked for additional sources of information to help the neurons.


 

Statistic Code

Weight for “equipment”

Weight for “landscape”

Weight for “structure”

Weight for “being”

Weight for “gas”

a

5.42037

-6.97767

-1.94022

-0.175702

6.17973

b

4.26491

-8.3301

2.66959

2.55869

1.34759

c

-2.65447

-1.83798

7.46498

-3.76228

3.34481

d

-1.09679

0.809329

4.76544

2.73314

-4.60454

e

5.05693

5.958

-4.6198

3.70639

-7.52645

f

-2.14902

-4.8451

0.489836

-6.63653

15.7119

g

1.35821

-0.716794

3.14483

1.64459

-2.82619

h

0.518828

0.74166

0.849864

-0.039771

0.47357

i

1.19423

2.70491

-2.10099

1.60758

-0.775907

j

1.37304

1.35419

-0.781583

-0.625126

1.27464

k

0.780265

0.819155

-0.859497

1.71527

0.132284

l

0.357971

2.2851

-3.13827

0.713661

2.37169

m

1.17743

-0.37642

2.06484

-0.196143

-0.169705

n

0.0904502

0.113563

1.07907

0.491628

0.725293

o

-1.98266

4.27702

-2.03083

0.643991

1.59248

p

1.2905

0.138557

0.260798

0.0888679

0.725936

q

0.947604

1.28191

-0.05991

1.50196

-0.880632

Figure 6: Weights found by the neural network, without linguistic information, for connections between the 17 input region statistics and the 5 output region classifications.

 

3.3 A neuron for focus identification

Pictures are not put in picture libraries unless they depict something well.  This means that such pictures usually have a "visual focus", an image subject (or more formally, a region or set of regions of primary visual importance).  The visual focus should generally be inferable from region properties, although photography is less precise than verbal description because photographs are sometimes taken in a hurry and do not provide the best possible view of the subject.  When a picture is captioned, visual focus often corresponds to the linguistic focus [Grosz 1977] or verbal emphasis of the caption.  From examination of a wide range of pictures depicting things, we identified six factors that contribute to the probability that a region in a picture is likely to be part of the visual focus:

·        it is big;

·        it is surrounded by a strong discontinuity in color or texture;

·        it has a uniform color mix, though its brightness may vary;

·        it does not touch the boundary of picture (except for people’s bodies when their faces are shown);

·        its center of mass is close to the center of the photograph;

·        its properties differ from those of any other region in the picture (except for some natural subjects like vegetation).

For Fig. 2 for instance, regions 1, 2, 3, 6, 8, 11, and 15 are unlikely to be foci by border touching, the fourth factor.  Of the remaining regions, 4 and 7 are the most likely to be foci by size, the first factor, and by uniformity of color, the third factor.  Region 4 is best on location of the center of mass, the fifth factor.  Hence region 4 is most likely to be a visual focus of the picture; and indeed, it corresponds to the subject noun phrase of the first sentence of the caption, the linguistic focus.

Since the first five factors were used for region classification, it was easy to build and train a focus neuron on them.  That is, a neuron computing log(Wfo*so + Wfj*sj + Wfg*sg + Wfc*sc + Wfq*sq), where o, j, g, c, and q are the derived-statistics codes, and the weights are special focus-neuron weights.  Regions that corresponded to caption word senses in one or more of the 25 classes were assumed to be part of the visual focus, and the focus-neuron weights were given positive feedback for them, while other regions caused negative feedback.  Word senses for the nouns in the captions were obtained by parsing with an improved version of the caption-parsing program referred to in [Rowe 1996].  Using among other things the Wordnet thesaurus system, we mapped the word senses we could to the 25 classes (taking also "instrumentality" for "equipment", "explosion" for "fire", "angiosperm" for "flower", "paved surface" for "pavement", "personnel" for "person", and "cloud" for "sky").  This simple way of using linguistic information did allow too many possible visual foci, but was quick and easy to implement.  The result was a 38.3% success rate (223/580) in focus identification for the set of test regions.  This is not great, but it was achieved without the region-classification information of the last section, and is significantly better than random guessing since about 10% of the regions are foci.

 

3.4 Using focus information to improve region identification

We then used the numerical output of the focus neuron to improve the region-classification neurons.  For instance with our test library, the visual focus should correlate negatively with the sky classification, since sky is rarely the subject of these photographs, but the visual focus should correlate positively with the aircraft classification, the main subject of the library.  So for every neuron for a classification t that is a superconcept for at least one caption word sense, we multiply the output of the region-class neuron by the output of the focus neuron to get a cumulative output.  Otherwise, we multiply the neuron output by 0.5, a default found to work well by experiment.

This approach improved classification success, again taking a success as when the neuron for the correct classification had the largest output of the neurons.  We obtained a success rate of 48.1% (i.e., 48.1% of the regions were identified correctly) for the modified network on 25 classes, a significant improvement over the success rates both without the focus information and in case-based reasoning.  In addition, the mistakes made were more intelligent, like confusing an aircraft region with a missile region rather than an aircraft with a sky region.  To show what kind of mistakes were made, Fig. 7 shows the confusion matrix for 500 harder test regions, for the final state of the neural network after training with focus information.  Rows represent classes chosen by the program, columns represent the correct classes, and the entries are counts. Here class 1=airplane, 2=airplane part, 3=helicopter part, 4=bomb, 5=bomb part, 6=missile, 7=missile part, 8=tank part, 9=equipment, 10=building part, 11=wall, 12=mountain, 13=pavement, 14=terrain, 15=sky, 16=person part, 17=smoke, and 18=fire.  (The remaining seven of the 25 classes are omitted from Fig. 7 since they occurred in only three events: sky was misidentified as water once, rock was misidentified as pavement once, and building was misidentified as sky once.)  The counts show that human artifacts like classes 1-11 are easier to identify than natural objects, and that identification of people is very hard, apparently because of the widely varying colors).  Note that these results were reached without any region-relationship constraints, which could significantly help.  Running the neural net averaged 0.0852 seconds per region using semi-compiled Quintus Prolog.


           

Class

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

1

0

2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

2

0

41

0

0

2

0

3

0

13

0

0

1

19

12

11

8

0

0

3

0

0

1

0

0

0

0

0

3

0

0

0

1

0

0

0

0

0

4

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

6

0

1

0

0

0

0

0

0

1

0

0

0

0

0

0

2

0

1

7

1

1

0

0

0

0

0

0

0

0

0

0

2

0

0

0

0

0

8

0

0

0

0

0

0

1

0

0

0

0

1

0

0

0

0

0

0

9

0

24

1

0

0

0

0

0

73

1

0

2

9

3

9

9

1

0

10

0

0

0

0

0

0

0

0

2

2

0

0

0

1

1

0

0

0

11

0

0

0

0

0

0

0

0

2

0

0

0

0

0

2

0

0

0

12

0

8

0

0

1

0

0

0

4

0

0

0

0

2

2

3

0

0

13

0

6

0

1

0

1

1

0

15

0

0

0

14

3

12

0

0

0

14

0

3

0

0

0

0

0

0

10

0

0

0

10

23

6

0

0

0

15

0

12

0

0

4

2

1

0

3

2

0

0

2

7

27

0

2

5

16

0

2

0

0

0

0

0

0

4

0

0

1

2

1

1

0

0

0

17

0

3

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

18

0

3

0

0

0

0

1

0

3

0

0

0

2

0

0

1

1

7

Figure 7: Confusion matrix for 500 test regions on the final version of the neural net with focus information, after training.

 

4. Conclusions

We have attempted the difficult task of robust classification for retrieval of visual objects in a library of widely varying depictive natural photographs, using mostly domain-independent ideas.  These kinds of images have not been attempted before, so our results are difficult to compare with others.  Like much information retrieval, our methods are imperfect, but they may be sufficient for browsing.  Our results suggest that traditional region-segmentation methods suffice for most regions in most natural photographs in a 100 by 100 pixel reduction, and that a simple neural net can correctly identify general object classes half the time.  Results also suggest that a neural net is preferable to case-based reasoning for region classification.  They also showed that the notion of visual focus helps in classification, as does as enumeration of relevant concepts mentioned in captions.  In this we confirm the advantages of multimodal redundancy cited in Mani et al and Hauptman and Witbrock [this volume].  Our errors appear due to the small size of our training set (each region had to be manually identified), some inaccuracy of our segmentation, and our failure to exploit relationship constraints between regions, all of which are remediable with additional work.  The largely domain-independent nature of our methods (albeit not in some of our defined classes) suggests scalability of our approach.  Our work could provide a valuable component in the multimedia-retrieval systems discussed in Section 1 of this volume.

 

5. References

Ballard, D. and Brown, C., 1982.  Computer Vision.  Englewood Cliffs, New Jersey: Prentice-Hall.

Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W., and Faloutsos, C, 1994.  Efficient and Effective Querying by Image Content.  Journal of Intelligent Information Systems, 3 (3-4): 231-262.

Chang, S., Shi, Q., and Yan, C., 1987.  Iconic Indexing by 2-D Strings.  IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-9 (May), 413-428.

Choo, A., Maeder, A., and Pham, B., 1990.  Image Segmentation for Complex Natural Scenes.  Image and Vision Computing, 8 (2): 155-163.

Draper, B., Collins, R., Brolio, J., Hanson, A., and Riseman, E, 1989.  The Schema System.  International Journal of Computer Vision, 2, 209-250.

Kato, T., 1992.  Database Architecture for Content-Based Image Retrieval.  Proceedings of SPIE, Image Storage and Retrieval Systems, San Jose, California (February).

Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., and Yanker, P., 1995.  Query by Image and Video Content: The QBIC System.  Computer, 28 (9): 23-32.

Grosz, B., 1977.  The Representation and Use of Focus in a System for Understanding Dialogs.  Proceedings of the International Joint Conference on Artificial Intelligence, Cambridge, Massachusetts, 67-76.

Lamdan, Y. and Wolfson, H., 1988.  Geometric Hashing: a General and Efficient Model-Based Recognition Scheme.  Second IEEE International Conference on Computer Vision, Tampa, Florida, 238-249.

Mehrotra, R. and Gary, J, 1995.  Similar-Shape Retrieval in Shape Data Management.  Computer, 28 (9): 57-62.

Ogle, V. and Stonebraker, M, 1995.  Chabot: Retrieval from a Relational Database of Images.  Computer, 28 (9): 40-48.

Rabitti, F. and Savino, P., 1992.  Automatic Image Indexation to Support Content-Based Retrieval.  Information Processing and Management, 28 (5): 547-565.

Rickman, R. and Stonham, J., 1993.  Similarity Retrieval from Image Databases--Neural Networks Can Deliver.  In Proceedings of SPIE, vol. 1908, Storage and Retrieval for Image and Video Databases, San Jose, California (February), 85-94.

Rowe, N., 1994.  Inferring Depictions in Natural-Language Captions for Efficient Access to Picture Data.  Information Processing and Management, 30 (3): 379-388.

Rowe, N., 1996.  Using Local Optimality Criteria for Efficient Information Retrieval with Redundant Information Filters.  ACM Transactions on Information Systems, 14 (2).

Scassellati, B., Alexopoulos, S., and Flickner, M, 1994.  Retrieving Images by 2D Shape: A Comparison of Computation Methods with Human Perceptual Judgments.  In Proceedings of SPIE, vol. 2185, Storage and Retrieval for Image and Video Databases II, San Jose, California (February), 2-14.

Schutz, H., Hull, D., and Pedersen, J., 1995.  A Comparison of Document Representations and Classifiers for the Routing Problem.  Proceedings of Eighteenth International Conference on Research and Development in Information Retrieval, Seattle, Washington, 229-237.

Seem, D. and Rowe, N, 1994.  Shape Correlation of Low-Frequency Underwater Sounds.  Journal of the Acoustical Society of America, 90 (5): 2099-2103.

Srihari,  R., 1994-1995.  Use of Captions and Other Collateral Text in Understanding Photographs.  Artificial Intelligence Review, 8 (5-6), 409-430.

Strat, T., 1992.  Natural Object Recognition.  New York: Springer-Verlag.

Wasserman, P., 1989.  Neural Computing.  New York: Van Nostrand Reinhold.       

 

6. Acknowledgments

 

This work was sponsored by DARPA as part of the I3 Project under AO 8939, and by the U.S. Army Artificial Intelligence Center.

This article is Chapter 4 in Intelligent Multimedia Information Retrieval, ed. M. Maybury, pp. 65-79, Cambridge, MA: AAAI Press, 1997.