Chapter: Exploiting captions for Web data mining

Critical Issues in Content Repurposing for Small Devices

Neil C. Rowe

U.S. Naval Postgraduate School

Abstract

Small handheld devices are increasingly popular, but it is difficult to display images, audio, and video on them as users would like.� The limitations of such devices in display and processing for multimedia require significant planning to overcome.� We discuss several ideas that are being studied, including panning and zooming, translation, substitution of links, and reformulation of data.� Many of these require some ranking of data content, so we also discuss methods for that, as well as the process of redoing a data display.� This topic is a highly active area of research, and many innovations will be appearing soon.

This is a chapter in the Encyclopedia of Multimedia Technology and Networking, ed. M. Pagani, Hershey, PA: The Idea Group, 2005.

Introduction

Content repurposing is the reorganizing of data for presentation on different display hardware (Singh, 2004).� It has been particularly important recently with the growth of handheld devices such as "personal digital assistants" (PDAs), sophisticated telephones, and other small specialized devices.� Unfortunately, such devices pose serious problems for multimedia delivery.� With their tiny screens (150 by 150 for a basic Palm PDA or 240 by 320 for a more modern one, versus 640 by 480 for standard computer screens), one cannot display much information (like most of a Web page); with their low bandwidths, one cannot display video and audio transmissions from a server ("streaming") with much quality; and with their small storage capabilities, large media files cannot be stored for later playback.� Furthermore, new devices and old ones with new characteristics have been appearing at a high rate, so software vendors are having difficulty keeping pace.� So some real-time, systematic, and automated planning could be helpful in figuring how to show desired data, especially multimedia, on a broad range of devices.

Background

The World Wide Web is the de facto standard for providing easily accessible information to people.� So it is desirable to use it and its language HTML as a basis for display for small handheld devices.� This would enable people to look up ratings of products while shopping, check routes while driving, and perform knowledge-intensive jobs while walking.� HTML is, in fact, device-independent: It requires the display device and its Web-browser software to make decisions about how to display its information within guidelines.� But HTML does not provide enough information to devices to ensure much user-friendliness of the resulting display: It does not tell the browser where to break lines or which graphics to keep colocated.� Display problems are exacerbated when screen sizes, screen shapes, audio capabilities, or video capabilities are significantly different.�� "Microbrowser" markup languages like WML, S-HTML, and HDML, that are based on HTML but designed to better serve the needs of small devices, help but these only solve some of the problems.

Content repurposing is a general term for reformatting information for different displays.� It occurs frequently with content management for an organization's publications (Boiko, 2002) where "content" or information is broken into pieces and entered in a "repository" to be used for different publications.� However, a repository is not cost-effective unless the information is reused many times, something not generally true for Web pages.� Content repurposing for small devices also involves real-time decisions about priorities.� For these reasons, the repository approach is not often used with small devices.

Content repurposing can be done either before or after a request for it.� Preprocessing can create separate pages for different devices, and the device fetches the page appropriate to it.� It can also involve conditional statements in pages which cause different code to be executed for different devices; such statements can be done with code in JavaScript or PHP embedded within HTML, or with more complex server code using such facilities as Java Server Pages (JSP) and Active Server Pages (ASP).� It can also involve device-specific planning (Karadkar, 2004).� Many popular Web sites provide preprocessed pages for different kinds of devices.� Preprocessing is cost-effective for frequently-needed content, but requires setup time and can require considerable storage space if there is a large amount of content and ways to display it.

Content repurposing can also be either client-side or server-side.� Server-side means a server supplies repurposed information for the client device; client-side means the device itself decides what to display and how.� Server-side repurposing saves work for the device, which is important for primitive devices, and can adjust to fluctuations in network bandwidth (Lyu et al, 2003), but requires added complexity in the server and significant time delays in getting information to the server.� Devices can have designated "proxy" servers for their needs.� Client-side repurposing, on the other hand, can respond quickly to changing user needs.� Its disadvantages are the additional processing burden on an already-slow device, and higher bandwidth demands since information is not eliminated until after it reaches the device.� The limitations of small devices require most audio and video repurposing to be server-side.

Methods of Content Repurposing

Repurposing Strategies

Content repurposing for small devices can be accomplished by several methods, including panning, zooming, reformatting, substitution of links, and modification of content.

A default repurposing method of the Internet Explorer and Netscape browser software is to show a "window" on the full display when it is too large to fit on the device screen.� Then the user can manipulate slider bars on the bottom and side of the window to view all the content ("pan" over it).� Some systems break content into overlapping "tiles" (Kasik, 2004), precomputed units of display information, and users can pan only from tile to tile; this can preventing splitting of key features like buttons and simplifies client-side processing but only works for certain kinds of content. Panning may be unsatisfactory for large displays like maps since considerable screen manipulation may be required, and good understanding may require an overview.� But it works fine for most content.�

Another idea is to change the scale of view, "zooming" in (closer) or out (further).� This can be either automatic or user-controlled.� The MapQuest city-map utility (www.mapquest.com) provides user-controlled zooming by dynamically creating maps at several levels of detail, so the user can start with a city and progressively narrow on a neighborhood (as well as do panning).� A problem for zooming out is that some details like text and thin lines cannot be shrunk beyond a certain minimum size and still remain legible.� Such details may be optional; for instance, MapQuest omits most street names and many of the streets in its broadest view.� But this may not be what the user wants.� Different details can be shrunk at different rates, so that lines one pixel wide are not shrunk at all (Ma & Singh, 2003), but this require content-specific tailoring.

The formatting of the page can be modified to use equivalent constructs that display better on a destination device (Government of Canada, 2004).� For instance with HTML, the fonts can be made smaller or narrower (taking into account viewability on the device) by "font" tags, line spacing can be reduced, or blank space can be eliminated.� Since tables take extra space, they can be converted into text.� Small images or video can substitute for large images or video when their content permits.� Text can be presented sequentially in the same box in the screen to save display space (Wobbrock et al, 2002).� For audio and video, the sampling or frame rate can be decreased (one image per second is fine for many applications provided the rate is steady).� Visual clues can be added to the display to indicate items just offscreen (Baudisch & Rosenholtz, 2003).

Clickable links can point to blocks of less-important information, thereby reducing the amount of content to be displayed at once.� This is especially good for media objects (which can require both bandwidth and screen size) but also helps for paragraphs of details.� Links can be thumbnail images, which is helpful for pages familiar to the user.� Links can also point to pages containing additional links so the scheme can be hierarchical.� (Buyukkoten et al, 2002) in fact experimented with repurposing displays containing links exclusively.� But insertion of links requires rating the content of the page by importance, a difficult problem in general (as discussed below), to decide what content is converted into links.� It also requires a careful wording of text links since just something like "picture here" is unhelpful, but a too-long link may be worse than no link at all.� Complex link hierarchies may also cause users to get lost.

One can also modify the content of a display by just eliminating unimportant or useless detail and rearranging the display (Gupta et al, 2003).� For instance, advertisements, acknowledgements, and horizontal bars can be removed, as well as JavaScript code and Macromedia Flash (SWF) images since most are only decorative.� Removed content need not be contiguous, as with removal of a power subsystem from a system diagram.� In addition, forms and tables can lose their associated graphics.� The lines in block diagrams can often be shortened when their lengths do not matter.� Color images can be converted to black-and-white, though one must be careful to maintain feature visibility, perhaps by exaggerating the contrast.� User assistance in deciding what to eliminate or summarize is helpful as user judgment provides insights that cannot easily be automated, as with selection of "highlights" for video (Pea et al, 2004).� An important special application is selection of information from a page for each user in a set of users (Han, Perret, & Naghshineh, 2000).� Appropriate modification of the display for a mobile device can also be quite radical; for instance, a good way to support route-following on a small device could be to give spoken directions rather than a map (Kray et al, 2003).

Content Rating by Importance

Several of the techniques mentioned above require judgment as to what is important in the data to be displayed.� The difficulty of automating this judgment varies considerably with the type of data.

Many editing tools mark document components with additional information like "style" tags, often in a form compatible with the XML language.� This information can assign additional categories to information beyond those of HTML, like identifying text as a "introduction", "promotion", "abstract", "author biography", "acknowledgements", "figure caption", "links menu", or "reference list" (Karben, 1999).� These categories can be rated in importance by content-repurposing software, and only text of the top-rated categories shown when display space is tight.� Such categorization is especially helpful with media objects (Obrenovic, Strarcevic, and Selic, 2004), but their automatic content analysis is difficult and it helps to persuade people to categorize them at least partially.

In the absence of explicit tagging, methods of automatic text summarization from natural-language processing can be used.� This technology, useful for building digital libraries, can be adapted for the content repurposing problem to display an inferred abstract of a page.� One approach is to select sentences from a body of text that are the most important as measured by various metrics (McDonald & Chen, 2002; Alam et al, 2003) like titles and section headings, first sentences of paragraphs, and distinctive keywords.� Keywords alone may suffice to summarize text when the words are sufficiently distinctive (Buyukkoten et al, 2002).� Distinctiveness can be measured by classic measure of TF-IDF, which is �where K is the number of occurrences of the word in the "document" or text to be summarized, N is a sample of documents, and n is the number of those documents in that sample having the word at least once. Other useful input for text summarization are the headings of pages linked to (Delort, Bouchon-Meunier, & Rifqi, 2003) since neighbor pages provide content clues.� Content can also be classified into semantic units by aggregating clues or even by "parsing" the page display.� For instance, the "@" symbol suggests a paragraph of contact information.

Media objects pose more serious problems than text, however, since they can require large bandwidths to download, and images can require considerable display space.� In many cases the media can be inferred to be decorative and can be eliminated, as for many banners and sidebars on pages as well as background sounds.� Simple criteria can distinguish decorative graphics from photographs (Rowe, 2002): size (photographs are larger), frequency of the most common color (graphics have a higher frequency), number of different colors (photographs have more), extremeness of the colors (graphics are more likely to have pure colors), and average variation in color between adjacent pixels in the image (photographs have less).� (Hu and Bagga, 2004) extends this to classify images in order of importance as "story", "preview", "host", "commercial", "icons and logos", "headings", and "formatting".� Images can be rated by these methods, then only the top-rated images displayed until sufficient to fill the screen.� Such rating methods are rarely necessary for video and audio which are almost always accessed by explicit links.� Planning can be done on the server for efficient delivery (Chandra, Ellis, & Vahdat, 2000) and the most important media objects can be delivered first.

In some cases, preprocessing can analyze the content of the media object and extract the most representative parts.� Video is a good example because it is characterized by much frame-to-frame redundancy.� A variety of techniques can extract representative frames (say one per shot) that convey the gist of the video and reduce the display to a "slide show".� If an image is graphics containing sub-objects, then the less-important sub-objects can be removed and a smaller image constructed.� An example is a block diagram where text outside the boxes represents notes that can be deleted.� Heuristics useful for finding important sub-objects are nearby labels, objects at ends of long lines, and adjacent blank areas (Kasik, 2004).� Processing can also in some applications do "visual abstraction" where, say, a rectangle is substituted for a complex part of the diagram that is known to be a conceptual unit (Egyed, 2002).

Redrawing the Display

Many of methods discussed require changing the layout of a page of information.� Thus content repurposing needs to use methods of efficient and user-friendly display formatting (Kamada & Kawai, 1991; Tan, Ong, & Wong, 1993).� This can be a difficult constraint optimization problem where the primary constraints are those of keeping related information together as much as possible in the display.� Examples of what needs to be kept together are section headings with their subsequent paragraphs, links with their describing paragraphs, images with their captions, and images with their text references.� Some of the necessary constraints, including device-specific ones, can be learned from observing users (Anderson, Domingos, & Weld, 2001).� Even with good page design, content search tools are helpful with large displays like maps to enable users to find things quickly without needing to pan or zoom.

�

Future Work

�

Content repurposing is currently an active area of research and we are likely to see a number of innovations in the near future in both academia and industry.� The large number of competing approaches will dwindle as concensus standards are reached for some of the technology, much as de facto standards have emerged in Web-page style.� It is likely that manufacturers of small devices will provide increasingly sophisticated repurposing in their software to reduce the burden on servers.� XML will increasingly be used to support repurposing, as it has achieved widespread acceptance in a short time for many other applications.� XML will be used to provide standard descriptors for information objects within organizations.� But XML will not solve all problems, and the issue of incompatible XML taxonomies could impede progress.

Conclusion

Content repurposing has recently become a key issue in management of small wireless devices as people want to display the information they can display on traditional screens and have discovered that it often looks bad on a small device.� So strategies are being devised to modify display information for these devices.� Simple strategies are effective for some content, but there are many special cases of information which require more sophisticated methods due to their size or organization.

References

Alam, H., Hartono, R., Kumar, A., Rahman, F., Tarnikov, Y., & Wilcox, C. (2003).� Web page summarization for handheld devices: a natural language approach.� Proceedings of 7^th International Conference on Document Analysis and Recognition, 1153-1158.

Anderson, C., Domingos, P. & Weld, D. (2001, May).� Personalizing Web sites for mobile users.� Proceedings of 10^th International Conference on the World Wide Web, Hong Kong, China, 565-575.

Baudisch, P., & Rosenholtz, R. (2003).� Halo: a technique for visualizing off-screen objects.� Proceedings of Conference on Human Factors in Computing Systems, Ft. Lauderdale, FL, 481-488.

Boiko, B. (2002).� Content management bible.� New York: Hungry Minds.

Buyukkokten, O., Kaljuvee, O., Garcia-Molina, H., Paepke, A., & Winograd, T. (2002, January).� Efficient Web browsing on handheld devices using page and form summarization.� ACM Transactions on Information Systems, 20 (1), 82-115.

Chandra, S., Ellis, C., & Vahdat, A., (2000, December).� Application-level differentiated multimedia Web services using quality aware transcoding.� IEEE Journal on Selected Areas in Communications, 18 (12), 2544-2565.

Delort, J.-Y., Bouchon-Meunier, B., & Rifqi, M. (2003, August).� Enhanced Web document summarization using hyperlinks.� Proceedings of 14^th ACM Conference on Hypertext and Hypermedia, Nottingham, UK, 208-215.

Egyed, A. (2002, October).� Automatic abstraction of class diagrams.� IEEE Transactions on Software Engineering and Methodology, 11 (4), 449-491.

Government of Canada (2004).� Tip sheets: Personal Digital Assistants (PDA).� Retrieved May 5, 2004 from www.chin.gc.ca/English/Digital_Content/Tip_Sheets/Pda.�

Gupta, S., Kaiser, G., Neistadt, D., Grimm, P. (2003, May).� DOM-based content extraction of HTML documents.� Proceedings of 12^th International Conference on the World Wide Web, Budapest, Hungary, 207-214.

Han, R., Perret, V., & Naghshineh, M. (2000, December).� WebSplitter: A unified XML framework for multi-device collaborative Web browsing.� Proceedings of ACM Conference on Computer Supported Cooperative Work, Philadelphia, PA, 221-230.

Hu, J., & Bagga, A. (2004, January-March).� Categorizing images in Web documents.� IEEE Multimedia, 11 (1), 22-30.

Jing, H., & McKeown, K. (2000).� Cut and paste based text summarization.� Proceedings of First Conference of North American Chapter of the Association for Computational Linguistics, Seattle, WA, 178-185.

Kamada, T., & Kawai, S. (1991, January).� A general framework for visualizing abstract objects and relations.� ACM Transactions on Graphics, 10 (1), 1-39.

Karadkar, U., Furuta, R., Ustun, S., Park, Y., Na, J.-C., Gupta, V., Ciftci, T., & Park, Y. (2004, August).� Display-agnostic hypermedia.� Proceedings of 15^th ACM Conference on Hypertext and Hypermedia, Santa Cruz, CA, 58-67.

Karben, A. (1999, March).� News you can reuse -- content repurposing at The Wall Street Journal Interactive Edition.� Markup Languages: Theory & Practice, 1 (1), 33-45.

Kasik, D. (2004, January-March).� Strategies for consistent image partitioning.� IEEE Multimedia, 11 (1), 32-41.

Kray, C., Elting, C., Laakso, K., & Coors, V. (2003).� Presenting route instructions on mobile devices.� Proceedings of 8^th International Conference on Intelligent User Interfaces, Miami, FL, 117-124.

Lyu, M., Yen, J., Yau, E., & Sze, S. (2003, November)� A wireless handheld multi-modal digital video library client system.� Proceedings of 5^th ACM International Workshop on Multimedia Information Retrieval, Berkeley CA, 231-238.

Ma, R.-H., & Singh, G. (2003).� Effective and efficient infographic image downscaling for mobile devices.� Proceedings of 4^th International Workshop on Mobile Computing, Rostock, Germany.

McDonald, D., & Chen, H. (2002, July).� Using sentence-selection heuristics to rank text in XTRACTOR.�� ACM-IEEE Joint Conference on Digital Libraries, Portland, OR, 28-35.

Obrenovic, Z., Starcevic, D., & Selic, B. (2004, January-March).� A model-driven approach to content repurposing.� IEEE Multimedia, 11 (1), 62-71.

Pea, R., Mills, M., Rosen, J., & Dauber, K. (2004, January-March).� The DIVER project: interactive digital video repurposing.� IEEE Multimedia, 11 (1), 54-61.

Rowe, N. (2002, July/August).� MARIE-4: A high-recall, self-improving Web crawler that finds images using captions.� IEEE Intelligent Systems, 17 (4), 8-14.

Singh, G. (2004, January-March).� Content repurposing.� IEEE Multimedia, 11 (1), 20-21.

Tan, K., Ong, G., & Wong, P. (1993, July).� A heuristics approach to automatic data flow diagram layout.� Proceedings of 6^th International Workshop on Computer-Aided Software Engineering, Singapore, 314-323.

Wobbrock, J., Forlizzi, J., Hudson, S., & Myers, B. (2002, October).� WebThumb: interaction techniques for small-screen browsers.� Proceedings of 15^th ACM Symp. on User Interface Software and Technology, Paris, France, 205-208.

Definitions of Terms

content management: Management of Web pages as assisted by software, "Web page bureaucracy".

content repurposing: Reorganizing or modifying the content of a graphical display to fit effectively on a different device than its original target.

microbrowser: A Web browser designed for a small device.

key frames: Representative shots extracted from a video that illustrate its main content.

pan: Move an image window with respect to the portion of the larger image from which it is taken..

PDA: "Personal Digital Assistant", a small electronic device that functions like a notepad.

streaming: Sending multimedia data to a client device at a rate the enables it to be played without having to store it.

tag: HTML and XML markers that delimit semantically meaningful units in their code.

XML: Extensible Markup Language, a general language for structuring information on the Internet for use with the HTTP protocol, an extension of HTML.

zoom: Change the fraction of an image being displayed when that image is taken from a larger one.