4
Collecting and Organizing Web Content Mira Dontcheva 1 Steven M. Drucker 3 Geraldine Wade 4 David Salesin 1,2 Michael F. Cohen 3 1 Computer Science & Engineering University of Washington Seattle, WA 98105-4615 {mirad, salesin}@cs.washington.edu 2 Adobe Systems 801 N. 34th Street Seattle, WA 98103 [email protected] 3 Microsoft Research, 4 Microsoft One Microsoft Way Redmond, WA 98052-6399 {sdrucker, gwade, mcohen}@microsoft.com ABSTRACT Our work focuses on lowering user overhead for collecting and organizing Web content through the use of automation and visualization. We discuss a framework that enables users to semi-automatically collect, view, and share personal Web content. Our approach allows a user to interactively select webpage elements of interest and automatically collect sim- ilar content. Further, we describe a technique for creating visual summaries of the collected information by combin- ing user labeling with predefined layout templates. These summaries are interactive in nature and provide a variety of visual representations for the collected content. Finally, the summaries can be saved or sent to other users to continue the research at another place or time. 1. INTRODUCTION As the amount of content delivered over the World Wide Web grows, so does the consumption of information. To- day people collect and organize information in ways different from before: in order to plan a vacation, they no longer go to a travel agent; when making purchases, they no longer go to a neighborhood specialty store; and when learning about a new topic, they no longer go to the library. Instead, they browse the Web. One study [6] of Web usage in 2005 shows that users visit thousands of webpages per week and in any thirty minute browsing session may visit hundreds of indi- vidual pages. Although advancements in search technologies make it much easier to find information, users often browse the Web with a particular task in mind and are concerned not only with finding but also with collecting, organizing, and sharing content. These types of browsing sessions typi- cally last a long time, span several sessions, involve gathering large amounts of heterogeneous content, and can be difficult to organize ahead of time, as the categories emerge through the tasks themselves, as discussed in [13]. Research in cur- rent practices for collecting and organizing Web content [9] shows that users save bookmarks or tabs, collect content in documents, store pages locally, and print them out. These . methods require a great deal of overhead as pages must be saved manually and organized into folders, which distracts from the real task of analyzing the content and making deci- sions. The goal of our work is to lower the overhead of gath- ering and organizing Web content. To achieve this goal we propose providing the user user with automatic techniques for extracting Web content so that he does not have to man- ually collect each piece of information. Further, we suggest that visual depictions are a better mechanism for organizing content for the purposes of analysis and sharing than current practices with files and folders. In this paper we describe on- going research on automatic mechanisms for collecting and presenting Web content. We describe a framework that al- lows users to collect Web content semi-automatically and view and organize that content through visual summaries. We have implemented a first version of our framework as a browser extension and present results of preliminary evalu- ations. Next we describe our framework, discuss user evalu- ation, and present future research directions. For a detailed description of our work, please see [3]. 2. WEB CONTENT SUMMARIES Our approach is motivated by Gibson et al.’s analysis [5] of the amount of template content on the Web. Their study shows that 40-50% of the content on the Web is template content and that template material has grown and is contin- uing to grow at a rate of approximately 6% per year. Schrae- fel et al. [12] further show that people are often interested in collecting not only full webpages but also pieces of web- pages. These findings motivated us to design an interface that leverages the growing amount of templated informa- tion on the Web and allows users to gather only the pieces of content they find relevant to their task. Our framework presents the user with an interface for semi-automatically collecting pieces of webpages and visual summaries that display the gathered content uniformly. Our current implementation consists of three components: an ex- traction pattern repository, a database, and a set of layout templates. Please refer to Figure 1 for a visual description of our system. The user specifies relevant content by interac- tively selecting Web page elements and labeling each element with a keyword, chosen from a predefined set including name, appearance, or address. The selected content is stored in the summary database and is also used to build an extraction pattern that can be used to collect more content from similar pages. Extraction patterns specify the locations of selected elements using the structure of the webpage and, optionally, text that must be matched in any new pages. The extraction Personal Information Management - A SIGIR 2006 Workshop 44

Collecting and Organizing Web Contentpim.ischool.washington.edu/pim06/files/dontcheva-paper.pdf · calendar, and grid. Figures 2 and 3 show the same content composed with two different

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Collecting and Organizing Web Contentpim.ischool.washington.edu/pim06/files/dontcheva-paper.pdf · calendar, and grid. Figures 2 and 3 show the same content composed with two different

Collecting and Organizing Web Content

Mira Dontcheva1 Steven M. Drucker3 Geraldine Wade4 David Salesin1,2 Michael F. Cohen3

1Computer Science & EngineeringUniversity of WashingtonSeattle, WA 98105-4615

{mirad, salesin}@cs.washington.edu

2Adobe Systems801 N. 34th StreetSeattle, WA 98103

[email protected]

3Microsoft Research, 4MicrosoftOne Microsoft Way

Redmond, WA 98052-6399{sdrucker, gwade, mcohen}@microsoft.com

ABSTRACTOur work focuses on lowering user overhead for collectingand organizing Web content through the use of automationand visualization. We discuss a framework that enables usersto semi-automatically collect, view, and share personal Webcontent. Our approach allows a user to interactively selectwebpage elements of interest and automatically collect sim-ilar content. Further, we describe a technique for creatingvisual summaries of the collected information by combin-ing user labeling with predefined layout templates. Thesesummaries are interactive in nature and provide a variety ofvisual representations for the collected content. Finally, thesummaries can be saved or sent to other users to continuethe research at another place or time.

1. INTRODUCTIONAs the amount of content delivered over the World Wide

Web grows, so does the consumption of information. To-day people collect and organize information in ways differentfrom before: in order to plan a vacation, they no longer goto a travel agent; when making purchases, they no longer goto a neighborhood specialty store; and when learning abouta new topic, they no longer go to the library. Instead, theybrowse the Web. One study [6] of Web usage in 2005 showsthat users visit thousands of webpages per week and in anythirty minute browsing session may visit hundreds of indi-vidual pages. Although advancements in search technologiesmake it much easier to find information, users often browsethe Web with a particular task in mind and are concernednot only with finding but also with collecting, organizing,and sharing content. These types of browsing sessions typi-cally last a long time, span several sessions, involve gatheringlarge amounts of heterogeneous content, and can be difficultto organize ahead of time, as the categories emerge throughthe tasks themselves, as discussed in [13]. Research in cur-rent practices for collecting and organizing Web content [9]shows that users save bookmarks or tabs, collect content indocuments, store pages locally, and print them out. These

.

methods require a great deal of overhead as pages must besaved manually and organized into folders, which distractsfrom the real task of analyzing the content and making deci-sions. The goal of our work is to lower the overhead of gath-ering and organizing Web content. To achieve this goal wepropose providing the user user with automatic techniquesfor extracting Web content so that he does not have to man-ually collect each piece of information. Further, we suggestthat visual depictions are a better mechanism for organizingcontent for the purposes of analysis and sharing than currentpractices with files and folders. In this paper we describe on-going research on automatic mechanisms for collecting andpresenting Web content. We describe a framework that al-lows users to collect Web content semi-automatically andview and organize that content through visual summaries.We have implemented a first version of our framework as abrowser extension and present results of preliminary evalu-ations. Next we describe our framework, discuss user evalu-ation, and present future research directions. For a detaileddescription of our work, please see [3].

2. WEB CONTENT SUMMARIESOur approach is motivated by Gibson et al.’s analysis [5]

of the amount of template content on the Web. Their studyshows that 40-50% of the content on the Web is templatecontent and that template material has grown and is contin-uing to grow at a rate of approximately 6% per year. Schrae-fel et al. [12] further show that people are often interestedin collecting not only full webpages but also pieces of web-pages. These findings motivated us to design an interfacethat leverages the growing amount of templated informa-tion on the Web and allows users to gather only the piecesof content they find relevant to their task.

Our framework presents the user with an interface forsemi-automatically collecting pieces of webpages and visualsummaries that display the gathered content uniformly. Ourcurrent implementation consists of three components: an ex-traction pattern repository, a database, and a set of layouttemplates. Please refer to Figure 1 for a visual description ofour system. The user specifies relevant content by interac-tively selecting Web page elements and labeling each elementwith a keyword, chosen from a predefined set including name,appearance, or address. The selected content is stored inthe summary database and is also used to build an extractionpattern that can be used to collect more content from similarpages. Extraction patterns specify the locations of selectedelements using the structure of the webpage and, optionally,text that must be matched in any new pages. The extraction

Personal Information Management - A SIGIR 2006 Workshop

44

Page 2: Collecting and Organizing Web Contentpim.ischool.washington.edu/pim06/files/dontcheva-paper.pdf · calendar, and grid. Figures 2 and 3 show the same content composed with two different

Figure 1: The user-selected page elements define extraction patterns that are stored in the extraction patternrepository. Every new page the user visits is matched with the patterns, and if matching page elements arefound, the user can add them to the summary database. The layout templates filter the database and composesummary views. The user can view the database contents at any time with any view.

pattern repository contains these patterns and when the uservisits a new page, all specified patterns are matched withthe page to automatically find analogous content. If thereare matches, the user can add the matching content to thedatabase with just the press of a button. We also provide aninterface for collecting content from multiple pages simulta-neously by allowing the user to select hyperlinks that pointto the pages of interest.

The summary database stores the collected content ac-cording to the user-specified labels and source webpage. Toorganize and present the gathered content, we employ layouttemplates that create visual summaries. A layout templatespecifies how the content should be organized and format-ted and defines a set of user interactions available within asummary. The templates filter the content in the databaseaccording to the labeling specified by the user during contentselection. To demonstrate possible content presentations, wehave implemented several layout templates, including a map,calendar, and grid. Figures 2 and 3 show the same contentcomposed with two different layout templates, a grid andmap. Layout templates can also be designed for differentpurposes or devices. Figure 4 shows the a visual summarycomposed with the PDA layout template. Our framework isnot limited to these designs and can incorporate additionallayout templates.

2.1 Preliminary EvaluationWe performed preliminary evaluations of our system by

1) measuring the performance of our extraction patterns ona set of popular websites, and 2) soliciting user feedback inan informal user study.

To measure the performance of our extraction patternswe used nine popular websites for shopping (Amazon, Ebay,Netflix), travel planning (Expedia, Orbitz, Travelocity), andacademic reference (ACM Digital Library, CiteSeer, IEE-EXplore). For each website, we visited 25 randomly selectedwebpages. On the first visited page for each website, we se-lected 3-6 page elements to build an extraction pattern andthen collected content from 24 more pages. We then mea-

sured how much of the content was collected from the 24webpages and computed an accuracy for each pattern. Wedefine accuracy as the number of page elements collected us-ing the extraction patterns divided by the number of pageelements that were present in all the visited webpages. Forseven of the nine websites our extraction patterns were 99-100% accurate requiring only one extraction pattern. Thetwo websites where our system had trouble are Ebay (82%accuracy) and Expedia (59% accuracy). Each product web-page on Ebay can be customized by the seller, and thus thereare a variety of possible presentations. We measured pagescreated by the same seller and our system was 100% ac-curate. Similarly, Expedia uses several different layouts forhotels. We suspect the hotel layouts have regional differencesand also depend on when the hotel was added to the Expe-dia database. Although the scope of this analysis is limited,it suggests that structural extraction patterns are appropri-ate for the task of collecting similar content. We are nowin the process of collecting Web data over a period of sev-eral months and investigating how to adapt the extractionpatterns to structural changes of webpages.

To evaluate the user interface of our system, we conducteda pilot user study with nine participants. The participantshad individual sessions of approximately one hour and wereasked to perform three Web research tasks as part of thescenario “a family visit to Seattle.” Each participant wasgiven a brief tutorial and then asked to collect content fromthree different websites. All participants completed the taskssuccessfully and were very positive about using our tool. Sixof them were willing to use it immediately. Although we de-signed our interface with the goal of minimizing user effort,at the onset of the user evaluation we were concerned aboutrequiring manual specification of extraction patterns. Wewere pleasantly surprised to find that all nine participantswere happy to specify their own patterns, as it considerablyaccelerated the rest of the task. One user mentioned, “I’mwilling to do that [manually select pieces of content], be-cause to me it’s a method for narrowing down the space of

Personal Information Management - A SIGIR 2006 Workshop

45

Page 3: Collecting and Organizing Web Contentpim.ischool.washington.edu/pim06/files/dontcheva-paper.pdf · calendar, and grid. Figures 2 and 3 show the same content composed with two different

Figure 2: The grid layout template places all the con-tent in the database on a grid in groups, accordingto extraction pattern. When the user moves the cur-sor over an item in the grid, a descriptive paragraphfor that item is presented.

what I’m looking at. By waiting for me to specify what I careabout, it allows me to eliminate what I don’t care about.”

Five of the participants requested that we provide mech-anisms for configuring both the label choices and the sum-mary layouts. We expected that the set of labels currentlydefined by the templates might not be meaningful for allusers or in all situations. For such cases, it would be bestto allow the user to create labels; however, any new labelswould have to be incorporated in the templates. In the fu-ture, we hope to create a graphical authoring tool for thetemplates, which would make it possible to dynamically addnew labels as the user is gathering content. We expect thatthere will be two types of users of our system: those whouse the defaults and those who create their own layouts andschemas for collecting and presenting information.

3. RELATED WORKOur work has its roots in the area of information workspaces

and systems like WebBook [2], Data Mountain [11], and Top-icShop [1], which present different mechanisms for present-ing and organizing collections of webpages. Our system ismost closely related to Hunter Gatherer [12] and InternetScrapbook [14], which provide the ability to save webpageelements and place them together in a document. Like thesetwo systems, we provide an interface that allows the user toselect pieces of webpages; however, we go further by makinguse of the user selection to create extraction patterns thatcan then be used to gather content automatically. We alsoprovide a mechanism for organizing the gathered content inricher ways, such as on a map or calendar, and for differentpurposes, such as for a PDA or printer.

Our use of user-specified labels for relating content fromdifferent sources links our work to Semantic Web applica-tions such as Piggy Bank [8] and Thresher [7]. Our workdiffers from these in two ways. First, we create visual sum-maries of the gathered content, while Piggy Bank and Thresherprovide a database interface. Second, our interface for label-

Figure 3: With the map layout template, the usercan view the content with respect to a geographicalmap. The template filters the database and presentsonly content that includes an address.

ing the content is more accessible to the average Web user,because all the user has to do is right-click and choose akeyword. Piggy Bank requires that the user write scripts,while Thresher expects that the user has some knowledge ofSemantic Web constructs.

Bibliographic reference tools, such as RefWorks [10] andEndNote [4], are specifically designed for collecting and man-aging academic references. These systems have specializedfilters, much like our extraction patterns, for electronic ref-erences websites and can automatically import all the datafor a particular reference into a database. Our system com-plements such systems by giving the user an interface forselecting personally relevant bibliographic information. Fur-thermore, our approach is more general and is applicable toother Web content.

4. FUTURE WORKThis paper presents a framework for collecting and view-

ing content found on the Web in a personalized way. Basedon this approach we implemented a system that is applicablefor a variety of tasks, including comparison shopping, travelplanning, and reference collecting. An evaluation of our sys-tem leads us to the firm conclusion that such an approach ismuch needed and welcomed by Web users. Furthermore, ourwork suggests new directions for research, such as designingan end-user interface for customizing layouts, providing apublic repository of extraction patterns and summaries, andusing visual summaries for other Web tasks.

Currently, the layout templates we use are pre-definedwith the help of a designer and are limited in the typesof interactions possible. We envision allowing the user to in-teractively specify the templates either with an authoringtool or directly in the summary by applying filters or in-teractively selecting which content should be displayed. Weplan to provide a set of default templates that the user cancustomize interactively to create new specialized summaryviews.

Personal Information Management - A SIGIR 2006 Workshop

46

Page 4: Collecting and Organizing Web Contentpim.ischool.washington.edu/pim06/files/dontcheva-paper.pdf · calendar, and grid. Figures 2 and 3 show the same content composed with two different

Figure 4: The PDA layout template is designed fora small-screen device so that the user can take thesummary anywhere. The layout separates the con-tent into tabs according to Web site and providesdetailed views when the user clicks on an item.

While we are satisfied that our user study participantsdidn’t find the overhead of selecting and labeling page ele-ments too time consuming, we were curious how they feltabout using a public repository of extraction patterns in-stead of manually selecting page elements. We asked theparticipants whether they would be willing to share and usea public repository of extraction patterns. The participantswere timid about using patterns defined by others as theyfelt they were no longer in control of the gathered infor-mation. Seven of the participants said they would be hes-itant to use a public repository, as they were unsure howmuch work would be required. We anticipate that althougha public repository of extraction patterns may not be usedby everyone, it could be useful for automatically labelingpage elements, especially since four of the nine participantsdid not remember to label the page elements they selected.Furthermore, we anticipate that such as repository may beuseful for sharing layout customizations or even summariesso users can share task specific content collections. For exam-ple, when planning a trip, users can use an existing summaryfor such a trip as a starting point in their search.

Finally, we plan to evaluate our system more rigorously.We would like to evaluate the robustness of our extractionpatterns over time and understand how often website tem-plates change. Given the positive reaction in the pilot study,we plan to release the system and study users as they carryout their own tasks over several months. We want to explorehow people accumulate content and use our framework overtime. We expect that users will find new tasks for our systemin addition to those we have already explored.

5. REFERENCES[1] B. Amento, L. Terveen, and W. Hill. Experiments in

social data mining: The topicshop system. ACMTransactions on Computer-Human Interaction(TOCHI), pages 54–85, 2003.

[2] S. Card, G. Roberston, and W. York. The webbookand the web forager: An information workspace for the

world-wide web. In Proc. of the SIGCHI conference onHuman factors in computing systems, 1996.

[3] M. Dontcheva, S. M. Drucker, G. Wade, D. Salesin,and M. F. Cohen. Summarizing personal webbrowsing sessions. In UIST ’06: Proc. of the 19thannual ACM symposium on User interface softwareand technology, to appear, 2006.

[4] EndNote. http://www.endnote.com.

[5] D. Gibson, K. Punera, and A. Tomkins. The volumeand evolution of web page templates. In WWW ’05:Special interest tracks and posters of the 14thinternational conference on World Wide Web, pages830–839, New York, NY, USA, 2005. ACM Press.

[6] K. Hawkey and K. Inkpen. Web browsing today: theimpact of changing contexts on user activity. In CHI’05: CHI ’05 extended abstracts on Human factors incomputing systems, pages 1443–1446, New York, NY,USA, 2005. ACM Press.

[7] A. Hogue and D. Karger. Thresher: automating theunwrapping of semantic content from the world wideweb. In WWW ’05: Proc. of the 14th internationalconference on World Wide Web, pages 86–95, NewYork, NY, USA, 2005. ACM Press.

[8] D. Huynh, S. Mazzocchi, and D. Karger. Piggy bank:Experience the semantic web inside your web browser.In ISWC ’05: Proc. of 4th international Semantic WebConference, 2005.

[9] W. Jones, H. Bruce, and S. Dumais. Once found, whatthen?: A study of “keeping” behaviors in the personaluse of web information. In Proc. of ASIST 2002, 2002.

[10] RefWorks. http://www.refworks.com.

[11] G. Robertson, M. Czerwinski, K. Larson, D. Robbins,D. Thiel, and M. van Dantzich. Data mountain: usingspatial memory for document management. In Proc.of the 11th annual ACM symposium on User interfacesoftware and technology (UIST 1998), pages 153–162,1998.

[12] m. schraefel, Y. Zhu, D. Modjeska, D. Wigdor, andS. Zhao. Hunter gatherer: interaction support for thecreation and management of within-web-pagecollections. In WWW ’02: Proc. of the 11thinternational conference on World Wide Web, pages172–181, New York, NY, USA, 2002. ACM Press.

[13] A. J. Sellen, R. Murphy, and K. L. Shaw. Howknowledge workers use the web. In CHI ’02: Proc. ofthe SIGCHI conference on Human factors incomputing systems, pages 227–234, New York, NY,USA, 2002. ACM Press.

[14] A. Sugiura and Y. Koseki. Internet scrapbook:automating web browsing tasks by demonstration. InUIST ’98: Proc. of the 11th annual ACM symposiumon User interface software and technology, pages 9–18,New York, NY, USA, 1998. ACM Press.

Personal Information Management - A SIGIR 2006 Workshop

47