Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 1/27
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems
Georg Rehm1, Marina Santini2, Alexander Mehler3,Pavel Braslavski4, Rüdiger Gleim3, Andrea Stubbe5,
Svetlana Symonenko6, Mirko Tavosanis7, Vedrana Vidulin8
Language Resources and Evaluation Conference – LREC 2008
University of Tübingen, Germany1
SFB 441: Linguistic Data Structures DSV, Sweden2
KTH-Stockholm UniversityUniversity of Bielefeld, Germany3
Computational Linguistics Dept.
Inst. of Engineering Science, RAS4
Ekaterinenburg, Russiaconject AG5
Munich, GermanyNitol, LLC6
Moscow, Russia
Università di Pisa, Italy7
Dipartimento di Studi italianisticiJožef Stefan Institute8
Ljubljana, SloveniaCorresponding author:
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 2/27
Introduction
Genres are specific types of text.
Genres have, roughly speaking, three characteristic properties:
- Content topic
- Form layout, design, text structure etc.
- Function communicative purpose etc.
Genres are socially specified sets of rules and conventions.
Genres are recognised by particular discourse communities.
Genres usually have established names.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 3/27
Examples of Traditional Genres
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 4/27
Scope of this Talk
There are not only hundreds (Dimter, 1981), but thousands (Adamzik, 1995) of genres:
- Shopping list
- Love letter
- Flyer
- Weather forecast
- CV
- PhD thesis
- …
This talk is not about traditional, paper-based genres.
This talk is about web genres.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 5/27
Web Genres
Studies have shown that genres also exist in the web, e.g.:
- Personal homepage
- FAQ
- Blog
- Search engine
- Encyclopedia
- Web shop
Web genres are more complex than traditional genres:
- The web is a hypertext system
- Interactive features
- Multimedia
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 6/27
Automatic Web Genre Identification
If we were able to identify web genres automatically, we could exploit this information in search engines. Find:
- textbook web pages that contain “language resource”
- PhD thesis web pages that contain “RCG parsing”
About 20 different approaches have been published in this area (incl. the identification of traditional genres). They mainly use
- Machine learning methods
- Hand-crafted genre detection rules
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 7/27
Automatic Web Genre Identification
All approaches have some characteristics in common.
Nearly every group of researchers
- have their own personal definition of “web genre”,
- create their own document collection,
- create their own set of web genre labels,
- annotate their corpora with these web genre labels.
Web Genre Identification Approach
Classification algorithm
Corpus (collection of web documents)
Tag set (genre categories)
DIY
DIY
DIY
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 8/27
Automatic Web Genre Identification
Approach 1
Algorithm 1
Corpus 1
Tag set 1
Approach 2
Algorithm 2
Corpus 2
Tag set 2
Approach 3
Algorithm 3
Corpus 3
Tag set 3
Approach 4
Algorithm 4
Corpus 4
Tag set 4
Approach 5
Algorithm 5
Corpus 5
Tag set 5
It’s impossible to compare such isolated approaches.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 9/27
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference Corpus of Web Genresenables comparative evaluation
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 10/27
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection
of web documents
Shared genre
category set or sets
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 11/27
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection
of web documents
Shared genre
category set or sets
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 12/27
Assigning Genre Labels to Web Pages
The construction of a genre corpus involves the task of assigning genre labels to web documents by a group of annotators.
Previous studies have shown that this is a very hard task.
tag with genre category Set of genre
categories
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 13/27
Preliminary Study
We conducted a survey amongst the group of authors:
- Goal: to measure the agreement of genre labels assigned to a random sample of 50 web documents by persons who are engaged in genre-related research.
- Seven of the nine authors participated.
Result: the categories assigned by the participants contain a very high number of disparate terms at various levels of abstraction.
Conclusion: the task of assigning genre labels to web documents is – even for linguists who work on genres – very hard.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 14/27
Assigning Genre Labels to Web Pages
Consistency: High
• Participant 1: News article • Participant 2: Article/commentary• Participant 3: Article• Participant 4: Feature• Participant 5: A newsletter article• Participant 6: News article• Participant 7: Journalistic
Consistency: High
• Participant 1: News article • Participant 2: Article/commentary• Participant 3: Article• Participant 4: Feature• Participant 5: A newsletter article• Participant 6: News article• Participant 7: Journalistic
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 15/27
Assigning Genre Labels to Web Pages
Consistency: Low
• P1: Entry page of the website of a research journal • P2: Table of contents with snippets• P3: Portal, link collection• P4: Bibliography/List of Articles• P5: A homepage of a subscription-based academic journal• P6: Homepage• P7: Index, Content Delivery
Consistency: Low
• P1: Entry page of the website of a research journal • P2: Table of contents with snippets• P3: Portal, link collection• P4: Bibliography/List of Articles• P5: A homepage of a subscription-based academic journal• P6: Homepage• P7: Index, Content Delivery
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 16/27
Genre Category Sets in Previous Approaches
Almost all category sets used in previous approaches are
- limited in size and scope and
- contain categories that cannot be considered genres:
Lim et al. (2005) Personal homepages; Public homepages; Commercial homepages; Bulletin collections; Link collections; Image collections; Simple tables/lists; Input pages; Journalistic materials; Research reports; Official materials; Informative materials; FAQs; Discussions; Product specifications; Others
Vidulin et al. (2007)
Blog; Childrens’; Commercial/Promotional; Community; Content Delivery; Entertainment; Error Message; FAQ; Gateway; Index; Informative; Journalistic; Official; Personal; Poetry; Scientific; Shopping; User Input
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 17/27
Shared Genre Category Sets
A set of genre categories is needed so that we can assign web genre labels to web documents.
Requirements for this shared category set:
- It should be precise, scalable, as unambiguous as possible, and reflect the genre-reality as it presents itself in the web.
- The majority of researchers in this field should agree upon the category set or sets.
We used a wiki to come up with an initial proposal of 78 web genre categories.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 18/27
Our Proposal for a Shared Genre Category Set
1. About Page 2. Abstract 3. Agenda (Schedule, Calendar) 4. Announcement 5. Application 6. Bibliography 7. Biography 8. Chronicle 9. Code Listings 10. Column / Editorial / Lead Article 11. Comic 12. Contact Form 13. Contract / Disclaimer / Terms and Conditons 14. Corporate Blog 15. Curriculum Vitae / CV / Resume 16. Data / Statistics / Data Sheet 17. Diary, Blog 18. Dictionary 19. Directory of Persons or Organisations 20. Discussion Group / Newsgroup 21. Download 22. Drama / Play 23. Encyclopedia 24. Errata 25. Error Message / Empty Page / Under Construction Page 26. Essay 27. Exercises (Problems) 28. FAQ 29. Feature Story / News Reportage 30. Game (Quiz, Puzzle) 31. Glossary 32. Guestbook 33. Homepage / Front Page / Entry Page 34. Horoscope 35. Index 36. Instruction 37. Interview 38. Invitation 39. Job Listing 40. Joke 41. Law / Regulation / Rule / Proclamation 42. Letter / Mail / E-Mail 43. Letter to the Editor 44. Linkfarm 45. Link Collection / Hotlist 46. List of Products 47. List of Projects 48. Login Page 49. Media (Images, videos, music, sound) 50. Meeting minutes 51. News Article 52. News Collection / Newsletter / Digest 53. Obituary 54. Official Report 55. Ordering Form / Booking Form 56. Pamphlet 57. Petition 58. Promotional / Advertisement 59. Poem / Poetry / Lyrics 60. Pornographic 61. Prose Fiction 62. Quotation 63. Reportage 64. Research Report 65. Review (Testimonial) 66. Script (Manuscript) 67. Search Form 68. Sermon 69. Shop 70. Specification 71. Speech 72. Splash Page / Gateway / Welcome Page 73. Strategic Plans 74. Survey 75. Table of contents / Sitemap / Navigation 76. Thesis 77. Travel Guide 78. Tutorial
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 19/27
Tagging HTML Documents with Genre Categories
tag
1) tag HTML documents; the most common approach
tag
2) tag websites
tag
tag
tag
tag
tag
3) tag page segments
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 20/27
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection
of web documents
Shared genre
category set or sets
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 21/27
Reference Collection of Web Documents
We plan to build the reference corpus in two stages:
- First, we will apply our shared set of genre categories to existing collections as a proof of concept.
Initial step towards an objective evaluation and integrative compatibility of individual approaches.
- Second, we will use a crawler to gather more recent as well as more diverse sets of documents.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 22/27
Reference Collection of Web Genres (Selection)
Web Corpus for English (Santini, 2007): editorial, biography, do-it-yourself guide, feature article (20 web pages each).
German corpus (Mehler et al., 2007, 2008): conference website (50 sites), personal academic homepage (68 sites), project website (52 sites), city website (180 sites).
Hierachical Web Genre Collection (Stubbe and Ringlstetter, 2007), 32 genre classes, 40 HTML files/class, English.
Corpus of 400 blog posts, Italian (Tavosanis, 2007).
English (65,177 pages) and Russian (29,650 pages) corpora (Sharoff, 2007).
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 23/27
Towards a Reference Corpus of Web Genres
Approach 1
Algorithm 1
Approach 2
Algorithm 2
Approach 3
Algorithm 3
Approach 4
Algorithm 4
Approach 5
Algorithm 5
Reference collection
of web documents
Shared genre
category set or sets
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 24/27
Corpus Management and Annotation Tools
Construction of the reference corpus requires tools that support
- compiling a document collection and
- annotating HTML documents.
We use the HyGraph toolbox:
- Supports researchers in the process of corpus compilation, annotation and analysis
- Annotate at various levels
- Assign confidence values
- Support for multiple tag setsand category systems
- Uses stand-off annotation
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 25/27
Towards a Reference Corpus of Web Genres
Reference collection
of web documents
Shared genre
category set or sets
Reference Corpus of Web Genres
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 26/27
Summary and Future Work
We construct a reference corpus of web genres.
Provide a shared resource for researchers who work on web genre identification and the evaluation of these systems.
Future work includes the further realisation of this resource:
- Apply a set of genre categories to existing corpora.
- Collect a large set of new documents that will be categorised based on annotation guidelines using HyGraph.
- Assign genre labels to single web documents first and to page segments as well as complete websites later.
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems 27/27
Q/A
Thanks for your attention!
Please get in touch if you (plan to) work in the field of automatic web genre identification or a related area:
http://129.70.40.20/WebGenreWiki/
A mailing list will be available soon.