47
Informatica — Universiteit van Amsterdam Supervisor: Maarten Marx Automatic content enrichment of cultural data Bas de Beer - 9045732 August 22, 2007 Bachelor Informatica Universiteit van Amsterdam

Automatic content enrichment of cultural data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic content enrichment of cultural data

Informatic

a—

Univ

ersi

teit

van

Amst

erdam

Supervisor: Maarten Marx

Automatic contentenrichment of cultural data

Bas de Beer - 9045732

August 22, 2007

Bachelor Informatica

Universiteit van Amsterdam

Page 2: Automatic content enrichment of cultural data

ii

Page 3: Automatic content enrichment of cultural data

Contents

1 Introduction 3

2 Research questions 5

3 Data retrieval and extraction 7

3.1 The AUB data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Enriched data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Find a source that provides the data . . . . . . . . . . . . . . . . . . . . . 8

3.2.2 Extract the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Storing the enriched data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Dimensions of information 11

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 The Spatial Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.2.2 The AUB data for the spatial dimension . . . . . . . . . . . . . . . . . . . 12

4.2.3 Where is the venue? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.4 Where can i buy tickets? . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2.5 Where can i park my car? . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2.6 Where can i eat close by the venue? . . . . . . . . . . . . . . . . . . . . . 15

4.2.7 What is the route to the venue by car/public transport/bike from myhome/work/? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

iii

Page 4: Automatic content enrichment of cultural data

4.3 Temporal Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.2 The AUB data for the temporal dimension . . . . . . . . . . . . . . . . . 17

4.3.3 The availability of temporal data . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.4 When does the presale start . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 The named entities dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4.2 The AUB dataset for the named entity dimension . . . . . . . . . . . . . 19

4.4.3 Named entity homepage retrieval module . . . . . . . . . . . . . . . . . . 19

4.4.4 The venue website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.5 Production website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.4.6 Group website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.7 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4.8 Conclusion of the evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 The financial dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.2 The AUB data for the financial dimension . . . . . . . . . . . . . . . . . . 29

4.5.3 How much does a ticket cost . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.4 How much does parking cost. . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5.5 How much does the public transport cost from my home/my work/my ...to the venue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.6 The reviews Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.7 Displaying the enriched data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Conclusion 35

A Apendices 39

A-1 Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iv

Page 5: Automatic content enrichment of cultural data

List of Tables

4.1 Missing spatial data in the AUB database . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Missing temporal data in the AUB database . . . . . . . . . . . . . . . . . . . . . 17

4.3 Overview of manual answers to the temporal data questions . . . . . . . . . . . . 18

4.4 AUB data availability with regards to the named entity dimension . . . . . . . . 20

4.5 root-subroot-path-file probability [13] . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.6 Port scan of the location url’s in the AUB database . . . . . . . . . . . . . . . . . 23

4.7 MRR and recall for venue website. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.8 Production test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.9 MRR and recall for production website. . . . . . . . . . . . . . . . . . . . . . . . 26

4.10 MRR and recall for group website. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.11 Missing financial data in the AUB database . . . . . . . . . . . . . . . . . . . . . 30

v

Page 6: Automatic content enrichment of cultural data

vi

Page 7: Automatic content enrichment of cultural data

List of Figures

3.1 Database schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Parkings availability display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.2 Restaurants display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3 Finding a named entity homepage . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 Google on ”Bellevue”(left) and ”Bellevue Amsterdam” (right) . . . . . . . . . . . 21

4.5 Location url retrieval results on name and city . . . . . . . . . . . . . . . . . . . 25

4.6 location url retrieval results on name . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.7 production homepage retrieval on production title . . . . . . . . . . . . . . . . . 27

4.8 production homepage retrieval on production title and genre . . . . . . . . . . . 27

4.9 Group homepage retrieval results . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.10 Ticket prices screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.11 Website screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

vii

Page 8: Automatic content enrichment of cultural data

viii

Page 9: Automatic content enrichment of cultural data

Abstract

The AUB is an organization that promotes culture and maintains a website that offers an overviewof most of the cultural events in Amsterdam. For each event information is available but oftenincomplete or missing. Completing or adding content by hand is a time consuming task, so theAUB is looking for a way to do this automatically. In this thesis we explore some methods todo this. First a categorization is made of the content we would like to find. We the determinesources where we could find this content. We also use the available data in the database andthe search engines Google, Yahoo and Live! to find url’s that might contain the extra content.These results are filtered for duplicates, combined and the found url’s are then checked for sanityand stored in a database. Also an effort is made to extract pieces of information from the foundpages. Furthermore extra content is generated by extracting information from preset web pages.The found information is then made available on a web page.

Website url : http://85.144.192.165/cem.php

Page 10: Automatic content enrichment of cultural data

2

Page 11: Automatic content enrichment of cultural data

CHAPTER 1

Introduction

Imagine you want to visit a theater play in an other town. For this you need tickets, informationabout the start and end times, where to park your car or public transport information if you findthat the parking fee is too steep. You might want to know if there is a break in the play and atwhat time the theater opens. Do they serve diner, or are there any nice places close by? Alsoreading something about the actors, the writer or what the play is about can be of interest.

Out of an user point of view it would be extremely handy when all this information was to becombined and displayed on 1 site. Combining information in this way is known as a mashup.Wikipedia defines a mashup as:

a website or application that combines content from more than one source into anintegrated experience.

When you for example use Google Maps and a list of the best restaurants in town to create amap that displays the location of those restaurants, you have created a mashup. More companiesor websites also offer interfaces (API’s) to (part of) their content. Much known examples areYahoo, Amazon and Ebay. Often they are free up to a limited number of API calls a day.

In our case we need more than API’s to those standard services. We need to combine informationfrom websites such as aub.nl or marktplaats.nl to obtain information about tickets and from’bereikbaaramsterdam.nl’ to know about parking. Public transport can be found at ov9292.nland traffic information for example is available at traphic.nl. For diner information ’Iens.nl’ ordinnersite.nl are good sources and to get the website of an actor you will need to Google them.Reviews are scattered over a number of websites. None of these websites offer an API so we haveto find other methods of obtaining this information.

In this thesis we research the automatic creation of a mashup for every event offered in the AUBdatabase by combining information retrieval and information extraction methods. We focus firston what information we want to include in our mashup and we then offer a categorization basedon this information. We then try to find webpages that we want to include or that might containthe information we are after. Where possible we extract this information and add it to themashup. The results will be made available online and this webpage will allow you to search forevents and see their enriched data.

3

Page 12: Automatic content enrichment of cultural data

4

Page 13: Automatic content enrichment of cultural data

CHAPTER 2

Research questions

The NUB (”Nederlands Uit Bureau”)1 is an organization that promotes culture and offers anoverview of cultural events in a number of cities and regions in the Netherlands. The AUB(”Amsterdams Uit Buro”)2 is the part of the NUB that focuses on the cultural events takingplace in Amsterdam. They host a database with information events and venues. All changes andadditions are done by hand and by different people. The database is therefore prone to errors.Furthermore the data that can be offered is currently limited to the attributes available in thedatabase3.

Out of the AUB has come the request to research the possibility to automate the addition ofmore content to the events on their website. The goal is twofold, first they want to offer theirclients better and more information and second they would like to get some information aboutthe ’buzz’ an event is creating.

In this thesis we will use the data that is available in the database of the AUB and try to combinethis with data available on the internet. So the main question is:

How can we enrich the events in the AUB database?

This a broad question and we will have to identify the smaller problems to answer this one. Toknow what we need, we first have to see what we have got. We first focus on the database ofthe AUB as this is our prime source and we want to know what data is available in the AUBdatabase and how reliable it is. So the first subquestion is:

Question 1: What is the quality of the AUB data

We will handle this question in sections 4.2.2, 4.3.2, 4.4.2, 4.5.2 and 4.6.1.

Knowing what we have, we can think of what we want. So we should ask ourselves whatinformation we should look for.

Question 2: What extra data would we like to add

We will categorize this wanted information into so called dimensions. For each of these dimensions1http://www.nub.nl2http://www.aub.nl3Already sometimes the same field is used for different types of information for different events. For example:

the field prijs informatie can contain prices or details about opening hours

5

Page 14: Automatic content enrichment of cultural data

we then ask ourselves questions that we think are useful in the context of a cultural event. Toavoid confusion with the research questions, we will call them ”dimensional questions”. Thesedimensional questions can be found in sections 4.2.1, 4.3.1, 4.4.1, 4.5.1 and 4.6.1

To answer the dimensional questions we have to find a source. When we know a good sourcewe don’t have to look further. We know for example that ov9292.nl is a good source to getpublic transport information. But often we don’t know this in advance, for example when we arelooking for information about an actor that participates in a play. We then should think aboutwhere and how to get this information. This is the third question.

Question 3: Where and how to find this extra data

This is addressed in sections 4.2 to 4.6 In the same sections we will evaluate the quality of theadded data.

The found information has to be stored in a such a way that extra data can easily be added. Soin section 3.3 we ask ourselves how we should store the extra data.

Question 4: How can we store this extra data

Finally we want to present the data to a user. The amount of data can become quite large sowe should think about a way to present the data to this user in proper way.

Question 5: How to present this extra data in a useful way to the user

This last question is discussed in section 4.7

6

Page 15: Automatic content enrichment of cultural data

CHAPTER 3

Data retrieval and extraction

3.1 The AUB data

In this section we will first discuss where and how we get the cultural data we want to enrich. Thisdata is stored in the AUB database to which the AUB has kindly provided us access. It turnsout that this database revolves around three main objects: events, production and locations.The most basic object is the event. Each event has an event ID, a production ID and a locationID. Each event then belongs to a production and a production can have 1 or more events. Thisis thus a one-to-many relation. The same is true for the location - event relation. The maindifference between events belonging to the same production is the time and or date and thelocation (or venue). So a theater play can for example play in the Stadsschouwburg Amstelveenon Wednesday and Thursday and play in the Stadsschouwburg Amsterdam on Saturday andSunday with an extra matinee on Sunday. So in this example we have 5 events, 4 dates, 2 venuesand 1 production.

Each object has attributes. At the event level the main attributes are the date and the time,at the production level the name of the production and its genre A-1 and at the location levelthe name of the venue and its address details. We will discuss the attributes further in sections4.2.2,4.3.2,4.4.2 and 4.5.2.

We used the data of August and September 2007. In this period the database hosted 959 venues,1070 productions and 2984 events. This data can be queried by sending a HTTP request to theserver of the AUB. In this request a start and end date can be added to specify a period or onlythe data of a single object can be chosen instead of all data. The server of the AUB returns thisdata as a XML-file.

This XML file can be parsed using a of the shelf XML parser. As we used Python as our scriptinglanguage we used libxml2dom for Python1. For each result (be it a location,production, event orthe three combined) we can then extract the data we need depending on what we want to findusing XPath expressions2 and the knowledge we have of the XML file structure3. This data isthen the cultural data we have and that we want to enrich.

1http://www.boddie.org.uk/python/libxml2dom.html2Xpath is a language designed to navigate through XML documents. See: http://www.w3.org/TR/xpath3The tag names of the XML file differ depending on whether you chose the complete dataset or only a special

group such as locations. For example when we want to find the url of the location, this changes names from”locatie url” in the locations set to ”loc url” in the complete dataset.

7

Page 16: Automatic content enrichment of cultural data

3.2 Enriched data

In the previous section we have extracted the attributes of the objects, be it events, locationsor production that we want to enrich. We can now use the values of these attributes to find theenriched data. This is done in three basic steps.

• Find a source that provides this data.

• Extract the data

• Store the enriched data

We will discuss these steps in further detail the following sections.

3.2.1 Find a source that provides the data

The internet is a huge source of information but it is in a form intended for human reading,not in a database form with records and fields that can be easily be queried for the requiredinformation. So locating web pages that contains the correct information is an important part offinding this information. This can be a homepage of a person or location (section 4.4 focusses onfinding homepages) or a web page that is dedicated to a special service such as websites that listrestaurants. So some thought and research has been put in finding such websites. Using such asingle source give the opportunity to use underlying formatting of this web page to extract theinformation.

When we found that such a source is not readily available we have to use a method to retrievea set of documents that might answer our question. The most logical method is to use searchengines to provide you with such a list. Search engine serve as the index for the informationavailable on the internet.

3.2.2 Extract the data

Defining a (list of) web page(s) as a source is only a part of the problem. Extracting the datafrom a page is the next part. There is a clear distinction between the methods to extract theinformation from a web page of which you know the structure or to extract it from a list of webpages returned by a search engine.

Single source data

Having a known source for your data allows you to use the formatting of this source to extract thedata4. This can be done by writing a set of rules that uses this formatting with this informationyou can then extract data from the page. This is normally done by writing a set of rules thatmake use of this formatting or the use of wrappers (Muslea [6]).

We use a technique that parses the HTML into a DOM-tree5 based on Gupta et al[3]. To create4We only concern ourselves with HTML pages for the moment. In reality results returned by search engines

can also be in other formats such as pdf.5the DOM (Document Oriented Model)-tree is the tree structure of html tags. It makes a node of every tag

and piece of text and preserves the nested structure of HTML. So text will be a child of a tag in a proper HTMLpage.

8

Page 17: Automatic content enrichment of cultural data

a DOM-tree we use the off-the-shelf parser for Python6 called Beautiful Soup 7. The HTML fileis read as a normal text file using the standard file reading features of Python and this parserthen makes a DOM-tree of this text file. The advantage of a DOM tree is it makes findingand selecting nodes (tags or text) is easy. It also allows to move to the parent or child of thefound node and has methods for removing parts of the DOM-tree such as header and javascriptsections. Those normally don’t contain useful information in our case.

We created a module using the DOM-tree structure together with regular expression that wereuse on most of the pages we want to extract data from. It needs some adjusting for every newpage and the input of keywords representing what we want to find. These adjustments can bedone in about 30 minutes for each page when you get the hang of it. The keywords used will bethe data we extracted in 3.1.

The main advantage of this method is that the reliability of the extracted data normally is high.The disadvantage are of course the manual adjustments needed for each new page you want toinclude.

Link propagation

A special case of this extraction method is link propagation. Here we search actively for a linkthat we can follow to a next page. This is useful as many pages present a list of results first thatlink to another page with additional information about this result.

Extracting the can easily been done by making use of the fact that a link always is of the format’href=”link”’. When we have extracted the link we can then extract data from the page it islinking to using the method in the previous section or extract another link.

A small module to do this has been written that also can be reused with minor adjustments.

Multiple source data

When you however have multiple sources such as a list of documents returned by a search enginethe method in the previous two sections becomes unfeasible due to the unknown structure of theresult pages.

For simple extraction that have enough regularity the usage of as set of rules and some regularexpressions is a good method. When you for example want to find a price of an object very oftena currency symbol has been attached. You can then use the DOM-tree structure of the previoussections to and use regular expressions to look for this currency format.

When the data you want to extract becomes more complicated other methods have to be used.Several models have been introduced in the recent years that use machine learning approaches toextract the data. They require sets of hand extracted examples to create their own appropriate setof rules to extract a certain type of data. [11][10] for example use a approach with POS-taggingin combination with question answering technique. POS-tagging is the process of marking up thewords in a text as corresponding to a particular part of speech, based on both its definition, aswell as its context. For a more comprehensive overview of extraction methods see McCallum[5].

6http://www.python.org/7www.crummy.com/software/BeautifulSoup/

9

Page 18: Automatic content enrichment of cultural data

3.3 Storing the enriched data

The enriched data has be stored and linked to the correct object. We use a MySQL database tostore the data retrieved. We want this database to be flexible as we may want to extend it inthe future when we have found extra information we would like to display.

The objects of the AUB database (event, production or location) we found in section 3.1 willalso form the main tables for the enriched data. Every piece of data we find enriches one of theseobjects. This way we can add the event id, production id or location id as a foreign key to tablesto connect it to the other enriched data as well as to the original data. When we for examplefind the website of the location this will be stored with the location id as a foreign key.

On the other hand we have some data stored in the original AUB database that actually shouldhave its own id. For example the name of a theater group is stored as free text in the originaldatabase which makes it less suitable to serve as a key. We now link the website of such a theatergroup to the production id. But when a theater group is involved in more than 1 production,this can lead to data redundancy because we will store it twice.

Apart from this directly linked data, we also have to store some information that is linked toan object at the moment it is displayed. For example parkings (see section 4.2.5 have their owntable that has no foreign key, only its own primary key. This results in the database scheme infigure 3.1. To keep the figure clear we have left out all the attributes except the primary keys.Tables connected to a

Figure 3.1: Database schema

10

Page 19: Automatic content enrichment of cultural data

CHAPTER 4

Dimensions of information

4.1 Introduction

The ultimate goal is to present the enriched information to a person. But what is this enrichedinformation? We would like to make this rather vague term a little clearer by bringing somestructure to it. For this purpose the notion of dimension is introduced. Each dimension standsfor a separate area of information. Within these dimension we can then enrich this information.

To find our dimensions we might want to take the viewpoint of a person that quickly wants toaccess information that is linked to an event. We can picture this to be a employee of the AUBwho services clients on the phone or desk for example. If a person wants to know somethinghe will start to ask questions. To find the dimension we might do to same. It turns out thata lot of these question start with the same words and so we can turn to the theory of questionanswering to help us with categorizing (Radev et al.[10]). The following types of questions arenormally distinguished:WHERE, WHEN, WHAT, HOW, WHO, WHY. Combining this withsome empirical testing with events we found that we can distinguish 5 dimensions for culturalevents. They are listed below with the main question type in brackets.

1. Spatial (WHERE)

2. Temporal (WHEN)

3. Named entities (WHO/WHAT)

4. Financial (HOW MUCH)

5. Reviews (WHAT/WHY)

In the next 5 sections we will for each dimension ask ourselves some dimensional questions. Thesequestion can in our opinion help a user to make a better decision about where, when, why orhow to go to an event. First we will examine the AUB database for answers. If this data doesn’tsatisfy we will describe where else an answer to these dimensional questions might be gottenand where we think this is feasible we try to obtain it. Each answer and, where necessary, itsretrieval method will then be evaluated.

11

Page 20: Automatic content enrichment of cultural data

4.2 The Spatial Dimension

4.2.1 Introduction

In the spatial dimension we will look at the spatial aspects of an event. These are typically (butnot exclusively) questions that start with ”WHERE”. For each event we can ask ourselves thefollowing questions:

1. Where is the venue?

2. Where can i buy tickets?

3. Where can i park my car?

4. Where can i eat close by the venue?

5. What is the route to the venue by car/public transport/bike from my home/work/?

Questions 1, 3, and 5 are of interest to reach the venue as easy as possible. The second questionshould list all venues where you can buy tickets. The fourth question addresses a commonproblem when going to a strange town to see an event.

In the next section we will first look at the AUB database.

4.2.2 The AUB data for the spatial dimension

The AUB database currently stores 959 locations in Amsterdam. Each location may have thefollowing attributes that hold data that we can use to find additional information in the spatialdimension:

• location name

• street

• house number

• postal code

• city

• ov info

• price info (contains information about where to buy tickets)

• longitude

• latitude

Unfortunately, this part of the database contains quite some missing data. An overview is givenin table 4.1.

The GPS data is a recent addition to the database using the street, house number or postal codetogether with Google geocoding service to extract it. These GPS coordinates have also been

12

Page 21: Automatic content enrichment of cultural data

field nr venues absolute missing % missinglocation name 959 0 0.0%

street 959 7 0.7%house number 959 92 9.5%postal code 959 0 0.0%

city 959 0 0.0%public transport info 959 431 45%

price info 2984 495 16.6%GPS coordinates (lat and lon) 959 87 8.6%

Table 4.1: Missing spatial data in the AUB database

evaluated with regards to the correctness of the found coordinates.1 It showed that 8,75 % wasnot correct, mostly due to wrong addresses. The data in the database is the input to find theextra content we want. Errors in this data will then propagate to the extra content we want tofind.

Table 4.1 also shows that 7 street names and 92 house numbers are missing. We checked allthose venues without house number and we found two main reasons. First, a lot of venues arecity squares and parks that don’t have a house number. Second, on 30 occasions the housenumber has been added to the street name in the street name field. Only in 5 cases there is anormal street without a house number. The 7 missing streets refer to locations that don’t havea address such as the beach ”Almeerderstrand”.

4.2.3 Where is the venue?

A good method to make the position of a venue available is by showing it on a map. This canbe done by using an address, but also by GPS coordinates. These coordinates also allow foreasy distance calculations to other places of interest nearby. So we would like to assign GPScoordinates to all locations. This means we have to find those for the locations that have notbeen assigned any yet(see table 4.1).

To do this we used a simple PHP script that obtains the details of all the venues that have notbeen assigned GPS coordinates yet and subsequently sends the address via a HTTP request toGoogle geocoding service 2. The resulting XML file is then easily parsed to obtain the longitudeand latitude and we subsequently store these in the database.

All 87 missing GPS-coordinates have been found this way. We still expect that the same 8.75%error mentioned in the previous section will also effect this new data. This can only be solvedby obtaining the correct addresses and this can, in our opinion, only been done properly bycorrecting them by hand.

We then show the address and the location on the website using Google Maps3.

4.2.4 Where can i buy tickets?

There are a lot of places where you might get tickets.1This evaluation has been done by a group of UVA students at the start of 2007. Their evaluation however is

not available to the public.2http : //www.google.com/apis/maps/documentation/#GeocodingExamples3http://maps.google.nl

13

Page 22: Automatic content enrichment of cultural data

1. The AUB ticketshop

2. VVV offices

3. the venue itself

4. record shops (pop concerts only)

5. Ticketservice selling points (pop concerts, musicals, ballet and opera)

6. online ticketshops

7. auction websites (marktplaats, ebay)

But how do we know which of these options sell tickets for an event?

The first two options are listed in the AUB database. Not all events require tickets or are soldthrough these channels, so the 16% missing data in table 4.1 can be explained this way. Whenfound in the AUB database we can add the AUB and VVV offices to the list of selling points forthis event.

The venue self will always be included in this list.

The fourth option is harder to verify. There are only a small number of shops that offer thisservice, but at their websites no information can be found about the events they offer thesetickets for. They are however connected to the same system as option 4 so in a way they aresimply an extension of the ticketservice and offer the same tickets as they do. These can befound at the Ticketservice4 website.

So we wrote a module that checks this website by sending a HTTP request to the search page ofthe Ticketservice. The result page lists the name of the event when found which can be verifiedby parsing the page (see section 3.2.1) and checking if the event is on listed on this page. Ifthis returns true we can add all their selling points to our event as possible ticket sellers. Wecollected those by parsing the addresses page of Ticketservice and collecting all the addressesin Amsterdam in this case. We then adjusted the script we used in 4.2.3 to return the GPScoordinates of each selling point. We will extend this module in section 4.5.3 when we look forthe prices of the tickets.

We can then show all these places using the Googlemaps API5. User input is asked for to show theselling points close to where the user lives/works/etc... A nice addition would be to also collectthe opening hours. Those are not available the Ticketservice website and should be collectedsomewhere else, for example at the websites of these offices.

The sixth and seventh option are virtual selling points and thus we cannot show them on a map.We will discuss these further in section 4.5.3 when we focus on finding the prices of the tickets.

4.2.5 Where can i park my car?

To find the nearby parking a list has been added to the database with the GPS coordinates ofthe parkings in Amsterdam. The details of these parkings are listed on the site bereikbaara-masterdam.nl 6. The GPS coordinates of the parkings have been found using Google geocodingservice 7 and the addresses of the parkings. For each venue then the nearest parkings can be

4http://www.ticketservice.nl5http://www.google.com/apis/maps/documentation/6http://www.bereikbaaramsterdam.nl7http : //www.google.com/apis/maps/documentation/#GeocodingExamples

14

Page 23: Automatic content enrichment of cultural data

found using an euclidian distance algorithm. As GPS coordinates are a 2 dimensional system(latitude and longitude) this is:

D =√

(x1 − y1)2 + (x2 − y2)2

The calculations are done online for each distinct venue where the event will take place (most ofthe times 1) . The parkings are then displayed using Google maps with the center at the venue.

Clicking a parking symbol on this Google map shows the prediction of the chance that you canactually park in those parkings for each hour in the next week. This prediction is part of thesystem that tries to regulate the traffic in the city of Amsterdam. Currently 13 parkings areincluded in this service and their available space is online available at the following page:

http://www.bereikbaaramsterdam.nl/live/main.asp?name=pagina&item_id={PARKING_ID}.

Where the {PARKING_ID} is the id of the parking as assigned by bereikbaaramsterdam.nl.

Again with the procedure in section 3.2.1 we extracted this data. We only store the next 6hours. Anything beyond this period mostly predicts an excellent chance so it has no addedvalue. Every 10 minutes a script running in the crontab is updating the predictions. figure 4.1shows a screenshot of this information.

Figure 4.1: Parkings availability display

4.2.6 Where can i eat close by the venue?

For the nearby restaurants we would like to have the GPS coordinates of the restaurants. Inorder to obtain those we wrote a script that parses all the pages of www.dinnersite.nl that referto Amsterdam. This can been done by sending the following HTTP request:

http://www.dinnersite.nl/zoek.php?plaats=amsterdam&p={PAGE_ID}

where the {PAGE_ID} is an page number start starts at 1 and ranges as far as there are pageswith restaurants. The script stops as soon as the pages become empty.

On these pages are the GPS coordinates of 1422 restaurants in Amsterdam, together with theiraddresses and names. These are stored in a database table together with the id the dinner sitehas given those restaurants to allow easy linking.

We can now easily display the found restaurants using Googlemaps. Again the nearest placesare calculated with an euclidian distance algorithm. In order to keep the map from clutteringwe only show the 15 closest restaurants. Figure 4.2 shows a screenshot of the display of the

15

Page 24: Automatic content enrichment of cultural data

restaurants on the website.

Figure 4.2: Restaurants display

4.2.7 What is the route to the venue by car/public transport/bike from my home/work/?

Several routeplanners are available on the internet. Most of them are not easy to integrate in awebpage. They require that you visit their webpage. An exception the one provided by Google.This one can be integrated in your site using the Google API. Again we ask for the postal codeof the location of the user. This will be the departure point. The arrival point is the locationwhere the event takes place.A route is then calculated.

In Amsterdam the route by car is often different then the route by bike due to the one-way roadsin the center and the city parks only accessible by bike. An excellent routeplanner for the bikeis offered by Routecraft 8. Alas has this planner been made with Macromedia Flash and no APIis offered. This makes it almost impossible to integrate it in a webpage.

To integrate the public transport route ov9292.nl offers an excellent routeplanner that apartfrom public transport also offers a routeplanner for cars. Integrating this routeplanner is not freehowever. The price of the lease depends on the number of calls made to this service.

4.3 Temporal Dimension

4.3.1 Introduction

In the temporal dimension we look at the time aspects of an event. These are typically questionsthat begin with ’WHEN’. A list of questions is given below.

1. When does the event start?

2. When does the event end?

3. When do the doors of the venue open to the public?

4. When is ultimate time you can pick up your reservations?

5. When does the pre sale start?8http://www.routecraft.com/fietsplanner

16

Page 25: Automatic content enrichment of cultural data

6. When is the ticket box open?

7. When does the last public transport leave?

8. When is the break?

In the following paragraphs we will discuss what the AUB database has got to offer in terms ofthe temporal dimension and we will discuss where and how we might get the answers.

4.3.2 The AUB data for the temporal dimension

The AUB database has the following attributes that can hold temporal information.

• start time

• start time tent

• date event

• start date (used for expositions only)

• end date (used for expositions only)

• part of day (1,2,3 for morning, afternoon, evening/night)

• opening of location (holds various information about opening hours and or holidays clos-ings)

• various information (sometimes contains reservation info)

• price information (sometimes contains opening times of reservation and ticket box offices)

Unfortunately this part of the database has not many attributes that can contain direct answersto our questions or has missing data if it does. An overview of these attributes and their missingdata is shown in table 4.2.

field nr events absolute missing % missingstart time 2398 247 10.3%

start time tent 2398 2398 100.0%date 2398 118 4.9%

start date exposition 118 0 0.0%end date exposition 118 0 0.0%

part of day 2398 0 0.0%opening location 2398 880 36.7%

Table 4.2: Missing temporal data in the AUB database

This data would suggest that almost 5% of all events have no date. But we checked and foundthat all these events are expositions. Expositions use the special attributes start date and anend date. 4.2 shows that for all 188 expositions these attributes have been filled out. Combiningthis with the rest of the activities we can say that all activities have a date. All the other fieldshowever are frequently left blank including the start time.

17

Page 26: Automatic content enrichment of cultural data

4.3.3 The availability of temporal data

We have not 1 single source that can provide us with the data we want. So we have to findseparate sources for each event and each dimensional question we want to answer. We startedwith a small survey to see what is available and if we can automate the answering of the questions.We tried to find the answers to the dimensional questions above for a sample of 25 events byhand. The results are summed up in table 4.3. Even with considerable effort we could only trace

end time doors open pick up reservation presale opening ticketbox break0 1 2 4 6 0

Table 4.3: Overview of manual answers to the temporal data questions

a very small number of answers. When we did find an answer this was in all but 1 of the cases( a presale date) at the website of the venue where the event is taking place so it seems thatgetting this website is important (See section 4.4.4 for the retrieval of the website of the venue).

When we look at table 4.3 only the opening times of the ticketbox and the presale dates seemavailable. The AUB already lists the ticketbox hours in more than 60% of the cases. We foundan answer in only 25%. This does not seem to be worth the effort. The only question we mightbe able to answer is that of the presale date.

4.3.4 When does the presale start

Having no source we have to retrieve documents that might contain the answer first. For thiswe use Google. The query we send to Google contains the production title, the venue name andthe word ”voorverkoop” (presale in Dutch). As we look for a specific format (a date) We used aregular expression that can find dates of several forms.9 We also check again for the appearanceof the production title. If we find multiple instances on 1 Google result page we compare those.If they match it strengthens our claim. If not they are discarded. We also make use of the factthat the presale date is normally before the actual event date so we discard all dates equal to orlater then the event date. When available the presale date is shown in the temporal dimensionpart of the website.

In total we found 83 presale dates for the events in august and September. We did a evaluationof the validity of these dates. As we have no test data we examined the snippets returned byGoogle and when in doubt opened the pages Google linked to. We found that 65 indeed domatch. 87% refer to pop concerts. The genre (See Appendix A-1) of the rest of the foundpresales is miscellaneous.

4.4 The named entities dimension

4.4.1 Introduction

Named entities are entities that refer to one or many rigid designators. Rigid designators anotion first mentioned by Kripke [8] and they can be seen as an unique identifer of the entityin each possible universe. For example, Kripke argues that George Washington could not be

9the regular expression matches all the dates of the forms: (0)1 (jan—Jan—JAN—Januari—januari) 1970,(0)1-(jan—Jan—JAN—Januari—januari)-1970 and (0)1\(jan|Jan|JAN |Januari|januari\1970

18

Page 27: Automatic content enrichment of cultural data

described by ”the first president of the US” for if in another universe Washington had died ininfancy he would never have fitted this description and somebody else would have. In otherwords, a named entity can be seen as unique name by which persons, organizations, locations,expressions of times, quantities etc can be identified. In our case we can make the following listof named entities stored in the AUB database.

• The venue

• The event or production title

• The persons involved such as writers, actors and directors.

• The group (as in theater group)

We can ask the following dimensional questions for each named entity.

1. Who or what is this named entity.

2. What is its relation to the event (e.g. its role)

Named entities are typically entities that can have a homepage. Such a homepage probably willoffer the background information we are after tell us more about who or what this named entityis. It also might contain answers to dimensional questions that we have not been able to answer,like those in the temporal dimension (section 4.3). We will first check the AUB database forthese url’s and then we will discuss a module to extract these homepages from the internet.

4.4.2 The AUB dataset for the named entity dimension

The AUB database has only a few fields that hold named entities or related data. An overviewis below:

• production title

• production url

• location name

• location url

• production participants (holds the names and roles of the participants)

• group name

An overview of the availability is given in table 4.4 using the months of august and September2007 as sample data and checking for distinct productions and locations only.

4.4.3 Named entity homepage retrieval module

We would like to provide a list of possible homepages for each named entity. We should striveto maximize the times that the correct result appears on this list. In other words we want tomaximize the recall. We also would like this result to appear as high on the list as possible andpreferably at the first position. So we also strive to maximize the precision at 1.

19

Page 28: Automatic content enrichment of cultural data

field nr productions absolute missing % missingproduction name 1070 0 0%production url 1070 662 61.87%location name 959 0 0%location url 959 491 51.2%participants 1070 643 60.09%group name 1070 882 76.82%

Table 4.4: AUB data availability with regards to the named entity dimension

To achieve this we created a module that sends queries to three search engines (Google, Yahooand Live!) and extracts the url’s they return. It then combines those url’s into one list ofcandidate homepages. Next it performs an extra check to improve the chance that this is thepage we were after and then reranks the list. The results are then stored in the database.

Figure 4.3 gives a schematic overview of the process involved in obtaining those url’s. In thefollowing sections we will discuss the separate parts of the module and we will evaluate it whilewe locate the homepages of the named entities in the following sections.

Figure 4.3: Finding a named entity homepage

Extract data

We can extract data from the AUB database using the method described in section 3.1. Whatdata we need depends on the named entity we want to find and will be explored in more detailin the next section as well in the sections 4.4.4, 4.4.5,4.4.6 and 4.4.7

Create and send queries

With the data we have now, we have to make our queries. In section 4.4.1 we already discussedthat the name by which we describe a named entity should point to 1 named entity. When wefor example want to find the homepage of the theater Bellevue in Amsterdam querying on thename ”Bellevue” (as it is stored in the AUB database) only will be less accurate then adding”Amsterdam” to it. The difference can be seen in figure 4.4.3.

The queries we form are then send to the search engines Google, Yahoo and Live!. Using threesearch engines should improves the precision once we combine the results. The main idea is thatno search engine will have indexed the entire web and that they thus will complement each otherwhen combined. (Henzinger [2]).

Extract list of candidate url’s

The search engines will each return a HTML page. We use the methods described in section 3.2.1to extract the url’s. Extra care has to be taken to discard advertisements and internal links. So

20

Page 29: Automatic content enrichment of cultural data

Figure 4.4: Google on ”Bellevue”(left) and ”Bellevue Amsterdam” (right)

we narrow our area of search to the result list using the characteristics of the DOM-tree we havecreated combined with our knowledge of the HTML structure of the result pages and we discardurl’s longer than 80 characters as they turn out to be internal links.

The search engine results do not offer any relevance scores, but only a rank in the result list. Wethen take this ranking to assign a weight to each result. We reverse the rank and assign the firstresult a score of n where n is the number of results on the page. We have set this to 10 for eachsearch engine during these experiments. The next result will then be assigned a score of n − 1and so on. This way we get three lists of {url, score} tuples. One for each search engine.

Create candidate url’s using words in the name

When you want to find a homepage often your first intuition is to type in the name of what youwant to find, add a extension (for example .nl) and give it a go before you start using Google. Soto include this intuition we also create a url with the name of the named entity and add a .nl exten-sion. If the name has more than 2 words, we make different combinations, but we keep the orderintact. So for example ”Het Nationale Theater” will produce: ”www.hetnationaletheater.nl”,”www.hetnationale.nl” and ”www.nationaletheater.nl”. We use a ping to check if they actuallyexist and if so assign a score equivalent to a top ranked document found by a search engine.

After this step we thus have a fourth list of {url,score} tuples. This one might be empty.

Combining the candidate url’s

We now have to combine those 4 lists into 1 list and add up the scores for url’s that have beenfound by several search engines. We first make sure that all the url’s are in the same format andall lower cased to allow for comparison so we remove slashes at the end and add ”http://” to

21

Page 30: Automatic content enrichment of cultural data

each url if necessary.

Aslam [1] made a overview of different combination algorithms. The most common are (weighted)Borda Fuse, (weighted) Bayes fuse and CombMNZ. They also showed that in absence of trainingdata and relevance scores for the separate document, (weighted) Borda Fuse is a good solution.It’s algorithm is simple but slightly less effective as the other variants as shown by Ogilvie andCallan [9] that however require a training set and relevance scores.

The Borda count is a voting algorithm that adds the votes for each candidate. The candidatesin our case are our url’s. For each url then the Borda count then is

∑k(scorek) where k is the

number of the list (1 to 4). When a url was not found on a list the assigned score is zero.

We can then sort on the score to get a combined and ranked list with {url,score} tuples. Thenext example shows that the Bellevue theater has moved up to the second place.

www.bellevue.nl , 33www.theaterbellevue.nl, 28www.ci.bellevue.wa.us, 26’..............."

Check and rerank

In this step we look at the results to decide whether we found the page we were looking for andto improve and rerank our prior results. For this we use two techniques.

The first technique we use is by parsing the webpages found and checking the title page for thename of the person or venue we are looking for. The reasoning is that this is a unique tag oneach webpage and because it displays the title of the page on top it normally is used just for thispurpose. A small test on 373 venue url’s stored in the AUB database shows that 97% have atitle tag. Of these 97% then again over 75% have words in this tag that correspond with (partsof) the name of the venue.

Each url that passes the test is assigned extra rank points (equal to n (=10)) so after sorting areranking has been effectuated.

The second technique is checking for the URL length as proposed by Westerveld et al. [13]. Theyshowed that homepages normally have quite short url’s. They made some classes of url’s:

• root: a domain name, optionally followed by index.html (e.g. http://www.muziektheater.nl)

• subroot: a domain name, followed by a single directory, optionally followed by index.htmlname (e.g. http://www.muziektheater.nl/agenda/)

• path: a domain name, followed by an arbitrarily deep path, but not ending in a file nameother than index.html (e.g. http://www.muziektheater.nl/agenda/augustus/2007/)

• file: anything ending in a filename other than index.html(e.g. http://www.muziektheater.nl/agenda/augustus/2007/ballet.html)

They made analysis of the probability that a url in one of these classes actually was a homepage.An overview of these probabilities is given in table 4.5

We used this probabilty to rerank our results by finding the class of each retrieved url and bymultiplying it’s current cumulated score with this probability. Sorting then again reranks theurl’s and we have our final list. Below is the result for the Bellevue theater after this check.

22

Page 31: Automatic content enrichment of cultural data

class probabilityroot 0.717

subroot 0.132path 0.057file 0.057

Table 4.5: root-subroot-path-file probability [13]

www.theaterbellevue.nl, 30www.ci.bellevue.wa.us, 28www.bellevue.nl , 23’..............."

Store in the database

Last but not least, we have to store the found results in a database. For each named entity westore a list of results. We add the object id (location of production) and the rank that allows usto retrieve them again in the correct order. The database has already been discussed in section3.3.

4.4.4 The venue website

Now we use this module to find the homepages of the named entities. Table 4.1 shows that in49% there is a value for the venue website in the database. We can then use this as our testdata.We will examine the quality of this data first.

Evaluation testdata

To evaluate our testdata we scan port 8010 for accepting incoming requests for each venue urlstored in the database. This gives a better indication then sending them a ping request. Serverscan be configured not to respond to ping requests. To demonstrate this we also sent ping requeststhe same set of url’s. The results of this test are in table 4.6.

result size sample percentage ping percentagesuccess 470 96% 85%failure 470 4% 15%

Table 4.6: Port scan of the location url’s in the AUB database

This indicates that the testdata, although not perfect, is usable to evaluate the retrieval of thevenue websites. It however provides no guarantee that the website indeed belongs to the venueit has been connected to, but in this case we assume the AUB database to be correct.

10port 80 is the default port that the server ”listens to” for Web requests.

23

Page 32: Automatic content enrichment of cultural data

Evaluation of location homepage

For every location we create a list of possible home page with the use of the algorithm describedin the previous section. For those that have a url in the test set we can measure the performanceof the results at several stages of this algorithm:

1. after retrieval by each search engine.

2. after the combining of these results

3. after our reality check.

We also measure the success rate of creating url’s using the words in the name of the namedentity (see section 4.4.3)

Performance measures

Precision The precision is the proportion of retrieved and relevant documents to all the docu-ments retrieved11. In this case only 1 correct url can be retrieved. We measure the rank at whichthe correct url matches the list with candidates. We do this at several cut-of ranks that we callp@1, P@2, P@5 and P@10. So if a match is made with the first url on the list it is assigned p@1and if it appears at the seventh place it is assigned p@10. p@1 implies also p@2, p@5 etc.

Mean Reciprocal Rank (MRR) The MRR is the reciprocal of the rank at which the first correctresponse is returned, or 0 if none of the first n responses contains a correct answer. The score fora sequence of queries is the mean of the individual query’s reciprocal ranks. It gives an indicationof the overall performance.

Recall This is the proportion of the documents retrieved out of the total documents available.In this case this represents the number of times the correct url has been found at all within thefirst 10 results.

We compared two different queries: Name of the venue and Name of the venue + city. The lattershowed more promising results in the example in figure 4.4.3 An overview of the results is givenin figures 4.5 and 4.6.

search method/engine MRR recallGoogle 0,62 67,1%Yahoo 0,51 58%Live 0,40 48,1%

CombNoSanity 0,63 79,9 %CombSanity 0,70 79,9 %

Table 4.7: MRR and recall for venue website.

From this data we can see the following:

• The combination of search engines improves the recall.11http://en.wikipedia.org/wiki/Information retrieval#Precision

24

Page 33: Automatic content enrichment of cultural data

Figure 4.5: Location url retrieval results on name and city

Figure 4.6: location url retrieval results on name

25

Page 34: Automatic content enrichment of cultural data

• The sanity check that is performed increases p@1 and the MRR

• Changing the query does not affect the result very much.

4.4.5 Production website

Next we use this module described in section 4 to find the websites of the production. We havea attribute called prod url in the AUB database and we use this as testdata.

Evaluation testdata

Again we evaluate this testdata first by scanning the port 80 for accepting incoming requests.figure 4.8 shows an overview.

result size sample percentagesuccess 489 94%

ping success 489 6%

Table 4.8: Production test data

Again the stored url seem acceptable to use a test set.

Evaluation production homepage

We use the same measurements we introduced when evaluating the venue website. Again wetry two different queries: the production title alone and the production title combined with thegenre of the production(see appendix A-1). An overview of the results is given in figures 4.7, 4.8and table 4.9

search method/engine MRR recallGoogle 0,19 25,5%Yahoo 0,12 20,8%Live 0,11 18,8%

CombNoSanity 0,20 28,3 %CombSanity 0,25 28,3 %

Table 4.9: MRR and recall for production website.

This data shows the following results:

• The combination of search engine again improves the recall.

• Adding the genre makes the results worse.

• The sanity check again improves the p@1 and the MRR

• The results are very low. This is probably the fault of the test set. This set might notcontain production url’s. We will discuss this further in section 4.4.6

26

Page 35: Automatic content enrichment of cultural data

Figure 4.7: production homepage retrieval on production title

Figure 4.8: production homepage retrieval on production title and genre

27

Page 36: Automatic content enrichment of cultural data

4.4.6 Group website

The group is not always available. Many events are not performed by a group. So we onlyevaluate the production where the group attribute has been filled out. We test against theprod url test set evaluated in the previous section. As adding anything to the query does notseem to matter we only use the group name.

Evaluation group homepage

We use again the same measurements. We have no other data that makes a logical match withthe group name so we only query on this name and we use the production url’s stored in theAUB database as our test set. This is the same testdata as in the previous section. The resultsare shown in figure 4.9 and table 4.10

Figure 4.9: Group homepage retrieval results

search method/engine MRR recallGoogle 0,65 69,5%Yahoo 0,49 66,1%Live 0,49 57,2%

CombNoSanity 0,65 77,3 %CombSanity 0,69 77,3 %

Table 4.10: MRR and recall for group website.

• We find that the group fits the testdata much better. The prod url then seems to be a mixbetween the production url and the group url with an emphasis on the group url.

• The combination of search engine again improves the recall.

• Adding the genre to the query does not help.

28

Page 37: Automatic content enrichment of cultural data

• The sanity check again improves the p@1 and the MRR

4.4.7 Participants

The participants attribute has not always been filled out. But when it has, it contains one ormore {role,name} tuples. The roles and names have been divided by a :-symbol. The differenttuples are divided by a —-symbol. We used the module to retrieve these homepages as well.Alas we do not have any testdata in the AUB database that allow a proper evaluation.

4.4.8 Conclusion of the evaluations

In all cases the module has improved the recall due to the combination of the search engineresults. Also the p@1 has improved after we used the check algorithms. We have found noresults of previous similar experiments using search engines and real websites so it is difficult toassign a broader value to these results.

4.5 The financial dimension

4.5.1 Introduction

Being Dutch, the costs of an outing are important. So in this section we will examine the financialside of visiting and event. Questions about money normally will start with ”HOW MUCH”. Wefocus on the financial information of an cultural event as far as they can be directly related tothe event or its location. So for example having dinner before the show starts of course willcost you, but how much really depends on your taste and wallet and thus this typically is adimensional question we will not try to answer. This limits our financial dimensional questionsto the following list:

1. How much does a ticket cost.

2. How much does parking cost.

3. How much does the public transport cost from my home/my work/my ... to the venue.

4.5.2 The AUB data for the financial dimension

Again we check the AUB database first. The database has a few fields for the price that arelisted below.

• online sales url

• prices

• sales through AUB

• sold out

29

Page 38: Automatic content enrichment of cultural data

We checked for the availability of this data for all the events in the months august and September.An overview is given in table 4.11

field nr venues absolute missing % missingonline sales url 2984 2557 85.7%

prices 2984 1257 42.3%sales through AUB 2984 2984 100%

sold out 959 8 0.3%

Table 4.11: Missing financial data in the AUB database

The purpose of the sales through the AUB attribute is a little bit unclear as it is zero all thetime, which seems to indicate that sales through the AUB are not available. In section 4.2.4 wefound however that the price information attribute does indicate that sales through the AUB arepossible.

4.5.3 How much does a ticket cost

There are several sources where we can get the price of a ticket.

• AUB database

• Ticketservice website

• The location website

• Auction sites like Markplaats/eBay

• Resellers like tickets4U.nl or onlineticketshop.nl

However, none of these sources provide a complete list of all the prices.

The AUB database has an attribute called ”prijs” that contains price information, but as shown intable 4.11 it is often left empty. Ticketservice is the biggest online seller of tickets, but it focusseson pop concerts, festivals, musicals and ballet mainly. The last two options are exponents of thethriving market for buying a stock of tickets for popular events and then selling them at doubleor triple the normal price when the event sells out. 12. It is a fact however that a lot of culturalevents never are in danger of selling out, so we should expect only to find offers for popularevents that will only be performed once or twice. This definition fits pop concerts best.

To add some extra prices apart from those stored in the AUB database we tried three options.

1. Marktplaats

2. Ticketservice

3. two online resellers (tickets4U.nl and onlineticketshop.nl)

12In the volkskrant of 07-08-2007 appeared an article that asked for a ban on this practises.

30

Page 39: Automatic content enrichment of cultural data

Marktplaats

Marktplaats is divided in categories and a few focus on the selling of tickets for concerts, theaterand other events. Using only those categories narrows down our search to tickets only instead ofother related products such as cd’s. Markplaats provides a RSS-feed-link that you can normallyuse to get updates on your selection of wannahaves. With some adjustments we could use thisservice to query markplaats for anything in the relevant categories. We created a module thatdoes just that for all the productions we have.

We can then parse the xml and extract the price from the description tag13. subsequently wecalculate the average of the found prices. The price normally is per ticket, but in about 10% ofthe cases the price is for multiple tickets at once. These prices will erroneously raise our average.To avoid this, we calculate a lower and upper boundary that we set at 50% resp. 150% of theaverage and then discard all the prices that fall outside those boundaries. We then calculate anew average together with a minimum and maximum price. These are stored in the database.If we have found some tickets at Marktplaats, a marktplaats option will appear on the website.

For the ticketservice and the two resellers we use the search function of the sites to examine ifthey sell tickets for the event at all. We use the link propagation module introduced in section3.2.2 to get the pages that list the prices. Each event has its own page. Searching for prices bydoing a regular expression search on the ”euro”-sign then gives us the price. When we find morethan 1 price we calculate the average. If we found a price by one of these providers, the optionwill appear on the website together with the price(s) found.

Evaluation

The number of events for which second hand tickets are available is low and as expected thoseare mainly pop concerts. At an total of 1070 productions we only found results for 40 events atMarktplaats on the 16th of august 2007. The ticketservice provides slightly better results. Theyreturned 87 prices. The online resellers on found 10 events in total.

The pattern is that prices do differ hugely however. For example the concert of the singer NorahJones yields averages of 55 euro at ticketservice, 80 euro at tickets4U, 93 euro at Marktplaatsand an astonishing 137 euro at ticketsonline.nl.

Figure 4.10 shows a screenshot of the ticket prices for a production.

4.5.4 How much does parking cost.

The prices of parking can be found on the website of bereikbaar-amsterdam.nl. They offerinformation about the parking rates in the street and the rates of the parkings.

The street parking rates of any street in Amsterdam can be found by filling out a form. Thisreturns a standard HTML page with this information. It turns out that this can also been doneby including the street name in a HTTP-request. This allowed us to obtain this page for everyvenue by looping through all the venues stored in the database. We parse the result page againusing the module from section 3.2.1

Obtaining extra information about the parkings works more or less in the same manner. Wehave to send a different HTTP request and instead of street names we use the parking id’s wehave found in section 4.2.5. Again we extract the parking rates of the garages (and the opening

13There is no separate tag for the price in this XML file

31

Page 40: Automatic content enrichment of cultural data

Figure 4.10: Ticket prices screenshot

hours) and then store this data in the parkings table. Figure 4.1 in section 4.2.5 also shows theseprices.

Evaluation

At first wee managed to assign 543 of the 944 locations a street parking rate this way. Themissing can be explained by the absence of an address in the AUB database. It also showed thatthe HTTP request is sensitive to the use of upper cases and special characters. So by removingall special characters and bringing them all to lower case we were able to improve the recallwith 40% to 763. A sample check of 50 random venues with an assigned rate showed 96% to becorrect. The faults are probably due to the mixing up of street names.

We did a manual check for all the parking (23) and they have all been assigned data. This datais equal to that on bereikbaar-amsterdam.nl in al cases.

4.5.5 How much does the public transport cost from my home/my work/my ... tothe venue.

For the price of public transport we once again point to the commercial routeplanner that isavailable from ov9292.nl. This planner comes included with a price calculator.

32

Page 41: Automatic content enrichment of cultural data

4.6 The reviews Dimension

4.6.1 Introduction

This dimension focusses on retrieving reviews about an event. With this information a personcan decide what to see or why to see it. We divide these reviews int two types: official and non-official. Official reviews are those published by selected and respected sources and non-officialreview contain blog posts, fora addition etc. Official reviews are mostly limited to a few specificgenres such as theater, musical and ballet. Non official reviews do not have this limitation. Wecan then ask the following questions

• What is this event about

• Why should you see this event

• What is the quality of this event.

The first questions is straightforward. The second and third questions are subjective and shouldbe left to a person to decide upon. We can only try to gather the information to help form thisopinion. Reviews seem a reasonable way to do this.

Finding the official reviews

The database of the AUB has nothing to offer with regards to reviews, so we have to find thosesomewhere else. We use a small fixed list of websites that offer reviews provided by the AUB.As these websites are primarily focused on theater we will only try to add reviews to events inthis genre. We made the following short list for some preliminary testing.

• www.theatercentraal.nl

• www.ihtr.nl

• www.8weekly.nl

Extracting the reviews can be done in several ways. One method is by writing a set of rules foreach site using the structure of this site and regular expressions to collect the information. Thisproduces accurate results, but it is time consuming and requires that you write a new templateeach time you want to add a new site to this list.

A more general method has been described by Van Waveren[12] and was used to extract newsarticles from the web. It makes use of the fact that news articles (and reviews alike) are blocks oftext. In a HTML page this clearly will stand out from all the markup, links etc. By then usingthe difference between block and inline elements14 these blocks of text can be extracted. In ourcase, however, we would like the complete review though instead of snippets. Storing them inMySQL and building a full-text index on these and using MySQL fulltext search15 enables youto search within these reviews for keywords

We tried retrieving reviews both the first method by writing a set of rules for the first two sites aswell as the second by rewriting the algorithm of [12] to a Python version and to feed it the recent

14Block elements act as a paragraph divider whereas inline element do not. Seehttp://htmlhelp.com/reference/html40/block.html for a listing.

15http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

33

Page 42: Automatic content enrichment of cultural data

reviews of those sites. In both cases we found little. The websites above offer an average of 5reviews per month over the past 6 months. Take into account that a review only has value for anevent that is still playing, this indicates that the number of reviews that actually matter will below. An alternative source is the website of recensies.nl. This is a beta-version of a website thatcollects reviews from a number of sites. It also offers this as a RSS-feed. It makes the distinctionbetween official and unofficial reviews less clear however.

For the moment the reviews have not been added to the mashup yet and we will leave this tofuture research.

4.7 Displaying the enriched data

The results of the enrichment are presented on a website. This website offers the possibility tosearch for an event by name and or genre. We kept a strict separation between the PHP scriptthat gets the data from the database and the HTML code that takes care of the display. Whena new source of information has been found, adding this to the display can easily been done bywriting a small piece of code that collects the data and then adds it to a dimension. It thendisplays the found information by dimension. On a few occasion however, this is less strictlyimplemented. For example, displaying the prices of the parkings is much more naturally doneusing the map that is part of the spatial dimension.

The spatial dimension itself depends on the location of the event and the location again dependson the date of the event. This means that the spatial dimension has become a part of thetemporal dimension when it comes to displaying it. The Named Entity, financial dimension andthe reviews dimension are connected to the production and will be displayed in separate fields.

Figure 4.11 shows a screenshot of this website.

Figure 4.11: Website screenshot

The website is online at: 85.144.192.165/cem.php.

34

Page 43: Automatic content enrichment of cultural data

CHAPTER 5

Conclusion

In order to answer our main question of how we can automatically enrich cultural data we haveto look at the sub questions of section 2.

Question 1: What is the quality of the AUB data We found that the AUB database is not complete.To most of the dimensional question we asked the answers provided by the AUB database areincomplete or altogether missing.

Question 2: What extra data would we like to add In fact we create a mashup of informationaround an event. To gain insight into what information we would like to be have we introducedthe concept of dimensions and dimensional questions. For each dimension a list of dimensionalquestions has been made. The answers to those questions are is the data we would like to add.

Question 3: Where and how to find this extra data We found that not for all questions properinformation is available on the internet. It seems more information is available for popular eventsand small events rely heavily on the homepage of the venue or artist for information. Findingthose is therefore important and can be done with reasonable precision.

Thus the lack of sources is the real limitation of enriching cultural data. We can sometimesdetermine a solid source, but we also sometimes have to rely on retrieving unknown sources.Depending on the source several extraction techniques can be used. The more we know aboutthe underlying format of the source the more reliable the extraction of the information will be.Scattered source with each a different underlying structure complicates the extraction and willlead to more false information. More research in this area of extraction is therefore needed.

Question 4: How can we store this extra data The enriched data can best be stored using thesame structure as the AUB database. We distinguished three object in this database. Events,Locations and production that are connected to each other. However, the creation of extraobjects is advised to avoid data redundancy. The enriched data will be stored in the databasewith connection to these objects. Some information on the other hand is not connected to anobject and will be displayed at the moment an user issues a query for an event.

35

Page 44: Automatic content enrichment of cultural data

Question 5: How to present this extra data in a useful way to the user The dimension mentionedbefore also help to structure the display of the found information. The webpage offers thepossibility to search on the names and the genres of the events of the months August andSeptember 2007.

36

Page 45: Automatic content enrichment of cultural data

Bibliography

[1] J.A. Aslam and M. Montague, Models for Metasearch, Proc of the 24th Annual InternationalACM SIGIR Conf. n Research and Development in Information Retrieval, pages 276-284,2001.

[2] M. Henzinger, Search Technologies for the Internet, Science : 468-471, july 2007.

[3] Suhit Gupta, Gail Kaiser, David Neistadt, Peter Grimm, DOM-based Content Extraction ofHTML Documents, pp. 207-214, In: Proceedings of the Twelfth International World WideWeb Conference, ACM Press, Budapest, Hungary, May 2003, ISBN 1581136803.

[4] J. H. Lee. Analyses of multiple evidence combination. In N. J. Belkin, A. D. Narasimhalu,and P. Willett, editors, Proceedings of the 20th Annual International ACM SIGIR Confer-ence on Research and Development in Information Retrieval, pages 267275, Philadelphia,Pennsylvania, USA, July 1997. ACM Press, New York.

[5] McCallum, Information Extraction, distilling structured data from unstructured text, ACMQueu, november 2005

[6] I. Muslea, S. Minton, and C. A. Knoblock. A hierarchical approach to wrapper induction.In Proc. of Third Intl. Conf. on Autonomous Agents, pages 190-197, 1999.

[7] K. B. Ng and P. B. Kantor. An investigation of the preconditions for effective data fusionin ir: A pilot study. In Proceedings of the 61th Annual Meeting ofthe American Society forInformation Science, 1998.

[8] S. Kripke, Naming and Necessity”, In Semantics of Natural Language, 1972

[9] P. Ogilvie and J. Callan, Combining Document Representation for known item search, 2003.

[10] Radev, Dragomir and Fan, Weiguo and Qi, Hong and Wu, Harris and Grewal, Amardeep(2002) Probabilistic Question Answering on the Web. In Proceedings International WWWConference(11), Honolulu, Hawaii, USA.

[11] Ravichandran, D , and Hovy, E. Learning Surface Text Patterns for a Question AnsweringSystem, Proceedings of the 40th Annual Meeting on Association for Computational, Pages:41 - 47, 2001

[12] F. van Waveren, Extracting and classifying election-related news items from the world wideweb, UVA scriptie, 2006.

[13] T. Westerveld and W. Kraij and D. Hiemstra, Retrieving Web pages using content,links, URLs and anchors, In TREC-2001 Notebook Proceedings, 2001. Available onlineat trec.nist.gov/pubs/.

37

Page 46: Automatic content enrichment of cultural data

38

Page 47: Automatic content enrichment of cultural data

APPENDIX A

Apendices

A-1 Genres

The genres available in the AUB database connected to the production level

• 60-70-80-90

• Ballet

• Bewegingstheater

• Big band

• Cabaret

• Cabaret-theatervorm

• Circus/variete

• Combo

• Dance

• Dans

• Debat

• Documentaire

• Expositie in galerie

• Festival

• Film

• Folklore/Niet Westersedans

• Frans/Duits

• Gesproken woord

• Het Nederlandse lied

• House

• Jazz

• Jeugd

• Kamermuziek

• Klassiek

• Klassiek - wereldmuziek

• Klassieke muziek

• Koorzang

• Lezing

• Lichte muziek/chanson

• Literaire voordracht

• Mainstream

• Mime

• Moderne dans

• Musical/show

• Muziek

• Muziektheater

• Nederlandstalig

• Niet westers theater

• Opera

• Orkest

• Overigen

• Pop

• Popmuziek

• Poppentheater

• R&B/Hiphop

• Recital

• Rock

• Soul/reggae

• Speelfilm

• Stadswandeling/Rondvaart

• Stand-up comedy

• Symposium/congres

• Tentoonstelling

• Theater

• Toneel

• Video

• Vocaal

• Wereld-dance

• Wereldmuziek

• Wereldpop

39