7
Web Content Trend Analysis Andreas Juffinger Know-Center Graz Inffeldgasse 21a Graz, Austria ajuffinger@know- center.at Reinhard Willfort Innovation Service Network Hugo-Wolf-Gasse 6a Graz, Austria [email protected] Michael Granitzer Know-Center Graz Inffeldgasse 21a Graz, Austria mgranitzer@know- center.at ABSTRACT The World Wide Web as a social network reflects changes of interest in certain domains. It has been shown that free online content in Blogs, Wikis, News and Forums is a valu- able source of information to identify trends in certain do- mains. In this paper we will present the methodology and the technical realisation of our framework for Web content trend analysis. The methodology incorporates techniques from ontology learning, population and evolution as well as information extraction to create reasonable models for up- coming trends in specific domains of interest. Aspects of ontology evolution, as qualitative measure, and inverse doc- ument tagging, as quantitative measure, for trend analysis are presented. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Content Anal- ysis and Indexing; H.4 [Information Systems Applica- tions]: Miscellaneous Keywords Web Crawling, Ontology Learning, Ontology Evolution, Trend Analysis 1. INTRODUCTION The evolution of the information society is characterised by high innovation rates and a continuously increasing com- plexity of new technologies. In such a dynamic environ- ment, organisations of any type can only stay competitive, if they treat knowledge as a valuable asset and manage it consciously. Ontologies (explicit formal specifications of the terms in the domain and relations among them, Gruber et al. [6]) and ontology-based approaches are considered the foundation for the next generation of information services. While ontologies enable software agents to exchange knowl- edge and information in a standardised, intelligent manner, describing todays vast amount of information in terms of ontological knowledge over time remains a challenge. In this work we propose the problem formulation, our method- ology and a trend analyse support system based on semi- automatic ontology learning as a solution for the identified problems. From our prospective, Web content trend anal- ysis is the task of identifying upcoming concepts within a domain, as well as the identification of significant shifts in concept occurrence within this domain. Web content trend analysis is therefore the task of tracking time dependent qualitative and quantitative differences in ontologies. For this we have identified the following necessary main steps: (a) Web content retrieval, (b) Ontology learning and popu- lation, (c) Ontology evolution and versioning and (d) Quan- titative analysis. The remainder of this paper is organised as follows: The rest of this Section introduces the four main steps in detail and related work. In Section 2 we introduce our methodology and the overall system architecture, before we explain tech- nical details in Section 3, evaluations of parts of the overall system in Section 4 and conclude in Section 5. 1.1 Web Content Retrieval Downloading the content of Web pages is generally achieved by some kind of crawling system. A crawler, also known as wanderer, robot, spider, has the task of retrieving doc- uments, extracting links and then fetching the newly dis- covered documents. This principal crawling loop is defined as follows: Starting from a collection St=0 of hyperlinks in the frontier, the crawler selects one link s St , resolves the hostname to an IP address, establishes a connection and fetches the document using a specific transfer protocol. The retrieved document is then preprocessed and parsed by a mime type specific parser. The discovered links S 0 are then added to the frontier St S 0 St+1. A crawler performs this loop, until some kind of higher target is reached, or it runs out of links. Search engines rely on huge amounts of Web content, there- fore Web content retrieval is mainly driven by search engine providers, who invented the first Web crawling systems. The first Web search engine “Wandex” used an index, collected by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993. Since then a number of crawling systems have been published for different applica- tions as outlined in [9]. We aim to take a snapshot of a predefined set of web sites within a time period, as short as possible, without overloading the web servers. For this

Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

Embed Size (px)

Citation preview

Page 1: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

Web Content Trend Analysis

Andreas JuffingerKnow-Center GrazInffeldgasse 21a

Graz, Austriaajuffinger@know-

center.at

Reinhard WillfortInnovation Service Network

Hugo-Wolf-Gasse 6aGraz, Austria

[email protected]

Michael GranitzerKnow-Center GrazInffeldgasse 21a

Graz, Austriamgranitzer@know-

center.at

ABSTRACTThe World Wide Web as a social network reflects changesof interest in certain domains. It has been shown that freeonline content in Blogs, Wikis, News and Forums is a valu-able source of information to identify trends in certain do-mains. In this paper we will present the methodology andthe technical realisation of our framework for Web contenttrend analysis. The methodology incorporates techniquesfrom ontology learning, population and evolution as well asinformation extraction to create reasonable models for up-coming trends in specific domains of interest. Aspects ofontology evolution, as qualitative measure, and inverse doc-ument tagging, as quantitative measure, for trend analysisare presented.

Categories and Subject DescriptorsH.3 [Information Storage and Retrieval]: Content Anal-ysis and Indexing; H.4 [Information Systems Applica-tions]: Miscellaneous

KeywordsWeb Crawling, Ontology Learning, Ontology Evolution, TrendAnalysis

1. INTRODUCTIONThe evolution of the information society is characterised byhigh innovation rates and a continuously increasing com-plexity of new technologies. In such a dynamic environ-ment, organisations of any type can only stay competitive,if they treat knowledge as a valuable asset and manage itconsciously. Ontologies (explicit formal specifications of theterms in the domain and relations among them, Gruber etal. [6]) and ontology-based approaches are considered thefoundation for the next generation of information services.While ontologies enable software agents to exchange knowl-edge and information in a standardised, intelligent manner,describing todays vast amount of information in terms ofontological knowledge over time remains a challenge. In

this work we propose the problem formulation, our method-ology and a trend analyse support system based on semi-automatic ontology learning as a solution for the identifiedproblems. From our prospective, Web content trend anal-ysis is the task of identifying upcoming concepts within adomain, as well as the identification of significant shifts inconcept occurrence within this domain. Web content trendanalysis is therefore the task of tracking time dependentqualitative and quantitative differences in ontologies. Forthis we have identified the following necessary main steps:(a) Web content retrieval, (b) Ontology learning and popu-lation, (c) Ontology evolution and versioning and (d) Quan-titative analysis.

The remainder of this paper is organised as follows: The restof this Section introduces the four main steps in detail andrelated work. In Section 2 we introduce our methodologyand the overall system architecture, before we explain tech-nical details in Section 3, evaluations of parts of the overallsystem in Section 4 and conclude in Section 5.

1.1 Web Content RetrievalDownloading the content of Web pages is generally achievedby some kind of crawling system. A crawler, also knownas wanderer, robot, spider, has the task of retrieving doc-uments, extracting links and then fetching the newly dis-covered documents. This principal crawling loop is definedas follows: Starting from a collection St=0 of hyperlinks inthe frontier, the crawler selects one link s ∈ St, resolves thehostname to an IP address, establishes a connection andfetches the document using a specific transfer protocol. Theretrieved document is then preprocessed and parsed by amime type specific parser. The discovered links S′ are thenadded to the frontier St ∪ S′ → St+1. A crawler performsthis loop, until some kind of higher target is reached, or itruns out of links.

Search engines rely on huge amounts of Web content, there-fore Web content retrieval is mainly driven by search engineproviders, who invented the first Web crawling systems. Thefirst Web search engine “Wandex” used an index, collectedby the World Wide Web Wanderer, a web crawler developedby Matthew Gray at MIT in 1993. Since then a number ofcrawling systems have been published for different applica-tions as outlined in [9]. We aim to take a snapshot of apredefined set of web sites within a time period, as shortas possible, without overloading the web servers. For this

Page 2: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

we have developed a distributed Web2.0 crawling system,as described in earlier work [9], to retrieve the necessaryamount of Web content within the defined time constraintsfor this work. Our system allows users to define an arbi-trary number of sites and pages as starting points. One candefine blacklists and whitelists address patterns for filteringas well as a number of other rules and constraints to controlthe amount and quality of the crawled data.

1.2 Ontology Learning and PopulationAccording to Buitelaar et al. [2] and other authors [14, 20],ontology learning is concerned with knowledge acquisitionfrom data. In the context of this paper we address primi-larly text as our data format. Ontology learning from text isbased on natural language processing, machine learning andartificial intelligence, and does overlap with knowledge ac-quisition due to the similar aim of extracting explicit knowl-edge implicitly contained in textual data. The aim of ontol-ogy population is, according to the definition in Amardeilh [1],the semi-automatic insertion of instances, properties and re-lations to the knowledge base, as defined by the domain on-tology. As argued in Buitelaar et al. [2], ontology populationis very much related to ontology learning. Ontology learningand population often follows the incremental growing para-digma, as outlined in Buitelaar et al. [2] and Maedche [14].Starting from scratch, new concepts and relations are iden-tified and added to the ontology and thereby growing theontology.

Based on this definition and our use case we have derived thefollowing ontology learning and ontology population loop asshown in Fig. 1.

Domain Expert(s)

Information Extraction,

Natural Language

Processing

Concept and

Relationship

Identification

Concept Positioning,

Relation Insertion

Figure 1: The Principal of a Ontology Learning andPopulation Loop

In ontology learning from text, one starts with informationextraction (IE) and natural language processing (NLP) toextract relevant terms and term relations from the corpus.After this, concept candidates and relations are identifiedand positioned in the ontology. After placing new conceptsand inserting the newly identified relations, there are twopossible scenarios to close the ontology learning loop. Firstof all, in the case of a semi-automatic approach, the resultswould be shown to a domain expert or knowledge workerbefore iterating the loop. Secondly, in a fully automaticapproach, iteration would happen without this interception.Any hybrid model within these two extremes is possible.

We experienced, that one should not automatically extendthe ontology by more than five concepts plus relations, dueto ascertainability reasons. Our ontology learning loop iscomparable with related work [14], where ontology learningiterates four phases (import, extract, prune and refine). Thephases can be mapped as follows to our building blocks.Import/reuse is implicitly included through the fact thatthis loop operates always on the same ontology (completereuse). The extract phase happens in the dash-dotted area.Finally pruning and refinement can be mapped to the expertintegration step.

1.3 Ontology EvolutionThe process of keeping track of the changes of an ontologyover time is called ontology evolution. It has been shownthat dynamics are nearly in every ontology [4], and it isnecessary to represent changes in ontologies [15, 21]. Theontology evolution loop is defined as shown in Fig. 2. Nat-urally the above ontology learning loop is a subpart of thisevolution loop. In each iteration, one has to repeat the on-tology learning loop to extend the earlier ontology. For Webcontent trend analysis one is concerned with technologies fortracking changes over time. Similar as in Noy and Klein [17],we do distinguish between traced and untraced evolution.Traced evolution is treated as a series of changes in the on-tology and untraced evolution deals with two versions of anontology and no knowledge of the steps that led from oneto another. In traced evolution one uses the ontology fromthe last evolution iteration and evaluates each concept andrelation for presence in the actual data. New concepts areinserted and non present concepts and relations are deleted.Each operation is a single step towards the evolved ontology,therefore a trace is left.

Domain Expert(s)

Ontology Alignment,

Matching and

Evolution

Information Extraction,

Natural Language

Processing

Concept and

Relationship

Identification

Concept

Positioning,

Relation Insertion

Figure 2: The Principal of a Ontology EvolutionLoop

To clarify the terminology one has to distinguish betweenontology evolution and ontology versioning. Ontology ver-sioning does mean that there are multiple variants of anontology, usually originating by the same ontology variant,and ontology evolution is the process of evolving an ontologyover time. Several authors [17, 15], and [11] outlined thatontology evolution is the ontological equivalent to databaseschema evolution or versioning [22]. Note that we dealexclusively with ontologies, where the changes are not re-flected in the ontology with appropriate classes, conceptsand relations. We consider ontology evolution in the senseof ontology versioning[10], where each single change triggersa new ontology variant or version.

Page 3: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

1.4 Quantitative AnalysisTrend analysis on the Web is currently focused on techno-logical and design aspects, and only a few attempts havebeen made, mainly by search engine providers, to identifytrends in Web user interests. Those content trend analysissystems are nearly all based on some kind of logs, such asquery logs in search engines, or web server and editor logs inwikis. For example Google Trends1 is based on query logs,and Wikipedia extracts trends from the web server logs torate the most popular articles. None of them is domain spe-cific or user controllable in the sense of defining the sites ofinterest. Well known trend analysts, as J. Naisbitt [16] andM. Horx [8], are bound to manual analysis of huge amountsof data, with only little help from information technology.

For trend analysis it is crucial to not only extract knowledgequalitative, but also to keep track of quantitative aspects.Inverse document annotation can help to extract quantita-tive aspects and answer questions such as: “How importantis concept A?”, “How important is the relation between con-cept A and B?”, “How many documents have been used toextract the knowledge?”. A natural quantitative measure forconcept importance is the number of documents in which acertain concept occurs. Similar to this, a quantitative mea-sure for relations, is the number of co-occurrences of twoconcepts on a sentence and document level. It is clear thatfor Web content, one can not always ensure that the crawleddata is of the same quantity (same number of Web pages oreven the same amount on data). Therefore, to allow mean-ingful quantitative comparance, normalisation has to takeplace.

2. METHODOLOGYIn this context it is necessary to clarify the meaning of to-kens, words, terms, named entity and concepts. In a firststep, we have to distinguish between tokens and terms. Atoken or word is a single string in a text, a syntactic atom.A term or multi-token terms are more specialised entities ofeveryday language, with a certain meaning in a particularsubject [14]. Note that we refer to terms also in the senseof a synonym set, similar as defined in WordNet2. Namedentities are terms of predefined categories such as the namesof persons, organisations, and locations. Concepts are muchmore complex constructs, with an intentional definition, aset of instances or extensions and a set of linguistic realiza-tions [2].

On a coarse level our methodology tackles the four mainsteps as explained in the introduction. Web content retrievalis discussed in detail in earlier work [9], ontology learningand population is discussed in the remainder of this Sectionand ontology evolution and versioning as well as quantitativeanalysis is dealt with in Section 3.

The ontology learning process can be splitted into differentlayers [2], as shown in Fig. 3. This layer cake presents thedifferent tasks that need to be solved in order to create afully functional ontology, whereas the complexity increaseswith each layer from the bottom to the top. Starting fromlayer three, concepts, one can call the according graph an

1http://www.google.com/trends2http://wordnet.princeton.edu/

ontology as defined by philosophy. The layers four to sevenare highly relevant in the field of computer and informationscience for formal semantics. Ontologies which are enrichedwith isA and other relations, as well as rules can be used toinfer explicit knowledge implicitly contained in the knowl-edge base or ontology. For Web content trend analysis it isenough to implement the ontology learning layers up to thelevel three, concepts, enriched with weak relations such asoccurs with.

Tokens and Words

Terms

Concepts

Concept Hierarchies

Relations

Rules

university, college, school

{university, college}

university := <I, E, L>

isA(professor, person)

teaches(professor, student)

matriculated(P, university) -> isA(P, student)

Figure 3: Ontology Learning Layers

To achieve an ontology for web content analysis we haveimplemented the following three main steps: Firstly, we col-lapse tokens into terms and we create a term network, sec-ondly, the term network is used to assess new concepts andthirdly, relevant relations in the term network became occurswith relations.

2.1 From Tokens To Term NetworkStarting with tokens we perform natural language processingto collapse multiple tokens into one term. This includes anormalisation of plural/singular words using a very simplescheme: if the singular transformation of a word (e.g. cutof trailing ’s’) is contained within the corpus, merge theseterms to form a synset. Additional synsets are formed byretrieving synonyms from WordNet. Named entities are alsoextracted through a rule based system implemented in Gatewith hand crafted rules.

Figure 4: Generating a Term Network from a Cor-pus

The term network is then generated based on the co-occurrenceof tokens belonging to different terms. We also include to-kens from named entities in the co-occurrence analysis, what

Page 4: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

allows us to build a single network with terms and namedentities on which later steps can operate. The generationof a single combined network has the advantage, that termswithout a path inbetween became navigable if a named en-tity deals with both “things”. At last concept candidates areidentified by a rule based system with Hearst Patterns [7].

2.2 From Term Network To ConceptsThe necessity to extend an ontology, with a new conceptidentified in the term network, occurs in our framework dur-ing the creation of the initial seed ontology and at each evo-lution step. The application of our methodology is for bothcases identical and consists of three steps to extend an ex-isting ontology Ot with information extracted from a corpus(Ot → Ot+1): (1) Activate all or selected concepts in Ot andcalculate the activation field in the term network and con-cept candidate nodes. (2) Select the most activated conceptcandidates, and (3) Extend the existing ontology with thethese candidates.

This ontology extension process is outlined in Fig. 5. Theterm network is then activated with the help of an earlier on-tology. Then we calculate an activation field by using leakycapacitor spreading activation similar as in Liu and Weich-selbraun [13]. The different activation level of the conceptcandidates is then used to select the most important candi-dates. In both cases, seed ontology learning and ontologyevolution, the extended ontology is then used in the nextiteration cycle. With this approach, we can ensure, thatonly concepts, which are associated with the existing on-tology, are considered as extension candidates and thereforegrowing a domain specific ontology.

Figure 5: The Ontology Extension Process

2.3 Relation DiffusionDue to the fact that we have relations inbetween terms basedon co-occurrence and a mapping of concepts to their termswe can derive the relations inbetween concepts from the termnetwork. We further have outlined, that only statistical sig-nificant relations are in the term network, so that no furtherrelevance testing is necessary when diffusing the relations

from the term network into the ontology. A second dimen-sion for relevance has been introduced by applying link typesuggestion as in [23] to estimate the type of the relation,what enables us to decline a number of artifact links and toimprove the precision of our method.

3. TECHNICAL DETAILSIn this Section we discuss the architecture and selected as-pects of our framework implementation and realization. Wedescribe the technical realisation of our versioning system,the concept assessment implementation, and our quantita-tive analysis, by means of the user interface module. TheWeb crawling system is described in detail in Juffinger etal. [9], and in the information extraction module we em-ploy the knowledge discovery framework KnowMiner [12]and state-of-the-art open source frameworks3.

3.1 System ArchitectureDue to the fact that our framework exclusively evolves theontology, and therefore we know all modification steps, wecan deal with traced evolution only. In a closed world ap-plication, as in our framework, one can disclaim externalchanges to the ontology. These reservations allow us to en-sure that our ontology variants are always full, backward,or upward compatible and never incompatible as defined byKlein and Fensel [10].

The building blocks of our framework are shown in Fig. 6.The system design has focused on extraordinary loosely cou-pled building blocks. According to the Pipes and Filters andthe Layer Design Pattern [5] we ensure communication inbe-tween horizontal blocks exclusively through the underlyingblocks. For example, the data crawled by the crawling sys-tem is stored in its raw format to the database. The informa-tion extraction module reads the data from the database andstores the extracted tokens and terms back to the database.Each block is further designed to allow parallel executionin a workstation cluster based on Hadoop4. For this it isnecessary that each block implements the Map and Reduceprogramming paradigm [3].

We do store raw and processed data in a PostgreSQL5 database.Due to performance and independence reasons, we storethe results of the information extraction block as one XMLdocument per raw document to the database. The XMLstructure allows us to store evolving information units overtime. For instance, in a first version the document con-tained plain terms and tokens and now we also includesterm types (verbs, nouns, ...), term stems, and sentences.This approach allows one to store evolving results of nearlyarbitrary techniques without any modifications of the un-derlying database schema, at the cost of little more compu-tational complexity when using the extracted information.

The ontology learning module builds upon the concept as-sessment, the object triple mapping (OTM), and the generalgraph algorithmic module. These modules themselves buildupon the Sesame6 module, an open source RDF schema-

3http://gate.ac.uk/, http://lucene.apache.org/4http://lucene.apache.org/hadoop/5http://www.postgresql.org6http://www.openrdf.org

Page 5: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

VersioningWeb Crawling

Information

Extraction

Relation

Extraction

Concept

Assessment

Storage LayerVersioned Triple Store

Graph

Store

Sesame

Graph

Alg.

User Feedback

Ontology Learning

OTM

Figure 6: The Trend Analysis Framework

based repository and querying library, developed by AdunaSoftware. Sesame offers the possibility to implement storagebackends - we developed a version enabled backend, whichallows us to transparently introduce versioning functionalityto the RDF repository at our needs. Further we have devel-oped a graph store, which is able to read arbitrary graphsfrom a Sesame storage backend. The store can handle dif-ferent mappings to nodes and relations, which allows one toaccess different aspects and subgraphs stored in the backend.The graph store further provides a simple graph API for di-rected, weighted graphs. Building upon the graph store,one can implement different graph algorithms in a standardgraph notation with nodes and edges, independent of thesemantic of the underlying RDF network. For the timebeing, we have implemented several graph traversal, graphshortest-path and spreading activation algorithms. On topof the ontology learning modules, we provide a user feedbackmodule, which allows user intervention during the ontologylearning loop, as defined in Fig. 1.

3.2 Relation ExtractionFor the use in Web content trend analysis, we do not needto discover formal relationships between concepts, rather weare interested wether two concepts have something in com-mon. We therefore formulate the following hypothesis: Ifthere is enough statistic evidence, that two terms togetheroccur more frequent compared to the normal distribution ofthe terms within the corpus, than these terms share some-thing in common. To test the statistical evidence we usethe chi-square test of independence with the following con-tingency table 1:

Table 1: Chi-Square Contingency TableAll Tokens Term 2

Corpus Distribution A BCo-Occurrence with Term 1 C D

The values A, B, C and D are the respective counts ofthe terms based on the scope (the whole document or co-occurrence on sentence/document level). This results interm tuples from within the corpus, which are in some wayassociated to each other. With this simple heuristic, we cancreate a representative network of terms of moderate size fora domain.

3.3 Concept AssessmentHearst Patterns are used to detect associations between pos-sible concepts and/or instances. The detected associations

are then validated using external data sources (WordNet,Wikipedia). To validate a concept we test if the extractedconcept association is valid within the external data source,e.g. is the concept contained in WordNet. This validationmethod allows us to achieve a higher precision value for de-tected concepts. Note that we only take validated conceptsinto account in further calculations.

The detected concept associations are furthermore attachedto existing terms within the term network as possible con-cept candidates. With the use of spreading activation andinjection of energy into the term network through the on-tology, as shown in Fig. 5 we are able to rank the conceptcandidates and to select the most relevant ones to extendthe existing ontology.

3.4 Quantitative AnalysisThe quantitative analysis part is responsible to extract thedocument, term and term tuple count from the corpus. Wedo store exactly this information during processing of theinput, and therefore no additional extraction step is neces-sary. As outlined in the introduction, one has to normalisethese counts to allow meaningful comparison. In the styleof Information Extraction (IE), we normalise the counts ac-cording to the Term Frequency Inverse Document Frequency(TFIDF) weighting scheme [18, 19].

(a) Manually defined seed ontology

(b) Valued seed ontology

Figure 7: Ontology Modelling and Quantisation

After term concept assignment of a manually defined seed

Page 6: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

ontology, Fig. 7(a), we are able to visualise the normalisedquantity into the ontology graph, as shown in Fig. 7(b). Inour current visualisation module, based on Eclipse GEF7,we illustrate the normalised importance as graph nodes withappropriate diameter. Note that only the size is a measurefor importance. The different colours encode the type ofconcept or instance as extracted in our framework, wherebywe differ between concepts and instances of persons, geo-graphical locations, and company’s. The time dependentquantisation is visualised by increased or decreased ringsaround the graph node. To use this functionality, the userhas to navigate to an ontology version and after selectinga basis ontology version, we can compare the quantisationand visualise relative changes as shown in Fig. 8(b). Thisvisualisations allows the user to identify very fast shifts inconcept popularity and relation importance, or with otherwords Web content trend analysis.

(a) Ontology Extension

(b) Quantified Ontology

Figure 8: Ontology Extension and QuantitativeAnalysis

4. EVALUATIONDue to the fact that the ontology learning and ontology evo-lution process is semi automatic, and therefore the outcomeis dependent on the user and other non deterministic factors,

7http://www.eclipse.org/gef

we are not able to evaluate the overall system performance.For the Web2.0 crawling system, which is also applicablefor other applications we evaluated the noise reduction per-formance. The outcome of the crawling system is a hugeamount on plain text what makes manual evaluation notapplicable. To evaluate our noise reduction method we usedvariable ranking as outlined in [?]. The scoring function isthe TFIDF (term frequency - inverse document frequency)weighting value from Information Retrieval. For the evalua-tion we used a crawl with about 2000 documents, for whichwe had filtered and unfiltered content available. The corpuscontained the filtered and unfiltered version, and had there-fore a overall size of about 4000 documents. We have thencalculated the TFIDF weight for each term in the corpus.Assuming each document pair as a cluster we then calcu-lated the cosine distance of each filtered document to eachunfiltered document. The assumption was that we only cut-ted irrelevant noise from the documents and therefore thedistance between unfiltered and filtered document shouldbe smaller as the distance to each other document. In otherwords, would it be possible to identify the original documentwhen we only have the filtered document. In our experimentthis was true for 94% of all filtered documents, what leadsto the interpretation that our noise reduction method worksquite well. Of course, this would be perfectly true for simplynot filtering the document so we also evaluated the amountof filtering. The current system cutted off about 37% of theterms leading to an recognition error of 6%.

5. CONCLUSIONThe impact and advances of our work is explained by meansof an example. Note that for the ontology in Fig.7(a), wehave crawled English articles from the Heise8 Web site. Af-ter an initial crawl we presented the results of the quanti-tative analysis to the user as in Fig.7(b). We applied thenontology learning as explained above and have been able tosuggest the Vista as an instance of Operating System andrelations to Software and Microsoft as shown in Fig.8(a).Two days later we crawled the Web site again and afternormalisation we have visualised the quantitative differencesas outlined in Fig.8(b). Altough intensive user studies arepending, we experienced that this framework is very usefulfor trend analysts and from our point of view a big steptowards fully automatic trend identification in text corpora.

6. FURTHER WORKFuture work are enhancements in the user interface whichwill include the realisation of a Gardner Quadrant like vi-sualisation of concepts, as well as a theme river depiction.Furthermore we plan intensive user studies and evaluationsof our ontology learning methodology in a controlled envi-ronment.

AcknowledgementThe Know-Center is funded within the Austrian COMETProgram - Competence Centers for Excellent Technologies- under the auspices of the Austrian Ministry of Transport,Innovation and Technology, the Austrian Ministry of Eco-nomics and Labor and by the State of Styria. COMET ismanaged by the Austrian Research Promotion Agency FFG.

8http://www.heise.de/english

Page 7: Web Content Trend Analysis - know-center.tugraz.at · 1.4 Quantitative Analysis Trend analysis on the Web is currently focused on techno-logical and design aspects, and only a few

7. REFERENCES[1] F. Amardeilh. OntoPop or How to annotate

Documents and Populate Ontologies from Text. InProc. of the 3rd European Semantic Web Conference,2006.

[2] P. Buitelaar, P. Cimiano, and B. Magnini. OntologyLearning from Text: Methods, Evaluation andApplications. IOS Press, 2005.

[3] J. Dean and S. Ghemawat. MapReduce: SimplifiedData Processing on Large Clusters. In 6th Symposiumon Operating System Design and Implementation,,2004.

[4] D. Fensel. Ontologies: Dynamic Networks of Meaning.In Proc. of the 1st Semantic Web WorkingSymposium, 2001.

[5] E. Gamma, R. Helm, R. Johnson, and J. Vlissides.Design Patterns - Elements of ReusableObject-Oriented Software. Addison-Wesley, 1995.

[6] T. R. Gruber. Toward Principles for the Design ofOntologies Used for Knowledge Sharing. In Proc. ofthe Int. Workshop on Formal Ontology, 1993.

[7] M. A. Hearst. Automatic Acquisition of Hyponymsfrom Large Text Corpora. Technical ReportS2K-92-09, 1992.

[8] M. Horx. Zukunft passiert (Future happens). MoldenVerlag, 2006.

[9] A. Juffinger, T. Neidhart, M. Granitzer, R. Kern,A. Weichselbraun, G. Wohlgenannt, and A. Scharl.Distributed Web2.0 Crawling for Ontology Evolution.In Proc. of the 2nd IEEE Int. Conf. on DigitalInformation Management, 2007.

[10] M. Klein and D. Fensel. Ontology Versioning on theSemantic Web. In Proc. of the Int. Semantic WebWorking Symposium (SWWS), Stanford University,2001.

[11] M. Klein and N. Natalya. A component-basedframework for ontology evolution. Technical report,Department of Computer Science, Vrije UniversiteitAmsterdam, 2003.

[12] W. Klieber, V. Sabol, M. Granitzer, W. Kienreich,and R. Kern. KnowMiner - Ein Service orientiertesKnowledge Discovery Framework. In Proc. ofGI-Edition. Bonner Koellen Verlag, 2006.

[13] W. Liu and A. Weichselbraun. Semi-automaticontology extension using spreading activation. Journalof Universal Knowledge Management, 2005.

[14] A. Maedche. Ontology Learning for the Semantic Web.Kluwer Academic Publishers, 2003.

[15] A. Maedche, S. Staab, S. Handschuh, and R. Volz.Managing Multiple Ontologies and OntologyEvolution in Ontologging. In Proc. of the Conf. onIntelligent Information Processing, 2002.

[16] J. Naisbitt. Mind Set!: Reset Your Thinking and Seethe Future. HarperCollins, 2006.

[17] N. F. Noy and M. Klein. Ontology Evolution: Not theSame as Schema Evolution. Knowledge andInformation Systems, 2004.

[18] C. J. Rijsbergen. Information Retrieval, 2nd Edition.Butterworths, 1979.

[19] G. Salton and C. Buckley. Term-weighting approachesin automatic text retrieval. Information Processing

and Management, 1988.

[20] S. Staab and R. Studer. Handbook on Ontologies.Springer, 2005.

[21] L. Stojanovic, A. Maedche, and N. S. B. Motik.User-driven Ontology Evolution Management. InProc. of the 13th Int. Conference on KnowledgeEngineering and Knowledge Management, 2002.

[22] V. Ventrone and S. Heiler. Semantic heterogenity as aresult of domain evolution. SIGMOD Record, 1991.

[23] A. Weichselbraun, G. Wohlgenannt, A. Scharl,M. Granitzer, T. Neidhart, and A. Juffinger. Applyingvector space models to ontology link type suggestion.In Proc. of ACM 16th Conference on Information andKnowledge Management (TO APPEAR), 2007.