41
How important are they? PERSONALIZATION, CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences Potsdam (http://iw.fh-potsdam.de/) Substitute Professor for Information Science and Information Retrieval E-mail: [email protected] 3 Outline About me Introduction Where are we? Problems Future of Search Topics of the course 4 Introduction – Where are we? Evolution of Search

CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

How important are they?

PERSONALIZATION, CONTEXT AND SEMANTICSIN INFORMATION SYSTEMS

2

About me

Ernesto William De Luca

University of Applied Sciences Potsdam(http://iw.fh-potsdam.de/)

Substitute Professor forInformation Science andInformation Retrieval

E-mail: [email protected]

3

Outline

About me Introduction Where are we? Problems Future of Search

Topics of the course

4

Introduction – Where are we?Evolution of Search

Page 2: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

5

Introduction – Where are we? Intelligent query extension

Source: www.semager.com

6

Introduction – Where are we?Extracting and providing of knowledge

Source: www.google.com

7

Introduction – Where are we? Search term suggestion

Source: www.google.com

Das Bild kann zurzeit nicht angezeigt werden.

8

Introduction – Where are we? Context dependent result presentation

Source: www.google.com

Page 3: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

9

Introduction – Where are we? Clustering of results

Source: de.vivisimo.com

10

Introduction – Where are we? Recognizing relations

Source: www.wolframalpha.com

11

Introduction – Where are we?Generating knowledge

Source: www.wolframalpha.com

12

Introduction – Where are we? Detecting homonyms and acronyms

Page 4: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

13

Introduction – Where are we?Understanding natural language queries

Source: www.powerset.com

14

Introduction – Where are we?Summary

Extracting and processing structured information unstructured information

Providing knowledge improving services

Inferring knowledge Searching for/in knowledge Presenting knowledge

15

Outline

About me Introduction Where are we? Problems Future of Search

Topics of the course

16

Introduction – Problems Entity recognition can always be improved

Source: www.semager.com

Page 5: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

17

Introduction – Problems Clustering by senses is necessary

Source: de.vivisimo.com

18

Introduction – Problems Collaboratively gained search term suggestions may be out of scope

Source: www.google.com

19

Introduction – Problems Facts are not known for each search

Source: www.google.com

20

Introduction – Problems Knowledge not linked correctly

Source: www.wolframalpha.com

Page 6: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

21

Introduction – Problems Not personalized recommendation

Source: www.wolframalpha.com

22

Introduction – Problems Semantic suggestions were language indepenent

23

Introduction – Problems Understanding Query in Natural Language

Source: www.powerset.com

24

Outline

About me Introduction Introduction Problems Future of Search

Topics of the course

Page 7: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

25

Introduction – Future of Search

Knowledge helps understanding the intention of queries Knowledge is retrieved from interactions

You are living in Berlin.Last week you have

searched for spicy recipes.

26

Introduction – Future of Search

Knowledge will always be structured

<div xmlns:contact=http://www.w3.org/2001/vcard-rdf/3.0#class="contactinfo" about="http://example.org/staff/robertc"><span property="contact:fn">Rob Crowther</span>.<span property="contact:title">Web hacker</span>at<a rel="contact:org" href="http://example.org">

Example.org</a>. You can contact me <a rel="contact:email" href="mailto:[email protected]">

via e-mail</a>or on my <span property="contact:tel">

<span property="contact:type">work</span>phone at <span property="contact:value">0123 456789</span>

</span>.

</div>

<div class="contactinfo">Rob Crowther. Web hackerat<a href="http://example.org">

Example.org</a>.You can contact me<a href="mailto:[email protected]">

via e-mail</a>or on my work phone at 0123 456789.

</div>

instead semantic annotation

27

Introduction – Future of Search

Knowledge is retrieved everywhere

Britney Spears currenthair color is blond.

06.09.2009 Source: www.viply.de

You are watching soccer: Hamburger SV vs. FC Bayern. Intermediate

result in the AOL Arena is 0:0

Source: www.sky.de

28

Introduction – Future of Search

Speech can be in- and output

How long do I have to wait forthe bus to themain station?

10 minutes

image source: www.portel.de(Huawei Android-Smartphone © Huawei)

Page 8: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

29

Introduction – Future of Search

Linked knowledge for answering complex questions.

Known facts: James T. Kirk is the captain of the Enterprise. Enterprise is a TV series. James T. Kirk was killed by Soran. Soran was played by Malcolm McDowell.

„Who played the killer of James T. Kirk?“

„Malcolm McDowell played thekiller of James T. Kirk“

30

Introduction – Future of Search

Entity recognition at query time

31

Introduction – Future of Search

Automatic text summarizations with minimal lossof semantics.

This webseite is about the movie: Star Trek: Generations. In this movie both captains of theenterprise (James T. Kirk and Jean Luc-Picard) work together to stop Soran.

INTRODUCTION TOINFORMATION RETRIEVAL

Evolution of the Search

Page 9: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

33

chair

Interaction and Retrieval (Humans)34

Content type: Analog Digitalized Digital Hybrid

Libraries

Interaction and Retrieval (Data)

Archives

Internet Social Networks

35

Interaction and Retrieval (Media Type)

Text

Picture Video

Audio

36

State of the Art

Current Problems Archives / Libraries / Digital Libraries (DL)

Mostly English supported search Mostly keyword-based search Librarians

Search Experts But no domain-specific knowledge

Search Engines Huge amount of Web documents Mostly keyword-based (monolingual) search Manually and automatically derived categories

Based on statistical methods only Lack of semantics (given a query)

General Goal: Find relevant information related to user query Structure and classification of information

Page 10: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

37

State of the Art

IR Problems Dagobert Soergel:

Important problems in information retrieval Problem 1. Assisting the user in clarifying and analyzing the

problem and determining information needs. Problem 2. Knowing how people use and process information. Problem 3. Knowledge representation. Problem 4. Procedures for processing knowledge/information. Problem 5. The human-computer interface. Problem 6. Designing integrated workbench systems. Problem 7. Designing user-enhanced information systems. Problem 8. System evaluation.

38

State of the ArtCurrent Search Engines - GoogleVivísimoTeoma

39

Rough timeline of the generations of information retrieval in digital libraries

Current ResearchEvolution of Information Retrieval

Bruce R. Schatz, „Information Retrieval in Digital Libraries: Bringing the Search to the Net." Science, Vol. 275. 1997

40

Multilingual Social Semantic Digital LibraryInvolves the world community into sharing multilingual knowledge

Current ResearchEvolution of Digital Content

Digital Enterprise Research Institute (www.deri.org)

Sebastian Kruk; „Digital Libraries of the Future. Use of Semantic Web and SocialBookmarking to support E-Learning in Digital Libraries“, Digital Enterprise Research Institute (DERI) National University of Ireland, Galway. 2006.

Page 11: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

41

i2010 (“information space, innovation and investment and inclusion”)to establish a single European information space

Current ResearchEvolution of the Legal and Technical Landscape

(Source: DLA Piper, 2007)

42

source: http://web2.wsj2.com/

Current ResearchStructured Interaction and Retrieval

43

Summary 43

Current Research What is Web 2.0? „Definition“ (O‘Reilly).

Web 1.0 Web 2.0 NewDoubleClick Google AdSense personalisedOfoto Flickr tagging, communityAkamai BitTorrent P2Pmp3.com Napster P2PBritannica Online Wikipedia community, free contentpersonal websites blogging dialogEvite upcoming.org and EVDBdomain name speculation search engine optimizationpage views cost per click pay for participationscreen scraping web services interoperabilitypublishing participationCMS wikis flexibility, freedomdirectories (taxonomy) tagging ("folksonomy") community, freedomstickiness syndication open content

44

Current Research?

Semantic Wikis

Nova Spivack: Metaweb

Page 12: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

45

Current Research?Is this the digital future?

INTRODUCTION TOINFORMATION RETRIEVAL

Topics

47

Introduction – Topics of the course

Information Retrieval

48

Introduction – Where are we?Evolution of Search

Multilingual Search

Page 13: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

49

Introduction – Where are we?Evolution of Search

Named-EntityDisambiguation

50

Introduction – What can we do?

Information Retrieval

Multilingual Search

Named-EntityDisambiguation

Page 14: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

An

Intro

duct

ion

INFO

RM

ATI

ON

RET

RIE

VAL

2

Introduction to Information Retrieval � What is IR?

� The IR task � IR process � Components of an IR system � Web search systems

� Related research fields

3

What Do We Discuss?

� How do search engines work? � How do they collect information? � What “tricks” do they apply? � How can search methods be used “outside the web”?

� How can we improve search approaches? � Do they support natural language? � How can user interaction be improved?

� How can we speed up computation? � Data structures � Caching � Compression, …

4

What Do We Discuss?

� How can we decide, whether a search approach really works? � In general for all queries or for specific queries � For specific document collections or the whole web � What kind of measures can we use?

� What else can we do? � Other types of media? � Other tasks?

Page 15: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

5

What is Information Retrieval (IR)?

� Salton 1968: Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.

� Wikipedia (German version): IR is a research field that deals with computer supported, content based, and vague search in unstructured data collections.

� Information in the form of: � In general: unstructured data � Most often: text documents � But also: images, videos, music, …

6

From Data to Knowledge

� Data � Tokens/characters that can be processed

by a machine � Data has no further meaning besides

its simple presence � Collection of facts, figures,

statistics, …

Data

Information

Knowledge

Wisdom

7

From Data to Knowledge

� Information � Interpreted data that gets a meaning � Abstract content of the data � Useful data in the process of asking

interrogative questions e.g. � Who? � What? � Where ? � How many ? � When ?

� Information = Data + Meaning/Purpose

Data

Information

Knowledge

Wisdom

8

From Data to Knowledge

� Knowledge � Meaningful combination of information � Information has been processed, organized or

structured, or otherwise being applied or put into action

� Knowledge = Information + Processing

Data

Information

Knowledge

Wisdom

Page 16: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

9

From Data to Knowledge

� Wisdom � Recently introduced � Requires understanding

� An appreciation of why

� Evaluated understanding � Further added value through unique

and personal judgment (ethical aspects)

Data

Information

Knowledge

Wisdom

10

From Data to Knowledge

11

What is Information Retrieval?

� Indexing and retrieval (finding and re-finding) of text documents

� Search for web pages in the World Wide Web is the current “killer application”

� It is mainly search for relevant documents given a certain question (query)

� It is also efficient search of documents in very large document collections

� It is not data retrieval as in databases

12

Databases

� Storage and retrieval of data � Data is stored in a clearly defined structure

� e.g., in tables with different columns � Structure and meaning of the data is precisely

determined in a scheme � Query language

� Artificial, with restricted syntax and vocabulary � Exact and complete specification of what is requested

All exactly matching items shall be retrieved All items are equally relevant

Page 17: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

13

Information Retrieval Systems

� Unstructured data � E.g., natural language texts

� Query language � Mainly based on natural language

Impossible to exactly specify the results of interest � Interest in partial matches � Relevance as central issue � Goal: fast access to relevant documents

� E.g., through a sorted list

� IR-System must be able to interpret the content of documents in the context of the user query

14

Databases vs. IR Systems

van Rijsbergen, 1979

Data Retrieval Information Retrieval Matching exact partial, best Inference deduction induction Model deterministic probabilistic Classification monothetic polythetic Query language artificial natural Query specification complete incomplete Items wanted matching relevant Error response sensitive insensitive

15

Task of an IR System

� In general � Answer a question or � Find a specific piece of information

� Typical simplification � Given to the system:

� A pre-existing set of “canned” natural language documents � A query in form of a text string

� Seek: � An ordered set of documents relevant to the query � The most relevant out of the repository and display them to

the user

16

Task of an IR System

� Build a system that retrieves documents that are most likely relevant to the user

Information need

query

docu- ments

retrieval system

documentcollection

Page 18: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

17

Other Tasks of an IR System

documents Retrieval System

categories

k n-1

k 1

k 2

k n

Classification

18

Other Tasks of an IR System

documents Retrieval System

categories

Filtering

pos

neg

19

Other Tasks of an IR System

documents Retrieval System

users

Routing

u n-1

u 1

u 2

u n …

20

Other Tasks of an IR System

documents Retrieval System

cluster

Clustering

c n-1

c 1

c 2

c n

Page 19: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

21

What is Information Retrieval?

� Starting point � User with information need, i.e., a lack of

information � Three phases

� Asking a question (Information Need) � Constructing an answer (Response) � Assessment of the answer (Evaluation)

� Iterative process � Several questions might be necessary to

satisfy the information need

22

Information Need

� Perceived gap in the user’s knowledge � Concrete Information Need

� A specific piece of information is required � E.g.:

What is the capital of Germany? When does the train to Munich leave?

� Problem-oriented Information Need � Research about a specific topic � A collection of several documents is required � E.g.:

What is the current state-of-the-art in web search engine technology?

23

Asking a Question

� Person asking user � Is in a certain cognitive state (context, frame of mind) � Is aware of a gap of knowledge but might not be able

to described it � However: Is required to specify his information need

� Paradox of Finding Out About � “The need to describe that which you do not know in

order to find it” (Roland Hjerppe) � You can only ask the right question, if you know what

the result is � Query

� Expression of this ill-defined state

24

Answering the Question

� Say the question answerer is human � Does the answerer know the answer himself? � Can he translate the user’s ill-defined

question into a better one? � Is he able to verbalize his answer? � Will the user understand this verbalization? � Can he provide the needed background?

� Say the question answerer is a computer system � …

Page 20: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

25

Assessing the Answer – Relevance

� How well does it answers the question? or How relevant is the answer to the user?

� Relevance is a subjective assessment and can include: � Right topic � From the right time frame (up to date) � From a trusted source � Answer considers goals and intended usage of the

user (information need)

26

Assessing the Answer – Relevance

� Answer is complete and precise � Q: Who teaches this lecture? � A: Ernesto William De Luca who is working at the

University of Applied Sciences Potsdam, Germany. � Question is partially answered

� Q: Where is Berlin? � A: In Germany.

� Answer suggests a source for more information � Q: What is Information Retrieval? � A: Attend this course.

27

Assessing the Answer – Relevance

� Answer gives background information � Q: What is Information Retrieval? � A: IR is a computer science discipline for

about 60 years. � Answer reminds the user of other relevant

knowledge � Q: What is Information Retrieval? � A: If you are interested in IR, it might be

helpful to have some background knowledge in databases.

28

Relevance for Keyword-based Search

� Simplest form � exact occurrence of the query in the

document � Less restrictive

� single words of the query have to occur often in the document

Page 21: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

29

Relevance for Keyword-based Search

� Problems � Missed documents due to synonyms

� “Big Apple” vs. “New York” � “automobile” vs. “car” � “profession” vs. “occupation”

� Irrelevant documents due to ambiguous terms � “bank” (finance institution vs. something to sit on) � “apple” (company vs. fruit) � “bit” (unit of data vs. act of eating)

30

IR is an Iterative Process

� Dialog instead of a single question � The exchange does not (necessarily) end

with first answer � User recognizes elements of a useful answer � Answer changes his knowledge although the

information need is not satisfied yet � User modifies the initial query

� During the search process: � Questions and understanding changes � Information need itself might also change

31

Berrypicking Model (Bates 1989)

� New information may yield new ideas and directions � The information need

� Is not satisfied by a single answer but rather � By a set of information found along the way.

T T

T

T

E

Q0

Q1

Q2

Q3

Q4

Q5

32

IR System Architecture

Information need

query

docu- ments

retrieval system

documentcollection

interface ranking

preprocessing and indexing

Page 22: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

33

IR System Architecture

34

Underlying Model

35

Underlying Model

� Fundamental component � Framework for representing

� Queries � Data items and � their relationships

� The “intuition” for ranking � Types of Models

� Boolean model � Vector space model � Probabilistic model � …

36

Internal Data Representation

Page 23: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

37

Internal Data Representation

� Represent data such that � Content/meaning is described appropriately � Efficient access based on a query is possible � Memory usage is kept

as small as possible � Types of index

structures � Inverted index � Suffix trees � …

38

Pre-Processing and Indexing

� Transform raw data into internal representation � For documents:

� Interpreting sequences of characters � Recognizing

� Words and phrases � Sentence structures� Part-of-speech

� Syntactical analysis � Morphological analysis � Statistical and linguistic

methods

39

Queries

40

Queries

� Way to express information need � Types:

� Boolean � Natural language � Stylized natural language � Form-based (GUI)

� E.g., Boolean: � Terms

� Words and Phrases

� Operators � AND, OR, NOT

Page 24: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

41

Matching and Relevance Ranking � Searches in the internal data storage for

documents matching the query � A relevance metric is used to order the retrieved

document set � Order, e.g.:

� Chronologically � By number of hits of

the query terms � By popularity � Refined metrics

42

Interface and Visualization

� Interaction with the user � Take queries � Visualize results

� Ranked list � Information per document � Structuring of result set � Displaying similarities

� Handle interaction like � Relevance feedback � Query refinement � Filtering

43

Relevance Feedback

� Different possibilities to improve search result � Reranking based on user-marked relevant and

irrelevant documents � Query modifications

� Reformulate entire query � Expansion, e.g., with

synonyms � Refinement, i.e., additional

search on current search results

44

Relevance Feedback

� Filters � Reduce set of candidate results � Often on meta data like

date, domain, file type, author, size, maximum number to retrieve

Page 25: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

45

Web Search

� Application of IR on Web documents in the World Wide Web

� Differences to standard IR � Documents must be collected previously

Crawling � Using structure as given through HTML/XML � Documents are not static; change cannot be

controlled � Using the hyperlink structure

46

Web Search System

Information need

query

docu- ments

retrieval system

document collection

spider / crawler

47

Other Tasks Close to IR

� Automated document categorization � Automated document clustering � Automated text summarization � Question answering � Information filtering (spam filtering) � Information extraction � Information integration � Recommending information or products � Searching und ranking in Web 2.0

Related Research Fields

INFORMATION RETRIEVAL

Page 26: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

49

Related Research Fields

� Database Management � Library and Information Science � Artificial Intelligence � Natural Language Processing � Machine Learning, Data Mining

50

Database Management

� Focused on � Structured data stored in relational tables � Not on free text

� Deals with efficient handling of well defined queries in a formal language (SQL)

� Clear semantics for data and queries � Currently, it also deals with semi-structured data

like XML this brings databases and IR closer together

51

Library and Information Science

� Focused on � Human-computer-interaction, user interface,

visualization � Organization and search of information in libraries

� Deals with the effective categorization of human knowledge

� Deals with the analysis of the ratio between persons and publications

� Current research in the field of digital libraries brings this field closer to IR

52

Artificial Intelligence

� Focused on methods to acquire, represent, and derive (new) knowledge

� Formalisms to represent knowledge and queries � Predicate logic � Description logics � Bayesian networks

� Current research in the field of the semantic web and ontologies bring a closer relation to IR

Page 27: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

53

Natural Language Processing

� Focused on the syntactic, semantic, and pragmatic analysis of natural language text

� This could allow for a search related to meaning instead of keywords

� Natural language processing for IR � Word sense disambiguation: Methods for detecting

the word sense of ambitious words in context � Information extraction: Methods to identify specific

information in text � Methods to answer natural language queries on

document collections

54

Machine Learning, Data Mining

� Focused on the development of systems that can improve their performance based on experience

� Supervised learning: automatic classification through learning from pre-classified training examples

� Unsupervised learning: automatic methods for grouping non-classified examples

55

Machine Learning, Data Mining

� Machine learning for IR � Text categorization

� Automatic classification in hierarchies (e.g., Yahoo) � Adaptive filtering, recommendation � Automatic spam filters

� Text clustering � Clustering of IR query results � Automatic learning of hierarchies

� Text mining � Learning for information extraction

� Learning User preferences � Learning to Rank

Named-Entity Disambiguation

INFORMATION RETRIEVAL

Page 28: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

57

Retrieval and Named-Entity Disambiguation

� If newspapers write about an event all over the world, the event stays the same.

� Idea: � Extract knowledge (concepts and entities)

from news articles � Find knowledge in other news articles to

detect � Same articles in other languages � Duplicate articles � Following articles (additional knowledge)

58

is a

is a

Retrieval and Named-Entity Disambiguation

59

Dict- ionaries

document

Entity Recognition Common-

Sense- Ontology

Entity Analysis

WordNet

Lexical Analysis

• synonyms • generalisations • relations

• entities (persons, locations, organizations)

• relations between found entities

Dat

a flo

w

semantic representation

Retrieval and Named-Entity Disambiguation 60

profile matching

clustering

Post-processing

• semantic groups • representatives of these groups

• relevance assumption

• Translation of representations da

ta fl

ow

semantic representation of one document

semantic profile

Retrieval and Named-Entity Disambiguation

Page 29: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

Conclusions

INFORMATION RETRIEVAL

62

Summary

� Basic ideas of IR � Overview over

the main components of IR and web search systems

63

Resources

� R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval, New York, NY: ACM Press; 1999.

� R. Ferber: Information Retrieval. Suchmodelle und Data-Mining-Verfahren für Textsammlungen und das Web. dpunkt-Verl.: Heidelberg, 2003.

� C. D. Manning, P. Raghavan, H. Schütze: Introduction to Information Retrieval, Cambridge University Press, 2008.

� C.D. Manning, H. Schütze: Foundations of Statistical Natural Language Processing, The MIT Press, 2002.

� Proceedings of the 7th European Summer School in Information Retrieval, 2009.

Page 30: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

USER MODELING IN KNOWLEDGE MINING

2

Overview

Data and Knowledge Mining User Modeling User Profiles Scenarios

KNOWLEDGE MINING

Ernesto William De Luca

An Introduction

[email protected]

4

Introduction

Today every enterprise uses electronic information processing systems. Production and

distribution planning Stock and supply

management Customer and

personnel management

However: Data alone are not enough. General patterns, structures, regularities go

undetected.

Page 31: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

5

Data

Examples of Data “Columbus discovered America in 1492.” “Mr Jones owns a Volkswagen Golf.”

Characteristics of Data refer to single instances

(single objects, persons, events, points in time etc.) describe individual properties are often available in huge amounts (databases, archives) are usually easy to collect or to obtain

(e.g. cash registers with scanners in supermarkets, Internet) do not allow us to make prediction

Knowledge Mining

6

Knowledge

Examples of Knowledge “All masses attract each other.” “Every day at 10:20 am there runs a train from Frankfurt to Darmstadt.“

Characteristic of Knowledge refers to classes of instances

(sets of objects, persons, points in time etc.) describes general patterns, structure, laws, principles etc. consists of as few statements as possible (this is an objective!) is usually difficult to find or to obtain

(e.g. natural laws, education) allows us to make predictions

Knowledge Mining

7

Not all statements are equally important, equally substantial, equally useful Knowledge must be assessed.

Assessment Criteria Correctness (probability, success in tests) Generality (range of validity, conditions of validity) Usefulness (relevance, predictive power) Comprehensibility (simplicity, clarity, parsimony) Novelty (previously unknown, unexpected)

Priority Science:

correctness, generality, simplicity Economy:

usefulness, comprehensibility, novelty

Criteria to Assess Knowledge

Knowledge Mining

8

How do we find knowledge?

We are drowning in information,but starving for knowledge.

John Naisbett

Attempts to Solve the Problems• Intelligent Data Analysis• Knowledge Discovery in Databases• Data Mining

Knowledge Mining

Page 32: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

9

Knowledge Discovery und Data Mining

Due to the growing volume of data: Knowledge Discovery in Databases (KDD) is the

non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad et al. 1996] Data Mining is that step of the knowledge

discovery process in which data analysis methods are applied to find interesting patterns.

Knowledge Mining

10

Data Mining Tasks

Classification Is this customer credit-worthy?

Segmentation, Clustering What groups of customers do I have?

Concept Description Which properties characterize fault-prone vehicles?

Prediction, Trend Analysis What will the exchange rate of the dollar be tomorrow?

Dependence/Association Analysis Which products are frequently bought together?

Deviation Analysis Are there seasonal or regional variations in turnover?

Knowledge Mining

11

Data Mining Methods

Classical Statistics(charts, parameter estimation, hypothesis testing, model selection, regression)tasks: classification, prediction, trend analysis

Bayes Classifiers (probabilistic classification, naive and full Bayes classifiers)tasks: classification, prediction

Decision and Regression Trees(top down induction, attribute selection measures, pruning)tasks: classification, prediction

k-nearest Neighbor/Case-based Reasoning(lazy learning, similarity measures, data structures for fast search)tasks: classification, prediction

Artificial Neural Networks(multilayer perceptrons, radial basis functionnetworks, learning vector quantization)tasks: classification, prediction, clustering

Cluster Analysis(k-means and fuzzy clustering, hierarchicalagglomerative clustering)tasks: segmentation, clustering

Association Rule Induction(frequent item set mining, rule generation)tasks: association analysis

Inductive Logic Programming(rule generation, version space, searchstrategies, declarative bias)tasks: classification, association analysis, concept description

Knowledge Mining

12

Supermarkets have a large amounts of customer data available.

Big Interest about Bond Purchases Arrangement of products on shelves

Association Rules to describebond purchases

„If a customer buys bread and wine, he/she will buy in 80% of the cases also cheese.

Data Mining Methods

Example: Association Rules

Page 33: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

13

Data Mining Methods

Example: Clustering

Goal: Arrange the given data tuples into classesor clusters.

Data tuples assigned to the same cluster should be as similar as possible.

Data tuples assigned to different clusters should be as dissimilar as possible.

Similarity is most often measured with the help of a distance function. (The smaller the distance, the more similar the data tuples.)

Knowledge Mining

14

Data Mining Methods

Example: Hierarchical Agglomerative Clustering

Centroid (red) Distance between the centroids (mean value vectors) of the two clusters

Average Linkage Average distance between two points of the two clusters.

Single Linkage (green) Distance between the two closest points of the two clusters.

Complete Linkage (blue) Distance between the two farthest points of the two clusters)

Knowledge Mining

15

Data Mining Methods

Example: Hierarchical Agglomerative Clustering

Start with every data point in its own cluster. (i.e., start with so-called singletons: single element clusters)

In each step merge those two clusters that are closest to each other.

Keep on merging clusters until all data points are contained in one cluster.

The result is a hierarchy of clusters that can be visualized in a tree structure (dendrogram)

Knowledge Mining

16

Lesson Learned

Knowledge Mining

Knowledge mining can be simply characterized by the following mapping:

DATA + PRIOR _KNOWLEDGE + GOAL NEW_KNOWLEDGE where GOAL is encoding of the knowledge needs of the user(s), NEW_KNOWLEDGE is knowledge satisfying the GOAL.

Such knowledge can be in the form of data mining methods, statistical summaries, visualizations, natural language summaries, or other knowledge representations.

Ryszard S. MichalskiKnowledge Mining: A Proposed New Direction

Invited talk at the Sanken Symposium on Data Mining and Semantic Web,

Osaka University, Japan, March 10-11, 2003

Knowledge Mining

Page 34: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

USER MODELING

Ernesto William De Luca

[email protected]

User Profiles and Scenarios

18

User Modeling

Motivation Intelligent Access to Information Consideration of user's preferences Information Filtering and Processing

User Profile “Modelling a user” Interests, preferences Behavior, patterns of interaction

User Modeling

19

How can we provide individual experience? How can we help users in finding only relevant

information? How can we give personal recommendations? How can we recognize what the user wants to be

recommended?

User Modeling

Problems and Challenges

User Modeling

20

Personalization to understand the user to understand the user needs to find semantically-related content to identify features that influence

the user’s (or item’s) current situation (context)

Knowledge to be used Implicit Knowledge Explicit Knowledge

User Modeling

Goals

User Modeling

Page 35: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

21

Observation of the behavior of the user ("Over-the-shoulder look") Inferences about preferences or interaction

patterns Application Flow Collecting samples Identification of common patterns Clustering Building a profile

User Modeling

Implicit Knowledge

User Modeling

22

User Modeling

Explicit Knowledge

Initialization by means of user Specification of rules and / or attributes

Example:Rules forE-Mail Sortingand Filtering

User Modeling

23

PresidentChairmanPerson Chairwomanchairperson officer meetings Organization…

Professorshipchair position professorPedagogy…

chair Furniture Support backarmchair …

chair

User Profile

Semantic Information (World Knowledge)

User Modeling

24

User Profile

Context-aware Information (Situational Knowledge)

User context Surroundings (weather, location) Company (alone, with friends) Mood/emotions any user related factor

User Modeling

Page 36: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

25

User Profile

User Characteristics (Level of Knowledge)

User Modeling

26

User Profile

User Characteristics (Culture and Language)

User Modeling

27

Personalization (User Profiles) User behaviour Analysis Semantic Profiles Context-aware Profiles

Knowledge to be used Implicit Knowledge Explicit Knowledge

Scenarios Music, News and Movies

User Modeling

Possible Solutions / Scenarios

User Modeling

28

Goal: Recommend news articles

based on the previous behavior of a user How: User behavior is analyzed Semantic knowledge is being linked

to current news articles. Algorithms were developed to analyze

semantic information and user behavior

Semantic User Profiling

User Modeling - Semantic User Profiles

Music and News Scenario

Page 37: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

29

User Modeling - Semantic User Profiles

User Profile Management (UPM)

Our Goals: Understand user interests / needs

Tracking user behaviour on dynamic websites Recommend user-relevant information

Create user profiles Tracking and understanding relations

between content on dynamic websites Including user interests / needs Aggregation of different user profiles from different applications

Semantic User Profiling

30

User Modeling - Context-Aware Profiles

Movie Scenario

Definition: “Context is any information that can be used to characterize the situation of an entity”

[Dey, 2001] Goal : implicit identification of context-related preferences

based on analysis of users’ interaction histories and current usage contexts

How: Key contextual and metadata features are identified and

used for the creation of several sets of user-specific and context-aware recommendations.

Context-aware User Profiling

31

User Modeling - Multilingual User Profiles

Web Scenario32

User Modeling

How can we manage it?

User-centric Recommendation

Multilingualism…

Location Time

Item genre…

Text Audio

PictureVideoMaps

...Personalized Information

Management

Personalized Information

Management

Page 38: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

Examples

PERSONALIZEDINFORMATION MANAGEMENT

34

Information Management

Definition: It is the collection and management of information from one or more sources and the distribution of that information to one or more

audiences.

It means the organization of and control over the structure, processing and delivery of information.

Examples: Recommender and Retrieval Systems

35

Information Management

Information

Text

Picture Video

Audio Experts

Maps

36

Information Management

User Profiles

Page 39: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

37

Information Management

Structured Interaction and Retrieval

source: http://web2.wsj2.com/

38

Personalized Information Management

Document-oriented Search

The huge amount of Information available is often distributed across multiple, heterogeneous sources

and must be manually collected and processed.

39

With a Document-oriented Personalized Information Management we can filter the flood of information by:

Intelligent processing of data from heterogeneous sourcesSemantic enrichment and association of collected informationSearch interfaces that are intuitively usable and easy to

masterPersonalized filtering and presentation of information Interactive visualizations of datapersonalized document management recommendations of related information support for collaborative knowledge exchange with other users

Personalized Information Management

Document-oriented Search40

http://www.pia-services.de/

Personalized Information Management

Document-oriented Search - Example

Page 40: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

41

Directories

Saved Searches

Newsletters

Query SuggestionsClusters

Tag Cloud

Ratings

Semantic Information

User Profile

Tags

Advanced Search

Paper Details Expert Search

Assistant

Relevance

Personalized Information Management42

Personalized Information Management

Knowledge-driven Search

Spree unlocks this knowledge by:

Identification of expertise in communities and companies. Domain-specific ontology-based modeling and classification

of expertise. Topic-specific classification of user/customer questions. Automatic identification of qualified experts. Communication services for real-time knowledge exchange.

80% of all knowledge is within the minds of people and

therefore difficult to access.

43

Personalized Information Management - Knowledge-driven Search

Searching for Experts

User

Experts

??

44

Personalized Information Management - Knowledge-driven Search

Searching for Experts - Example

Page 41: CONTEXT AND SEMANTICS ( IN INFORMATION SYSTEMS … · 2014-11-18 · CONTEXT AND SEMANTICS IN INFORMATION SYSTEMS 2 About me Ernesto William De Luca University of Applied Sciences

45

Questions are automatically analyzed and categorized.

Qualified experts are identified and contacted.

Knowledge transfer happens through chat, blog and email in real-time.

Created knowledgeis easily searchable and accessible.

We can connects users and experts:

Personalized Information Management - Knowledge-driven Search

Searching for Experts - Example Scenario46

Personalized Information Management

Personalized Health Assistance

Developing a health assistant for migrants: prevention service provides

easy access to prevention measures

health information service improves the access to health information

Services are multimodal, context aware and use multilingual

intuitive user interfaces for easy access for users with little technological experience.

47

Personalized Information Management

Personalized Health Assistance – Inf. Services48

Role of Information in the Knowledge Society