Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new...

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

Knowledge and the Web –

Inferring new knowledge from data(bases):

Knowledge Discovery in Databases

Bettina Berendt

KU Leuven, Department of Computer Science

http://people.cs.kuleuven.be/~bettina.berendt/teaching

Last update: 25 November 2015

Where are we?

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

What should we recommend to a customer/user?

What‘s spam and what isn‘t?

Classification / prediction: how is that done?

In which weather will someone play (tennis etc.)?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Classification / prediction: What makes people happy?

“Classification along a numerical scale“: other forms of sentiment analysis

9When we don‘t know the classes yet, but need to discover them: What “news stories“ are there today?

What „circles“ of friends do you have?

12Topic detection: What topics exist in a collection of texts, and how do they evolve?

News texts, scientific publications, speeches, …

From your questions to the speakers

These days you hear a lot about Big Data . Nobody seems to have a really good definition for it though. Do you see linked data as a part of Big Data or more as something separate.

14A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (1)

Agenda

Classification

Regression

Clustering

Forms of data analysis

• Confirmatory• Hypothesis testing• Experimental procedure, data gathered for this purpose• Inferential statistics• Causality

• Exploratory• Data mining• Already-existing data• Data mining & machine learning models• “Correlation“ (in a wide sense)

• Different basic assumptions, different evaluation methodologies, even when they use the same models (e.g. regression)!

Styles of reasoning

• Descriptive vs. predictive

• Deductive vs. inductive inference

• Data mining prediction is always inductive inference!

From your questions

Are there any economic indicators, related to the (country of representation of a) speaker that influence how many speeches are given by a certain country in the European parliament?

Are economically more powerful countries more influential in the European parliament?

Why does Germany have so much influence on European politics or is this a false statement?

Empiricism and apophenia

22Empiricism and apophenia: correlation, causation, and instrumentality

23“Correlation replaces causation“: Business logic and prediction vs. explanation ...

A related issue: number of data points / From your questions

Does the weather in Finland during the European Parliament elections affect the voting behaviour of the Finnish people?

Agenda

Classification

Regression

Clustering

26Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

The KDD process: The output

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

The process part of knowledge discovery

CRISP-DM • CRoss Industry

Standard Process for Data Mining

• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining sometimes = KD,

sometimes = ML

How much time will you actually spend modelling?

Standard data mining algorithms work on single tables

Important Q for data preparation: How to get from an RDF graph to a table?

Agenda

Classification

Regression

Clustering

Descriptive and predictive modelling / learning

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

From your questions

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

A simple descriptive statistic: Correlation

0 5 10 15 20 250

0 5 10 15 20 25

0 5 10 15 20 250

0 5 10 15 20 25

“Truly numerical data“: Pearson correlation

From your questions

Is there a correlation between the countries of the speakers who give speeches about the environment and the countries that have the best environmental policies? (pollution, renewable energy, waste generation, etc.)

Rank data: Spearman‘s rank correlation coefficient

Unclear to me / From your questions

Is there a correlation between BBC coverage and the topic of the talks given at the European Parliament?

Is there a correlation between the government type of a country and how much its members talk about democracy?

39Understand your data (1): Understand your concepts and how your variables measure them

Agenda

Classification

Regression

Clustering

Attributes

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes”

But: number of attributes may vary in practice

Possible solution: “irrelevant value” flag Related problem: existence of an attribute

may depend of value of another one Possible attribute types (“levels of

measurement”, aka “scales of measurement”):

Nominal, ordinal, interval and ratio

Agenda

Classification

Regression

Clustering

44Task: align example measures, scale of measurement, and allowed operations

Example Scale level operations

Temperature (celsius)

Grades at school/university

Pass or no pass (exam)Metres

Temperature („warm“, „cold“, ...)

Weather („good“, „bad“)

Weather („sunny“, „windy“, „cold crisp day“, ...)

Likert-scale values („on a scale of 1-7, ...“)

Duration of work tasks (in minutes)

ECTS credits

NominalOrdinalIntervalratio

=, ≠<, >+, -*, /%modemedianarithmetic meangeom. mean

Nominal quantities

Values are distinct symbols Values themselves serve only as labels or

names Nominal comes from the Latin word for name

Example: attribute “outlook” from weather data

Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values

(no ordering or distance measure) Only equality tests can be performed

Ordinal quantities

Impose order on values But: no distance between values defined Example:

attribute “temperature” in weather data Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense

Example rule:temperature < hot Þ play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units

Example 1: attribute “temperature” expressed in degrees Fahrenheit

Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense

Zero point is not defined!

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point

Example: attribute “distance” Distance between an object and itself is zero

Ratio quantities are treated as real numbers All mathematical operations are allowed

But: is there an “inherently” defined zero point?

Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)

Understanding your data (2): Visualize!

0 5 10 15 20 250

0 5 10 15 20 25

0 5 10 15 20 250

0 5 10 15 20 25

Understanding your data (3): How to visualize non-numerical data?

How could you visualize data on this to avoid drawing wrong conclusions already at the outset?

Agenda

Classification

Regression

Clustering

55Supervised and unsupervised learning and examples dealt with here

• Supervised learning

• Classification / classifier learning

• regression

• Unsupervised learning

• Association rule mining

• Clustering

A question to the speakers that I don‘t quite understand

A lot of hierarchies in RDF specifications are built using some human compromise between the properties of a concept and the hierarchy in which the concept is classified. Unsupervised learners already outperform humans in some classification tasks.

How does this automatisation influence the availability of linked open data?

How to: our proposal

• Basic KDD techniques: frame your research question in terms of one of these tasks, use software to analyse your data (e.g. RapidMiner)

• Advanced KDD techniques (topic detection, sentiment analysis): use 3rd-party software (Sebastijan will provide a list)

• More advanced ideas? Ask / consult with us!

Agenda

Classification

Regression

Clustering

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

For the sake of the argument, let us rephrase this a bit to give a typical classification task (see later for a more appropriate formalization):

People with what features (feature values) get a Nobel Prize?

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class Will illustrate key ideas with ID3, a very

simple decision-tree learning algorithm

Which attribute to select?

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029

bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

Continuing to split

gain(Temperature ) = 0.571 bits

gain(Humidity ) = 0.971 bits

gain(Windy ) = 0.020 bits

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

Variants

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes,

missing values, noisy data other measures instead of information gain

(details see exercise session / individual)

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Classification rules

Popular alternative to decision trees Antecedent (pre-condition): a series of tests

(just like the tests at the nodes of a decision tree)

Tests are usually logically ANDed together (but may also be general logical expressions)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

Individual rules are often logically ORed together

Conflicts arise if different conclusions apply

An example

If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

Transition: Trees for numeric prediction

Regression: the process of computing an expression that predicts a numeric quantity

Regression tree: “decision tree” where each leaf predicts a numeric quantity

Predicted value is average value of training instances that reach the leaf

Model tree: “regression tree” with linear regression models at the leaf nodes

Linear patches approximate continuous function

An example

……………

40FalseNormalMildRainy

55FalseHighHot Overcast

0TrueHigh Hot Sunny

5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Agenda

Classification

Regression

Clustering

From your questions

Lead question

“How does the dependent variable depend on the independent one?“

“Can we predict the likely value of the dependent variable for a new data instance (with a given value of the independent variable)?“

Introduction to Linear Regression(the statistical approach)

The Pearson correlation measures the degree to which a set of data points form a straight line relationship.

Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data.

Slides 44-49: slightly adapted from https://home.ubalt.edu/tmitch/631/PowerPoint_Lectures/chapter17/chapter17.ppt

Introduction to Linear Regression (cont.)

Any straight line can be represented by an equation of the form Y = bX + a, where b and a are constants.

The value of b is called the slope constant and determines the direction and degree to which the line is tilted.

The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.

How well a set of data points fits a straight line can be measured by calculating the distance between the data points and the line.

The total error between the data points and the line is obtained by squaring each distance and then summing the squared values.

The regression equation is designed to produce the minimum sum of squared errors.

The equation for the regression line is

From your questions

Human Development Index

Multiple regression

(details: see exercise session)

86From your questions

This has (assumed) categorical predictors, which can be modelled by dummy variables in a linear regression.

Dummy variables

From your questions

Logistic regression – input data

Logistic regression – fitting a curve

Logistic regression - prediction

From your questions

Note: Logistic regression also exists in multivariate form (= with multiple predictor variables)

Agenda

Classification

Regression

Clustering

From your questions

To what extent are a politician‘s topics of choice influenced by * their field of study during higher education?

* phrasing: See remark on “correlation vs. causation“ above!

Are speeches in the European Parliament related to what the public think or search online?

95Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)

Where to put: spaghetti,

butter?

"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

97Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (2)

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

More formally: Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets Spaghetti: support = 3 (60%) tomato sauce: support = 3 (60%) bread: support = 4 (80%) butter: support = 1 (20%)

2 Spaghetti, bread

4 bread, butter

Contd.

step 2: large 1-itemsets

Spaghetti

tomato sauce

candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)

2 Spaghetti, bread

4 bread, butter

step 3: large 2-itemsets {Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}

candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)

step 4: large 3-itemsets { }

2 Spaghetti, bread

4 bread, butter

Contd.

From itemsets to association rules

Schema: If subset then large k-itemset with support s and confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)

Example:

If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Agenda

Classification

Regression

Clustering

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

108Concepts in Clustering

Defining distance between points Euclidean distance

any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)

A good clustering is one where (Intra-cluster distance) the sum of distances between objects in the same

cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

K-means algorithm

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Clustering non-numerical data

(to follow)

Agenda

Classification

Regression

Clustering

Next lecture

Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new...

Documents

1 Advanced databases – Inferring implicit/new knowledge from data(bases): Some thoughts about mining and privacy Bettina Berendt Katholieke Universiteit

Inferring Analogous Attributes - University of Texas at Austinvision.cs.utexas.edu/projects/inferring_analogous_attribute/inferring-analogous... · Inferring Analogous Attributes

Presented by Bettina Berendt, K.U. Leuven

1 Berendt: Knowledge and the Web, 1st semester 2014/2015, berendt/teaching/ 1 Knowledge and the Web Inference on the Semantic

1 Berendt: Advanced databases, 2011, berendt/teaching 1 Advanced databases – Core ideas of federated databases; Schema and ontology

1 Berendt: Advanced databases, first semester 2008, berendt/teaching/2008w/adb/ 1 Advanced databases – Inferring new knowledge

1 Berendt: Advanced databases, first semester 2009, berendt/teaching 1 Advanced databases – Introduction and overview Prof

Data mining, privacy and (non-)discrimination Bettina Berendt, KU Leuven Knowledge and the Web / Privacy and Big Data courses 2015 last updated 9 December

1 Berendt: Advanced databases, winter term 2007/08, berendt/teaching/2007w/adb/ 1 Advanced databases – Defining and combining

Presented by Bettina Berendt, K.U. Leuven. Presented by Bettina Berendt, K.U. Leuven or: PRIVate but not dePRIVed !

Berendt, joachim el jazz (1994)

Inferring Strategy

1 Bettina Berendt KU Leuven, Dept. of Computer Science, Hypermedia & Databases berendt 3 December 2007 [updated version] Intelligent

1 1 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Web mining, esp. Web usage mining Bettina Berendt Katholieke Universiteit

1 Berendt: Advanced databases, 1ste semester 2010/2011, berendt/teaching/ 1 Advanced databases – The Semantic Web (1) Bettina

Where does this new information belong? From developing mining algorithms to supporting knowledge discovery Bettina Berendt – thanks for joint work with

Janusz Podliński , Artur Berendt, Jerzy Mizeraczyk

1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge

7 inferring

1 Berendt: Knowledge and the Web, 2014, berendt/teaching 1 Knowledge and the Web – Schema, instance and ontology matching Bettina