Berendt: Knowledge and the Web, 2015, berendt/teaching/ 1 Knowledge and the Web – Inferring new...

Preview:

Citation preview

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

1

Knowledge and the Web –

Inferring new knowledge from data(bases):

Knowledge Discovery in Databases

Bettina Berendt

KU Leuven, Department of Computer Science

http://people.cs.kuleuven.be/~bettina.berendt/teaching

Last update: 25 November 2015

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

2

Where are we?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

3

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

4

What should we recommend to a customer/user?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

5

What‘s spam and what isn‘t?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

6

Classification / prediction: how is that done?

In which weather will someone play (tennis etc.)?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

7

Classification / prediction: What makes people happy?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

8

“Classification along a numerical scale“: other forms of sentiment analysis

8

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

9When we don‘t know the classes yet, but need to discover them: What “news stories“ are there today?

9

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

10

What „circles“ of friends do you have?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

11

What „circles“ of friends do you have?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

12Topic detection: What topics exist in a collection of texts, and how do they evolve?

News texts, scientific publications, speeches, …

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

13

From your questions to the speakers

These days you hear a lot about Big Data . Nobody seems to have a really good definition for it though. Do you see linked data as a part of Big Data  or more as something separate.

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

14A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (1)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

15A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (2)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

16A note on last week‘s remark on the challenges of wrong data “used by machines“ vs. “used by people“ (3)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

17

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

18

Forms of data analysis

• Confirmatory• Hypothesis testing• Experimental procedure, data gathered for this purpose• Inferential statistics• Causality

• Exploratory• Data mining• Already-existing data• Data mining & machine learning models• “Correlation“ (in a wide sense)

• Different basic assumptions, different evaluation methodologies, even when they use the same models (e.g. regression)!

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

19

Styles of reasoning

• Descriptive vs. predictive

• Deductive vs. inductive inference

• Data mining prediction is always inductive inference!

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

20

From your questions

Are there any economic indicators, related to the (country of representation of a) speaker that influence how many speeches are given by a certain country in the European parliament?

Are economically more powerful countries more influential in the European parliament?

Why does Germany have so much influence on European politics or is this a false statement?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

21

Empiricism and apophenia

21

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

22Empiricism and apophenia: correlation, causation, and instrumentality

22

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

23“Correlation replaces causation“: Business logic and prediction vs. explanation ...

23

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

24

A related issue: number of data points / From your questions

Does the weather in Finland during the European Parliament elections affect the voting behaviour of the Finnish people?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

25

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

26Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

26

The KDD process: The output

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

27

The process part of knowledge discovery

CRISP-DM • CRoss Industry

Standard Process for Data Mining

• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

28

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining sometimes = KD,

sometimes = ML

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

29

How much time will you actually spend modelling?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

30

Standard data mining algorithms work on single tables

Important Q for data preparation: How to get from an RDF graph to a table?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

31

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

32

Descriptive and predictive modelling / learning

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

33

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

...

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

34

A simple descriptive statistic: Correlation

0 5 10 15 20 250

10

20

30

40

50

60

70

80

90

y1

0 5 10 15 20 25

-300

-250

-200

-150

-100

-50

0

50

y2

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y3

0 5 10 15 20 25

-100

-80

-60

-40

-20

0

20

y4

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

35

“Truly numerical data“: Pearson correlation

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

36

From your questions

Is there a correlation between the countries of the speakers who give speeches about the environment and the countries that have the best environmental policies? (pollution, renewable energy, waste generation, etc.)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

37

Rank data: Spearman‘s rank correlation coefficient

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

38

Unclear to me / From your questions

Is there a correlation between BBC coverage and the topic of the talks given at the European Parliament?

Is there a correlation between the government type of a country and how much its members talk about democracy?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

39Understand your data (1): Understand your concepts and how your variables measure them

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

40

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

41

Attributes

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

42

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes”

But: number of attributes may vary in practice

Possible solution: “irrelevant value” flag Related problem: existence of an attribute

may depend of value of another one Possible attribute types (“levels of

measurement”, aka “scales of measurement”):

Nominal, ordinal, interval and ratio

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

43

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

44Task: align example measures, scale of measurement, and allowed operations

Example Scale level operations

Temperature (celsius)

Grades at school/university

Pass or no pass (exam)Metres

Temperature („warm“, „cold“, ...)

Weather („good“, „bad“)

Weather („sunny“, „windy“, „cold crisp day“, ...)

Likert-scale values („on a scale of 1-7, ...“)

Duration of work tasks (in minutes)

ECTS credits

NominalOrdinalIntervalratio

=, ≠<, >+, -*, /%modemedianarithmetic meangeom. mean

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

45

Nominal quantities

Values are distinct symbols Values themselves serve only as labels or

names Nominal comes from the Latin word for name

Example: attribute “outlook” from weather data

Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values

(no ordering or distance measure) Only equality tests can be performed

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

46

Ordinal quantities

Impose order on values But: no distance between values defined Example:

attribute “temperature” in weather data Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense

Example rule:temperature < hot Þ play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

47

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units

Example 1: attribute “temperature” expressed in degrees Fahrenheit

Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense

Zero point is not defined!

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

48

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point

Example: attribute “distance” Distance between an object and itself is zero

Ratio quantities are treated as real numbers All mathematical operations are allowed

But: is there an “inherently” defined zero point?

Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

52

Understanding your data (2): Visualize!

0 5 10 15 20 250

10

20

30

40

50

60

70

80

90

y1

0 5 10 15 20 25

-300

-250

-200

-150

-100

-50

0

50

y2

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y3

0 5 10 15 20 25

-100

-80

-60

-40

-20

0

20

y4

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

53

Understanding your data (3): How to visualize non-numerical data?

Is there a correlation between the government type of a country and how much its members talk about democracy?

How could you visualize data on this to avoid drawing wrong conclusions already at the outset?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

54

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

55Supervised and unsupervised learning and examples dealt with here

• Supervised learning

• Classification / classifier learning

• regression

• Unsupervised learning

• Association rule mining

• Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

56

A question to the speakers that I don‘t quite understand

A lot of hierarchies in RDF specifications are built using some human compromise between the properties of a concept and the hierarchy in which the concept is classified. Unsupervised learners already outperform humans in some classification  tasks.

How does this automatisation influence the availability of linked open data?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

57

How to: our proposal

• Basic KDD techniques: frame your research question in terms of one of these tasks, use software to analyse your data (e.g. RapidMiner)

• Advanced KDD techniques (topic detection, sentiment analysis): use 3rd-party software (Sebastijan will provide a list)

• More advanced ideas? Ask / consult with us!

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

58

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

59

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

For the sake of the argument, let us rephrase this a bit to give a typical classification task (see later for a more appropriate formalization):

People with what features (feature values) get a Nobel Prize?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

60

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class Will illustrate key ideas with ID3, a very

simple decision-tree learning algorithm

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

61

Which attribute to select?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

62

Which attribute to select?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

63

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

64

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

65

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

66

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029

bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

67

Continuing to split

gain(Temperature ) = 0.571 bits

gain(Humidity ) = 0.971 bits

gain(Windy ) = 0.020 bits

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

68

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

69

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

70

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

71

Variants

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes,

missing values, noisy data other measures instead of information gain

(details see exercise session / individual)

……………

YesFalse8075Rainy

YesFalse8683Overcast

NoTrue9080Sunny

NoFalse8585Sunny

PlayWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

72

Classification rules

Popular alternative to decision trees Antecedent (pre-condition): a series of tests

(just like the tests at the nodes of a decision tree)

Tests are usually logically ANDed together (but may also be general logical expressions)

Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule

Individual rules are often logically ORed together

Conflicts arise if different conclusions apply

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

73

An example

If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

74

Transition: Trees for numeric prediction

Regression: the process of computing an expression that predicts a numeric quantity

Regression tree: “decision tree” where each leaf predicts a numeric quantity

Predicted value is average value of training instances that reach the leaf

Model tree: “regression tree” with linear regression models at the leaf nodes

Linear patches approximate continuous function

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

75

An example

……………

40FalseNormalMildRainy

55FalseHighHot Overcast

0TrueHigh Hot Sunny

5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

76

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

77

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

...

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

78

Lead question

“How does the dependent variable depend on the independent one?“

“Can we predict the likely value of the dependent variable for a new data instance (with a given value of the independent variable)?“

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

79

79

Introduction to Linear Regression(the statistical approach)

The Pearson correlation measures the degree to which a set of data points form a straight line relationship.

Regression is a statistical procedure that determines the equation for the straight line that best fits a specific set of data.

Slides 44-49: slightly adapted from https://home.ubalt.edu/tmitch/631/PowerPoint_Lectures/chapter17/chapter17.ppt

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

80

80

Introduction to Linear Regression (cont.)

Any straight line can be represented by an equation of the form Y = bX + a, where b and a are constants.

The value of b is called the slope constant and determines the direction and degree to which the line is tilted.

The value of a is called the Y-intercept and determines the point where the line crosses the Y-axis.

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

81

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

82

82

Introduction to Linear Regression (cont.)

How well a set of data points fits a straight line can be measured by calculating the distance between the data points and the line.

The total error between the data points and the line is obtained by squaring each distance and then summing the squared values.

The regression equation is designed to produce the minimum sum of squared errors.

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

83

83

Introduction to Linear Regression (cont.)

The equation for the regression line is

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

84

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

85

From your questions

Are economically more powerful countries more influential in the European parliament?

...

Economically powerful countries can be based on different factors, including

Gross Domestic Product per Capita

Human Development Index

...

Multiple regression

(details: see exercise session)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

86From your questions

Is there a correlation between the government type of a country and how much its members talk about democracy?

This has (assumed) categorical predictors, which can be modelled by dummy variables in a linear regression.

Dummy variables

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

88

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

89

Logistic regression – input data

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

90

Logistic regression – fitting a curve

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

91

Logistic regression - prediction

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

92

From your questions

Which European politicians have a high chance of receiving a Nobel Prize?

Note: Logistic regression also exists in multivariate form (= with multiple predictor variables)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

93

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

94

From your questions

To what extent are a politician‘s topics of choice influenced by * their field of study during higher education?

* phrasing: See remark on “correlation vs. causation“ above!

Are speeches in the European Parliament related to what the public think or search online?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

95Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)

Where to put: spaghetti,

butter?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

96

Data

"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

97Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

98

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (2)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

99

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (3)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

100

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (4)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

101

More formally: Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets Spaghetti: support = 3 (60%) tomato sauce: support = 3 (60%) bread: support = 4 (80%) butter: support = 1 (20%)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

102

Contd.

step 2: large 1-itemsets

Spaghetti

tomato sauce

bread

candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

103

step 3: large 2-itemsets {Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}

candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)

step 4: large 3-itemsets { }

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Contd.

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

104

From itemsets to association rules

Schema: If subset then large k-itemset with support s and confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)

Example:

If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

105

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

106

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

107

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

Att

rib

ute

2

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

108Concepts in Clustering

Defining distance between points Euclidean distance

any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)

A good clustering is one where (Intra-cluster distance) the sum of distances between objects in the same

cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

||

||

RQ

RQ

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

109

K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

110

K-means algorithm

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

111

From local associations to global models: clustering

To what extent are a politician‘s topics of choice influenced by their field of study during higher education?

Can we find clusters of educational background and topics?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

112

Clustering non-numerical data

(to follow)

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

113

Agenda

Motivation: application examples

Forms of data analysis and styles of reasoning

The process of knowledge discovery

Description and prediction

Data understanding: two important notes (among other issues)

Types of learning tasks

Classification

Regression

Assocation-rule mining

Clustering

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

114

Next lecture

More on KDD concepts and methods

for your projects

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

115Supervised and unsupervised learning and examples dealt with here

• Supervised learning

• Classification / classifier learning

• regression

• Unsupervised learning

• Association rule mining

• Clustering

What‘s the human input in both types?

Berendt: Knowledge and the Web, 2015, http://www.cs.kuleuven.be/~berendt/teaching/

116

References / background reading; acknowledgements

The slides are based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and

Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

In particular, pp. 8-57 are based on the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/

(chapters 1-4):

http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or

http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)

Scales (aka levels) of measurement are explained well here:

http://en.wikipedia.org/wiki/Level_of_measurement [15 Nov 2014]

Recommended