11
Data Mining W2 1

Lecture 2 Part B

Embed Size (px)

Citation preview

Page 1: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 1/11

Data Mining

W2

1

Page 2: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 2/11

Marketing and Sales

• Companies precisely record massive amounts

of marketing and sales data

• Applications:

• Special offers

• identifying profitable customers

•(e.g. reliable owners of credit cards that needextra money during the holiday season)

2

Page 3: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 3/11

Marketing and Sales

Association techniques find

• groups of items that tend to occur together in

a transaction

• Historical analysis of purchasing patterns

• Identifying prospective customers

•Focusing promotional mail outs (targetedcampaigns are cheaper than mass‐marketed

ones)

3

Page 4: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 4/11

Machine Learning & Statistics

• Historical difference:

•  – Statistics: testing hypotheses

•  – Machine learning: finding the right hypothesis

• But: huge overlap

•  – Decision trees (C4.5 and CART)

• Today: perspectives have converged (joined)

•  – Most ML algorithms employ statistical

techniques

4

Page 5: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 5/11

5

Page 6: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 6/11

Over fitting

• Over fitting occurs when a statistical model describes random erroror noise instead of the underlying relationship.

• Over fitting generally occurs when a model is excessively complex,

such as having too many parameters relative to the number of 

observations.

• A model which has been over fit will generally have poor predictive

performance, as it can exaggerate minor fluctuations in the data.

• However, especially in cases where learning was performed too

long or where training examples are rare, the learner may adjust to

very specific random features of the training data, that have nocausal relation to the target function.

• In this process of over fitting, the performance on the training

examples still increases while the performance on unseen data

becomes worse. 6

Page 7: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 7/11

Over fitting

• Modified strategy

•  – E.g. pruning (simplifying a description)

• Pre‐pruning: stops at a simple description before search

proceeds to an overly complex one

• Post‐pruning: generates a complex description first and

simplifies it afterwards

7

Page 8: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 8/11

Data Mining & Ethics

• It is widely accepted that before people make adecision to provide personal information theyneed to know how it will be used and what it will

be used for, what steps will be taken to protect itsconfidentiality and integrity, what theconsequences of supplying or withholding theinformation are, and any rights of claim they may

have. Whenever such information is collected,individuals should be told all straight forwardly inplain language they can understand.

8

Page 9: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 9/11

Data Mining & Ethics

• The potential use of data mining techniques means that theways in which a repository of data can be used may stretchfar beyond what was conceived when the data wasoriginally collected.

This creates a serious problem: it is necessary to determinethe conditions under which the data was collected and forwhat purposes it may be used.

• Does the ownership of data bestow the right to use it inways other than those purported when it was originally

recorded?• Clearly in the case of explicitly collected personal data it

does not. But in general the situation is complex.

9

Page 10: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 10/11

Data Mining & Ethics

• Ethical issues arise in practical applications

• Data mining often used to discriminate

•  – E.g. loan applications: using some information

(e.g. sex, religion, race) is unethical

• Ethical situation depends on application

•  – E.g. same information ok in medical application

• Attributes may contain problematic information

•  – E.g. area code may correlate with race

10

Page 11: Lecture 2 Part B

8/3/2019 Lecture 2 Part B

http://slidepdf.com/reader/full/lecture-2-part-b 11/11

Data Mining & Ethics

• Important questions:

•  – Who is permitted access to the data?

 – For what purpose was the data collected?•  – What kind of conclusions can be legitimately

drawn from it?

Caution must be attached to results• Are resources put to good use?

11