Upload
hummankhan
View
221
Download
0
Embed Size (px)
Citation preview
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 1/11
Data Mining
W2
1
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 2/11
Marketing and Sales
• Companies precisely record massive amounts
of marketing and sales data
• Applications:
• Special offers
• identifying profitable customers
•(e.g. reliable owners of credit cards that needextra money during the holiday season)
2
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 3/11
Marketing and Sales
Association techniques find
• groups of items that tend to occur together in
a transaction
• Historical analysis of purchasing patterns
• Identifying prospective customers
•Focusing promotional mail outs (targetedcampaigns are cheaper than mass‐marketed
ones)
3
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 4/11
Machine Learning & Statistics
• Historical difference:
• – Statistics: testing hypotheses
• – Machine learning: finding the right hypothesis
• But: huge overlap
• – Decision trees (C4.5 and CART)
• Today: perspectives have converged (joined)
• – Most ML algorithms employ statistical
techniques
4
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 5/11
5
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 6/11
Over fitting
• Over fitting occurs when a statistical model describes random erroror noise instead of the underlying relationship.
• Over fitting generally occurs when a model is excessively complex,
such as having too many parameters relative to the number of
observations.
• A model which has been over fit will generally have poor predictive
performance, as it can exaggerate minor fluctuations in the data.
• However, especially in cases where learning was performed too
long or where training examples are rare, the learner may adjust to
very specific random features of the training data, that have nocausal relation to the target function.
• In this process of over fitting, the performance on the training
examples still increases while the performance on unseen data
becomes worse. 6
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 7/11
Over fitting
• Modified strategy
• – E.g. pruning (simplifying a description)
• Pre‐pruning: stops at a simple description before search
proceeds to an overly complex one
• Post‐pruning: generates a complex description first and
simplifies it afterwards
7
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 8/11
Data Mining & Ethics
• It is widely accepted that before people make adecision to provide personal information theyneed to know how it will be used and what it will
be used for, what steps will be taken to protect itsconfidentiality and integrity, what theconsequences of supplying or withholding theinformation are, and any rights of claim they may
have. Whenever such information is collected,individuals should be told all straight forwardly inplain language they can understand.
8
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 9/11
Data Mining & Ethics
• The potential use of data mining techniques means that theways in which a repository of data can be used may stretchfar beyond what was conceived when the data wasoriginally collected.
•
This creates a serious problem: it is necessary to determinethe conditions under which the data was collected and forwhat purposes it may be used.
• Does the ownership of data bestow the right to use it inways other than those purported when it was originally
•
recorded?• Clearly in the case of explicitly collected personal data it
does not. But in general the situation is complex.
9
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 10/11
Data Mining & Ethics
• Ethical issues arise in practical applications
• Data mining often used to discriminate
• – E.g. loan applications: using some information
(e.g. sex, religion, race) is unethical
• Ethical situation depends on application
• – E.g. same information ok in medical application
• Attributes may contain problematic information
• – E.g. area code may correlate with race
10
8/3/2019 Lecture 2 Part B
http://slidepdf.com/reader/full/lecture-2-part-b 11/11
Data Mining & Ethics
• Important questions:
• – Who is permitted access to the data?
•
– For what purpose was the data collected?• – What kind of conclusions can be legitimately
drawn from it?
•
Caution must be attached to results• Are resources put to good use?
11