Upload
cody-watson
View
216
Download
3
Embed Size (px)
Citation preview
Starting point!
Data exploration starts with data.
?
The real starting point!
Data exploration starts with data.
Data exploration starts with identifying a need.
?
? !
Customer
• Problem owners
• Problem holders
• Useful
• Profitable
Participation Motivation
The process (CRISP-DM)
The process (Pyle)
Exploring the problemExploring the solutionImplementation specification
PreparationSurveyData modeling
20% work80% importance
80% work20% importance
The problem
• Identify the right problem
• Define solvable problem(s)
• Transfer the problem understanding to the miner
Example
“I really need a model of the Monday and Friday failure rates so we can stop them!”
• What is a failure?
• How it is detected/measured?
• Is it a quality problem or just fluctuation of error rates?
• Which problem components need to be looked at?
• ...
The solution
What does the solution look like?
- a program used by an expert - a data set to be referred to- a model to be used for prediction- a presentation / report- ...
How (and by whom) is the solution implemented?
Data mining
• Prepare: both the data and the miner
• Survey: understand the data
is the data adequate?
• Model: refining the details
depends on nature of data and the solution goal
Why preparation?
GIGO: fix the data
Get a data set which isof maximum use
preserves the information
enhanced for problem & model
PIE
Prepared Information Environment1. prepare the training/testing data
2. transform prepared values to original
3. apply the same preparation to new data
PIE-in
PIE-out
data
newdata
report
model
Why survey?
Get a broad idea of the data: • what is covered
• what is not covered, or is covered poorly
Dangerous areas: • bias in data
• sparse data (in a dynamic area)
Is the data adequate?
Modeling hype
Universal approximator
can be applied to any data
Data-driven
no theoretical knowledge required
Modeling definition
Model: “a representation … to show the construction or serve as a copy of something”
= makes information understandable or usable =
Modeling in data mining
Modeling is iterative:
1. Define problem2. Select tool
3. Collect data 4. Make model
5. Apply6. Evaluate
Traditional statistical methods: first model, then data
Model types
• Active or passive
• Explanatory or predictive
• Static or continuously learning
Ten golden rules
1. Select clear problem with tangible benefit
2. Specify required solution
3. Define how solution is implemented
4. Understand the domain
6. Stipulate assumptions
5. Let the problem drive the modeling
7. Refine the model iteratively
8. Make the model as simple as possible (but no simpler)
9. Find areas of instability
10. Find areas of uncertainty
Critique
• Model evaluation is missing
• Iteration of planning stage
• Domain expert as data miner