Data Mining CS157B Section 2 Larry Varela. What is Data Mining? Data Mining is "The science of extracting useful information from large data sets or databases“

Data Mining

CS157B Section 2

Larry Varela

What is Data Mining?

Data Mining is "The science of extracting useful information from large data sets or databases“. -- http://en.wikipedia.org/wiki/Data_mining

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information within a particular context.

History of Data Mining

Although data mining is a relatively new term the technology has been around for more than 20 years.

Companies have used powerful computers to sift through volumes of supermarket scanner data and analyze market research reports for years.

Recent innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost.

Data Mining History cont…

Data mining was derived from three previously defined disciplines. Classical statistics - embrace concepts such as regression

analysis, standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals, all of which are used to study data and data relationships.

Artificial intelligence - attempts to apply human-thought-like processing to statistical problems. AI concepts have been adopted by some high-end commercial products, such as query optimization modules for Relational Database Management Systems (RDBMS).

Machine learning - attempts to let software learn about the data they study, such that future decisions are based on the quality of the studied data.

What is it used for?

Data Mining enables businesses to automatically explore and understand their data while identifying patterns, relationships, and dependencies that impact business outcomes. (Descriptive application) Business Outcomes include: revenue growth, profit

improvement, cost containment, and risk management. Data Mining enables the uncovering and identification of

relationships expressed as business rules, or predictive models. These outputs can then be communicated in traditional

reporting formats to guide business planning and strategies. In addition, these outputs can also be expressed as

programming code that can then be deployed into business software to generate predictions of future outcomes. (Predictive application)

Common Types of Relationships

Classes: Stored data is used to locate information in predetermined groups. For example, a coffee chain could mine customer purchase data to determine when customers arrive and what they typically purchase. This information could be used to increase traffic by having daily specials.

Clusters: Data items can be grouped according to logical relationships. For example, data can be mined to identify technology market segments or recent consumer purchasing trends.

Associations: Data can be mined to identify associations between items purchased or queried. For example the beer-diaper example Dr. Lee mentioned during last class is an example of associative mining.

Sequential patterns: Data is mined to anticipate or predict behavior patterns and trends. For example, a Corvette dealer could predict the likelihood of power-folding convertible tops being purchased based on recent increased purchases of convertible style vehicles.

How does data mining work?

Data mining consists of five major elements: Extract, transform, and load transaction data onto

the data warehouse system. Store and manage the data in a multidimensional

database system. Provide data access to business analysts and/or

information technology professionals. Analyze the data using application software. Present the data in a readable format.

-- info quoted from http://www.anderson.ucla.edu

Data mining Techniques

Classical TechniquesStatisticsNeighborhoods and Clustering

Next Generation TechniquesTreesNetworks and Rules

Trees

Within a decision tree each branch is a classification question and the leaves of the tree are partitions of the dataset with their classification.

Decision trees can be viewed as segmentations of the original dataset where each segment would be one of the leaves of the tree.

The decision tree technology can be used for exploration of datasets and/or business problems. This is often done by looking at the predictors and values that are chosen for each split of the tree. Often times these predictors provide usable insights or propose questions that need to be answered.

Type of Decision Trees

Classification tree analysis is a term used when the predicted outcome is the class to which the data belongs.

Regression tree analysis is a term used when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length of stay in a hospital).

CART analysis is a term used to refer to both of the above procedures. The name CART is an acronym from the words Classification And Regression Trees, and was first introduced by Breiman et al. [BFOS84].

-- info quoted from http://en.wikipedia.org/wiki/Decision_tree

Decision Tree Example

Angelo is the manager of a children's’ zoo. Recently Angelo has been experiencing customer attendance problems. Some days lots of visitors arrive wanting to tour the park when the staff is overworked. Yet on other days no visitors arrive and zoo staff has too much unproductive free time. Angelo’s objective is to optimize staff availability by trying to predict when people will visit the park. To accomplish this Angelo needs to understand why people decide to visit on particular days. He assumes that weather must be an important underlying factor, so he decides to use the weather forecast for the upcoming week. Angelo records the following:

Weather Outlook (sunny, cloudy, or rainy) Temperature Percent Humidity Whether it was windy or not. Zoo attendance on that particular day

Decision Tree ExampleINDEPENDENT VARIABLES DEPENDENT VARIABLE

OUTLOOK TEMP HUMIDITY WINDY VISITOR ATTENDENCE

sunny 85 85 FALSE no visits

sunny 80 90 TRUE no visits

overcast 83 78 FALSE visits

rain 70 96 FALSE visits


rain 65 70 TRUE no visits

overcast 64 65 TRUE visits

sunny 72 95 FALSE no visits

sunny 69 70 FALSE visits


sunny 75 70 TRUE visits

overcast 72 90 TRUE visits

overcast 81 75 FALSE visits

rain 71 80 TRUE no visits

Decision Tree Example cont… Angelo then applies a decision tree model to solve his

problem.

Visits = 9No Visits = 5



Visit = 4No Visit = 0





OUTLOOK?

WINDY?HUMIDITY?

sunny

FALSETRUE>70<=70

rainovercast

Decision Tree Example cont…

The decision tree created is a model of the data that encodes the distribution of the class label in terms of the predictor attributes. The top node represents all the data. The classification tree algorithm finds out that the best way to explain the dependent variable, VISIT, is by using the variable OUTLOOK.

Angelo’s first conclusion: if the OUTLOOK is OVERCAST people always visit the zoo, and there exist some crazy people who visit the zoo even in the rain.

But then again he divided the sunny group in two groups and realized that people don't like to visit the zoo if the humidity is higher than seventy percent.

Finally he divided the rain category into two and found that visitors will also not visit the zoo if it is windy.

Decision Tree Example Conclusion

Angelo dismisses most of the staff on days that are sunny and humid or on rainy and windy because almost no one is going to visit the zoo on those days. On days when a lot of people will visit, he hires extra staff.

The conclusion is that the decision tree helped Angelo turn a complex data representation into a much easier structure.

Decision Tree Advantages

Decision trees are simple to understand and interpret. Data preparation for a decision tree is basic or

unnecessary. Is able to handle both nominal and categorical data.

Other techniques are usually specialised in analysing datasets that have only one type of variable.

It is possible to validate a model using statistical tests. Is robust, perform well with large data in a short time.

Data Mining Pitfalls

Sometime data mining may imposing patterns on data where none exist. This imposition of irrelevant correlation is termed data dredging or data fishing.

Large data sets invariably happen to have some exciting relationships peculiar to that data. Therefore any conclusions reached are likely to be highly suspect.

References

Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from www.wikipedia.com

Wikipedia.org (2006) Data mining. Retrieved on 3/20/2006 from www.wikipedia.com

Bill Palace (1996) What is Data Mining? Retrieved on 3/20/2006 at http://www.anderson.ucla.edu/faculty/jason.frand/teacher/techn ologies/palace/datamining.htm

Data-Mining-Software.com (2006) Data Mining History. Retrieved on 3/20/2006 at http://www.data-mining-software.com/data_mining_history.htm

Alex Berson, Stephen Smith, and Kurt Thearling (1999) An Overview of Data Mining Techniques. Retrieved on 3/20/2006 from http://www.thearling.com/text/dmtechniques/dmtechniques.htm

[BFOS84] L. Breiman, J. Friedman, R. A. Olshen and C. J. Stone, Classification and regression trees. Wadsworth, 1984.

Documents

Data Mining CS157B Section 2 Larry Varela. What is Data Mining? Data Mining is "The science of extracting useful information from large data sets or databases“