Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014

Data Mining (and Machine Learning)With Microsoft Tools

Michael Lisin, Plaster GroupMay 8, 2014

Why Reinvent a Toilet?

Page 2

Definitions

Page 3

Concept Definition / Solution ForData Mining

Algorithms to discover unknown data patterns

Machine Learning

Algorithms to predict based on data patterns

Statistics Branch of mathematics, methods of data collection and interpretation

Data Science

All of the above + Data Visualization

What Do You Think?

Page 4

Is Linear Regression?Data MiningMachine LearningStatisticsAll of the above

Linear Regression is a straight line describing how variable Y responds to changes in variable X

MS DM Environment• SQL Server 2000 - 2014

• Excel Data Mining Add-Ins (optional, recommended)• Interact with: Excel (add-ins), SQL Management

Studio, SQL Server Data Tools (SSDT), Custom Code

Page 5

SQL EditionComponent: Capability Enterprise BI StandardSSIS: Text Mining SSAS: DM basic SSAS: DM advanced (CV, prediction queries, …)

SSDTCustom Code

Start With a Question

Page 6

7

Many Potential Questions

MS DM Capabilities

How do we combine our products to increase profits?How do we predict the demand for a product / service?Why are customers buying from us?Where can we best cut costs?What are the opportunities to reduce risks?Who are our best customers?…

Generic question: What are the data patterns?

Best if more specific and directed at a problem, for example:

Approach• Define problem / questions• Prepare data • Build model• Validate model• Implement predictions• Automate model refresh• Extend / custom applications

Page 8

Mor

e Te

chni

cal

SQL DM Algorithms SummaryDiscrete

Continuous

Sequence

Common Group

Similar Group

TXT Semantic

Decision Trees [Classify, Estimate] Linear Regression [Advanced] Time Series [Forecast (T), Forecast] Clustering [Detect Categories(T), Except, Cluster]

Sequence Clustering [Advanced] Neural Network [Advanced] Logistic Regression [Fill From Sample (T), Scenario Analysis(T), Prediction Calculator (T)]

Association Rules [Shopping Basket (T), Associate] Naïve Bayers [Analyze Key Influencers(T)] Text Mining (matching, grouping, extracting)

Page 9

Predict Using Models

SELECTModel.[Bike Buyer],PredictProbability(

Model.[Bike Buyer]),NewData.Email

FROM [Model]NATURAL PREDICTION JOIN

(SELECT Age,[Commute Distance],Email

FROM …) As NewData Page 10

DMX = Data Mining Extensions to query models for predictions

Yes 0.1423 rob4@...

No 0.5698 elizabeth5@...

Yes 0.9045 eugene10@...

…

Output:DMX Query:

Demo

Page 11

[email protected]

Page 12

Appendix

Page 13

SQL Server Data Mining Algorithms

Page 14

Decision Tree Linear Regression Clustering

Sequence Clustering Association

Naive Bayes Neural Network

Time Series

Text Mining

• Fuzzy Grouping• Term Extraction• Term Lookup

Key SQL Server Algorithms - 1

Page 15

Decision Tree - makes predictions based on the relationships between input columns in a dataset. The decision tree makes predictions based on this tendency toward a particular outcome. Example: predict which customers are likely to be satisfied with a company, based on some input variables (# purchases, avg. transaction size).

Linear Regression - is a variation of the Decision Trees calculates a linear relationship between a dependent and independent variable, and then use that relationship for prediction. The algorithm is most applicable to predict continuous attribute. Example: product demand, price, site visitors.

Clustering is a segmentation algorithm that uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, customer segmentation.


Page 16

Sequence Clustering – is similar to Clustering algorithm; however, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence. It is used to explore data that contains events that can be linked by following paths, or sequences. For example: the click paths that are created when users navigate a Web site; the order in which a user follows a process. Association is useful to recommends products to customers (recommendation engine) based on items they have already bought, or in which they have indicated an interest. Example: market basket analysis.

Naive Bayes is a classification algorithm, it uses Bayes theorem but does not take into account dependencies that may exist, thus its assumptions are said to be naive. Can be used to do initial explorations of data where later you can apply the results to create additional mining models with other more computationally intense and more accurate algorithms. Example: send mailers only to those customers who are likely to respond.


Page 17

Neural Network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute, and uses the data to calculate probabilities. useful for analyzing complex input data, such as from a manufacturing or commercial process, or business problems for which a significant quantity of data is available but for which rules cannot be easily derived by using other algorithms.Time Series algorithm provides regression algorithms that are optimized for the forecasting of continuous values, such as product sales, over time. Whereas other Microsoft algorithms, such as decision trees, require additional columns of new information as input to predict a trend, a time series model does not.

Text Mining algorithm analyzes unstructured text data. This allows companies to analyze unstructured data such as a "comments" section on a customer satisfaction survey. This algorithm is available in SQL Server Integration Services.

TEXT

SQL Text Mining

Page 18

Term Extraction Transformation

Creates (extracts) a list of terms discovered in the sourceWrites the terms (+score) to a transformation output columnLimitations:• English only• Nouns or noun phrases only

Term Lookup Transformation

Matches terms extracted from text in an input with terms in a reference table. Counts the number of times a term in the lookup table occurs in the input data set, writes the count together with the term from the reference table to columns in the transformation output.

Fuzzy Grouping Transformation

Select canonical row, identify fuzzy (to exact) text fragment match. Output: UID, Group ID, Similarity Score 0..1

Supplemental• Sampling (training and test sets, uniform representation):

• Row (Quantity) Sampling Transformation • Percentage Sampling Transformation

• Sort Transformation

Interesting Links• Sources of free data for research

– https://opendata.socrata.com– http://datamarket.azure.com– http://aws.amazon.com/datasets– http://www.google.com/publicdata/directory

• Algorithms– http://msdn.microsoft.com/en-us/library/ms174879.aspx – http://research.microsoft.com/apps/pubs/default.aspx?id=69669 – http://academic.research.microsoft.com/Paper/4499824– http://academic.research.microsoft.com/Paper/226089.aspx – http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/– http://en.wikipedia.org/wiki/Expectation-maximization_algorithm– http://axon.cs.byu.edu/Dan/678/papers/Cluster/Xu.pdf – http://www.epa.gov/bioiweb1/statprimer/tableall.html#multivclustr – http://research.microsoft.com/pubs/69669/tr-98-35.pdf– http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html – http://en.wikipedia.org/wiki/Expectation-maximization_algorithm – http://msdn.microsoft.com/en-us/library/dd299424(v=SQL.100).aspx – http://msdn.microsoft.com/en-us/library/cc280445.aspx – http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Defa

ult.aspx

Page 19

https://opendata.socrata.com/

https://opendata.socrata.com/

http://datamarket.azure.com/

http://datamarket.azure.com/

http://aws.amazon.com/datasets

http://aws.amazon.com/datasets

http://www.google.com/publicdata/directory

http://www.google.com/publicdata/directory

http://msdn.microsoft.com/en-us/library/ms174879.aspx



http://research.microsoft.com/apps/pubs/default.aspx?id=69669

http://academic.research.microsoft.com/Paper/4499824

http://academic.research.microsoft.com/Paper/226089.aspx

http://www.sematopia.com/2006/04/k-means-and-em-clustering-algorithms/

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://axon.cs.byu.edu/Dan/678/papers/Cluster/Xu.pdf

http://www.epa.gov/bioiweb1/statprimer/tableall.html#multivclustr

http://research.microsoft.com/pubs/69669/tr-98-35.pdf

http://mathworld.wolfram.com/K-MeansClusteringAlgorithm.html

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://msdn.microsoft.com/en-us/library/dd299424(v=SQL.100).aspx

http://msdn.microsoft.com/en-us/library/cc280445.aspx

http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Default.aspx

http://www.sqlserverdatamining.com/ssdm/Home/DataMiningAddinsLaunch/tabid/69/Default.aspx

Useful Terms• Population is a group of use cases

– Valid: purchasers = customers who purchased items– Questionable: purchasers = customers who indicated in survey that they

would buy an item; actual here – customers who answered surveys, intent does not indicate behavior, pus sample may be insufficient

• Sample random subset of data. Correct sample size selection requires knowledge of data.

• Range all values including exceptions and outliers• Bias incorrect results, often form incorrect non-random sample selection, i.e.

selecting Seattle to represent WA• Mean or average sum of values / number of samples• Distribution frequency of a value, typically arranged as a graph around mean• Variance = • Standard Deviation = • Correlation variable changes as a result of change to another var.• Overfitting model accurately fit sample, but not real world• Underfitting model is not able to establish a useful pattern• Cross validation checking model on a subset of inputs not used in model

generation Page 20

Documents

Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014