Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    1/185

    sas. = . . Getting S t a e d withSASEnterprise Mine( . 5.3

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    2/185

    Getting Started with

    SAS Enterprise MinerTM 5.3

    SAS

    Documentation

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    3/185

    The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2008.Getting Started with SAS Enterprise Miner TM 5.3. Cary, NC: SAS Institute Inc.

    Getting Started with SAS Enterprise MinerTM 5.3

    Copyright 2008, SAS Institute Inc., Cary, NC, USA

    ISBN-13: 978-1-59994-827-0

    All rights reserved. Produced in the United States of America.

    For a hard-copy book: No part of this publication may be reproduced, stored in aretrieval system, or transmitted, in any form or by any means, electronic, mechanical,photocopying, or otherwise, without the prior written permission of the publisher, SASInstitute Inc.

    For a Web download or e-book: Your use of this publication shall be governed by theterms established by the vendor at the time you acquire this publication.

    U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of thissoftware and related documentation by the U.S. government is subject to the Agreementwith SAS Institute and the restrictions set forth in FAR 52.22719 Commercial ComputerSoftware-Restricted Rights (June 1987).

    SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.1st printing, June 2008

    SAS Publishing provides a complete selection of books and electronic products to helpcustomers use SAS software to its fullest potential. For more information about oure-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web siteat support.sas.com/pubs or call 1-800-727-3228.

    SAS and all other SAS Institute Inc. product or service names are registered trademarksor trademarks of SAS Institute Inc. in the USA and other countries. indicates USAregistration.

    Other brand and product names are registered trademarks or trademarks of theirrespective companies.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    4/185

    Contents

    Chapter 14

    Introduction to SAS Enterprise Miner 5.3 Software 1

    Data Mining Overview 1Layout of the Enterprise Miner Window 2

    Organization and Uses of Enterprise Miner Nodes 8

    Usage Rules for Nodes 19

    Overview of the SAS Enterprise Miner 5.3 Getting Started Example 19

    Example Problem Description 20

    Software Requirements 22

    Chapter 2 4 Setting Up Your Project 23

    Create a New Project 23

    Example Data Description 26

    Locate and Install the Example Data 26

    Configure the Example Data 26

    Define the Donor Data Source 29

    Create a Diagram 43

    Other Useful Tasks and Tips 44

    Chapter 3 4 Working with Nodes That Sample, Explore, and Modify 45

    Overview of This Group of Tasks 45

    Identify Input Data 45

    Generate Descriptive Statistics 46

    Create Exploratory Plots 51

    Partition the Raw Data 54

    Replace Missing Data 55

    Chapter 4 4 Working with Nodes That Model 61

    Overview of This Group of Tasks 61

    Basic Decision Tree Terms and Results 61

    Create a Decision Tree 62

    Create an Interactive Decision Tree 75

    Chapter 54

    Working with Nodes That Modify, Model, and Explore 103

    Overview of This Group of Tasks 103

    About Missing Values 103

    Impute Missing Values 104

    Create Variable Transformations 105

    Develop a Stepwise Logistic Regression 121

    Preliminary Variable Selection 125

    Develop Other Competitor Models 128

    Chapter 64

    Working with Nodes That Assess 135

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    5/185

    iv

    Overview of This Group of Tasks 135

    Compare Models 135

    Score New Data 139

    Chapter 74

    Sharing Models and Projects 153

    Overview of This Group of Tasks 153

    Create Model Packages 154Using Saved Model Packages 155

    View the Score Code 157

    Register Models 158

    Save and Import Diagrams in XML 160

    Appendix 1 4 Recommended Reading 163

    Recommended Reading 163

    Appendix 24

    Example Data Description 165

    Example Data Description 165

    Glossary 169

    Index 175

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    6/185

    1

    C H A P T E R

    1Introduction to SAS Enterprise

    Miner 5.3 Software

    Data Mining Overview 1

    Layout of the Enterprise Miner Window 2

    About the Graphical Interface 2

    Enterprise Miner Menus 4

    Diagram Workspace Pop-up Menus 8

    Organization and Uses of Enterprise Miner Nodes 8

    About Nodes 8

    Sample Nodes 9

    Explore Nodes 11

    Modify Nodes 13

    Model Nodes 15

    Assess Nodes 17

    Utility Nodes 18

    Usage Rules for Nodes 19

    Overview of the SAS Enterprise Miner 5.3 Getting Started Example 19

    Example Problem Description 20

    Software Requirements 22

    Data Mining Overview

    SAS defines data mining as the process of uncovering hidden patterns in largeamounts of data. Many industries use data mining to address business problems and

    opportunities such as fraud detection, risk and affinity analyses, database marketing,

    householding, customer churn, bankruptcy prediction, and portfolio analysis.The SAS

    data mining process is summarized in the acronym SEMMA, which stands for

    sampling, exploring, modifying, modeling, and assessing data.

    3 Sample the data by creating one or more data tables. The sample should be large

    enough to contain the significant information, yet small enough to process.

    3 Explore the data by searching for anticipated relationships, unanticipated trends,

    and anomalies in order to gain understanding and ideas.

    3 Modify the data by creating, selecting, and transforming the variables to focus the

    model selection process.

    3 Model the data by using the analytical tools to search for a combination of the

    data that reliably predicts a desired outcome.

    3 Assess the data by evaluating the usefulness and reliability of the findings fromthe data mining process.

    You might not include all of these steps in your analysis, and it might be necessary to

    repeat one or more of the steps several times before you are satisfied with the results.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    7/185

    2 Layout of the Enterprise Miner Window 4 Chapter 1

    After you have completed the assessment phase of the SEMMA process, you apply thescoring formula from one or more champion models to new data that might or might not

    contain the target. The goal of most data mining tasks is to apply models that are

    constructed using training and validation data in order to make accurate predictions

    about observations of new, raw data.

    The SEMMA data mining process is driven by a process flow diagram, which you canmodify and save. The Graphical User Interface is designed in such a way that the

    business analyst who has little statistical expertise can navigate through the datamining methodology, while the quantitative expert can go behind the scenes to

    fine-tune the analytical process.

    SAS Enterprise Miner 5.3 contains a collection of sophisticated analysis tools that

    have a common user-friendly interface that you can use to create and compare multiplemodels. Analytical tools include clustering, association and sequence discovery, market

    basket analysis, path analysis, self-organizing maps / Kohonen, variable selection,

    decision trees and gradient boosting, linear and logistic regression, two stage modeling,

    partial least squares, support vector machines, and neural networking. Data

    preparation tools include outlier detection, variable transformations, variableclustering, interactive binning, principal components, rule building and induction, data

    imputation, random sampling, and the partitioning of data sets (into train, test, and

    validate data sets). Advanced visualization tools enable you to quickly and easily

    examine large amounts of data in multidimensional histograms and to graphicallycompare modeling results.

    Enterprise Miner is designed for PCs or servers that are running under Windows XP,

    UNIX, Linux, or subsequent releases of those operating environments. The figures and

    screen captures that are presented in this document were taken on a PC that was

    running under Windows XP.

    Layout of the Enterprise Miner Window

    About the Graphical Interface

    You use the Enterprise Miner graphical interface to build a process flow diagram that

    controls your data mining project.Figure 1.1 shows the components of the Enterprise Miner window.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    8/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 About the Graphical Interface 3

    Figure 1.1 The Enterprise Miner Window

    The Enterprise Miner window contains the following interface components:

    3 Toolbar and Toolbar shortcut buttons The Enterprise Miner Toolbar is a graphicset of node icons that are organized by SEMMA categories. Above the Toolbar is a

    collection of Toolbar shortcut buttons that are commonly used to build process flow

    diagrams in the Diagram Workspace. Move the mouse pointer over any node, or

    shortcut button to see the text name. Drag a node into the Diagram Workspace to

    use it. The Toolbar icon remains in place and the node in the Diagram Workspaceis ready to be connected and configured for use in your process flow diagram. Click

    on a shortcut button to use it.

    3

    Project Panel Use the Project Panel to manage and view data sources,diagrams, model packages, and project users.

    3 Properties Panel Use the Properties Panel to view and edit the settings of data

    sources, diagrams, nodes, and model packages.

    3 Diagram Workspace Use the Diagram Workspace to build, edit, run, and save

    process flow diagrams. This is where you graphically build, order, sequence and

    connect the nodes that you use to mine your data and generate reports.

    3 Property Help Panel The Property Help Panel displays a short description ofthe property that you select in the Properties Panel. Extended help can be found

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    9/185

    4 Enterprise Miner Menus 4 Chapter 1

    in the Help Topics selection from the Help main menu or from the Help button onmany windows.

    3 Status Bar The Status Bar is a single pane at the bottom of the window thatindicates the execution status of a SAS Enterprise Miner task.

    Enterprise Miner Menus

    Here is a summary of the Enterprise Miner menus:

    3 File

    3 New

    3 Project creates a new project.

    3 Diagram creates a new diagram.

    3 Data Source creates a new data source using the Data Source wizard.

    3 Library creates a new SAS library.

    3 Open Project opens an existing project. You can also create a new project

    from the Open Project window.

    3 Recent Projects lists the projects on which you were most recently

    working. You can open recent projects using this menu item.3 Open Model Package opens a model package SAS Package (SPK) file that

    you have previously created.

    3 Explore Model Packages opens the Model Package Manager window, in

    which you can view and compare model packages.

    3 Open Diagram opens the diagram that you select in the Project Panel.

    3 Close Diagram closes the open diagram that you select in the Project Panel.

    3 Close this Project closes the current project.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    10/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Enterprise Miner Menus 5

    3

    Delete this Project deletes the current project.

    3 Import Diagram from XML imports a diagram that has been defined by an

    XML file.

    3 Save Diagram As saves a diagram as an image (BMP or GIF) or as an

    XML file. You must have an open diagram and that diagram must be selected

    in the Project Panel. Otherwise, this menu item appears as Save As and isdimmed and unavailable.

    3 Print Diagram prints the contents of the window that is open in the

    Diagram Workspace. You must have an open diagram and that diagram must

    be selected in the Project Panel. Otherwise, this menu item is dimmed and

    unavailable.

    3 Print Preview displays a preview of the Diagram Workspace that can be

    printed. You must have an open diagram and that diagram must be selected

    in the Project Panel. Otherwise, this menu item is dimmed and unavailable.

    3 Exit ends the Enterprise Miner session and closes the window.

    3 Edit

    3 Cut deletes the selected item and copies it to the clipboard.

    3 Copy copies the selected node to the clipboard.

    3 Paste pastes a copied object from the clipboard.

    3 Delete deletes the selected diagram, data source, or node.

    3 Rename renames the selected diagram, data source, or node.

    3 Duplicate creates a copy of the selected data source.

    3 Select All selects all of the nodes in the open diagram, selects all texts in theProgram Editor, Log, or Output windows.

    3 Clear All clears text from the Program Editor, Log, or Output windows.

    3 Find/Replace opens the Find/Replace window so that you can search for and

    replace text in the Program Editor, Log, and Results windows.

    3 Go To Line opens the Go To Line window. Enter the line number on whichyou want to enter or view text.

    3 Layout

    3 Horizontally creates an orderly horizontal arrangement of the layout of

    nodes that you have placed in the Diagram Workspace.

    3 Vertically creates an orderly vertical arrangement of the layout of nodes

    that you have placed in the Diagram Workspace.

    3 Zoom increases or decreases the size of the process flow diagram within thediagram window.

    3 Copy Diagram to Clipboard copies the Diagram Workspace to the clipboard.

    3 View

    3 Program Editor opens a SAS Program Editor window in which you can enter

    SAS code.

    3 Log opens a SAS Log window.

    3 Output opens a SAS Output window.

    3 Explorer opens a window that displays the SAS libraries (and their contents)

    to which Enterprise Miner has access.

    3 Graphs opens the Graphs window. Graphs that you create with SAS code in

    the Program Editor are displayed in this window.

    3 Refresh Project updates the project tree to incorporate any changes that weremade to the project from outside the Enterprise Miner user interface.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    11/185

    6 Enterprise Miner Menus 4 Chapter 1

    3 Actions

    3 Add Node adds a node that you have selected to the Diagram Workspace.

    3 Select Nodes opens the Select Nodes window.

    3 Connect nodes opens the Connect Nodes window. You must select a node in

    the Diagram Workspace to make this menu item available. You can connect the

    node that you select to any nodes that have been placed in your DiagramWorkspace.

    3 Disconnect Nodes opens the Disconnect Nodes window. You must select anode in the Diagram Workspace to make this menu item available. You can

    disconnect the selected node from a predecessor node or a successor node.

    3 Update updates the selected node to incorporate any changes that you have

    made.

    3 Run runs the selected node and any predecessor nodes in the process flow

    that have not been executed, or submits any code that you type in the Program

    Editor window.

    3 Stop Run interrupts a currently running process flow.

    3 View Results opens the Results window for the selected node.

    3 Create Model Package generates a mining model package.

    3 Export Path as SAS Program saves the path that you select as a SAS

    program. In the window that opens, you can specify the location to which you

    want to save the file. You also specify whether you want the code to run the

    path or create a model package.

    3 Options

    3 Preferences opens the Preferences window. Use the following options to

    change the user interface:

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    12/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Enterprise Miner Menus 7

    3 Look and Feel you can select Cross Platform, which uses a standardappearance scheme that is the same on all platforms, or System which uses

    the appearance scheme that you have chosen for your platform.

    3 Property Sheet Tooltips controls whether tooltips are displayed on various

    property sheets appearing throughout the user interface.

    3 Tools Palette Tooltips controls how much tooltip information you wantdisplayed for the tool icons in the Toolbar.

    3 Sample Methods generates a sample that will be used for graphicaldisplays. You can specify either Top or Random.

    3 Fetch Size specifies the number of observations to download for graphical

    displays. You can choose either Default or Max.

    3 Random Seed specifies the value you want to use to randomly sample

    observations from your input data.

    3 Generate C Score Code creates C score code when you create a report. The

    default is No.

    3 Generate Java Score Code creates Java score code when you create a

    report. The default is No. If you select Yes for Generate Java Score Code,

    you must enter a filename for the score code package in the Java Score Code

    Package box.3 Java Score Code Package identifies the filename of the Java Score Code

    package.

    3 Grid Processing enables you to use grid processing when you are running

    data mining flows on grid-enabled servers.

    3 Window

    3 Tile displays windows in the Diagram Workspace so that all windows are

    visible at the same time.

    3 Cascade displays windows in the Diagram Workspace so that windowsoverlap.

    3 Help

    3 Contents opens the Enterprise Miner Help window, which enables you to

    view all the Enterprise Miner Reference Help.

    3 Component Properties opens a table that displays the component

    properties of each tool.

    3 Generate Sample Data Sources creates sample data sources that you can

    access from the Data Sources folder.

    3 Configuration displays the current system configuration of your EnterpriseMiner session.

    3 About displays information about the version of Enterprise Miner that youare using.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    13/185

    8 Diagram Workspace Pop-up Menus 4 Chapter 1

    Diagram Workspace Pop-up Menus

    You can use the Diagram Workspace pop-up menus to perform many tasks. To open

    the pop-up menu, right-click in an open area of the Diagram Workspace. (Note that you

    can also perform many of these tasks by using the pull-down menus.) The pop-up menu

    contains the following items:

    3 Add node accesses the Add Node window.3 Paste pastes a node from the clipboard to the Diagram Workspace.

    3

    Select All selects all nodes in the process flow diagram.

    3 Select Nodes opens a window that displays all the nodes that are on your

    diagram. You can select as many as you want.

    3 Layout creates an orderly horizontally or vertically aligned arrangement of the

    nodes in the Diagram Workspace.

    3 Zoom increases or decreases the size of the process flow diagram within the

    diagram window by the amount that you choose.

    3 Copy Diagram to Clipboard copies the Diagram Workspace to the clipboard.

    Organization and Uses of Enterprise Miner Nodes

    About Nodes

    The nodes of Enterprise Miner are organized according to the Sample, Explore,

    Modify, Model, and Assess (SEMMA) data mining methodology. In addition, there are

    also Credit Scoring and Utility node tools. You use the Credit Scoring node tools to

    score your data models and to create freestanding code. You use the Utility node toolsto submit SAS programming statements, and to define control points in the process flow

    diagram.

    Note: The Credit Scoring tab does not appear in all installed versions ofEnterprise Miner.

    4

    Remember that in a data mining project, it can be an advantage to repeat parts of

    the data mining process. For example, you might want to explore and plot the data atseveral intervals throughout your project. It might be advantageous to fit models,

    assess the models, and then refit the models and then assess them again.

    The following tables list the nodes and give each nodes primary purpose.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    14/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Sample Nodes 9

    Sample Nodes

    Node Name Description

    Append Use the Append node to append data sets that are exported by two

    different paths in a single process flow diagram. The Append nodecan also append train, validation, and test data sets into a new

    training data set.

    Data Partition Use the Data Partition node to partition data sets into training, test,

    and validation data sets. The training data set is used for

    preliminary model fitting. The validation data set is used to monitor

    and tune the model weights during estimation and is also used for

    model assessment. The test data set is an additional hold-out data

    set that you can use for model assessment. This node uses simple

    random sampling, stratified random sampling, or clustered sampling

    to create partitioned data sets. See Chapter 3.

    Filter Use the Filter node to create and apply filters to your training data

    set and optionally, to the validation and test data sets. You can use

    filters to exclude certain observations, such as extreme outliers and

    errant data that you do not want to include in your mining analysis.

    Filtering extreme values from the training data tends to produce

    better models because the parameter estimates are more stable. By

    default, the Filter node ignores target and rejected variables.

    Input Data Source Use the Input Data Source node to access SAS data sets and other

    types of data. This node introduces a predefined Enterprise Miner

    Data Source and metadata into a Diagram Workspace for processing.

    You can view metadata information about your data in the Input

    Data Source node, such as initial values for measurement levels and

    model roles of each variable. Summary statistics are displayed for

    interval and class variables. See Chapter 3.

    Merge Use the Merge node to merge observations from two or more data

    sets into a single observation in a new data set.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    15/185

    10 Sample Nodes 4 Chapter 1

    Node Name Description

    Sample Use the Sample node to take random, stratified random samples,

    and to take cluster samples of data sets. Sampling is recommended

    for extremely large databases because it can significantly decrease

    model training time. If the random sample sufficiently represents the

    source data set, then data relationships that Enterprise Miner finds

    in the sample can be extrapolated upon the complete source data set.

    The Sample node writes the sampled observations to an output data

    set and saves the seed values that are used to generate the random

    numbers for the samples so that you can replicate the samples.

    Time Series Use the Time Series node to convert transactional data to time series

    data to perform seasonal and trend analysis. This node enables you

    to understand trends and seasonal variations in the transaction data

    that you collect from your customers and suppliers over the time, by

    converting transactional data into time series data. Transactional

    data is time-stamped data that is collected over time at no particular

    frequency. By contrast, time series data is time-stamped data that is

    collected over time at a specific frequency. The size of transaction

    data can be very large, which makes traditional data mining tasks

    difficult. By condensing the information into a time series, you candiscover trends and seasonal variations in customer and supplier

    habits that might not be visible in transactional data.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    16/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Explore Nodes 11

    Explore Nodes

    Node Name Description

    Association Use the Association node to identify association relationships within

    the data. For example, if a customer buys a loaf of bread, how likelyis the customer to also buy a gallon of milk? You use the Association

    node to perform sequence discovery if a time-stamped variable (a

    sequence variable) is present in the data set. Binary sequences are

    constructed automatically, but you can use the Event Chain Handler

    to construct longer sequences that are based on the patterns that the

    algorithm discovered.

    Cluster Use the Cluster node to segment your data so that you can identify

    data observations that are similar in some way. When displayed in a

    plot, observations that are similar tend to be in the same cluster,

    and observations that are different tend to be in different clusters.

    The cluster identifier for each observation can be passed to other

    nodes for use as an input, ID, or target variable. This identifier canalso be passed as a group variable that enables you to automatically

    construct separate models for each group.

    DMDB The DMDB node creates a data mining database that provides

    summary statistics and factor-level information for class and

    interval variables in the imported data set.

    In Enterprise Miner 4.3, the DMDB database optimized the

    performance of the Variable Selection, Tree, Neural Network, and

    Regression nodes. It did so by reducing the number of

    passes through the data that the analytical engine needed to make

    when running a process flow diagram. Improvements to the

    Enterprise Miner 5.3 software have eliminated the need to use the

    DMDB node to optimize the performance of nodes, but the DMDBdatabase can still provide quick summary statistics for class and

    interval variables at a given point in a process flow diagram.

    Graph Explore The Graph Explore node is an advanced visualization tool that

    enables you to explore large volumes of data graphically to uncover

    patterns and trends and to reveal extreme values in the

    database. You can analyze univariate distributions, investigate

    multivariate distributions, create scatter and box plots, constellation

    and 3D charts, and so on. If the Graph Explore node follows a node

    that exports a data set in the process flow, it can use either a sample

    or the entire data set as input. The resulting plot is fully interactive:

    you can rotate a chart to different angles and move it anywhere on

    the screen to obtain different perspectives on the data. You can also

    probe the data by positioning the cursor over a particular bar within

    the chart. A text window displays the values that correspond to that

    bar. You may also want to use the node downstream in the process

    flow to perform tasks, such as creating a chart of the predicted

    values from a model developed with one of the modeling nodes.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    17/185

    12 Explore Nodes 4 Chapter 1

    Node Name Description

    Market Basket The Market Basket node performs association rule mining over

    transaction data in conjunction with item taxonomy. Transaction

    data contain sales transaction records with details about items

    bought by customers. Market basket analysis uses the information

    from the transaction data to give you insight about which products

    tend to be purchased together. This information can be used to

    change store layouts, to determine which products to put on sale, or

    to determine when to issue coupons or some other profitable course

    of action.

    The market basket analysis is not limited to the retail marketing

    domain. The analysis framework can be abstracted to other areas

    such as word co-occurrence relationships in text documents.

    The Market Basket node is not included with SAS Enterprise Miner

    for the Desktop.

    MultiPlot Use the MultiPlot node to explore larger volumes of data graphically.

    The MultiPlot node automatically creates bar charts and scatter

    plots for the input and target variables without requiring you to

    make several menu or window item selections. The code that is

    created by this node can be used to create graphs in a batch

    environment. See Chapter 3.

    Path Analysis Use the Path Analysis node to analyze Web log data and to

    determine the paths that visitors take as they navigate through a

    Web site. You can also use the node to perform sequence analysis.

    SOM/Kohonen Use the SOM/Kohonen node to perform unsupervised learning by

    using Kohonen vector quantization (VQ), Kohonen self-organizing

    maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linear

    smoothing. Kohonen VQ is a clustering method, whereas SOMs are

    primarily dimension-reduction methods.

    StatExplore Use the StatExplore node to examine variable distributions and

    statistics in your data sets. You can use the StatExplore node tocompute standard univariate distribution statistics, to compute

    standard bivariate statistics by class target and class segment, and to

    compute correlation statistics for interval variables by interval input

    and target. You can also combine the StatExplore node with other

    Enterprise Miner tools to perform data mining tasks such as using

    the StatExplore node with the Metadata node to reject variables,

    using the StatExplore node with the Transform Variables node to

    suggest transformations, or even using the StatExplore node with

    the Regression node to create interactions terms. See Chapter 3.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    18/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Modify Nodes 13

    Node Name Description

    Variable Clustering Variable clustering is a useful tool for data reduction, such as

    choosing the best variables or cluster components for

    analysis. Variable clustering removes collinearity, decreases variable

    redundancy, and helps to reveal the underlying structure of the input

    variables in a data set. When properly used as a variable-reduction

    tool, the Variable Clustering node can replace a large set of variables

    with the set of cluster components with little loss of information.

    Variable Selection Use the Variable Selection node to evaluate the importance of input

    variables in predicting or classifying the target variable. To preselect

    the important inputs, the Variable Selection node uses either an

    R-Square or a Chi-Square selection (tree-based) criterion. You can

    use the R-Square criterion to remove variables in hierarchies,

    remove variables that have large percentages of missing values, and

    remove class variables that are based on the number of unique

    values. The variables that are not related to the target are set to a

    status of rejected. Although rejected variables are passed to

    subsequent nodes in the process flow diagram, these variables are

    not used as model inputs by a more detailed modeling node, such as

    the Neural Network and Decision Tree nodes. You can reassign thestatus of the input model variables to rejected in the Variable

    Selection node. See Chapter 5.

    Modify Nodes

    Node Name Description

    Drop Use the Drop node to drop certain variables from your scored

    Enterprise Miner data sets. You can drop variables that have roles

    of Assess, Classification, Frequency, Hidden, Input, Predict,

    Rejected, Residual, Target, and Other from your scored data sets.

    Impute Use the Impute node to impute (fill in) values for observations that

    have missing values. You can replace missing values for interval

    variables with the mean, median, midrange, mid-minimum spacing,

    distribution-based replacement. Alternatively, you can use a

    replacement M-estimator such as Tukeys biweight, Hubers, or

    Andrews Wave. You can also estimate the replacement values for

    each interval input by using a tree-based imputation method.

    Missing values for class variables can be replaced with the most

    frequently occurring value, distribution-based replacement,tree-based imputation, or a constant. See Chapter 5.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    19/185

    14 Modify Nodes 4 Chapter 1

    Node Name Description

    Interactive Binning The Interactive Binning node is an interactive grouping tool that you

    use to model nonlinear functions of multiple modes of continuous

    distributions. The interactive tool computes initial bins by quantiles;

    then you can interactively split and combine the initial bins.You use

    the Interactive Binning node to create bins or buckets or classes of

    all input variables. You can create bins in order to reduce the

    number of unique levels as well as attempt to improve the predictive

    power of each input. The Interactive Binning node enables you to

    select strong characteristics based on the Gini statistic and to group

    the selected characteristics based on business considerations. The

    node is helpful in shaping the data to represent risk ranking trends

    rather than modeling quirks, which might lead to overfitting.

    Principal Components Use the Principal Components node to perform a principal

    components analysis for data interpretation and dimension

    reduction. The node generates principal components that are

    uncorrelated linear combinations of the original input variables and

    that depend on the covariance matrix or correlation matrix of the

    input variables. In data mining, principal components are usually

    used as the new set of input variables for subsequent analysis bymodeling nodes.

    Replacement Use the Replacement node to impute (fill in) values for observations

    that have missing values and to replace specified non-missing values

    for class variables in data sets. You can replace missing values for

    interval variables with the mean, median, midrange, or

    mid-minimum spacing, or with a distribution-based replacement.

    Alternatively, you can use a replacement M-estimator such as

    Tukeys biweight, Hubers, or Andrews Wave. You can also estimate

    the replacement values for each interval input by using a tree-based

    imputation method. Missing values for class variables can be

    replaced with the most frequently occurring value,

    distribution-based replacement, tree-based imputation, or aconstant. See Chapters 3, 4, and 5.

    Rules Builder The Rules Builder node accesses the Rules Builder window so you

    can create ad hoc sets of rules with user-definable outcomes. You can

    interactively define the values of the outcome variable and the paths

    to the outcome. This is useful in ad hoc rule creation such as

    applying logic for posterior probabilities and scorecard values. Any

    Input Data Source data set can be used as an input to the Rules

    Builder node. Rules are defined using charts and histograms based

    on a sample of the data.

    Transform Variables Use the Transform Variables node to create new variables that are

    transformations of existing variables in your data. Transformations

    are useful when you want to improve the fit of a model to the data.For example, transformations can be used to stabilize variances,

    remove nonlinearity, improve additivity, and correct nonnormality in

    variables. In Enterprise Miner, the Transform Variables node also

    enables you to transform class variables and to create interaction

    variables. See Chapter 5.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    20/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 M ode l N od es 15

    Model Nodes

    Node Name Description

    AutoNeural Use the AutoNeural node to automatically configure a neural

    network. It conducts limited searches for a better network

    configuration. See Chapters 5 and 6.

    Decision Tree Use the Decision Tree node to fit decision tree models to your data.

    The implementation includes features that are found in a variety of

    popular decision tree algorithms such as CHAID, CART, and C4.5.

    The node supports both automatic and interactive training. When

    you run the Decision Tree node in automatic mode, it automatically

    ranks the input variables, based on the strength of their

    contribution to the tree. This ranking can be used to select variables

    for use in subsequent modeling. You can override any automatic step

    with the option to define a splitting rule and prune explicit tools or

    subtrees. Interactive training enables you to explore and evaluate alarge set of trees as you develop them. See Chapters 4 and 6.

    Dmine Regression Use the Dmine Regression node to compute a forward stepwise

    least-squares regression model. In each step, an independent

    variable is selected that contributes maximally to the model

    R-square value.

    DMNeural Use DMNeural node to fit an additive nonlinear model. The additive

    nonlinear model uses bucketed principal components as inputs to

    predict a binary or an interval target variable.

    Ensemble Use the Ensemble node to create new models by combining the

    posterior probabilities (for class targets) or the predicted values (for

    interval targets) from multiple predecessor models.

    Gradient Boosting Gradient boosting is a boosting approach that creates a series of

    simple decision trees that together form a single predictive model.

    Each tree in the series is fit to the residual of the prediction from the

    earlier trees in the series. Each time the data is used to grow a tree,

    the accuracy of the tree is computed. The successive samples are

    adjusted to accommodate previously computed inaccuracies. Because

    each successive sample is weighted according to the classification

    accuracy of previous models, this approach is sometimes called

    stochastic gradient boosting. Boosting is defined for binary, nominal,

    and interval targets.

    MBR (Memory-Based

    Reasoning)

    Use the MBR (Memory-Based Reasoning) node to identify similar

    cases and to apply information that is obtained from these cases to a

    new record. The MBR node uses k-nearest neighbor algorithms to

    categorize or predict observations.

    Model Import Use the Model Import node to import and assess a model that was

    not created by one of the Enterprise Miner modeling nodes. You can

    then use the Model Comparison node to compare the user-defined

    model with one or more models that you developed with an

    Enterprise Miner modeling node. This process is called integrated

    assessment.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    21/185

    16 Mo de l No de s4 Chapter 1

    Node Name Description

    Neural Network Use the Neural Network node to construct, train, and validate

    multilayer feedforward neural networks. By default, the Neural

    Network node automatically constructs a multilayer feedforward

    network that has one hidden layer consisting of three neurons. In

    general, each input is fully connected to the first hidden layer, each

    hidden layer is fully connected to the next hidden layer, and the last

    hidden layer is fully connected to the output. The Neural Network

    node supports many variations of this general form. See Chapters 5

    and 6.

    Partial Least Squares The Partial Least Squares node is a tool for modeling continuous

    and binary targets that are based on SAS/STAT PROC PLS. Partial

    least squares regression produces factor scores that are linear

    combinations of the original predictor variables. As a result, no

    correlation exists between the factor score variables that are used in

    the predictive regression model. Consider a data set that has a

    matrix of response variables Y and a matrix with a large number of

    predictor variables X. Some of the predictor variables are highly

    correlated. A regression model that uses factor extraction for the

    data computes the factor score matrix T=XW, where W is the weightmatrix. Next, the model considers the linear regression model

    Y=TQ+E, where Q is a matrix of regression coefficients for the factor

    score matrix T, and where E is the noise term. After computing the

    regression coefficients, the regression model becomes equivalent to

    Y=XB+E, where B=WQ, which can be used as a predictive regression

    model.

    Regression Use the Regression node to fit both linear and logistic regression

    models to your data. You can use continuous, ordinal, and binary

    target variables. You can use both continuous and discrete variables

    as inputs. The node supports the stepwise, forward, and backward

    selection methods. A point-and-click term editor enables you to

    customize your model by specifying interaction terms and theordering of the model terms. See Chapters 5 and 6.

    Rule Induction Use the Rule Induction node to improve the classification of rare

    events in your modeling data. The Rule Induction node creates a

    Rule Induction model that uses split techniques to remove the

    largest pure split node from the data. Rule Induction also creates

    binary models for each level of a target variable and ranks the levels

    from the most rare event to the most common. After all levels of the

    target variable are modeled, the score code is combined into a SAS

    DATA step.

    Support Vector Machines

    (Experimental)

    Support Vector Machines are used for classification. They use a

    hyperplane to separate points mapped on a higher dimensional

    space. The data points used to build this hyperplane are calledsupport vectors.

    TwoStage Use the TwoStage node to compute a two-stage model for predicting

    a class and an interval target variables at the same time. The

    interval target variable is usually a value that is associated with a

    level of the class target.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    22/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 A ss es s N od es 17

    Note: These modeling nodes use a directory table facility, called the Model Manager,in which you can store and access models on demand. The modeling nodes also enable

    you to modify the target profile or profiles for a target variable. 4

    Assess Nodes

    Node Name Description

    Cutoff The Cutoff node provides tabular and graphical information to assist

    users in determining an appropriate probability cutoff point for

    decision making with binary target models. The establishment of a

    cutoff decision point entails the risk of generating false positives and

    false negatives, but an appropriate use of the Cutoff node can help

    minimize those risks.

    You will typically run the node at least twice. In the first run, you

    obtain all the plots and tables. In subsequent runs, you can change

    the values of the Cutoff Method and Cutoff User Input properties,

    customizing the plots, until an optimal cutoff value is obtained.

    Decisions Use the Decisions node to define target profiles for a target that

    produces optimal decisions. The decisions are made using a

    user-specified decision matrix and output from a subsequent

    modeling procedure.

    Model Comparison Use the Model Comparison node to use a common framework for

    comparing models and predictions from any of the modeling tools

    (such as Regression, Decision Tree, and Neural Network tools). The

    comparison is based on the expected and actual profits or losses that

    would result from implementing the model. The node produces the

    following charts that help to describe the usefulness of the model:

    lift, profit, return on investment, receiver operating curves,

    diagnostic charts, and threshold-based charts. See Chapter 6.

    Segment Profile Use the Segment Profile node to assess and explore segmented data

    sets. Segmented data is created from data BY-values, clustering, or

    applied business rules. The Segment Profile node facilitates data

    exploration to identify factors that differentiate individual segments

    from the population, and to compare the distribution of key factors

    between individual segments and the population. The Segment

    Profile node outputs a Profile plot of variable distributions across

    segments and population, a Segment Size pie chart, a Variable

    Worth plot that ranks factor importance within each segment, and

    summary statistics for the segmentation results. The Segment

    Profile node does not generate score code or modify metadata.

    Score Use the Score node to manage, edit, export, and execute scoring codethat is generated from a trained model. Scoring is the generation of

    predicted values for a data set that might not contain a target

    variable. The Score node generates and manages scoring formulas in

    the form of a single SAS DATA step, which can be used in most SAS

    environments even without the presence of Enterprise Miner. See

    Chapter 6.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    23/185

    18 Utility Nodes 4 Chapter 1

    Utility Nodes

    Node Name Description

    Control Point Use the Control Point node to establish a control point to reduce thenumber of connections that are made in process flow diagrams. For

    example, suppose three Input Data nodes are to be connected to

    three modeling nodes. If no Control Point node is used, then nine

    connections are required to connect all of the Input Data nodes to all

    of the modeling nodes. However, if a Control Point node is used, only

    six connections are required.

    End Groups The End Groups node is used only in conjunction with the Start

    Groups node. The End Groups node acts as a boundary marker that

    defines the end of group processing operations in a process flow

    diagram. Group processing operations are performed on the portion

    of the process flow diagram that exists between the Start Groups

    node and the End Groups node.

    If the group processing function that is specified in the Start Groups

    node is stratified, bagging, or boosting, the End Groups node

    functions as a model node and presents the final aggregated model.

    Enterprise Miner tools that follow the End Groups node continue

    data mining processes normally.

    Start Groups The Start Groups node is useful when your data can be segmented

    or grouped, and you want to process the grouped data in different

    ways. The Start Groups node uses BY-group processing as a method

    to process observations from one or more data sources that are

    grouped or ordered by values of one or more common variables. BY

    variables identify the variable or variables by which the data source

    is indexed, and BY statements process data and order output

    according to the BY-group values.

    You can use the Enterprise Miner Start Groups node to perform

    these tasks:

    3 define group variables such as GENDER or JOB, in order to

    obtain separate analyses for each level of a group variable

    3 analyze more than one target variable in the same process flow

    3 specify index looping, or how many times the flow that follows

    the node should loop

    3 resample the data set and use unweighted sampling to create

    bagging models

    3 resample the training data set and use reweighted sampling to

    create boosting models

    Metadata Use the Metadata node to modify the columns metadata information

    at some point in your process flow diagram. You can modify

    attributes such as roles, measurement levels, and order.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    24/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Overview of the SAS Enterprise Miner 5.3 Getting Started Example 19

    Node Name Description

    Reporter The Reporter node uses SAS Output Delivery System (ODS)

    capability to create a single PDF or RTF file that contains

    information about the open process flow diagram. The PDF or RTF

    documents can be viewed and saved directly and are included in

    Enterprise Miner report package files.

    The report contains a header that shows the Enterprise Miner

    settings, process flow diagram, and detailed information for each

    node. Based on the Nodes property setting, each node that is

    included in the open process flow diagram has a header, property

    settings, and a variable summary. Moreover, the report also includes

    results such as variable selection, model diagnostic tables, and plots

    from the Results browser. Score code, log, and output listing are not

    included in the report. Those items are found in the Enterprise

    Miner package folder.

    SAS Code Use the SAS Code node to incorporate new or existing SAS code into

    process flows that you develop using Enterprise Miner. The SAS

    Code node extends the functionality of Enterprise Miner by making

    other SAS procedures available in your data mining analysis. You

    can also write a SAS DATA step to create customized scoring code, to

    conditionally process data, and to concatenate or to merge existing

    data sets. See Chapter 6.

    Usage Rules for Nodes

    Here are some general rules that govern the placement of nodes in a process flow

    diagram:

    3 The Input Data Source node cannot be preceded by any other nodes.

    3 All nodes except the Input Data Source and SAS Code nodes must be preceded by

    a node that exports a data set.

    3 The SAS Code node can be defined in any stage of the process flow diagram. Itdoes not require an input data set that is defined in the Input Data Source node.

    3 The Model Comparison node must be preceded by one or more modeling nodes.

    3 The Score node must be preceded by a node that produces score code. Forexample, the modeling nodes produce score code.

    3 The Ensemble node must be preceded by a modeling node.

    3 The Replacement node must follow a node that exports a data set, such as a Data

    Source, Sample, or Data Partition node.

    Overview of the SAS Enterprise Miner 5.3 Getting Started Example

    This book uses an extended example that is intended to familiarize you with the

    many features of Enterprise Miner. Several key components of the Enterprise Miner

    process flow diagram are covered.

    In this step-by-step example you learn to do basic tasks in Enterprise Miner: youcreate a project and build a process flow diagram. In your diagram you perform tasks

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    25/185

    20 Example Problem Description 4 Chapter 1

    such as accessing data, preparing the data, building multiple predictive models,comparing the models, selecting the best model, and applying the chosen model to new

    data (known as scoring data). You also perform tasks such as filtering data, exploring

    data, and transforming variables. The example is designed to be used in conjunction

    with Enterprise Miner software.

    Example Problem Description

    A national charitable organization seeks to better target its solicitations for

    donations. By only soliciting the most likely donors, less money will be spent onsolicitation efforts and more money will be available for charitable concerns.

    Solicitations involve sending a small gift to an individual along with a request for a

    donation. Gifts include mailing labels and greeting cards.

    The organization has more than 3.5 million individuals in its mailing database.

    These individuals have been classified by their response to previous solicitation efforts.

    Of particular interest is the class of individuals who are identified as lapsing donors.These individuals have made their most recent donation between 12 and 24 months

    ago. The organization has found that by predicting the response of this group, they can

    use the model to rank all 3.5 million individuals in their database. The campaign refersto a greeting card mailing sent in June of 1997. It is identified in the raw data as the

    97NK campaign.When the most appropriate model for maximizing solicitation profit by screening the

    most likely donors is determined, the scoring code will be used to create a new score

    data set that is named Donor.ScoreData. Scoring new data that does not contain the

    target is the end result of most data mining applications.

    When you are finished with this example, your process flow diagram will resemblethe one shown below.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    26/185

    Introduction to SAS Enterprise Miner 5.3 Software 4 Example Problem Description 21

    &ata P a 1 ~ i o n s t a t o r e f

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    27/185

    22 Software Requirements 4 Chapter 1

    Here is a preview of topics and tasks in this example:

    Chapter Task

    2 Create your project, define the data source, configure the metadata, define

    prior probabilities and profit matrix, and create an empty process flow

    diagram.

    3 Define the input data, explore your data by generating descriptivestatistics and creating exploratory plots. You will also partition the raw

    data and replace missing data.

    4 Create a decision tree and interactive decision tree models.

    5 Impute missing values and create variable transformations. You will also

    develop regression, neural network, and autoneural models. Finally, you

    will use the variable selection node.

    6 Assess and compare the models. Also, you will score new data using the

    models.

    7 Create model results packages, register your models, save and import the

    process flow diagram in XML.

    Note: This example provides an introduction to using Enterprise Miner in order to

    familiarize you with the interface and the capabilities of the software. The example is

    not meant to provide a comprehensive analysis of the sample data. 4

    Software Requirements

    In order to re-create this example, you must have access to SAS Enterprise Miner 5.3

    software, either as client/server application, or as a complete client on your local

    machine.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    28/185

    23

    C H A P T E R

    2Setting Up Your Project

    Create a New Project 23

    Example Data Description 26

    Locate and Install the Example Data 26

    Configure the Example Data 26

    Define the Donor Data Source 29

    Overview of the Enterprise Miner Data Source 29

    Specify the Data Type 30

    Select a SAS Table 31

    Configure the Metadata 33

    Define Prior Probabilities and a Profit Matrix 38

    Optional Steps 42

    Create a Diagram 43

    Other Useful Tasks and Tips 44

    Create a New Project

    In Enterprise Miner, you store your work in projects. A project can contain multipleprocess flow diagrams and information that pertains to them. It is a good idea to create

    a separate project for each major data mining problem that you want to investigate.

    This task creates a new project that you will use for this example.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    29/185

    24 Create a New Project4 Chapter 2

    1 To create a new project, click New Project in the Welcome to Enterprise Minerwindow.

    E n t e r r s e Miner - - L]I ile Edit View A..ctions Options Window HelpWelcome to Enterp s e M n e r

    w Help Topicsw New Project ..w Open Project ..w Recent P r o j e c t t ~ ...w ~

    J e l e c t File-.>New Project to cre .. s a s g u t ~ s t as SAS project open

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    30/185

    Setting Up Your Project 4 Create a New Project 25

    2 The Create New Project window opens. In the Name box, type a name for theproject, such as Getting Started Charitable Giving Example.

    3 In the Host box, select a logical workspace server from the drop-down list. The

    main SAS workspace server is named SASMain by default. Contact your system

    administrator if you are unsure of your sites configuration.

    4 In the Path box, type the path to the location on the server where you want to

    store the data that is associated with the example project. Your project pathdepends on whether you are running Enterprise Miner as a complete client on

    your local machine or as a client/server application.

    If you are running Enterprise Miner as a complete client, your local machineacts as its own server. Your Enterprise Miner projects are stored on your local

    machine, in a location that you specify, such as C:\EMProjects.

    If you are running Enterprise Miner as a client/server application, all projects

    are stored on the Enterprise Miner server. Ask your system administrator to

    configure the library location and access permission to the data source for this

    example.

    If the Path box is empty, you must enter a valid path. If you see a default path

    in the Path box, you can accept the default path, or you may be able to specifyyour own project path. If you see a default path in the Path box and the path field

    is dimmed and unavailable for editing, you must use the default path that has

    been defined by the system administrator. This example uses C:\EMProjects\.

    5 On the Start-Up Code tab, you can enter SAS code that you want SAS EnterpriseMiner to run each time you open the project. Enter the following statement.

    Similarly, you can use the Exit Code tab to enter SAS code that you want

    Enterprise Miner to run each time you exit the project.

    6 Click OK . The new project will be created and it opens automatically.

    Note: Example results might differ from your results. Enterprise Miner nodes and

    their statistical methods might incrementally change between releases. Your process

    flow diagram results might differ slightly from the results that are shown in thisexample. However, the overall scope of the analysis will be the same. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    31/185

    26 Example Data Description 4 Chapter 2

    Example Data Description

    See Example Data Description for a list of variables that are used in this example.

    Locate and Install the Example DataDownload the donor_raw_data.sas7bdat and donor_score_data.sas7bdat data

    sets from http://support.sas.com/documentation/onlinedoc/miner under the

    SAS Enterprise Miner 5.3 heading.

    If you access Enterprise Miner 5.3 as a complete client, download and save the donor

    sample data source to your local machine. If you are running Enterprise Miner as aclient/server application, downloadand save the donor sample data source to the

    Enterprise Miner server

    Configure the Example Data

    The first step is to create a SAS library that is accessible by Enterprise Miner. When

    you create a library, you give SAS a shortcut name or pointer to a storage location in

    your operating environment where you store SAS files.

    To create a new SAS library for your sample donor data using Enterprise Miner 5.3,

    complete the following steps:

    1 Open the Explorer window by clicking on the Explorer icon ( ) or by selecting

    View I Explorer .

    2 Select File I New I Library. The Library Wizard will open.

    3 In the Library Wizard, click the Create New Library and then click Next .

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    32/185

    Setting Up Your Project 4 Configure the Example Data 27

    4 In the Name box of the Library Wizard, enter a library reference. The library name

    is Donor in this example.

    Note: Library names are limited to eight characters. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    33/185

    28 Configure the Example Data 4 Chapter 2

    5 Select an engine type from the drop-down list. If you are not sure which engine to

    choose, use the Base SAS engine. If no data sets exist in your new library, then

    select the Base SAS engine.

    6 Type the path where your data is stored in the Path box of the LibraryInformation area. For this example, we supplied the path c:\EM53\GS\data.

    7 Enter any options that you want to specify in the Options box of the Library

    Information area. For this example, leave the Options box blank.

    8 Click Next .

    The following window will be displayed enabling you to confirm the informationthat you have entered.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    34/185

    Setting Up Your Project 4 Overview of the Enterprise Miner Data Source 29

    9 Click Finish .

    10 Click the Show Project Data check box in the Explorer window, and you will see

    the new Donor library.

    Define the Donor Data Source

    Overview of the Enterprise Miner Data Source

    In order to access the example data in Enterprise Miner, you need to define the

    imported data as an Enterprise Miner data source. An Enterprise Miner data sourcestores all of the data sets metadata. Enterprise Miner metadata includes the data sets

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    35/185

    30 Specify the Data Type 4 Chapter 2

    name, location, library path, as well as variable role assignments, measurement levels,and other attributes that guide the data mining process. The metadata is necessary in

    order to start data mining. Note that Enterprise Miner data sources are not the actual

    training data, but are the metadata that defines the data source for Enterprise Miner.

    The data source must reside in an allocated library. You assigned the libname Donor

    to the data that is found in C:\EM53\GS\Data when you created the SAS Library forthis example.

    The following tasks use the Data Source wizard in order to define the data sourcethat you will use for this example.

    Specify the Data Type

    In this task you open the Data Source wizard and identify the type of data that you

    will use.

    1 Right-click the Data Sources folder in the Project Navigator and select Create

    Data Source to open the Data Source wizard. Alternatively, you can select File I

    New I Data Source from the main menu, or you can click theCreate Data Source on the Shortcut Toolbar.

    2 In the Source box of the Data Source Wizard Metadata Source window, select SAS

    Table to tell SAS Enterprise Miner that the data is formatted as a SAS table.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    36/185

    Setting Up Your Project 4 Select a SAS Table 31

    3 Click Next . The Data Source Wizard Select a SAS Table window opens.

    Select a SAS Table

    In this task, you specify the data set that you will use, and view the table metadata.

    1 Click Browse in the Data Source Wizard Select a SAS Table window.

    The Select a SAS Table window opens.

    2 Click the SAS library named Donor in the list of libraries on the left. The Donor

    library folder expands to show all the data sets that are in the library.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    37/185

    32 Select a SAS Table4 Chapter 2

    3 Select the DONOR_RAW_DATA table and click OK . The two-level name

    DONOR.DONOR_RAW_DATAappears in the Table box of the Select a SAS Table

    window.

    4 Click Next . The Table Information window opens. Examine the metadata in the

    Table Properties section. Notice that the DONOR_RAW_DATA data set has 50 variables and 19,372 observations.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    38/185

    Setting Up Your Project 4 Configure the Metadata 33

    5 After you finish examining the table metadata, click Next . The Data Source

    Wizard Metadata Advisor Options window opens.

    Configure the Metadata

    The Metadata Configuration step activates the Metadata Advisor, which you can use

    to control how Enterprise Miner organizes metadata for the variables in your data

    source.

    In this task, you generate and examine metadata about the variables in your data set.

    1 Select Advanced and click Customize .

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    39/185

    34 Configure the Metadata 4 Chapter 2

    The Advanced Advisor Options window opens.

    In the Advanced Advisor Options window, you can view or set additional

    metadata properties. When you select a property, the property description appears

    in the bottom half of the window.

    Notice that the threshold value for class variables is 20 levels. You will see theeffects of this setting when you view the Column Metadata window in the next

    step. Click OK to use the defaults for this example.

    2 Click Next in the Data Source Wizard Metadata Advisor Options window to

    generate the metadata for the table. The Data Source Wizard Column Metadata

    window opens.

    Note: In the Column Metadata window, you can view and, if necessary, adjust the

    metadata that has been defined for the variables in your SAS table. Scroll through

    the table and examine the metadata. In this window, columns that have a white

    background are editable, and columns that have a gray background are not

    editable. 4

    3 Select the Names column header to sort the variables alphabetically.

    Note that the roles for the variables CLUSTER_CODE andCONTROL_NUMBER are set to Rejected because the variables exceed the

    maximum class count threshold of 20. This is a direct result of the threshold

    values that were set in the Data Source Wizard Metadata Advisory Options

    window in the previous step. To see all of the levels of data, select the columns of

    interest and then click Explore in the upper right-hand corner of the window.

    4 Redefine these variable roles and measurement levels:

    3 Set the role for the CONTROL_NUMBER variable to ID.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    40/185

    Setting Up Your Project 4 Configure the Metadata 35

    3

    Set these variables to the Interval measurement level:

    3 CARD_PROM_12

    3 INCOME_GROUP

    3 RECENT_CARD_RESPONSE_COUNT

    3 RECENT_RESPONSE_COUNT

    3 WEALTH_RATING

    5 Set the role for the variable TARGET_D to Rejected, since you will not model thisvariable. Note that Enterprise Miner correctly identified TARGET_D and

    TARGET_B as targets since they start with the prefix TARGET.

    6 Select the TARGET_B variable and click Explore to view the distribution of

    TARGET_B. As an exercise, select additional variables and explore their

    distributions.

    7 In the Sample Properties window, set Fetch Size to Max and then click Apply .

    8 Select the bar that corresponds to donors (TARGET_B = 1) on the TARGET_B

    histogram and note that the donors are highlighted in theDONOR.DONOR_RAW_DATA table.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    41/185

    36 Configure the Metadata 4 Chapter 2

    9 Close the Explore window.10 Sort the Metadata table by Level and check your customized metadata

    assignments.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    42/185

    Setting Up Your Project 4 Configure the Metadata 37

    11 Select the Report column and select Yes for URBANICITY and DONOR_AGE todefine them as report variables. These variables will be used as additional

    profiling variables in results such as assessment tables and cluster profiles plots.

    12 Click Next to open the Data Source Wizard Decision Configuration window.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    43/185

    38 Define Prior Probabilities and a Profit Matrix4 Chapter 2

    To end this task, select Yes and click Next in order to open the DecisionConfiguration window.

    Define Prior Probabilities and a Profit Matrix

    The Data Source Wizard Decision Configuration window enables you to define a

    target profile that produces optimal decisions from a model. You can specify target

    profile information such as the profit or loss of each possible decision, priorprobabilities, and cost functions. In order to create a target profile in the Decision

    Configuration window, you must have a variable that has a role of Target in your datasource. You cannot define decisions for an interval level target variable.

    In this task, you specify whether to implement decision processing when you build

    your models.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    44/185

    Setting Up Your Project 4 Define Prior Probabilities and a Profit Matrix 39

    1 Select the Prior Probabilities tab. Click Yes to reveal the Adjusted Prior

    column and enter the following adjusted probabilities, which are representative ofthe underlying population of donors.

    3 Level 1 = 0.05

    3 Level 0 = 0.95

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    45/185

    40 Define Prior Probabilities and a Profit Matrix4 Chapter 2

    2 Select the Decision Weights tab and specify the following weight values:

    Table 2.1 Weight Values or Profit Matrix

    Level Decision 1 Decision 2

    1 14.5 0

    0 -0.5 0

    A profit value of $14.50 is obtained after accounting for a 50cent mailing cost.The focus of this example will be to develop models that maximize profit.

    3 Click Next to open the Data Source Attributes window. In this window, you can

    specify a name, role, and segment for your data source.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    46/185

    Setting Up Your Project 4 Define Prior Probabilities and a Profit Matrix 41

    4Click Finish to add the donor table to the Data Sources folder of the ProjectNavigator.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    47/185

    42 Optional Steps 4 Chapter 2

    Optional Steps

    3 The data source can be used in other diagrams. Expand the Data Sources folder.

    Select the DONOR_RAW_DATA data source and notice that the Property panel

    now shows properties for this data source.

    i { E n t e r r i : s : e Miner - Getting Started Charitable G i n g Exa

    DONOR_RAW_DATA$ Diagrams

    Model Packages

    11D'ata Source identifier. The metadata tables are stored inSAS library, and use this identifier as its LIBREF.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    48/185

    Setting Up Your Project 4 Create a Diagram 43

    Create a Diagram

    Now that you have created a project and defined your data source, you are ready to

    begin building your process flow diagram. This task creates a new process flow diagram

    in your project.

    1 Right-click the Diagrams folder of the Project Navigator and select CreateDiagram.

    Alternatively, you can select File I New Diagram from the main menu, or you

    can click Create Diagram in the toolbar. The Create New Diagram window opens.2 Enter Donations in the Diagram Name box and click OK . The empty Donations

    diagram opens in the Diagram Workspace area.

    3 Click the diagram icon next to your newly created diagram and notice that the

    Properties panel now shows properties for the diagram.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    49/185

    44 Other Useful Tasks and Tips 4 Chapter 2

    Other Useful Tasks and Tips

    3 Explore the node tools that are organized by the SEMMA process on the toolbar.

    When you move your mouse pointer over a toolbar icon, a tooltip displays the

    name of each node tool.

    3 Explore the Toolbar Shortcut buttons that are located to the right of the node tool

    icons.

    3 Note that the Properties panel displays the properties that are associated with theproject that you just created.

    3 From the main menu, select Help I Contents or, alternatively, press the F1 key.

    Browse the Help topics.

    3 To specify model results package options or to customize the appearance of your

    Enterprise Miner GUI, select Options I Preferences from the main menu.

    3 You can also use the View menu items to open the Program Editor, Log, Output,Explorer, and Graph windows.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    50/185

    45

    C H A P T E R

    3Working with Nodes ThatSample, Explore, and Modify

    Overview of This Group of Tasks 45

    Identify Input Data 45

    Generate Descriptive Statistics 46

    Create Exploratory Plots 51

    Partition the Raw Data 54

    Replace Missing Data 55

    Overview of This Group of Tasks

    These tasks develop the process flow diagram that you created in Create a Diagram.The Input Data node is typically the first node that you use when you create a process

    flow diagram. The node represents the data source that you choose for your data mining

    analysis and provides metadata about the variables. The other nodes that you use in

    this chapter show you some typical techniques of exploring and modifying your data.

    Identify Input Data

    In this task, you add an Input Data node to your process flow diagram.

    1 Select the DONOR_RAW_DATA data source from the Data Sources list in the

    Project panel and drag the DONOR_RAW_DATA data source into the Diagram

    Workspace.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    51/185

    46 Generate Descriptive Statistics 4 Chapter 3

    Note: Although this task develops one process flow diagram, Enterprise Miner

    enables you to open multiple diagrams at one time. You can also disconnect from andreconnect to a diagram if you have also configured the Enterprise Miner application

    server. Other users can also access the same project. However, only one user can open a

    diagram at a time. 4

    Generate Descriptive Statistics

    As you begin a project, you should consider creating summary statistics for each of

    the variables, including their relationship with the target, using tools like the

    StatExplore node.

    In this task, you add a StatExplore node to your diagram.

    1 Select the Explore tab on the toolbar at the top left and select the StatExplore

    node. Drag this node into the Diagram Workspace. Alternatively, you can alsoright-click the Diagram Workspace and use the pop-up menus to add nodes to the

    workspace.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    52/185

    Working with Nodes That Sample, Explore, and Modify 4 Generate Descriptive Statistics 47

    2 Connect the DONOR_RAW_DATA Data Source node to the StatExplore node.

    3 Select the StatExplore node to view its properties. Details about the node appearin the Properties panel. By default, the StatExplore node creates Chi-Square

    statistics and correlation statistics.

    Note: An alternate way to see all of the properties for a node is to double-click

    the node in the toolbar above the diagram. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    53/185

    48 Generate Descriptive Statistics 4 Chapter 3

    4 To create Chi-Square statistics for the binned interval variables in addition to theclass variables, set the Interval Variables property to Yes.

    I n t e v a l VariablesGenerates Chi-Square statistics for interval

    l wriabl.es by binning the wriables.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    54/185

    Working with Nodes That Sample, Explore, and Modify 4 Generate Descriptive Statistics 49

    5 Right-click the StatExplore node and select Run. A Confirmation window appears.Click Yes . A green border appears around each successive node in the diagram as

    Enterprise Miner runs the path to the StatExplore node.

    Note: An alternate way to run a node is to select the Run icon from the Toolbar

    Shortcut Buttons. Doing so runs the path from the Input Data node to the selectednode on the diagram.

    If there are any errors in the path that you ran, the border around the node

    that contains the error will be red rather than green, and an Error window will

    appear. The Error window tells you that the run has failed and provides

    information about what is wrong. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    55/185

    50 Generate Descriptive Statistics 4 Chapter 3

    6 A Run Status window opens when the path has run. Click Results . The Resultswindow opens.

    The Chi-Square plot highlights inputs that are associated with the target. Many of

    the binned continuous inputs have the largest Cramers V values. The Pearsons

    correlation coefficients are displayed if the target is a continuous variable.

    Note: An alternate way to view results is to select the Results icon from the

    Toolbar Shortcut Buttons. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    56/185

    Working with Nodes That Sample, Explore, and Modify 4 Create Exploratory Plots 51

    7 Maximize the Output window. The Output window provides distribution andsummary statistics for the class and interval inputs, including summaries that are

    relative to the target.

    8 Scroll down to the Interval Variables Summary Statistics section. The

    Non-Missing column lists the number of observations that have valid values foreach interval variable. The Missing column lists the number of observations that

    have missing values for each interval variable.

    Several variables such as DONOR_AGE, INCOME_GROUP,WEALTH_RATING, and MONTHS_SINCE_LAST_PROM_RESP have missing

    values. The entire customer case is excluded from a regression or neural network

    analysis when a variable attribute about a customer is missing. Later, you will

    impute some of these variables using the Replacement node.

    Notice that many variables have very large standard deviations. You should

    plot these variables in order to decide whether transformations are warranted.

    9 Close the Results window.

    Note: If you make changes to any of the nodes in your process flow diagramafter you have run a path, you need to rerun the path in order for the changes to

    affect later nodes. 4

    Create Exploratory Plots

    Enterprise Miner enables you to generate numerous data visualization graphics inorder to reveal extreme values in the data and to discover patterns and trends. You use

    the MultiPlot node to visualize your data from a wide range of perspectives. With

    MultiPlot you can graphically explore large volumes of data, observe data distributions,

    and examine relationships among the variables. The MultiPlot node uses all of the

    observations for plotting.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    57/185

    52 Create Exploratory Plots 4 Chapter 3

    In this task, you add a MultiPlot node to your diagram.

    1 Select the Explore tab from the node toolbar and drag a MultiPlot node into the

    Diagram Workspace. Connect the StatExplore node to the MultiPlot node.

    2 Select the MultiPlot node in the Diagram Workspace. In the Properties panel, set

    the Type of Charts property to Both in order to generate both scatter and bar

    charts.

    3 In the Diagram Workspace, right-click the MultiPlot node, and select Run.

    4 After the run is complete, select Results from the Run Status window.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    58/185

    Working with Nodes That Sample, Explore, and Modify 4 Create Exploratory Plots 53

    5 In the Results window, maximize the Train Graphs window.

    Click First , Previous , or Next at the bottom of the window to scroll through the

    graphs. You can also view a specific graph by selecting the variable on theselection box to the right of Last .

    You will notice several results in the graphs.

    3 One value for the variable DONOR_GENDER is incorrectly recorded as an A.

    3 There are several heavily skewed variables, such as FILE_AVG_GIFT,

    LAST_GIFT_AMT, LIFETIME_AVG_GIFT_AMT,

    LIFETIME_GIFT_AMOUNT, MOR_HIT_RATE, PCT_ATTRIBUTE1, and

    PCT_OWNER_OCCUPIED. You might want to consider a log transformationlater.

    3 Increasing values of LIFTIME_CARD_PROM, RECENT_RESPONSE_PROP,

    LIFETIME_GIFT_AMOUNT, LIFETIME_GIFT_COUNT ,

    MEDIAN_HOME_VALUE, MEDIAN_HOUSEHOLD_INCOME,

    PER_CAPITA_INCOME, and RECENT_STAR_STATUS tend to be moreassociated with donors and are also heavily skewed. You might want toconsider a bucket transformation that will be relative to the relationship with

    target.

    3 Other variables, such as MONTHS_SINCE_LAST_PROM_RESP and

    NUMBER_PROM_12, show some good separation of the target values at both

    tails of the distribution.

    6 Close the Results window.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    59/185

    54 Partition the Raw Data4 Chapter 3

    Partition the Raw Data

    In data mining, one strategy for assessing model generalization is to partition the

    data source. A portion of the data, called the training data, is used for preliminary

    model fitting. The rest is reserved for empirical validation. The hold-out sample itself is

    often split into two parts: validation data and test data. The validation data is used to

    prevent a modeling node from over-fitting the training data (model fine-tuning), and tocompare prediction models. The test data set is used for a final assessment of the

    chosen model.

    Enterprise Miner can partition your data in several ways. Choose one of the

    following methods.

    3 By default, Enterprise Miner uses either simple random sampling or stratified

    sampling, depending on your target. If your target is a class variable, then SASEnterprise Miner stratifies the sample on the class target. Otherwise, simple

    random sampling is used.

    3 If you specify simple random sampling, every observation in the data set has the

    same probability of being included in the sample.

    3

    If you specify simple cluster sampling, SAS Enterprise Miner samples from acluster of observations that are similar in some way.

    3 If you specify stratified sampling, you identify variables in your data set to form

    strata of the total population. SAS Enterprise Miner samples from each stratum

    so that the strata proportions of the total population are preserved in each sample.

    In this task, you use the Data Partition node to partition your data.

    1 Select the Sample tab from the node toolbar at the top left of the application. Drag

    a Data Partition node from the toolbar into the Diagram Workspace.

    2 Connect the DONOR_RAW_DATA Data Source node to the Data Partition node.

    3 Select the Data Partition node in the Diagram Workspace. Details about data

    partitioning appear in the Properties panel.

    Note: If the target variable is a class variable, the default partitioning methodthat Enterprise Miner uses is stratification. Otherwise, the default partitioningmethod is simple random.

    4

    4 In the Properties panel under the Data Set Percentages section, set the following

    values:

    3 set Training to 55

    3 set Validation to 45

    3 set Test to 0

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    60/185

    Working with Nodes That Sample, Explore, and Modify 4 Replace Missing Data 55

    In the Data Set Percentages section of the Properties panel, the values for the

    Training, Validation, and Test properties specify how you want to

    proportionally allocate the original data set into the three partitions. You canallocate the percentages for each partition by using any real number between 0

    and 100, as long as the sum of the three partitions equals 100.

    Note: By default, the Data Partition node partitions the data by stratifying onthe target variable. This is a good idea in this case, because there are few donors

    relative to non-donors. 4

    5 Run the Data Partition node.

    Replace Missing Data

    You use the Replacement node to generate score code to process unknown variable

    levels when you are scoring data, and to interactively specify replacement values for

    class levels.In this task, you add and configure a Replacement node in your process flow diagram.

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    61/185

    56 Replace Missing Data 4 Chapter 3

    1 From the Modify tab of the node toolbar, drag a Replacement node into theDiagram Workspace and connect it to the Data Partition node.

    2 Select the Data Partition node. On the Properties panel, select the ellipsis button

    to the right of the Variables property to explore any of the variables in the input

    data set. The Variables window opens.

    3 In the Variables window, sort by level and then select the variables SES and

    URBANICITY, and then click Explore . The Explore window opens.

    Note: If Explore is dimmed and unavailable, right-click the Data Partition

    node and select Run. 4

  • 8/3/2019 Sas Getting Started With Sas Enterprise Miner 5 3 9(2008)

    62/185

    Working with Nodes That Sample, Explore, and Modify 4 Replace Missing Data 57

    4 In the Explore window, notice that both the SES and U