117
xAI WORKBENCH TRAINING

xAI WORKBENCH TRAINING

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: xAI WORKBENCH TRAINING

xAI WORKBENCH TRAINING

Page 2: xAI WORKBENCH TRAINING

Agenda | Part 1

• Introduction to xAI Workbench

• simClassify+

• Data Upload

• Data Type Specifications

• Model Tuning - Hyperparameter

Selection

• Auto Tune

• Exhaustive Grid Search

• Thresholding

• Domain Property

• Weighted Recall

• Classification Analysis Reports

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 3: xAI WORKBENCH TRAINING

Agenda | Part 2

• Clustering

• simCluster+

• Cluster Visualizations

• Cluster/Segment Statistical

Analysis

• Classification Operational Issues

• Dataset Update & Merge

• Update Instance

• Copy Instance

• Monitoring

• Sample High-Capacity

Production Setup

• Applications with the API

• simClassify

• simCluster

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 4: xAI WORKBENCH TRAINING

PART 1

Page 5: xAI WORKBENCH TRAINING

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Unlike static, traditional ML segmentation, our dynamic solution

can change as often as your customers change their behavior.

Our predictive technology clusters and segments based on

an action, outcome or other meaningful business objective.

These outcomes drive a segmentation schema comprised

of intent and action – not just descriptive statistics.

Our Expertise:Dynamic Predictive

Segmentation (DPS)

Page 6: xAI WORKBENCH TRAINING

Traditional machine learning is a black box.

TARGET

Model

PREDICTED CLASS:

97% CONFIDENCEWILL BUY

Confidential ∙ Copyright ©2021 ∙ InRule Technology, Inc. ∙ All rights reserved.

Page 7: xAI WORKBENCH TRAINING

We Open the Box – to Deliver Smarter Anything

Our predictions with the WHY® make it easy for analytics and business teams to apply machine learning quickly and

effectively with an easy-to-use workbench, explainable outputs, automation & RESTful APIs.

With user-controlled granularity and feature selection, our single-pass prediction and clustering deliver high precision

models with dynamically weighted attributes by segment for ultimate transparency and explainability.

WILL BUYPREDICTED CLASS:

97% CONFIDENCE

TARGET

PREDICTIVE SEGMENTSDIFFERENTIATING ATTRIBUTES

OUTDOOR ENTHUSIAST

MARRIED

HOUSEHOLD CHILDREN

PET OWNER

SENTIMENT INDEX

Nearest neighbors in the database inform each prediction with rich insights

contextual intelligence enables a more relevant decision on the offer or recommendation to present

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 8: xAI WORKBENCH TRAINING

HIGH LEVEL OVERVIEW

Page 9: xAI WORKBENCH TRAINING

FileManagement

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Classification(simClassify & simClassify+)

Clustering(simCluster & simCluster+)

CollaborativeRecommendation

(simRecommend) Results

Fold & Grid Cross Validation

Data UI Engine Specification/Validation Forms Results Visualization

Very fast random data retrieval.

Automated cross validation and hyperparameter optimization.

Specification Analyzer

Optimized engine with incredibly fast speeds.

Easy to use, simplified data identification. Classification and clustering exploration.

Host Architecture

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 10: xAI WORKBENCH TRAINING

SIMCLASSIFY+

Page 11: xAI WORKBENCH TRAINING

simClassify+

• simClassify+ is a learned relevancy function based on proprietary ML

techniques.

• Can get improved classification accuracy (over simClassify) in some

circumstances.

• By using this learning approach, we are able to match or outperform

other ML techniques, while still providing transparency at the local,

prediction level.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 12: xAI WORKBENCH TRAINING

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Classification(simClassify)

Fold & Grid Validation

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Fold & Grid Validation

Classification(simClassify)

Fold & Grid Validation

Classification(simClassify)

Architecture - simClassify+

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 13: xAI WORKBENCH TRAINING

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Fold & Grid Validation

Metric

K Nearest Neighbor

Classification(simClassify)

K

Threshold

Class Weighting

Parameters - simClassify+

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 14: xAI WORKBENCH TRAINING

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Fold & Grid Validation

Metric

K Nearest Neighbor

Domain Optimization

Classification(simClassify)

K

Threshold

Class WeightingDomain (Feature)

Domain Importance

Weighted Recall (Feature)

Parameters - simClassify+

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 15: xAI WORKBENCH TRAINING

Indexed sM database

Fast sM access

Similarity Search(simSearch)

Fold & Grid Validation

Metric Learner

Metric

K Nearest Neighbor

Domain Optimization

Classification(simClassify)

KIterations

Threshold

Learning Rate

Feature Subsampling

Feature Focus

Class WeightingDomain (Feature)

Domain Importance

Weighted Recall (Feature)

Parameters - simClassify+

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 16: xAI WORKBENCH TRAINING

Parameter Description

Iterations Number of iterations of the learning algorithm.

Learning RateStep size of the learning algorithm. Small values can lead to longer runtime. Large values can lead to overfitting.

Feature Subsampling

Ratio of randomly subsampled features at each iteration of the metric learning

algorithm. Randomizations provides diversity in the preparation of the similarity criteria.

Feature FocusMaximum number of dynamically selected features at any given time. This works like a

localized feature selection process.

Class WeightingUNIFORM or NORMALIZED. Uniform gives the same weight to all classes. Normalization

takes into account class imbalance.

simClassify+ Parameters

• In addition to the typical simClassify settings, simClassify+ has parameters for metric learning. We highly

recommend tuning parameters through auto tune or exhaustive grid search.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 17: xAI WORKBENCH TRAINING

At iteration 0, dX = 0

Iteration 1 x1, x2, x6, x8, x9

x1:x2

x8:x9

d’1

Feature Subsampling

Feature Focus

Learning Rate

Iteration 2 x3, x4, x7, x9, x10

x3:x4

x9:x10

d’2

Iteration 3 x1, x2, x3, x4, x8

x1:x2

x4:x8

d’3

dX = dX + (r * d’1)

dX = dX + (r * d’2)

dX = dX + (r * d’3)

dX

Feature Subsampling and Focus Drill Down

For this example, let’s assume the

following:

• Feature set of training data: {x1,

x2,..., x10} = X

• Feature Subsampling = .5

• Feature Focus = 2

• Iterations = 3

• Learning Rate = r

The output is dx a measure of the

weighted relevancy of features for

predicting the result.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 18: xAI WORKBENCH TRAINING

Using simClassify+

• Choose training data folder and spec file

• Next, assign hyper-parameters

• We strongly suggest using the Auto-Tune feature

to find optimal hyper-parameters

• If using an exhaustive grid to tune hyper

parameters, identify the best performing model

in the grid based on selected metrics (or tune

further if necessary) and create the model.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 19: xAI WORKBENCH TRAINING

simClassify+ Results

• simClassify+ returns a prediction with its probability score and the weighted factors behind the prediction.

Special Values

• +/- represents whether that feature/value pair is in the query object (+) or not (-).

• <bias> represents how the class distribution in the underlying data set affects the model’s classification. This should be more prevalent in imbalanced data sets.

• SNLL represents a null value

• SRAR represents a rare value in nominal columns

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 20: xAI WORKBENCH TRAINING

DATA UPLOAD

Page 21: xAI WORKBENCH TRAINING

Data Upload - GUI

The user can upload data either directly from the GUI, or from the command line using a curl command

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 22: xAI WORKBENCH TRAINING

Data Upload - Command Line

• The curl command to load data from the command line has the following structure:

curl -u <user>:<password> -v -F folderName=<GUI_Folder_Name> -F fileSize=<File_Size> -F

fileName=<File_Name> -F fileData=@<File_Path>

<Cloud_API_Protocol>://<Cloud_API_IP>:<Cloud_API_Port>/cloud/uploadFile

• In order to get fileSize value, run this command:

stat --printf="%s" <file-name>

• Example of data upload using model input data:

sudo curl -u user_1:password -v -F folderName=DataFolderName -F fileSize=123456 -F

fileName=data.tsv -F fileData=@/path/to/data.tsv http://127.0.0.1:9090/cloud/uploadFile

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 23: xAI WORKBENCH TRAINING

DATA TYPE SPECIFICATIONS

Page 24: xAI WORKBENCH TRAINING

Spec Types

• ID - Required field; unique identifier of each row

• Class - Specifies the target column in supervised learning.

• Real - Numerical values

• Nominal - Values that do not bear a quantitative relationship with each other (i.e. strings and numbers which represent non-numerical information).

• Ignore - This column will be ignored during model development

• Item_Set - A series of values with weights. Formatted as item1:weight1;item2:weight2;...;itemN:weightN.

• Multi_English - Freeform English text. Numbers and symbols can be included as well.

• Multi_Spanish - Freeform Spanish text. Numbers and symbols can be included as well.

• Multi_Japanese - Freeform Japanese text. Numbers and symbols can be included as well.

• Multi_Plain - Freeform non-language specific text. Numbers and symbols can be included as well.

• Null_Indicator - This spec type transforms the column into a binary indicator of whether each row in the selected column has data in it or not.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 25: xAI WORKBENCH TRAINING

Spec Generation

• Once the data is uploaded, you will have to “spec the data”, or choose the data types for each column in the dataset.

• There are multiple ways to do with within xAI Workbench

• Manual Selection

• Automated Selection using the Spec Analyzer

• Import previously created spec files

• Navigate to the Specs page by selecting “Edit Specs for Folder” in your desired data folder

• Data file must be uploaded to folder for user to have the ability to go to Specs page.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 26: xAI WORKBENCH TRAINING

Name spec file

Choose model type

Analyze columns

Spec Analyzer

• The Spec Analyzer will generate recommended data types for each column, along with providing some high

level analysis for each column.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 27: xAI WORKBENCH TRAINING

Spec File Upload

• You can upload a previously created spec file from the “Upload Spec File” tab.

• The format of the spec file can be CSV, TSV, or JSON

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 28: xAI WORKBENCH TRAINING

MODEL TUNING - HYPERPARAMETER SELECTIONAUTO TUNE

Page 29: xAI WORKBENCH TRAINING

Auto Tune: Setup

• Once your data is uploaded and specs are created,

you are ready to begin model development.

• In Auto-Tune mode, the engine intelligently searches

a large grid of experiments and only creates model

experiments when the probability of successfully

increasing the metric of interest is high.

• Current metrics available for Auto-Tune

optimization:

• AUC - Area Under the ROC Curve (Binomial)

• Log-Loss (Binomial)

• MCC - Matthews Correlation Coefficient (Binomial)

• Accuracy (Multinomial)

• Recall (Multinomial)

• Precision (Multinomial)

• F1 Score (Multinomial)

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 30: xAI WORKBENCH TRAINING

Auto Tune: Experiment Types

• xAI Workbench offers two types of experiments when evaluating a model’s performance during grid

experiments: N-Fold Experiments & Date Split Experiments

• N-Fold Experiments

• N-Fold cross validation is used to measure model performance

• Date Split Experiments

• A date column is used to split the training data into tuning, testing, and validation splits. This is useful for time sensitive data

sets.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 31: xAI WORKBENCH TRAINING

Auto Tune: Experiment Types - NFold

• Fold experiments are one way to measure results during grid execution.

• There are three steps.

• Step 1: You specify the number of folds and the fold seed.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 32: xAI WORKBENCH TRAINING

3-Fold Experiment

simClassify+

Test

Results

simClassify+

Train

Fold Experiment Parameters● # of Folds● Fold Seed

Fold 1

simClassify+

Fold 2

Fold 3

Test

Results

Train

Test

Results

Auto Tune: Experiment Types - NFold

• Step 2: For each grid configuration, a fold experiment runs. The training data is split into the number of

folds specified. In each iteration, one fold is left out of training for testing. The number of iterations equal to

the number of folds.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 33: xAI WORKBENCH TRAINING

Tuning - Train Tuning - Test

Validation - Train Validation - Test

Auto Tune: Experiment Types - Date Split

• Select Column - date column in dataset to use for splitting

• Date Format - format of date field in selected column

• First Split Date - initial split date for testing

• Second Split Date - final split date for testing

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 34: xAI WORKBENCH TRAINING

Auto Tune: Experiment Types - Date Split

• The training data is split into three groups, in time order.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 35: xAI WORKBENCH TRAINING

Auto Tune: Experiment Types - NFold

• Step 3: Interpret Results - model metrics represent aggregate measures of each record when included in

the “testing” fold during N-Fold experiment

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 36: xAI WORKBENCH TRAINING

Auto Tune: Execution

• Once an auto-tune experiment has been executed, you can navigate to the Grid Results page to see current

best model.

• To create this model directly from the results page, select the blue “Create Model” button.

• To view all of the results from the underlying grid experiment, select the “All Results” radio button.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 37: xAI WORKBENCH TRAINING

Auto Tune: Validation

• For date split experiments, the user can validate multiple model configurations.

• This allows the user to test for overfitting and to see how well the model generalizes on data it hasn’t seen

before.

• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in

the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the

page.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 38: xAI WORKBENCH TRAINING

MODEL TUNING - HYPERPARAMETER SELECTIONEXHAUSTIVE GRID SEARCH

Page 39: xAI WORKBENCH TRAINING

Exhaustive Grid Search: Grid Creation

• If “Auto Tune” is turned off (as illustrated in the

screenshot), an exhaustive grid search will be

performed to test hyper parameter

configurations.

• You can edit the parameters to test by selecting

the “Edit Initial Parameters” slider.

• This particular grid in the screenshot will test

eight distinct hyper parameter configurations

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 40: xAI WORKBENCH TRAINING

Exhaustive Grid Search: Grid Results

• Each combination of parameters used in grid experiments can be created into a model from the Grid Results

table

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 41: xAI WORKBENCH TRAINING

Exhaustive Grid Search: Validation

• Just like in “Auto-Tune” mode, the user can send model configurations to validation for date split

experiments.

• This allows the user to test for overfitting, and to see how well the model generalizes on data it hasn’t seen

before

• In the grid results table, send configurations to validation by selecting the respective “Validate” check box in

the table. Once all desired configurations are chosen, hit the “Execute Validation” button on the bottom of the

page.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 42: xAI WORKBENCH TRAINING

THRESHOLDING

Page 43: xAI WORKBENCH TRAINING

0 1Probability of Class 1

Class 1Class 0

Threshold = 0.5

10

0 1Probability of Class 1

Class 0

Class 1

Threshold = 0.2

10

Balanced Unbalanced

Thresholding

• Thresholding is a feature used in simClassify and simClassify+

• It acts as a limit to split resulting predictions into a true or false category

• Thresholding only applies to:

• Binary classification

• The positive class of a prediction

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 44: xAI WORKBENCH TRAINING

N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

F

NN

Threshold

Thresholding

• Perfection

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 45: xAI WORKBENCH TRAINING

N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Thresholding

• Reality

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 46: xAI WORKBENCH TRAINING

N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Precision

• Quality

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 47: xAI WORKBENCH TRAINING

N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Recall

• Quantity

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 48: xAI WORKBENCH TRAINING

N

N

N

N

NN

N

N

N

N

NNN

N

N

NN

F F

FF

N

TrueNegatives

FalseNegatives

FalsePositives

TruePositives

F

N

N

Threshold

Precision / Recall Tradeoff

• Quantity

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 49: xAI WORKBENCH TRAINING

F1 Score

• Precision

• Quality of results: How exact were they?

• Recall

• Quantity of results: How complete were they?

• Precision increases at the expense of recall and vice versa.

• F1 Score

• Balance (harmonic mean) of precision and recall

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 50: xAI WORKBENCH TRAINING

False Positive RateR

ec

all

ROC Curve

Random Model

Good Model

Better Model

AUC

• Area Under Receiver Operator Characteristic

(ROC) Curve

• Walks through a range of thresholds and

plots the Recall and FPR for each.

• Line of equality is a random model.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 51: xAI WORKBENCH TRAINING

DOMAIN PROPERTY

Page 52: xAI WORKBENCH TRAINING

Domain Property

• Configuration set in model training

• Only available for binomial classifiers

• User selects a “real” column from training data. Model will place more importance on this column during training. E.g. transaction dollars, person hours, etc.

Domain Column

• “Real” column to place importance on during model training.

Domain Importance Function

• Global - emphasis on both classes

• Conservative - emphasis on positive class, conservative approach

• Aggressive - emphasis on positive class, aggressive approach

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 53: xAI WORKBENCH TRAINING

WEIGHTED RECALL

Page 54: xAI WORKBENCH TRAINING

Weighted Recall Formula (Positive Class)Rw = ∑ Selected Metric True Positive / ∑ Selected Metric Condition

Positive

Weighted Recall

• Metric used in grid results analysis

• User selects a “real” column from training data to evaluate recall from. The underlying logic is the same as standard recall, but you are evaluating the percentage of the chosen metric caught for each class (e.g. % of fraud dollars caught).

• Weighted recall metric will appear in the grid results table, just like other metrics

• Metric name in grid results table

• weighted_recall_*POS_CLASS*

• weighted_recall_*NEG_CLASS*

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 55: xAI WORKBENCH TRAINING

CLASSIFICATION ANALYSIS REPORTS

Page 56: xAI WORKBENCH TRAINING

Global Feature Analysis

• This report can be accessed through the “Model Analysis” link on the “Model Actions” page

• This report will show you the feature importance at a global level for classification models that you have

created.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 57: xAI WORKBENCH TRAINING

Model Menu

Batch Query

Make Batch Query

1.

2.

Model Analysis

Batch Query Analytics

+ Report

3.

Batch Query Analysis

• This report can be accessed through the “Model Analysis” link on the “Model Actions” page

• This report will show you the cumulative results, by classification rate, for the results of a batch query file. Typically, this might be done with a separate hold out file, or periodically with a sample from production.

• In this report, the confidence level is multiplied by 100 to get a score between 0 and 100.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 58: xAI WORKBENCH TRAINING

Batch Query Analytics

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 59: xAI WORKBENCH TRAINING

PART 2

Page 60: xAI WORKBENCH TRAINING

CLUSTERING

Page 61: xAI WORKBENCH TRAINING

Clustered all feature similarity

Unsupervised Clustering

• Unsupervised Clustering identifies groups of similar data objects based on the frequency of shared variables

between objects.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 62: xAI WORKBENCH TRAINING

Class 1

Clustered by differentiating features

Class 2

Cluster 1 Cluster 2

Supervised Clustering

• Supervised Clustering uses xAI Workbench’s simClassify+ to identify groupings of the differences in features

between classes.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 63: xAI WORKBENCH TRAINING

Class 1

Class 2

Cluster 2Cluster 1

Supervised Clustering (cont.)

• If the classes had been different, the clustering would have been different.

Clustered by differentiating features

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 64: xAI WORKBENCH TRAINING

Class Variable

Why Factors - most predictive feature-value pairs driving the clusters

Cluster Label

Cluster Level Details

Download Cluster

View single cluster, or compare two clusters

Cluster Size

Cluster Visualization

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 65: xAI WORKBENCH TRAINING

SIMCLUSTER +

Page 66: xAI WORKBENCH TRAINING

simCluster+: Overview

• simCluster+ is our K-Means clustering algorithm for supervised clustering. It can also be used in unsupervised clustering, where Euclidean, Manhattan, and our One Class distance functions are available.

• simCluster+ uses a technique that can create K-Means clustering or K-Spilling clusters, unlike simCluster which uses agglomerative clusters.

• In K-Spilling clusters, K clusters that are more tightly formed around their “mean” centroid will be returned, but if a datapoint is not close enough to the centroids, it will “spill” into secondary clusters.

• The Range Percentile parameter determines if K-Means or K-Spilling will be used.

• If Range Percentile is 1.0, then all data points will be clustered into K clusters. If the Range Percentile is less than 1.0, then dense clusters will be produced and data points that are not within the specified density will spill into clusters beyond K.

• A visual walkthrough of the K-Spilling methodology is included in your user manual.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 67: xAI WORKBENCH TRAINING

simCluster+: Parameters

Both Supervised and Unsupervised

• K - Minimum number of clusters. If Distance Percentile is 1.0, K clusters will be generated. Anything less than 1.0, at least K w ill be generated.

• Feature Focus - The maximum number of dynamically selected features used in any iteration of the clustering algorithm. This works like a localized feature selection.

• Distance Percentile - The maximum distance between the cluster center and a given element of the cluster. 1.0 will produce a K -Means clustering. Values less than 1.0 will create ‘tighter clusters,’ but elements not in the range will “spill” into new clusters. (Range gre ater than 0 and less than or equal to 1.0.)

• Iterations - Number of iterations of the learning algorithm.

Unsupervised Only

• Max Samples - The maximum number of data points used in each iteration.

Supervised Only

• Learning Rate - Step size of metric learning algorithm. Small values can lead to longer run times and large values can lead to overfitting.

• Feature Subsampling - Ratio of randomly subsampled features in each iteration of the metric learning algorithm. Randomization provides diversity inthe resulting similarity metric.

• Class Weighting - UNIFORM gives the same weight to all classes. NORMALIZED takes into account class imbalance.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 68: xAI WORKBENCH TRAINING

K=2Distance Percentile = 1.0

K=2Distance Percentile < 1.0

simCluster+: Distance Percentile

• Distance Percentile sets the maximum

distance between an object and the center of

the cluster to which it belongs.

• Distance Percentile is a float value between 0

(exclusive) and 1 (inclusive).

• If some elements are not assigned to any

cluster due to exceeding the Distance

Percentile value, they will be spilled over to the

next iteration of cluster assignment.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 69: xAI WORKBENCH TRAINING

K=2Distance Percentile = 1.0

K=2Distance Percentile < 1.0

simCluster+: K

• K is an integer value that represents the minimum

number of clusters to be created.

• If Distance Percentile is 1.0, exactly K clusters will

be generated. Anything less than 1.0, at least K

will be generated.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 70: xAI WORKBENCH TRAINING

simCluster+: Unsupervised

• Three distance functions available for

unsupervised clustering

• Manhattan

• Euclidean

• One Class

• One Class proprietary distance

function is useful when you know the

underlying data set has a majority

“common” class (i.e. fraud, customer

behavior, anomaly detection, etc.)

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 71: xAI WORKBENCH TRAINING

simCluster+: Supervised

• When creating a supervised clustering experiment, you should find the optimal hyper parameter configuration tuning the simClassify+ classifier (Auto Tune or Exhaustive Grid Search).

• Once you find your optimal hyper parameter configuration, use them in this page to create your clustering engine.

• If you are using both simClassify+ and simCluster+ for a particular data set/use case, choosing the same hyper parameters will ensure that the learned distance function is consistent between the classification and clustering.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 72: xAI WORKBENCH TRAINING

Create Visualization

• Top N: Number of layers (ie feature-value pairs) in visualization

• Max Number of Clusters: Maximum number of slices (ie clusters)

• Limit: Minimum number of elements for cluster to be visualized

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 73: xAI WORKBENCH TRAINING

CLUSTER VISUALIZATIONS

Page 74: xAI WORKBENCH TRAINING

Cluster Label

Save cluster label

See as main cluster (Compare mode)

Deselect cluster

Download cluster elements and features

Most predictive (or frequent if unsupervised) factors in the selected cluster

View Results

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 75: xAI WORKBENCH TRAINING

View Details

• View the most predictive or frequent factors of each cluster, on either a local or global level.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 76: xAI WORKBENCH TRAINING

Cluster Comparison

• Select two cluster from the visualization page, and compare the most predictive or frequent factors between

the two

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 77: xAI WORKBENCH TRAINING

CLUSTER/SEGMENT STATISTICAL ANALYSIS

Page 78: xAI WORKBENCH TRAINING

Cluster Statistics

• If the cluster was created with statistics turned on, the Statistics tab will take you to the cluster statistics

view.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 79: xAI WORKBENCH TRAINING

Cluster Statistics Choice

• In the cluster statistics view you can select a dataset attribute and see the statistical properties of the

attribute values in that cluster.

• You can also compare numerical value distribution to the overall dataset distribution. If there are significant

differences, this may lead to useful insights.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 80: xAI WORKBENCH TRAINING

Cluster Statistics Choice

• Selecting the blue + button will

show a list of attributes to choose

from.

• Once an attribute is chosen, the left

side button turns to a red - button

and can be used to remove the

display of that attribute’s details.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 81: xAI WORKBENCH TRAINING

Cluster Statistics

• Number of Laboratory

Procedures was one of the

most heavily weighted

attributes for predictions in

this cluster.

• Looking at the distribution of

values for this cluster

compared to the distribution

in the whole dataset, we see

that many of the frequently

occurring values are not even

in the top ten values in the

overall dataset.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 82: xAI WORKBENCH TRAINING

Cluster Statistics

• As a comparison, gender is distributed very similarly in both this

cluster and the overall dataset.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 83: xAI WORKBENCH TRAINING

DATASET UPDATE & MERGE

Page 84: xAI WORKBENCH TRAINING

Dataset Update & Merge

• The update and merge features give users

the ability to refresh their data for model

re-training

• Files that will be used to update the data

must have the exact same header as the

original dataset

• The user has two choices when updating a

dataset:

• Appending rows to current dataset

• Refreshing values in current dataset (rows in

new dataset must have rows with a unique

ID that match rows in the original dataset)

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 85: xAI WORKBENCH TRAINING

UPDATE INSTANCE

Page 86: xAI WORKBENCH TRAINING

Update Instance

• To re-train a given model on a new data set, select “Update Instance” from the Model Actions page.

• This will create another version of the existing model, in which the user can select to receive queries or not.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 87: xAI WORKBENCH TRAINING

COPY INSTANCE

Page 88: xAI WORKBENCH TRAINING

Copy Instance

Step 1

• From “Your Models” page, choose “Model Actions” for the new model you would like to copy. The new model will have to be first

created as an entirely separate model.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 89: xAI WORKBENCH TRAINING

Copy Instance

Step 2

• Next, choose “Copy Instance” from the “Models Details” tab.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 90: xAI WORKBENCH TRAINING

Copy Instance

Step 3

• Next, choose the “Target Instance” you wish to update. This will be the current instance you are using in production. Then, select

“Copy Instance”.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 91: xAI WORKBENCH TRAINING

Copy Instance

Step 4

• After choosing “Copy Instance” in the dialog box, you will be directed to the target instance “View Versions” page. The copied

version of the instance will default to “Sleeping” mode.

• To set the copied version to “Current”, which will enable querying from the API, select the “Set as Current” option in it’s row.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 92: xAI WORKBENCH TRAINING

Copy Instance

Step 5

• The updated instance is now set to current, and is ready for querying from the API.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 93: xAI WORKBENCH TRAINING

MONITORING

Page 94: xAI WORKBENCH TRAINING

Monitoring

• Model monitoring can be setup to notify users if a certain prediction class is experiencing outlier behavior

• Parameters to be set:• Class Name - Name of class to monitor

• Packet Size - Calculations are done over packets. A packet is a set of queries.

• Sample Size - Number of predictions for specified class that must occur before monitor activation

• Percentage - Specifies minimum % of packets that have to be marked as outlier to trigger a warning

• Revision Frequency - How often monitoring report will run (in milliseconds)

• zScore - Number of standard deviations the accepted range is from the mean

• Email Address - email address to be notified when a monitoring warning is triggered

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 95: xAI WORKBENCH TRAINING

Monitoring

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 96: xAI WORKBENCH TRAINING

SAMPLE HIGH CAPACITY PRODUCTION SET -UP

Page 97: xAI WORKBENCH TRAINING

Sample High Capacity Production Set-up

• One server is used for ‘Training’, in other words running grids.

• The other four servers are divided into a Master and three query Slaves

• The three slaves are connected to a load balancer, but the master could be added to the load balancer as needed.

• Updated data would be merged into the Training server and grid are run to determine if there is significant difference to update the query models.

• If there is sufficient change, the hyper parameters and files (updated data) for the model are copied over to the master and a new version of the model is created.

• Either a new model is created in the Master and then Copy Model is used to create a new version,

• Or, Update Model is used to create a new version of the model.

• This new model version will be replicated to the slaves.

• The switchover to the new model can then happen by changing the ‘Current Version’ for the model in the Master, which will replicate to the Slaves.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 98: xAI WORKBENCH TRAINING

Updates

TrainingDataset

Production Fraud Detect

Production Fraud Detect

Lo

ad

Ba

lan

ce

Client Operations

Master ServerQuery Server

Query Server

xAI Workbench Software

Queries

Training

Merge

rsync latest version of

models

Training Server

Update Models

Hyper-parameters and updated files copied here from Training Server to

update models

Production Fraud Detect

Query Server

Monitor

Alert Email

Sample High Capacity Production Set-up

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 99: xAI WORKBENCH TRAINING

Sample High-Capacity Production Throughput

• Server configuration

• 16 cores

• 128 Gb RAM

• ~2 milliseconds per query

• ~20 milliseconds per query with the Whys (dynamic weighted factors)

• Throughput per query server: ~30,000 queries per minute

• Sample system throughput (three servers): ~90,000 queries per minute

• Sample system high-load throughput (three servers plus master): ~120,000 queries per

minute

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 100: xAI WORKBENCH TRAINING

APPLICATIONS WITH THE API

Page 101: xAI WORKBENCH TRAINING

Select all API text and Copy.

API Usage

• Command Line using curl as displayed in form:

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 102: xAI WORKBENCH TRAINING

Remove:[PASSWORD]

Enter password when prompted.

API Usage

• Command Line using curl as displayed in form:

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 103: xAI WORKBENCH TRAINING

API Usage

• Command Line using curl as displayed in form:

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 104: xAI WORKBENCH TRAINING

APPENDIX

Page 105: xAI WORKBENCH TRAINING

Avoid Overfitting

• Overfitting - a model that makes very, very accurate predictions, but only for a specific dataset. An overfit

model does not generalize.

• Three part approach to avoid overfitting:

• Training Dataset - a set of examples used for learning.

• Validation Dataset - a set of examples used to tune the parameters of a model. Usually these examples are a separate

subset of the training dataset. Choose the best model based on the validation dataset metrics.

• Holdout Dataset - a set of examples used only to assess the performance of a fully-trained model. Never seen before in

training and validation datasets. Used to test best model from above to see if model performance held.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 106: xAI WORKBENCH TRAINING

SIMCLASSIFY

Page 107: xAI WORKBENCH TRAINING

simClassify

• simClassify is one of xAI Workbench’s classifiers.

• It accepts queries in the form of a data object with an unknown Class column.

• simClassify uses our similarity engine to identify the nearest neighbors to a queried object and uses the

Class field from those objects to predict the Class field for the queried object.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 108: xAI WORKBENCH TRAINING

simClassify Settings

• Bins

• Determines how many “buckets” fields with numbers as values will be split into.

• e.g. if you have values from 1-100, 5 bins would give you splits of 20. Higher values can increase accuracy at the risk of overfitting.

• Top Columns

• The number of columns to consider when making the prediction.

• Fields with strings may be broken into multiple columns.

• Higher values can increase accuracy at the cost of speed.

• Classification K

• The number of nearest neighbors to use when making the classification. We recommend the default, CK, which auto-detects the proper value.

• Energy Weight

• Used if one class is expected to be significantly more frequent than others.

• Dense Mode

• The distance function being used by the engine.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 109: xAI WORKBENCH TRAINING

simClassify Distance Functions

• simClassify can accept any distance function.

• We typically recommend using the SMART distance function.

• SMART learns the relationships between objects based on their class. This means that the dataset is clustered based on

outcome, resulting in very clean clusters for predictions to be made from.

• Along with all other settings, the accuracy of various distance functions can be tested in Fold Experiments.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 110: xAI WORKBENCH TRAINING

Queried Objects

Very close neighbors,high confidence

More distant neighbors, lower confidence

Class 1 Region

Class 2 Region

Interpreting simClassify Results

• simClassify returns results as a

confidence value based on the

distance between the queried object

and its neighbors.

• The closer an object is to its

neighbors, the more confident the

algorithm is that it has the correct

classification.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 111: xAI WORKBENCH TRAINING

queried object

Class 1 Region

Class 2 Region

Class 1.95 confidence

Factor Weight

Circle 1.5

Medium Size 0.7

Yellow 0.2

Interpreting simClassify Results (cont.)

• Along with the prediction, simClassify

provides the weighted factors which

support that prediction.

• In this example, the result and factors

would be something like:

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 112: xAI WORKBENCH TRAINING

SIMCLUSTER

Page 113: xAI WORKBENCH TRAINING

simCluster

• simCluster is the xAI Workbench clustering engine.

• Clusters will be different depending on parameters and the distance function used.

• It can produce either supervised or unsupervised clusters.

• Unsupervised clustering can be used for data analysis and exploration. It can reveal complex patterns and

relationships in data.

• Supervised clustering is clustering based on classifications. It will identify the features that differentiate

classes from each other. This is a very powerful way to visualize what a classification engine is doing and

can be used to identify groups and subgroups in data.

• Application examples: anomaly detection, customer segmentation

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 114: xAI WORKBENCH TRAINING

simCluster Parameters

• Processing Recipe - The distance function to be used for clustering.

• Sim Cluster Range - The maximum distance between the center of a cluster and an object on its edge.

• Sim Cluster Iterations - The number of passes made by the algorithm to identify cluster centers.

• Sim Cluster Percentage - The percentage of data to use for identifying new cluster centers during each

iteration.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 115: xAI WORKBENCH TRAINING

Processing Recipe

• The Processing Recipe is the distance function that is used to determine the relationship between objects.

• simCluster has access to two distance functions by default (on the platform):

• Universal is the unsupervised function. It clusters based on the frequency of shared variables between objects.

• Dense is the supervised function. It detects the variables that are most critical for differentiating classes and clusters based

on those.

• Additional distance functions can be used through the API or can be added by request.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 116: xAI WORKBENCH TRAINING

Centroid of Cluster

Range = 0.4

Range = 0.5

● At range 0.4 only the green objects will be in the cluster.

● At 0.5 the blue objects will be, too.

● Neither setting will add the red object to the cluster.

simCluster Range

• simCluster works by identifying data

objects near the center of clusters and then

measuring the distance from other data

objects to those centers.

• simCluster Range sets the maximum

distance between an object and the center

of the cluster to which it belongs.

• simCluster Range is a float value between

0 (exclusive) and 1 (inclusive).

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.

Page 117: xAI WORKBENCH TRAINING

simCluster Iterations and Percentage

• To boost performance, simCluster creates and populates clusters in multiple iterations.

• In the first iteration, simCluster will take an amount of data equal to the simCluster

percentage and identify the center of any clusters in that subset. simCluster will then attempt

to populate those clusters with all of the data.

• simCluster Percentage is a float value between 0 (exclusive) and 1 (inclusive).

• It will then take any data that cannot be placed in those clusters (distance exceeds

simCluster Range) and attempt to identify new clusters.

• The number of times this process is repeated is the simCluster Iterations parameter.

• simCluster Iterations is an integer value equal or greater than 1.

Confidential ∙Copyright ©2021 ∙ InRule Technology, Inc. ∙All rights reserved.