SciKit Learn: How to Standardize Your Data

How to

Standardize Your Data:

A ML Recipe

DAMIAN MINGLECHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

GET THE FULL STORY

bit.ly/UseSciKitNow

http://bit.ly/UseSciKitNow

What’s Standardization Anyway?

• Often referred to as “functions and transformers that change raw feature vectors into a representation that is more suitable for the downstream estimator”

• Shifting the distribution of each attribute to have a mean of “0” and a standard deviation of “1”.

Why Standardization Matters

• It’s a common requirement of models

• Models may behave badly without it

• It’s useful for models that rely on the distribution of attributes such as Gaussian processes.

Power in SciKit Learn

• Preprocessing

• Clustering

• Regression

• Classification

• Dimensionality Reduction

• Model Selection

Power of SciKit Learn

Let’s Look at ML Recipe

Standardization

The Imports

from sklearn.datasets import load_iris

from sklearn import preprocessing

Separate Features from Target

iris = load_iris()

print(iris.data.shape)

X = iris.data

y = iris.target

Standardize the Features

normalized_X = preprocessing.scale(X)

Standardization Recipe

# Normalize the data attributes for the Iris dataset.

from sklearn.datasets import load_iris

from sklearn import preprocessing

# load the iris dataset iris = load_iris() print(iris.data.shape)

# separate the data from the target attributes

X = iris.data

y = iris.target

# normalize the data attributes

normalized_X = preprocessing.scale(X)

How to

Standardize Your Data:

An ML Recipe

DAMIAN MINGLECHIEF DATA SCIENTIST, WPC Healthcare

@DamianMingle

GET THE FULL STORY

bit.ly/UseSciKitNow

http://bit.ly/UseSciKitNow

Resources

• Society of Data Scientists

• SciKit Learn

• Also:• Scaling features to a range (MinMaxScaler or MaxAbsScaler)

• Scaling sparse data (StandardScaler)

• Scaling data with outliers (RobustScaler)

http://societyofdatascientists.com/

http://scikit-learn.org/stable/

Data & Analytics

SciKit Learn: How to Standardize Your Data