Data science tips for data engineers

© 2016 IBM Corporation

Data Processing in Data Science

Mokhtar [email protected] Technical Sales TeamJuly 27, 2016

© 2016 IBM Corporation2

Data Science Methodology (John B. Rollins – [email protected])

Business understanding

Analytic approach

Data requirements

Data collection

Data understanding

Data preparation

Evaluation

Deployment

Feedback

Data Scientist/Data Engineer

StatisticianDomain Expert

Modeling


Agenda Stepping through a data centric view of the data science

methodology- Highly iterative process with three pillars

• Data requirements and collection• Data pre-processing• Data analysis

Example of text classification- Naïve Bayes- TF-IDF encoding.


Data centric view of methodology

0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

EastWestNorth

Data sources

Models

New data

“Good” model

Data ready

Data

Data analysis• analytical• statistical• visual

Data “pre”-processing• noisy/missing data• data corruption • dim. reduction

Data req/collection• databases• data warehouses• Hadoop

Model/pattern evaluation• quantitative• qualitative

Model/patterns deployment• prediction• decision support


Data Collection

Access data which can originate from one or several repositories:- Relational RDBMS:

• SQL, JBDC, Sqoop, Export, High Perf Unload

- Real time streaming: Kafka, Streams, Custom code• RSS feed• Etc…

- HDFS:• Files (CSV, Parquet, etc…)• Hive• Hbase

- Others:• S3, AWS, etc…


Data Collection

Apache Spark can read a multitude of data sources.- Extensive set of data source APIs- Can help avoid costly operations:

• Parsing text or JSON records• Deserialize data stored in binary• Converting Java objects to JSON/Parquet/Avro records• Etc…

Built-in support External libraries

and more…


Data Collection

If relevant data sets reside in different data sources (e.g. legacy data on mainframes), it is important to decide:- Centralize the data in one single location (HDFS)- Use a federation model approach

Centralized FederatedPerformance Data extraction is usually

fasterSubject to longer delays in data delivery

Data availability

Better availability Can be an issue if data cannot be queried all the time

Implementation cost

How much does it cost to build the centralization

How much does it cost to leverage federation

Time constraints

Is there time to build a centralized repository?

Usually less time consuming

Data quality Easier to ensure data quality if quality is low

More subject to quality of provider of data



Data sources

Data ready

Data




Data pre-processing (preparation)

Data preparation can be very time consuming depending on:- The state of the original data

• Data is typically collected in a “human” friendly format

- The desired final state of the data (as required by the machine learning models and algorithms)• The desired final state is typically some “algorithm” friendly format

- There may be a need for a (long) pipeline of transformations before the data is ready to be consumed by a model:• These transformations can be done manually (write code)• These transformations can be done through tools



Data can be missing values:- Blank fields- Fields with dummy values (9999)- Fields with “U” or “Unknown”

Data can be corrupt or incoherent:- Data fields can be in the wrong place (strings where numbers are expected)- Spurious “End of Line” characters can chop original lines of data into several

lines and cause data fields in the wrong place- Data entered in different formats: USA / US / United States

These aspects of data preparation are often referred to as:- Data cleansing / wrangling



Data dimensionality may need to be reduced:

The idea behind reducing data dimensionality is that data raw data tends to have two subcomponents:- “Useful features” (aka structure)- Noise (random and irrelevant)- Extracting the structure makes for better models

- Examples of applications of dimensionality reduction• Extracting the important features in face/pattern recognition• Removing stop words when working on text classification• Stemming: fishing, fished, fisher fish

- Examples methods of dimensionality reduction• Principal Component Analysis• Singular Value Decomposition



Data may need to be transformed to match algorithms requirements:- Tokenizing (typical in text processing)

- Vectorizing (several algorithms in Spark MLlib require this)• Transform data into Vector arrays • Can be done manually (write Python or Scala code)• Can be done using tools (VectorAssembler in the new ML package)

• (TF-IDF in text processing)• Word2Vec

- Bucketizing• Transform a range of continuous values into a set of buckets



Data may need to be transformed to match algorithms requirements:

- Standardization• Transform data to a set of Vector with Zero mean and Unit Standard deviation• Linear Regression with SGD in Spark MLlib requires this

Houseprice

Houseprice

Squarefootage

StandardizedSquarefootage



Data may need to be transformed to match algorithms requirements:- Normalization

• Transform data so that each Vector has a Unit norm

- Categorical values need to be converted to numbers• This is required by Spark MLlib classification trees• Marital Status: {“Widowed”, “Married”, “Divorced”, “Single”}• Marital Status: {0, 1, 2, 3}• You cannot do this if the algorithm could infer: Single = 3 X Married



Data may need to be transformed to match algorithms requirements:

- Dummy encoding• When categorical values cannot be converted to consecutive numbers• Marital Status: {“Single”, “Married”, “Divorced”, “Widowed”}• Marital Status: {“0001”, “0010”, “0100”, “1000”}• This is necessary if the algorithm could make some wrong inference from the

numerical based categorical encoding:• Single = 3• Married = 2• Divorced = 1• Widowed = 0

Single = Married + Divorced Single = Divorced x 3(this is a contrived example, but you get the idea , replace marital status with colors… )



0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

EastWestNorth

Data sources

Models

Data ready

Data





Data analysis:

The data analysis phase is very likely to be intertwined with the data preparation phase:- It is not uncommon to use data analysis results to refine or drive

the data preparation• In text classification: Group the tokens at hand and calculate the numbers

in each group.• You most likely want to remove the most common and least common tokens

as part of the analysis phase.

The data analysis phase should use:- Mathematical tools (Statistics, correlations, Chi Square test

of independence, etc…)- Visualizations


Data analysis: visualizations

Anscombe’s quartet- The four datasets have nearly identical statistical properties

(mean, variance, correlation), yet the differences are striking when looking at the simple visualization

The four data sets have similar statistical properties:•The mean of x is 9•The variance of x is 11•The mean of y is approx. 7.50•The variance of y is approx. 4.12•The correlation is 0.816As shown the linear regression lines are approx. y=3.00+0.500x.


Data analysis: visualizations Ocean temperature and salinity data in an archipelago in the area

of Vancouver (British Columbia, Canada)- Goal: Show added value of tools such as notebooks and how scientists

can get quick insights.- Answer questions such as: Identify various regimes of temperature

and salinity in the archipelago


Data analysis: visualizations Verifying the geographical positioning of the archipelago

on a python map in the Zeppelin Notebook…


Data analysis: visualizations Flipping the sign of longitude values…


Data analysis: visualizations Summer surface (depth < 10 meters) average temperature

evolution in the archipelago…


Data analysis: visualizations Summer surface (depth < 10 meters) average temperature

evolution in the archipelago + average salinity…


Data analysis: The data analysis phase can GUIDE you towards better and more

relevant modeling:- When you are building a classifier, it is important to understand the

PREVALANCE of the condition that you are building a model for, i.e. how common or uncommon this condition effectively is…

- Imagine you are working towards building a classifier for some medical condition and your training and testing data sets yield the following model

95 (True Positive) 5 (False Positive)

5 (False Negative) 95 (True Negative)

Disease (100)

Normal (100)

Test positive Test negative

With 95% sensitivity & specificity, this sounds like a great test…


Data analysis: What truly matters to the users of your new model / test (doctors,

bankers, practitioners) is the PREDICTIVE VALUE of the test:- If the test is positive, then what is the actual chance of being sick?- If the test is positive, then what is the actual chance of defaulting on the loan?- Is it 95% ?

Let’s run the test on a population of 1,000,000 where 1% individuals (10,000) are actually suffering from this condition:

9500 (95% True Positive) 500 (5% False Negative)

49500 (5% False Positive) 940500 (95% True Negative)

Disease (10000)

Normal (990000)



Data analysis:

Probability of being sick if the test is positive:- (# of people truly sick) / # positive result tests- 10000 / (49500 + 9500) = 16.9%- What is happening here:

• The condition is RARE and the 5% FALSE POSITIVES are still way higher in numbers than the true positives.

Data analysis of the prevalence of the condition tells us that a test with 99% or higher sensitivity / specificity would be needed.

9500 (95% True Positive) 500 (5% False Negative)

49500 (5% False Positive) 940500 (95% True Negative)

Disease (10000)

Normal (990000)



Text Classification & Analytics Deep Dive:Naïve Bayes classifier


Text Classification Deep Dive Example:

We will start with a set of text documents (files => Corpus) representing news articles (in the area of Information Technology) already categorized in classes (big data, operating systems, cloud computing, etc…)

The set of documents is split in two parts (~65% / ~35%):X A training set. Given the set of text documents in this set, we will prepare the

data and then train a Naïve Bayes classification model which will learn how to associate the “features” of a text to the class which it is assigned to.

X A test set. Using the test set, we will verify the accuracy of the model trained above.



Data sourcesData



Data requirements/collection:

Requirements:─ Collect a sufficient number of pre-classified articles.─ Split the set of files into a training and testing set (using aforementioned proportions)─ There should not be too much overlap of files between the categories─ The files will be organized into 10 sub folders (big-data, networking,

operating systems, etc...)─ Tooling: work with Apache Spark

Collection:─ The files are collected (from the feed) in HTML format.─ Write a java program to read the RSS feed (provided as part of the lab)─ Run the java program over enough time and collect a few hundreds of documents─ Organize the collected HTML files as mentioned above into 10 sub folders (big-data,

networking, operating systems, etc…) where each sub folder corresponds to one category in the classification scheme.


Data requirements/collection:

At the conclusion of the collection process, you should have:Top Folder:

─ big-data |-- File1.html |-- File2.html |-- File3.html ─ cloud-computing |-- File4.html |-- File5.html |-- File6.html ─ networking

|-- File7.html|-- File8.html|-- File2.html



Data sources

Data ready

Data




Data pre-processing / preparation

Data understanding / preparation:X Extract the TEXT portions from the HTML documents which were collected

• HTML documents contain a lot of information (tags, meta data, etc…) which is irrelevant to the target classification.

X Need to write a some code to parse the HTML document and extract the relevant text portions (in the context of the lab this is provided)

X A simple script has to be written to invoke the above program logic and create a “mirror” set of folders and sub folders where all files are pure text.


Data pre-processing / preparation

At the end of the text extraction phase you should have:

Top Folder: ─ big-data |-- File1.html -> .txt |-- File2.html -> .txt |-- File3.html -> .txt ─ cloud-computing |-- File4.html -> .txt |-- File5.html -> .txt |-- File6.html -> .txt ─ networking |-- File7.html -> .txt |-- File8.html -> .txt |-- File2.html -> .txt ─ etc…



0102030405060708090

1stQtr

2ndQtr

3rdQtr

4thQtr

EastWestNorth

Data sources

Models

Data ready

Data





Data analysis

Examine distribution of files among classes Examine the amount of overlap of text files between categories

─ Do we have a large proportion of files which belong to more than one single category? (can hurt Naïve Bayes classification accuracy)

Find out how many tokens compose the given corpus Find out how many UNIQUE tokens compose the given corpus

(all lowercase) Eliminate “useless” tokens such as common stop words

─ “The”, “An”, “Of”, “For” Eliminate very rare words Estimate the dimensionality of the problem size of vectors to use. Perform stemming and check new dimensionality


Data analysis

Load text files in a Pair RDD

/home/…/big-data/File1.txt

Visual analytics and data science are hot. As we create more data than ever before, we seek newer, better, faster ways to make sense of it and to know what actions we should take to improve


Hey, Andy, check out this data I have. What's the best chart to show it?" I get asked this question a lot. My answer is always the same: It depends. This is partly because it depends …


Nearly every CEO will tell you human talent is the reason their business is successful: A great sales hire can ….

/home/…/cloud-computing/File4.txt

Enterprise users who rely on the web interface for Microsoft Outlook email will be getting an upgrade with new features…

/home/…/cloud-computing/File5.txt

Immediately after Hurricane Sandy tore through New York City in October 2012, city officials needed a quick way to show the damage that had been done to streets and infrastructure….

/home/…/networking/File6.txt

As Google prepares to launch a subscription version of YouTube, the move has been endorsed by at least one interested party: YouTube cofounder Chad Hurley. YouTube has grown massively

/home/…/networking/File7.txt

While service providers such as AT&T, Sprint, and Verizon have long touted the capabilities of their networks, many of the large Web-scale providers are now openly disclosing their internally

Etc….

Top Folder ─ big-data |-- File1.html -> .txt |-- File2.html -> .txt |-- File3.html -> .txt ─ cloud-computing |-- File4.html -> .txt |-- File5.html -> .txt |-- File6.html -> .txt ─ networking |-- File7.html -> .txt |-- File8.html -> .txt |-- File2.html -> .txt ─ etc…


Data analysis

How much duplication in the datasetX Extract the filename from the first column of the RDDX Count the total number of entries (filenames): 608X Count the number of UNIQUE entries: 534

We have 534 UNIQUE filenames over 608 entries.X There is about 12% duplication in the datasetX Doesn’t sound too bad…(but there might be surprises!)


Data analysis

Distribution of files among classes (using SparkSQL)

… the distribution is not very even…With 10 classes, an even distribution of files should be around 10% per class.Let’s keep this information handy for future analysis…


Data analysis

Tokenizing:- How should you approach it? Use White Spaces to split

the sentences into words?- Use some analysis to drive the prep!!- Split your text using White Spaces and take a look at the results:


Data analysis

Tokenizing:- We have about 370,000 tokens in the whole corpus- We have about 35,000 unique tokens in the corpus- A small sample of the tokens:

• “Setup;” “gigahertz)” “park,” “having)” “signal.” “Ignore,”

Conclusion:- Non word characters such as punctuation and brackets are being attached to

tokens and unnecessarily creating different unique tokens which shouldn’t be- We will therefore try splitting on non-word characters…


Data analysis

Tokenizing using non word characters:- 380,000 total tokens- 17,000 unique tokens (versus 35,000)- Quality of tokens looks much higher!


Data analysis

Tokenizing final thoughts:- The number of unique tokens can be further reduced using mentioned data

preparation techniques:• Remove common stop words• Remove very rare tokens• Possibly perform stemming

- Instead of 17,000 unique tokens, you may end up with 10k to 12k or even less.- You are effectively reducing the dimensionality of your space

- This will help determine an appropriate vector size when using TF-IDF encoding.


Data preparation / transformation

Tokenizing your text files would return the following:

big-data [visual, analytics, and, data, science, are, hot, as, we, create, ….]

big-data [hey, andy, check, out, this, data, i, have, what, the, best, ….]

big-data [nearly, every, ceo, will, tell, you, human, talent, is, the, ….]

cloud-computing [enterprise, users, who, rely, on, the, web, interface, for, microsoft, outlook, email,…]

cloud-computing [immediately, after, hurricane, sandy, tore, through, new, york, city, in, october, …]

networking [as, google, prepares, to, launch, a, subscription, version, of, youtube, ….]

networking [while, service, providers, such, as, at&t, sprint, and, verizon, have, long, …]

Etc…. […., ….., ……, ……, ……]



Token Frequency, Inverse Document Frequency:- Hashing with weights- Suppose text file is: “The little brown fox jumped over the lazy dog”- “little, brown, fox, jumped, lazy, dog”

littlefox lazy dog

0 1 2 3 1000 5000 … N-1

Vector of size N

brown

…



Token Frequency, Inverse Document Frequency:- Due to hashing, two different tokens could collide in the same position- This is why it is important to size the vector properly (large enough to minimize

collisions dimensionality of the space)

littlefox lazy dog

0 1 2 3 1000 5000 … N-1

Vector of size N

brown

…



Token Frequency, Inverse Document Frequency:- TF-IDF will actually put Weights wherever there is a populated entry

in the vector.- Term Frequency alone is based on the frequency of a token (word) in the text.

However, it is possible for a token to appear heavily in ONE text file but not (or little) in the rest of the corpus. Such a token may not be a great significance for classification purposes.

- The IDF transformation will “smooth out” the frequency numbers by adjusting a weight which takes into account the frequency of the token across the whole corpus.

W1000W1 Wj

0 1 2 3 1000 5000 … N-1

Vector of size N

Wk

…

W5000


Data preparation / transformationbig-data [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….]

big-data [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….]

big-data [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..]

cloud-computing [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …]

cloud-computing [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…]

networking [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…]

networking [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..]

Etc…. […., ….., ……, ……, ……]

UsingTF-IDF

0 [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….]

0 [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….]

0 [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..]

1 [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …]

1 [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…]

2 [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…]

2 [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..]

Etc… […., ….., ……, ……, ……]

Encodingclasses



Naïve Bayes uses a LabeledPoint data structure

labeledPoint (0, [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….])

labeledPoint (0, [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….])

labeledPoint (0, [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..])

labeledPoint (1, [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …])

labeledPoint (1, [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…])

labeledPoint (2, [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…])

labeledPoint (2, [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..])

Etc…

Build labeledPoints


Modeling

0 0

0 0

1 1

1 3

2 2

Etc… Etc…

Training RDD usedfor training

Naïve Bayes model

Prediction


labeledPoint (0, [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….])


labeledPoint (1, [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …])




Etc…






Etc…

LabelTest RDD

ACCURACY: X%

Model

Model


(back to) Data analysis

Take a look at the confusion matrix

Quite clearly, not all classes performed the same…For example, OS and Vertical-IT seem to be the top and worst performer…


Remember this slide from early data analysis…

Distribution of files among classes (using SparkSQL)

Alright, so the amount of training data seems to matter (collect more?) So are “emerging-technology” and “mobile-wireless” going to be number 2 and number 3 on the podium?


(back to) Data analysis

Write a couple of SQL statements to analyze:─ Accuracy (and misses) for each class─ Proportion of files in training & test sets, versus overall data set


Further analysis

The overall amount of overlap in the whole dataset was reasonably low… but how about the amount of overlap for each class?