Upload
ibm-analytics
View
224
Download
1
Embed Size (px)
Citation preview
© 2016 IBM Corporation
Data Processing in Data Science
Mokhtar [email protected] Technical Sales TeamJuly 27, 2016
© 2016 IBM Corporation2
Data Science Methodology (John B. Rollins – [email protected])
Business understanding
Analytic approach
Data requirements
Data collection
Data understanding
Data preparation
Evaluation
Deployment
Feedback
Data Scientist/Data Engineer
StatisticianDomain Expert
Modeling
© 2016 IBM Corporation3
Agenda Stepping through a data centric view of the data science
methodology- Highly iterative process with three pillars
• Data requirements and collection• Data pre-processing• Data analysis
Example of text classification- Naïve Bayes- TF-IDF encoding.
© 2016 IBM Corporation4
Data centric view of methodology
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
EastWestNorth
Data sources
Models
New data
“Good” model
Data ready
Data
Data analysis• analytical• statistical• visual
Data “pre”-processing• noisy/missing data• data corruption • dim. reduction
Data req/collection• databases• data warehouses• Hadoop
Model/pattern evaluation• quantitative• qualitative
Model/patterns deployment• prediction• decision support
© 2016 IBM Corporation5
Data Collection
Access data which can originate from one or several repositories:- Relational RDBMS:
• SQL, JBDC, Sqoop, Export, High Perf Unload
- Real time streaming: Kafka, Streams, Custom code• RSS feed• Etc…
- HDFS:• Files (CSV, Parquet, etc…)• Hive• Hbase
- Others:• S3, AWS, etc…
© 2016 IBM Corporation6
Data Collection
Apache Spark can read a multitude of data sources.- Extensive set of data source APIs- Can help avoid costly operations:
• Parsing text or JSON records• Deserialize data stored in binary• Converting Java objects to JSON/Parquet/Avro records• Etc…
Built-in support External libraries
and more…
© 2016 IBM Corporation7
Data Collection
If relevant data sets reside in different data sources (e.g. legacy data on mainframes), it is important to decide:- Centralize the data in one single location (HDFS)- Use a federation model approach
Centralized FederatedPerformance Data extraction is usually
fasterSubject to longer delays in data delivery
Data availability
Better availability Can be an issue if data cannot be queried all the time
Implementation cost
How much does it cost to build the centralization
How much does it cost to leverage federation
Time constraints
Is there time to build a centralized repository?
Usually less time consuming
Data quality Easier to ensure data quality if quality is low
More subject to quality of provider of data
© 2016 IBM Corporation8
Data centric view of methodology
Data sources
Data ready
Data
Data “pre”-processing• noisy/missing data• data corruption • dim. reduction
Data req/collection• databases• data warehouses• Hadoop
© 2016 IBM Corporation9
Data pre-processing (preparation)
Data preparation can be very time consuming depending on:- The state of the original data
• Data is typically collected in a “human” friendly format
- The desired final state of the data (as required by the machine learning models and algorithms)• The desired final state is typically some “algorithm” friendly format
- There may be a need for a (long) pipeline of transformations before the data is ready to be consumed by a model:• These transformations can be done manually (write code)• These transformations can be done through tools
© 2016 IBM Corporation10
Data pre-processing (preparation)
Data can be missing values:- Blank fields- Fields with dummy values (9999)- Fields with “U” or “Unknown”
Data can be corrupt or incoherent:- Data fields can be in the wrong place (strings where numbers are expected)- Spurious “End of Line” characters can chop original lines of data into several
lines and cause data fields in the wrong place- Data entered in different formats: USA / US / United States
These aspects of data preparation are often referred to as:- Data cleansing / wrangling
© 2016 IBM Corporation11
Data pre-processing (preparation)
Data dimensionality may need to be reduced:
The idea behind reducing data dimensionality is that data raw data tends to have two subcomponents:- “Useful features” (aka structure)- Noise (random and irrelevant)- Extracting the structure makes for better models
- Examples of applications of dimensionality reduction• Extracting the important features in face/pattern recognition• Removing stop words when working on text classification• Stemming: fishing, fished, fisher fish
- Examples methods of dimensionality reduction• Principal Component Analysis• Singular Value Decomposition
© 2016 IBM Corporation12
Data pre-processing (preparation)
Data may need to be transformed to match algorithms requirements:- Tokenizing (typical in text processing)
- Vectorizing (several algorithms in Spark MLlib require this)• Transform data into Vector arrays • Can be done manually (write Python or Scala code)• Can be done using tools (VectorAssembler in the new ML package)
• (TF-IDF in text processing)• Word2Vec
- Bucketizing• Transform a range of continuous values into a set of buckets
© 2016 IBM Corporation13
Data pre-processing (preparation)
Data may need to be transformed to match algorithms requirements:
- Standardization• Transform data to a set of Vector with Zero mean and Unit Standard deviation• Linear Regression with SGD in Spark MLlib requires this
Houseprice
Houseprice
Squarefootage
StandardizedSquarefootage
© 2016 IBM Corporation14
Data pre-processing (preparation)
Data may need to be transformed to match algorithms requirements:- Normalization
• Transform data so that each Vector has a Unit norm
- Categorical values need to be converted to numbers• This is required by Spark MLlib classification trees• Marital Status: {“Widowed”, “Married”, “Divorced”, “Single”}• Marital Status: {0, 1, 2, 3}• You cannot do this if the algorithm could infer: Single = 3 X Married
© 2016 IBM Corporation15
Data pre-processing (preparation)
Data may need to be transformed to match algorithms requirements:
- Dummy encoding• When categorical values cannot be converted to consecutive numbers• Marital Status: {“Single”, “Married”, “Divorced”, “Widowed”}• Marital Status: {“0001”, “0010”, “0100”, “1000”}• This is necessary if the algorithm could make some wrong inference from the
numerical based categorical encoding:• Single = 3• Married = 2• Divorced = 1• Widowed = 0
Single = Married + Divorced Single = Divorced x 3(this is a contrived example, but you get the idea , replace marital status with colors… )
© 2016 IBM Corporation16
Data centric view of methodology
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
EastWestNorth
Data sources
Models
Data ready
Data
Data analysis• analytical• statistical• visual
Data “pre”-processing• noisy/missing data• data corruption • dim. reduction
Data req/collection• databases• data warehouses• Hadoop
© 2016 IBM Corporation17
Data analysis:
The data analysis phase is very likely to be intertwined with the data preparation phase:- It is not uncommon to use data analysis results to refine or drive
the data preparation• In text classification: Group the tokens at hand and calculate the numbers
in each group.• You most likely want to remove the most common and least common tokens
as part of the analysis phase.
The data analysis phase should use:- Mathematical tools (Statistics, correlations, Chi Square test
of independence, etc…)- Visualizations
© 2016 IBM Corporation18
Data analysis: visualizations
Anscombe’s quartet- The four datasets have nearly identical statistical properties
(mean, variance, correlation), yet the differences are striking when looking at the simple visualization
The four data sets have similar statistical properties:•The mean of x is 9•The variance of x is 11•The mean of y is approx. 7.50•The variance of y is approx. 4.12•The correlation is 0.816As shown the linear regression lines are approx. y=3.00+0.500x.
© 2016 IBM Corporation19
Data analysis: visualizations Ocean temperature and salinity data in an archipelago in the area
of Vancouver (British Columbia, Canada)- Goal: Show added value of tools such as notebooks and how scientists
can get quick insights.- Answer questions such as: Identify various regimes of temperature
and salinity in the archipelago
© 2016 IBM Corporation20
Data analysis: visualizations Verifying the geographical positioning of the archipelago
on a python map in the Zeppelin Notebook…
© 2016 IBM Corporation21
Data analysis: visualizations Flipping the sign of longitude values…
© 2016 IBM Corporation22
Data analysis: visualizations Summer surface (depth < 10 meters) average temperature
evolution in the archipelago…
© 2016 IBM Corporation23
Data analysis: visualizations Summer surface (depth < 10 meters) average temperature
evolution in the archipelago + average salinity…
© 2016 IBM Corporation24
Data analysis: The data analysis phase can GUIDE you towards better and more
relevant modeling:- When you are building a classifier, it is important to understand the
PREVALANCE of the condition that you are building a model for, i.e. how common or uncommon this condition effectively is…
- Imagine you are working towards building a classifier for some medical condition and your training and testing data sets yield the following model
95 (True Positive) 5 (False Positive)
5 (False Negative) 95 (True Negative)
Disease (100)
Normal (100)
Test positive Test negative
With 95% sensitivity & specificity, this sounds like a great test…
© 2016 IBM Corporation25
Data analysis: What truly matters to the users of your new model / test (doctors,
bankers, practitioners) is the PREDICTIVE VALUE of the test:- If the test is positive, then what is the actual chance of being sick?- If the test is positive, then what is the actual chance of defaulting on the loan?- Is it 95% ?
Let’s run the test on a population of 1,000,000 where 1% individuals (10,000) are actually suffering from this condition:
9500 (95% True Positive) 500 (5% False Negative)
49500 (5% False Positive) 940500 (95% True Negative)
Disease (10000)
Normal (990000)
Test positive Test negative
© 2016 IBM Corporation26
Data analysis:
Probability of being sick if the test is positive:- (# of people truly sick) / # positive result tests- 10000 / (49500 + 9500) = 16.9%- What is happening here:
• The condition is RARE and the 5% FALSE POSITIVES are still way higher in numbers than the true positives.
Data analysis of the prevalence of the condition tells us that a test with 99% or higher sensitivity / specificity would be needed.
9500 (95% True Positive) 500 (5% False Negative)
49500 (5% False Positive) 940500 (95% True Negative)
Disease (10000)
Normal (990000)
Test positive Test negative
© 2016 IBM Corporation27
Text Classification & Analytics Deep Dive:Naïve Bayes classifier
© 2016 IBM Corporation28
Text Classification Deep Dive Example:
We will start with a set of text documents (files => Corpus) representing news articles (in the area of Information Technology) already categorized in classes (big data, operating systems, cloud computing, etc…)
The set of documents is split in two parts (~65% / ~35%):X A training set. Given the set of text documents in this set, we will prepare the
data and then train a Naïve Bayes classification model which will learn how to associate the “features” of a text to the class which it is assigned to.
X A test set. Using the test set, we will verify the accuracy of the model trained above.
© 2016 IBM Corporation29
Data centric view of methodology
Data sourcesData
Data req/collection• databases• data warehouses• Hadoop
© 2016 IBM Corporation30
Data requirements/collection:
Requirements:─ Collect a sufficient number of pre-classified articles.─ Split the set of files into a training and testing set (using aforementioned proportions)─ There should not be too much overlap of files between the categories─ The files will be organized into 10 sub folders (big-data, networking,
operating systems, etc...)─ Tooling: work with Apache Spark
Collection:─ The files are collected (from the feed) in HTML format.─ Write a java program to read the RSS feed (provided as part of the lab)─ Run the java program over enough time and collect a few hundreds of documents─ Organize the collected HTML files as mentioned above into 10 sub folders (big-data,
networking, operating systems, etc…) where each sub folder corresponds to one category in the classification scheme.
© 2016 IBM Corporation31
Data requirements/collection:
At the conclusion of the collection process, you should have:Top Folder:
─ big-data |-- File1.html |-- File2.html |-- File3.html ─ cloud-computing |-- File4.html |-- File5.html |-- File6.html ─ networking
|-- File7.html|-- File8.html|-- File2.html
© 2016 IBM Corporation32
Data centric view of methodology
Data sources
Data ready
Data
Data “pre”-processing• noisy/missing data• data corruption • dim. reduction
Data req/collection• databases• data warehouses• Hadoop
© 2016 IBM Corporation33
Data pre-processing / preparation
Data understanding / preparation:X Extract the TEXT portions from the HTML documents which were collected
• HTML documents contain a lot of information (tags, meta data, etc…) which is irrelevant to the target classification.
X Need to write a some code to parse the HTML document and extract the relevant text portions (in the context of the lab this is provided)
X A simple script has to be written to invoke the above program logic and create a “mirror” set of folders and sub folders where all files are pure text.
© 2016 IBM Corporation34
Data pre-processing / preparation
At the end of the text extraction phase you should have:
Top Folder: ─ big-data |-- File1.html -> .txt |-- File2.html -> .txt |-- File3.html -> .txt ─ cloud-computing |-- File4.html -> .txt |-- File5.html -> .txt |-- File6.html -> .txt ─ networking |-- File7.html -> .txt |-- File8.html -> .txt |-- File2.html -> .txt ─ etc…
© 2016 IBM Corporation35
Data centric view of methodology
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
EastWestNorth
Data sources
Models
Data ready
Data
Data analysis• analytical• statistical• visual
Data “pre”-processing• noisy/missing data• data corruption • dim. reduction
Data req/collection• databases• data warehouses• Hadoop
© 2016 IBM Corporation36
Data analysis
Examine distribution of files among classes Examine the amount of overlap of text files between categories
─ Do we have a large proportion of files which belong to more than one single category? (can hurt Naïve Bayes classification accuracy)
Find out how many tokens compose the given corpus Find out how many UNIQUE tokens compose the given corpus
(all lowercase) Eliminate “useless” tokens such as common stop words
─ “The”, “An”, “Of”, “For” Eliminate very rare words Estimate the dimensionality of the problem size of vectors to use. Perform stemming and check new dimensionality
© 2016 IBM Corporation37
Data analysis
Load text files in a Pair RDD
/home/…/big-data/File1.txt
Visual analytics and data science are hot. As we create more data than ever before, we seek newer, better, faster ways to make sense of it and to know what actions we should take to improve
/home/…/big-data/File2.txt
Hey, Andy, check out this data I have. What's the best chart to show it?" I get asked this question a lot. My answer is always the same: It depends. This is partly because it depends …
/home/…/big-data/File3.txt
Nearly every CEO will tell you human talent is the reason their business is successful: A great sales hire can ….
/home/…/cloud-computing/File4.txt
Enterprise users who rely on the web interface for Microsoft Outlook email will be getting an upgrade with new features…
/home/…/cloud-computing/File5.txt
Immediately after Hurricane Sandy tore through New York City in October 2012, city officials needed a quick way to show the damage that had been done to streets and infrastructure….
/home/…/networking/File6.txt
As Google prepares to launch a subscription version of YouTube, the move has been endorsed by at least one interested party: YouTube cofounder Chad Hurley. YouTube has grown massively
/home/…/networking/File7.txt
While service providers such as AT&T, Sprint, and Verizon have long touted the capabilities of their networks, many of the large Web-scale providers are now openly disclosing their internally
Etc….
Top Folder ─ big-data |-- File1.html -> .txt |-- File2.html -> .txt |-- File3.html -> .txt ─ cloud-computing |-- File4.html -> .txt |-- File5.html -> .txt |-- File6.html -> .txt ─ networking |-- File7.html -> .txt |-- File8.html -> .txt |-- File2.html -> .txt ─ etc…
© 2016 IBM Corporation38
Data analysis
How much duplication in the datasetX Extract the filename from the first column of the RDDX Count the total number of entries (filenames): 608X Count the number of UNIQUE entries: 534
We have 534 UNIQUE filenames over 608 entries.X There is about 12% duplication in the datasetX Doesn’t sound too bad…(but there might be surprises!)
© 2016 IBM Corporation39
Data analysis
Distribution of files among classes (using SparkSQL)
… the distribution is not very even…With 10 classes, an even distribution of files should be around 10% per class.Let’s keep this information handy for future analysis…
© 2016 IBM Corporation40
Data analysis
Tokenizing:- How should you approach it? Use White Spaces to split
the sentences into words?- Use some analysis to drive the prep!!- Split your text using White Spaces and take a look at the results:
© 2016 IBM Corporation41
Data analysis
Tokenizing:- We have about 370,000 tokens in the whole corpus- We have about 35,000 unique tokens in the corpus- A small sample of the tokens:
• “Setup;” “gigahertz)” “park,” “having)” “signal.” “Ignore,”
Conclusion:- Non word characters such as punctuation and brackets are being attached to
tokens and unnecessarily creating different unique tokens which shouldn’t be- We will therefore try splitting on non-word characters…
© 2016 IBM Corporation42
Data analysis
Tokenizing using non word characters:- 380,000 total tokens- 17,000 unique tokens (versus 35,000)- Quality of tokens looks much higher!
© 2016 IBM Corporation43
Data analysis
Tokenizing final thoughts:- The number of unique tokens can be further reduced using mentioned data
preparation techniques:• Remove common stop words• Remove very rare tokens• Possibly perform stemming
- Instead of 17,000 unique tokens, you may end up with 10k to 12k or even less.- You are effectively reducing the dimensionality of your space
- This will help determine an appropriate vector size when using TF-IDF encoding.
© 2016 IBM Corporation44
Data preparation / transformation
Tokenizing your text files would return the following:
big-data [visual, analytics, and, data, science, are, hot, as, we, create, ….]
big-data [hey, andy, check, out, this, data, i, have, what, the, best, ….]
big-data [nearly, every, ceo, will, tell, you, human, talent, is, the, ….]
cloud-computing [enterprise, users, who, rely, on, the, web, interface, for, microsoft, outlook, email,…]
cloud-computing [immediately, after, hurricane, sandy, tore, through, new, york, city, in, october, …]
networking [as, google, prepares, to, launch, a, subscription, version, of, youtube, ….]
networking [while, service, providers, such, as, at&t, sprint, and, verizon, have, long, …]
Etc…. […., ….., ……, ……, ……]
© 2016 IBM Corporation45
Data preparation / transformation
Token Frequency, Inverse Document Frequency:- Hashing with weights- Suppose text file is: “The little brown fox jumped over the lazy dog”- “little, brown, fox, jumped, lazy, dog”
littlefox lazy dog
0 1 2 3 1000 5000 … N-1
Vector of size N
brown
…
© 2016 IBM Corporation46
Data preparation / transformation
Token Frequency, Inverse Document Frequency:- Due to hashing, two different tokens could collide in the same position- This is why it is important to size the vector properly (large enough to minimize
collisions dimensionality of the space)
littlefox lazy dog
0 1 2 3 1000 5000 … N-1
Vector of size N
brown
…
© 2016 IBM Corporation47
Data preparation / transformation
Token Frequency, Inverse Document Frequency:- TF-IDF will actually put Weights wherever there is a populated entry
in the vector.- Term Frequency alone is based on the frequency of a token (word) in the text.
However, it is possible for a token to appear heavily in ONE text file but not (or little) in the rest of the corpus. Such a token may not be a great significance for classification purposes.
- The IDF transformation will “smooth out” the frequency numbers by adjusting a weight which takes into account the frequency of the token across the whole corpus.
W1000W1 Wj
0 1 2 3 1000 5000 … N-1
Vector of size N
Wk
…
W5000
© 2016 IBM Corporation48
Data preparation / transformationbig-data [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….]
big-data [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….]
big-data [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..]
cloud-computing [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …]
cloud-computing [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…]
networking [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…]
networking [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..]
Etc…. […., ….., ……, ……, ……]
UsingTF-IDF
0 [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….]
0 [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….]
0 [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..]
1 [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …]
1 [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…]
2 [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…]
2 [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..]
Etc… […., ….., ……, ……, ……]
Encodingclasses
© 2016 IBM Corporation49
Data preparation / transformation
Naïve Bayes uses a LabeledPoint data structure
labeledPoint (0, [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….])
labeledPoint (0, [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….])
labeledPoint (0, [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..])
labeledPoint (1, [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …])
labeledPoint (1, [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…])
labeledPoint (2, [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…])
labeledPoint (2, [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..])
Etc…
Build labeledPoints
© 2016 IBM Corporation50
Modeling
0 0
0 0
1 1
1 3
2 2
Etc… Etc…
Training RDD usedfor training
Naïve Bayes model
Prediction
labeledPoint (0, [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….])
labeledPoint (0, [w1, w2, 0, 0, w5, 0, 0, 0, 0, w7, w8, ….])
labeledPoint (0, [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..])
labeledPoint (1, [w56, 0, 0, 0, 0, w143, 0, 0, w65, 0, 0, …])
labeledPoint (1, [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…])
labeledPoint (2, [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…])
labeledPoint (2, [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..])
Etc…
labeledPoint (0, [w1, 0, w2, 0, 0, 0, w3, w4, 0, 0,…… ….])
labeledPoint (0, [w5, 0, w2, w100, 0, 0, 0, w89, 0, w34,..])
labeledPoint (1, [w99, 0, w55, w167, 0, 0, 0, w213, 0, 0…])
labeledPoint (2, [w13, w33, 0, 0, 0, 0, 0, w45, w89, 0, 0…])
labeledPoint (2, [0, 0, 0, 0, w465, 0, w765, w999, 0, 0…..])
Etc…
LabelTest RDD
ACCURACY: X%
Model
Model
© 2016 IBM Corporation51
(back to) Data analysis
Take a look at the confusion matrix
Quite clearly, not all classes performed the same…For example, OS and Vertical-IT seem to be the top and worst performer…
© 2016 IBM Corporation52
Remember this slide from early data analysis…
Distribution of files among classes (using SparkSQL)
Alright, so the amount of training data seems to matter (collect more?) So are “emerging-technology” and “mobile-wireless” going to be number 2 and number 3 on the podium?
© 2016 IBM Corporation53
(back to) Data analysis
Write a couple of SQL statements to analyze:─ Accuracy (and misses) for each class─ Proportion of files in training & test sets, versus overall data set
© 2016 IBM Corporation54
Further analysis
The overall amount of overlap in the whole dataset was reasonably low… but how about the amount of overlap for each class?