Upload
infofarm
View
196
Download
0
Embed Size (px)
DESCRIPTION
Slidedeck from our seminar about Machine Learning (07/11/2014) Topics covered: - What is Machine Learning? - Techiques (clustering, classification, ...) - Tools (Mahout, R, Spark MlLib, Weka, ...) - Practical example of Machine Learning applications - How to embed Machine Learning in software development - Demo's
Citation preview
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science Company
Machine Learning in Practice
An InfoFarm Seminar
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Data Science
Big Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
2 Data Scientists 4 Big Data
Consultants
1 Infrastructure
Specialist
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Java
PHPE-Commerce
Mobile
Web
Development
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Agenda
• 13:00 What is Machine Learning?
• 13:30 Techniques
• 14:30 Tools
• 15:00 Practical examples
• 16:00 Wrap up
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
What is Machine Learning?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Magic?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Machine Learning is a subfield of
computer science and statistics that deals
with systems that can learn from data,
instead of follow explicitly programmed
instructions.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Machine Learning vs Data Science vs Big Data
• You don’t need Big Data to leverage the
benefits of machine learning, but more
learning data makes a better machine
• Data Science can help you to get the most
out of Machine Learning
• Machine Learning can help you to get the
most out of Data Science
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Terminology
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Terminology
Weight (g) Wingspan (cm) Webbed feet? Back color Species
1000.1 125.0 No Brown Buteo jamaicenis
3000.7 200.0 No Gray Sagittarius serpentarius
3300.0 220.3 No Gray Sagittarius serpentarius
4100.0 136.0 Yes Black Gavia Immer
3.0 11.0 No Green Colothorax lucifer
570.0 75.0 No Black Campephilus principalic
• Features / attributes• Instance / data point• Label / target variable• Factorial versus Numeric versus Binary data
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Learning
• Supervised Learning
• Unsupervised Learning
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Techniques
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Machine Learning
Classification
Clustering
Association Rules
Regression
Information extraction
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Classification
• Predict a category for a given instance
• Mostly supervised learning.
• Algorithms
– Naïve Bayes
– Support Vector Machine
– Decision Trees
– Neural Networks
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Classification: Use Cases
• Incoming mail redirection
• Sentiment analysis
• Order picking optimization
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Clustering
• Try to find clusters in unstructured data
• Unsupervised learning
• Algorithms: K-Means
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Clustering: Use cases
• Customer profiling
• Grouping of shopping items
• Recommendation systems
• Fraud detection
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Association Rule Learning
• Find interesting relations
• Find frequent occurring patterns
• Algorithms
– Apriori
– Singular Value Decomposition
– FP-growth
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Association Rule Learning: Use Cases
• Recommendations
• Data exploration
• Find connections between unrelated
events
• Frequent pattern mining
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Regression
• Prediction of a quantity
• Algorithms:
– Linear regression
– Logistic regression
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Regression: Use Cases
• Order Quantity Prediction
• Lag analysis
• Trend estimation
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Information Extraction
• Extract variables out of unstructured data
like text.
• Named Entity Extraction
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Tools
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Mahout
Pro Contra
Relatively stable Poor documentation
Build on Hadoop – Scales well Mahout is currently migrating from Apache Hadoop to Apache Spark. Development is slow and Apache Spark already built a machine learning library of their own… Instant legacy?
Command-line access for most algorithms Kind of slow for smaller use cases
All important algorithms are available
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Weka
Pro Contra
A lot of algorithms are available Not ‘Big Data’ ready
Graphical user interface for prototypingand experimenting
Requires custom data format – ARRF-files
Available as a Java library Optimized for academic use cases
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Apache Spark: MLLib
Pro Contra
Based on Apache Spark – Very, fast and scalable
Based on Apache Spark – Requires knowledge of Spark and Scala
Very fast development cycle, new features are rolling out every couple of months
Relatively new, so a small choice of algorithms. But the essential ones are there.
New and refreshing API, easy integration with other components of Apache Spark.
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
R
Pro Contra
A lot of algorithms are available Can run on Hadoop/Spark, but requires a lot of knowledge from both platforms
Well documented Must learn a new language
Lot’s of existing packages, that are easilyavailable
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Noteworthy
Java
• DeepLearning4J
• Mallet
• MOA
Python
• NLTK
• Theano
• PyBrain
• SciKit-Learn
Lua
• Torch
General
• LibSVM
• LibLinear
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Integration with Software Development
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Development Cycle
UseTestTrainExtractAnalyzeCollect
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Feature extraction
• Describe an instance to be
used in an algorithm
• Recognize hand-written digits
by converting the images to
lines of 1’s and 0’s
0000000000000100000000000000000000000000000001110000000000000000000000000000111100000000000000000000000000111110000000000000000000000000001111000000000000000000000000000001111000000000000000000000000000111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000011111000000000000000000000000000011111000000000000000000000000001111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000000111100000000000000000000000000011111000011100000000000000000001111111111111111100000000000000011111111111111111100000000000000111111111111111111000000000000000111111111111111111100000000000011111111100000111111000000000000111110000000000011110000000000000111100000000000111100000000000000111100000000000111100000000000001111100000000001111000000000000011111100000001111110000000000000111111111111111111100000000000001111111111111111110000000000000000111111111111111100000000000000000111111111111111000000000000000000011111111100000000000000000000000001111110000000000
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Training an algorithm
1. Collect you’re data as a collection of instances
2. Split you’re data set into a training set and a testing set
3. Train the algorithm with the training set
4. Validate the results using the test set
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Runtime model
• During training most algorithms generate a
mathematical runtime model.
• Model should be updated on a regular
basis
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
A / B Testing
• Slow integration in the main system.
• If the machine is certain (enough) the
machine can take over
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Hands-on
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Demo
• K-Nearest Neighbour Classifier
• Clustering using Weka
• Named-Entity Extraction
• Classification of tweets
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
What’s in it for you?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be
Benefits of using machine learning
• Automate repetitive tasks
• Can be a solution for problems that are difficult to automate
• Gain insights about your business
• Optimize business decisions by using the opinion of the computer
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Questions?
Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company
Wrap-up