46
Veldkant 33A, Kontich [email protected] www.infofarm.be Data Science Company Machine Learning in Practice An InfoFarm Seminar

Machine learning

Embed Size (px)

DESCRIPTION

Slidedeck from our seminar about Machine Learning (07/11/2014) Topics covered: - What is Machine Learning? - Techiques (clustering, classification, ...) - Tools (Mahout, R, Spark MlLib, Weka, ...) - Practical example of Machine Learning applications - How to embed Machine Learning in software development - Demo's

Citation preview

Page 1: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Machine Learning in Practice

An InfoFarm Seminar

Page 2: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science

Big Data

Identifying, extracting and using data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning

and business value from it.

Page 3: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

2 Data Scientists 4 Big Data

Consultants

1 Infrastructure

Specialist

Page 4: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Java

PHPE-Commerce

Mobile

Web

Development

Page 5: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 6: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Agenda

• 13:00 What is Machine Learning?

• 13:30 Techniques

• 14:30 Tools

• 15:00 Practical examples

• 16:00 Wrap up

Page 7: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

What is Machine Learning?

Page 8: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Magic?

Page 9: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine Learning is a subfield of

computer science and statistics that deals

with systems that can learn from data,

instead of follow explicitly programmed

instructions.

Page 10: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine Learning vs Data Science vs Big Data

• You don’t need Big Data to leverage the

benefits of machine learning, but more

learning data makes a better machine

• Data Science can help you to get the most

out of Machine Learning

• Machine Learning can help you to get the

most out of Data Science

Page 11: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Terminology

Page 12: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Terminology

Weight (g) Wingspan (cm) Webbed feet? Back color Species

1000.1 125.0 No Brown Buteo jamaicenis

3000.7 200.0 No Gray Sagittarius serpentarius

3300.0 220.3 No Gray Sagittarius serpentarius

4100.0 136.0 Yes Black Gavia Immer

3.0 11.0 No Green Colothorax lucifer

570.0 75.0 No Black Campephilus principalic

• Features / attributes• Instance / data point• Label / target variable• Factorial versus Numeric versus Binary data

Page 13: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Learning

• Supervised Learning

• Unsupervised Learning

Page 14: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Techniques

Page 15: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Machine Learning

Classification

Clustering

Association Rules

Regression

Information extraction

Page 16: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Classification

• Predict a category for a given instance

• Mostly supervised learning.

• Algorithms

– Naïve Bayes

– Support Vector Machine

– Decision Trees

– Neural Networks

Page 17: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Classification: Use Cases

• Incoming mail redirection

• Sentiment analysis

• Order picking optimization

Page 18: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Clustering

• Try to find clusters in unstructured data

• Unsupervised learning

• Algorithms: K-Means

Page 19: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Clustering: Use cases

• Customer profiling

• Grouping of shopping items

• Recommendation systems

• Fraud detection

Page 20: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Association Rule Learning

• Find interesting relations

• Find frequent occurring patterns

• Algorithms

– Apriori

– Singular Value Decomposition

– FP-growth

Page 21: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Association Rule Learning: Use Cases

• Recommendations

• Data exploration

• Find connections between unrelated

events

• Frequent pattern mining

Page 22: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Regression

• Prediction of a quantity

• Algorithms:

– Linear regression

– Logistic regression

Page 23: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Regression: Use Cases

• Order Quantity Prediction

• Lag analysis

• Trend estimation

Page 24: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Information Extraction

• Extract variables out of unstructured data

like text.

• Named Entity Extraction

Page 25: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Tools

Page 26: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 27: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Mahout

Pro Contra

Relatively stable Poor documentation

Build on Hadoop – Scales well Mahout is currently migrating from Apache Hadoop to Apache Spark. Development is slow and Apache Spark already built a machine learning library of their own… Instant legacy?

Command-line access for most algorithms Kind of slow for smaller use cases

All important algorithms are available

Page 28: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 29: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Weka

Pro Contra

A lot of algorithms are available Not ‘Big Data’ ready

Graphical user interface for prototypingand experimenting

Requires custom data format – ARRF-files

Available as a Java library Optimized for academic use cases

Page 30: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 31: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Apache Spark: MLLib

Pro Contra

Based on Apache Spark – Very, fast and scalable

Based on Apache Spark – Requires knowledge of Spark and Scala

Very fast development cycle, new features are rolling out every couple of months

Relatively new, so a small choice of algorithms. But the essential ones are there.

New and refreshing API, easy integration with other components of Apache Spark.

Page 32: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Page 33: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

R

Pro Contra

A lot of algorithms are available Can run on Hadoop/Spark, but requires a lot of knowledge from both platforms

Well documented Must learn a new language

Lot’s of existing packages, that are easilyavailable

Page 34: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Noteworthy

Java

• DeepLearning4J

• Mallet

• MOA

Python

• NLTK

• Theano

• PyBrain

• SciKit-Learn

Lua

• Torch

General

• LibSVM

• LibLinear

Page 35: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Integration with Software Development

Page 36: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Development Cycle

UseTestTrainExtractAnalyzeCollect

Page 37: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Feature extraction

• Describe an instance to be

used in an algorithm

• Recognize hand-written digits

by converting the images to

lines of 1’s and 0’s

0000000000000100000000000000000000000000000001110000000000000000000000000000111100000000000000000000000000111110000000000000000000000000001111000000000000000000000000000001111000000000000000000000000000111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000011111000000000000000000000000000011111000000000000000000000000001111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000000111100000000000000000000000000011111000011100000000000000000001111111111111111100000000000000011111111111111111100000000000000111111111111111111000000000000000111111111111111111100000000000011111111100000111111000000000000111110000000000011110000000000000111100000000000111100000000000000111100000000000111100000000000001111100000000001111000000000000011111100000001111110000000000000111111111111111111100000000000001111111111111111110000000000000000111111111111111100000000000000000111111111111111000000000000000000011111111100000000000000000000000001111110000000000

Page 38: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Training an algorithm

1. Collect you’re data as a collection of instances

2. Split you’re data set into a training set and a testing set

3. Train the algorithm with the training set

4. Validate the results using the test set

Page 39: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Runtime model

• During training most algorithms generate a

mathematical runtime model.

• Model should be updated on a regular

basis

Page 40: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

A / B Testing

• Slow integration in the main system.

• If the machine is certain (enough) the

machine can take over

Page 41: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Hands-on

Page 42: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Demo

• K-Nearest Neighbour Classifier

• Clustering using Weka

• Named-Entity Extraction

• Classification of tweets

Page 43: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

What’s in it for you?

Page 44: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Benefits of using machine learning

• Automate repetitive tasks

• Can be a solution for problems that are difficult to automate

• Gain insights about your business

• Optimize business decisions by using the opinion of the computer

Page 45: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Questions?

Page 46: Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

Wrap-up