Machine learning

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.be

Data Science Company

Machine Learning in Practice

An InfoFarm Seminar


Data Science

Big Data

Identifying, extracting and using data of all types

and origins; exploring, correlating and using it in new

and innovative ways in order to extract meaning

and business value from it.


2 Data Scientists 4 Big Data

Consultants

1 Infrastructure

Specialist


Java

PHPE-Commerce

Mobile

Web

Development



Agenda

• 13:00 What is Machine Learning?

• 13:30 Techniques

• 14:30 Tools

• 15:00 Practical examples

• 16:00 Wrap up

Veldkant 33A, Kontich ● [email protected] ● www.infofarm.beData Science Company

What is Machine Learning?


Magic?


Machine Learning is a subfield of

computer science and statistics that deals

with systems that can learn from data,

instead of follow explicitly programmed

instructions.


Machine Learning vs Data Science vs Big Data

• You don’t need Big Data to leverage the

benefits of machine learning, but more

learning data makes a better machine

• Data Science can help you to get the most

out of Machine Learning

• Machine Learning can help you to get the

most out of Data Science


Terminology


Terminology

Weight (g) Wingspan (cm) Webbed feet? Back color Species

1000.1 125.0 No Brown Buteo jamaicenis

3000.7 200.0 No Gray Sagittarius serpentarius

3300.0 220.3 No Gray Sagittarius serpentarius

4100.0 136.0 Yes Black Gavia Immer

3.0 11.0 No Green Colothorax lucifer

570.0 75.0 No Black Campephilus principalic

• Features / attributes• Instance / data point• Label / target variable• Factorial versus Numeric versus Binary data


Learning

• Supervised Learning

• Unsupervised Learning


Techniques


Machine Learning

Classification

Clustering

Association Rules

Regression

Information extraction


Classification

• Predict a category for a given instance

• Mostly supervised learning.

• Algorithms

– Naïve Bayes

– Support Vector Machine

– Decision Trees

– Neural Networks


Classification: Use Cases

• Incoming mail redirection

• Sentiment analysis

• Order picking optimization


Clustering

• Try to find clusters in unstructured data

• Unsupervised learning

• Algorithms: K-Means


Clustering: Use cases

• Customer profiling

• Grouping of shopping items

• Recommendation systems

• Fraud detection


Association Rule Learning

• Find interesting relations

• Find frequent occurring patterns

• Algorithms

– Apriori

– Singular Value Decomposition

– FP-growth


Association Rule Learning: Use Cases

• Recommendations

• Data exploration

• Find connections between unrelated

events

• Frequent pattern mining


Regression

• Prediction of a quantity

• Algorithms:

– Linear regression

– Logistic regression


Regression: Use Cases

• Order Quantity Prediction

• Lag analysis

• Trend estimation


Information Extraction

• Extract variables out of unstructured data

like text.

• Named Entity Extraction


Tools



Apache Mahout

Pro Contra

Relatively stable Poor documentation

Build on Hadoop – Scales well Mahout is currently migrating from Apache Hadoop to Apache Spark. Development is slow and Apache Spark already built a machine learning library of their own… Instant legacy?

Command-line access for most algorithms Kind of slow for smaller use cases

All important algorithms are available



Weka

Pro Contra

A lot of algorithms are available Not ‘Big Data’ ready

Graphical user interface for prototypingand experimenting

Requires custom data format – ARRF-files

Available as a Java library Optimized for academic use cases



Apache Spark: MLLib

Pro Contra

Based on Apache Spark – Very, fast and scalable

Based on Apache Spark – Requires knowledge of Spark and Scala

Very fast development cycle, new features are rolling out every couple of months

Relatively new, so a small choice of algorithms. But the essential ones are there.

New and refreshing API, easy integration with other components of Apache Spark.



R

Pro Contra

A lot of algorithms are available Can run on Hadoop/Spark, but requires a lot of knowledge from both platforms

Well documented Must learn a new language

Lot’s of existing packages, that are easilyavailable


Noteworthy

Java

• DeepLearning4J

• Mallet

• MOA

Python

• NLTK

• Theano

• PyBrain

• SciKit-Learn

Lua

• Torch

General

• LibSVM

• LibLinear


Integration with Software Development


Development Cycle

UseTestTrainExtractAnalyzeCollect


Feature extraction

• Describe an instance to be

used in an algorithm

• Recognize hand-written digits

by converting the images to

lines of 1’s and 0’s

0000000000000100000000000000000000000000000001110000000000000000000000000000111100000000000000000000000000111110000000000000000000000000001111000000000000000000000000000001111000000000000000000000000000111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000011111000000000000000000000000000011111000000000000000000000000001111110000000000000000000000000011111000000000000000000000000000111100000000000000000000000000000111100000000000000000000000000011111000011100000000000000000001111111111111111100000000000000011111111111111111100000000000000111111111111111111000000000000000111111111111111111100000000000011111111100000111111000000000000111110000000000011110000000000000111100000000000111100000000000000111100000000000111100000000000001111100000000001111000000000000011111100000001111110000000000000111111111111111111100000000000001111111111111111110000000000000000111111111111111100000000000000000111111111111111000000000000000000011111111100000000000000000000000001111110000000000


Training an algorithm

1. Collect you’re data as a collection of instances

2. Split you’re data set into a training set and a testing set

3. Train the algorithm with the training set

4. Validate the results using the test set


Runtime model

• During training most algorithms generate a

mathematical runtime model.

• Model should be updated on a regular

basis


A / B Testing

• Slow integration in the main system.

• If the machine is certain (enough) the

machine can take over


Hands-on


Demo

• K-Nearest Neighbour Classifier

• Clustering using Weka

• Named-Entity Extraction

• Classification of tweets


What’s in it for you?


Benefits of using machine learning

• Automate repetitive tasks

• Can be a solution for problems that are difficult to automate

• Gain insights about your business

• Optimize business decisions by using the opinion of the computer


Questions?


Wrap-up

Technology

Machine learning