30
Open Data, Big Data and Machine Learning Steven Van Vaerenbergh Universidad de Cantabria May 31, 2016 #EMWeek16 - Santander

Open Data, Big Data and Machine Learning

Embed Size (px)

Citation preview

Page 1: Open Data, Big Data and Machine Learning

Open Data, Big Data and Machine Learning

Steven Van VaerenberghUniversidad de Cantabria

May 31, 2016#EMWeek16 - Santander

Page 2: Open Data, Big Data and Machine Learning

About me

Researcher in machine learning

gtas.unican.es/people/steven

Open Data, Big Data and Machine Learning 2

twitter.com/steven2358

Steven Van Vaerenbergh

Page 3: Open Data, Big Data and Machine Learning

1. Open Data

Open Data, Big Data and Machine Learning 3Steven Van Vaerenbergh

Page 4: Open Data, Big Data and Machine Learning

Denmark’s Open Address Data Set

• Making public data“free of charge”

Open Data, Big Data and Machine Learning 4

Period Benefits Costs Return onInvestment

2004-2009 (includingsetup)

>€60M ~€2M 22:1

2010 (steady state) ~€14M €0.2M 70:1

Source: http://odimpact.org/static/files/case-study-denmark.pdf

Steven Van Vaerenbergh

Page 5: Open Data, Big Data and Machine Learning

Open Data, Big Data and Machine Learning 5Steven Van Vaerenbergh

Page 6: Open Data, Big Data and Machine Learning

Steven Van Vaerenbergh Open Data, Big Data and Machine Learning 6

Page 7: Open Data, Big Data and Machine Learning

Open Data in Santander

• Santander Datos Abiertos http://datos.santander.es/• FIWARE lab: https://www.fiware.org/lab/• FIWARE Academy: http://edu.fiware.org

Open Data, Big Data and Machine Learning 7Steven Van Vaerenbergh

Page 8: Open Data, Big Data and Machine Learning

Open data

• “A data set is open if it is available under a free license to everyone”.

• Providers: Governments, public services, companies, individuals.

• Tendency: Many data providers stop making apps and leave this to third parties.

Open data improves transparencyNot all data should be open though (privacy)

Open Data, Big Data and Machine Learning 8Steven Van Vaerenbergh

Page 9: Open Data, Big Data and Machine Learning

2. Big Data

Open Data, Big Data and Machine Learning 9Steven Van Vaerenbergh

Page 10: Open Data, Big Data and Machine Learning

Big Data

• Scientific definition: “Data sets that are so largethat traditional data processing techniques cannotbe applied to them”.

• Terabytes, Petabytes, Exabytes, etc.• “Big Data” is also used to refer to novel analysis

techniques for such data.• Typically not open data.

Open Data, Big Data and Machine Learning 10Steven Van Vaerenbergh

Page 11: Open Data, Big Data and Machine Learning

Big Data = Data Science with Lots of Data

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Open Data, Big Data and Machine Learning 11Steven Van Vaerenbergh

Page 12: Open Data, Big Data and Machine Learning

Big Data

• Many frameworks are being developed:• Apache Hadoop• Apache Mahout• NoSQL

• Caution: The science behind big data is in its infancy.E.g. most methods are not able to produce error bars, which is paramount in many applications.

Open Data, Big Data and Machine Learning 12Steven Van Vaerenbergh

Page 13: Open Data, Big Data and Machine Learning

Big Data

• Media and press often use “big data” to refer to data science even if the amount of data is relatively small. “Big data” is often simply a marketing term.

Open Data, Big Data and Machine Learning 13Steven Van Vaerenbergh

Page 14: Open Data, Big Data and Machine Learning

Big Data = Data Science with Lots of Data

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Open Data, Big Data and Machine Learning 14

?

Steven Van Vaerenbergh

Page 15: Open Data, Big Data and Machine Learning

3. Machine Learning

Open Data, Big Data and Machine Learning 15Steven Van Vaerenbergh

Page 16: Open Data, Big Data and Machine Learning

Traditional Machine Intelligence

• Example: decision tree for determining access

Program consists of a set of rules (logic)Open Data, Big Data and Machine Learning 16

Input: age, gender, occupation,… Permission to enter Juanito’s tree house?

Yes No No

Steven Van Vaerenbergh

Page 17: Open Data, Big Data and Machine Learning

Traditional Machine Intelligence

• Example: decision tree for digit recognition

Set of rules is very hard to design by handOpen Data, Big Data and Machine Learning 17

Input: images (MNIST) Which digit is represented?

Steven Van Vaerenbergh

Page 18: Open Data, Big Data and Machine Learning

Traditional Machine Intelligence

• Example: decision tree for image recognition

Set of rules is impossible to design by handOpen Data, Big Data and Machine Learning 18

Input: images (CIFAR10) What does the image represent?

Correctanswer

?Steven Van Vaerenbergh

Page 19: Open Data, Big Data and Machine Learning

Machine Learning

• Solution: Let the program itself determine itsinternal set of rules.

• Provide the program with inputs and correctanswers for these rules, and let it “learn”.

“Machine Learning is the study ofcomputer algorithms that

improve their performanceon a task automaticallythrough experience.”

- Tom Mitchell

Open Data, Big Data and Machine Learning 19Steven Van Vaerenbergh

Page 20: Open Data, Big Data and Machine Learning

Open Data, Big Data and Machine Learning 20

Traditional Machine Intelligence

ComputerInput

ProgramOutput

Machine Learning (ML)

ML algorithmInput

OutputProgram

Steven Van Vaerenbergh

Page 21: Open Data, Big Data and Machine Learning

Machine Learning Applications

• Spam filters detect unsolicited emails

Open Data, Big Data and Machine Learning 21

SPAM

Steven Van Vaerenbergh

Page 22: Open Data, Big Data and Machine Learning

Machine Learning Applications

• Biomedicine: pattern detection in images

Open Data, Big Data and Machine Learning 22Steven Van Vaerenbergh

Page 23: Open Data, Big Data and Machine Learning

Machine Learning Applications

• Computer Vision: Kinect body tracking

Open Data, Big Data and Machine Learning 23Steven Van Vaerenbergh

Page 24: Open Data, Big Data and Machine Learning

Machine Learning Applications

• Natural Language Processing (NLP)

Open Data, Big Data and Machine Learning 24Steven Van Vaerenbergh

Page 25: Open Data, Big Data and Machine Learning

Machine Learning Applications

1996: IBM’s Deep Blue (Chess)• Intelligence based on

manually-entered rules

2016: Google Deepmind’sAlphaGo (Go)• Program learns

autonomously

Open Data, Big Data and Machine Learning 25Steven Van Vaerenbergh

Page 26: Open Data, Big Data and Machine Learning

Machine Learning Applications• Human activity recognition

Open Data, Big Data and Machine Learning 26

Running

Walking

Steven Van Vaerenbergh

Page 27: Open Data, Big Data and Machine Learning

Internal representation

How to represent the function from input to output?• Neural networks• Support vector machines • Sets of rules / Logic programs• Bayes/Markov nets• Model ensembles• Decision trees• Etc.Neural net demo: http://playground.tensorflow.org/

Open Data, Big Data and Machine Learning 27Steven Van Vaerenbergh

Page 28: Open Data, Big Data and Machine Learning

Tools and Frameworks

• Machine learning toolkits:• Scikit Learn (Python) http://scikit-learn.org/• Weka (Java) http://www.cs.waikato.ac.nz/ml/weka/• Shogun http://www.shogun-toolbox.org/

• Cloud-based machine learning• IBM Watson https://developer.ibm.com/watson/• Amazon ML https://aws.amazon.com/machine-learning/• Microsoft Azure ML https://azure.microsoft.com/en-

us/services/machine-learning/• Google Cloud ML

https://cloud.google.com/products/machine-learning/

Open Data, Big Data and Machine Learning 28Steven Van Vaerenbergh

Page 29: Open Data, Big Data and Machine Learning

Takeaways

• Open data, big data and machine learning are components of the current technological wave that resembles an industrial revolution.

• Big data requires a rigorous scientific engineeringframework that is currently unfinished.

• Machine learning algorithms create intelligentprograms by automatically learning from exampledata.

Open Data, Big Data and Machine Learning 29Steven Van Vaerenbergh

Page 30: Open Data, Big Data and Machine Learning

Join us on Meetup

Meetup group for peoplein Santander & Cantabriainterested in everythingrelated to data science

www.meetup.com/Data-Science-Santander

Open Data, Big Data and Machine Learning 30Steven Van Vaerenbergh