Thinking Big with Big Data

Thinking BigAn Introduction to Big Data

About MeShawn Hermans● Data Engineer/Scientist● Technology consultant● Physics, math, data geek

About this Talk● Non-technical introduction to Big Data● Not focused on any technology or platform● Focus on concepts

Should you believe the hype?

● No need for scientific method● Predict disease outbreaks before the CDC● Cure cancer● Innovating healthcare● Solve world hunger● Bring about world peace

Big Data Promises

http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory

http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/?_php=true&_type=blogs&_r=0

http://online.wsj.com/articles/ad-tech-entrepreneurs-build-cancer-database-1403134613

http://www.mckinsey.com/insights/health_systems_and_services/the_big-data_revolution_in_us_health_care

http://www.bbc.com/news/business-26424338

http://www.foreignpolicy.com/articles/2014/04/25/can_big_data_stop_wars_preemptive_peace_technology_conflict

Big Data Criticism ● Garbage in, Garbage out● Ignores the role of the scientific method● Lots of questions don’t require large

amounts of data to get good stats● Privacy issues

Big Data is just another way to think about data

Mental Models“A mental model is simply a representation of an external reality inside your head. Mental models are concerned with understanding knowledge about the world.”

- Farnam Street Blog

http://www.farnamstreetblog.com/mental-models/

http://www.farnamstreetblog.com/mental-models/

Examples● Occam's razor● Mind maps● Law of supply and demand● Never get in a land war in Asia

All models are wrong, but some are useful

Relational ResistanceResistance to big data concepts, technologies, and techniques because of belief that the relational model is the only way to think about data.

See also: Theory induced blindness

Data Mental Models● Relational● Linked● Object Oriented● Geospatial● Temporal

● Semantic● Event Based● Data as Code● Bayesian● Unstructured

What is Big Data?

“Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

According to Gartner

According to Me

Big data is the Bazaar to traditional data’s Cathedral

Cathedral and BazaarTraditional Data● Clean● Top down● Carefully collected● Scales vertically● One true way

Big Data● Disorderly● Bottom up● Randomly collected● Scales horizontally● More than one way

Big Data DifferencesRelational● Normalization● ACID● SQL/Query● Structured/Schema

Big Data● Denormalization● BASE● MapReduce/Other● Loosely Structured

Integrating all available data is the promise of Big

Data

Why should you care?

Information as an Asset● Target specific customer's needs rather than

broad segments● Just-in-time inventory management● Evaluating demand for product● Predict and track traffic patterns

Big Data and You● What information do you have, that no one

else has?● Can you easily integrate your data or is it

locked in silos?● What data don’t you collect?● What data don’t you archive?

Big Data Technology

Big Data PlatformsCloud● AWS● Google● Microsoft

Hadoop● Cloudera● MapR● Hortonworks

This isn’t an all inclusive list, but a sample of the big players in the space.

Big Data Stack● Batch Processing● Data Collection● SQL/Query● Search● Machine Learning● Serialization● Security

● Stream Processing● File Storage● Resource

management● Online NoSQL● Data Pipeline

What about data science?

● Data science is statistics on a Mac● A data scientist is a statistician who lives in

San Francisco● Person who is better at statistics than any

software engineer and better at software engineering than any statistician.

What IS Data Science?

The need for Data Science● There is a LOT of data● Too much data for people to look at it all● Probabilistic models help extract signal from

the noise● Need to automate the analysis and

exploitation of data

Big Data has its limits

Black Swans and Big Data● There are fundamental limits to prediction● Hard to predict rare events where no prior

data exists (i.e. Black Swans)● Complex systems often have feedback loops

(e.g. stock market)

What’s next?

Business● Identify some unresolved

questions● Figure out what data

could answer those questions

● Pick the easiest and test out your hypothesis

Getting StartedTechnology● Pick a technology you

know or want to learn● Pick a platform● Pick a data set and

identify some basic problems to solve

My InfoTwitter: @shawnhermans Github: github.com/shawnhermansBlog: http://shawnhermans.github.io/ (In Progress)Slideshare: www.slideshare.net/shawnhermans/Quora: http://www.quora.com/Shawn-Hermans

http://shawnhermans.github.io/

Backup Slides

The Fourth Quadrant and the Failure of Statistics

Soothsayer ● Simple HTTP/JSON

API for training/classifying data

● Lots of built in classifier statistics

https://github.com/shawnhermans/soothsayer

https://github.com/shawnhermans/soothsayer

Data & Analytics

Thinking Big with Big Data