A non-technical introduction to Big Data that conveys the core concepts and ideas of big data without giving into the hype.
Text of Thinking Big with Big Data
Thinking Big An Introduction to Big Data
About Me Shawn Hermans Data Engineer/Scientist Technology
consultant Physics, math, data geek
About this Talk Non-technical introduction to Big Data Not
focused on any technology or platform Focus on concepts
Should you believe the hype?
No need for scientific method Predict disease outbreaks before
the CDC Cure cancer Innovating healthcare Solve world hunger Bring
about world peace Big Data Promises
Big Data Criticism Garbage in, Garbage out Ignores the role of
the scientific method Lots of questions dont require large amounts
of data to get good stats Privacy issues
Big Data is just another way to think about data
Mental Models A mental model is simply a representation of an
external reality inside your head. Mental models are concerned with
understanding knowledge about the world. - Farnam Street Blog
Examples Occam's razor Mind maps Law of supply and demand Never
get in a land war in Asia
All models are wrong, but some are useful
Relational Resistance Resistance to big data concepts,
technologies, and techniques because of belief that the relational
model is the only way to think about data. See also: Theory induced
Data Mental Models Relational Linked Object Oriented Geospatial
Temporal Semantic Event Based Data as Code Bayesian
What is Big Data?
Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization. According to Gartner
According to Me Big data is the Bazaar to traditional datas
Cathedral and Bazaar Traditional Data Clean Top down Carefully
collected Scales vertically One true way Big Data Disorderly Bottom
up Randomly collected Scales horizontally More than one way
Big Data Differences Relational Normalization ACID SQL/Query
Structured/Schema Big Data Denormalization BASE MapReduce/Other
Integrating all available data is the promise of Big Data
Why should you care?
Information as an Asset Target specific customer's needs rather
than broad segments Just-in-time inventory management Evaluating
demand for product Predict and track traffic patterns
Big Data and You What information do you have, that no one else
has? Can you easily integrate your data or is it locked in silos?
What data dont you collect? What data dont you archive?
Big Data Technology
Big Data Platforms Cloud AWS Google Microsoft Hadoop Cloudera
MapR Hortonworks This isnt an all inclusive list, but a sample of
the big players in the space.
Big Data Stack Batch Processing Data Collection SQL/Query
Search Machine Learning Serialization Security Stream Processing
File Storage Resource management Online NoSQL Data Pipeline
What about data science?
Data science is statistics on a Mac A data scientist is a
statistician who lives in San Francisco Person who is better at
statistics than any software engineer and better at software
engineering than any statistician. What IS Data Science?
The need for Data Science There is a LOT of data Too much data
for people to look at it all Probabilistic models help extract
signal from the noise Need to automate the analysis and
exploitation of data
Big Data has its limits
Black Swans and Big Data There are fundamental limits to
prediction Hard to predict rare events where no prior data exists
(i.e. Black Swans) Complex systems often have feedback loops (e.g.
Business Identify some unresolved questions Figure out what
data could answer those questions Pick the easiest and test out
your hypothesis Getting Started Technology Pick a technology you
know or want to learn Pick a platform Pick a data set and identify
some basic problems to solve
My Info Twitter: @shawnhermans Github: github.com/shawnhermans
Blog: http://shawnhermans.github.io/ (In Progress) Slideshare:
The Fourth Quadrant and the Failure of Statistics
Soothsayer Simple HTTP/JSON API for training/classifying data
Lots of built in classifier statistics