Upload
lynn-langit
View
4.178
Download
0
Embed Size (px)
Citation preview
NoSQL for the DBA
Lynn Langit
April 2013 – Big Data Tech Con
Data Expertise / Lynn Langit
• Industry awards– Microsoft – MVP for SQL Server – Google – GDE for Cloud Platform– 10Gen – Master for MongoDB
• Practicing Architect• Technical author / trainer
– Pluralsight – Google Cloud Series– DevelopMentor – SQL Server Series – 2 books on SQL Server BI– Cloudera trainer (certified)
• Former MSFT FTE– 4 years
but first…
Business Intelligence to BigData
What is the relationship?
Business Intelligence NoSQL ????
“The Past” BI = Effective Reports
Data optimized for Static READING
BI = Optimized RDBMS
SQL queries & Data Stored on disk
BI = OLAP Cubes storage
BI = OLAP Cubes clients
BI = Transactional Data
• What happened?• Why did that happen?• Decision Support Systems
Collecting Transactional
data
So Why Change?
EnterBig Data
Q: What is it?
A: Your Data, plus more data….
BigData Pipeline - STEP 1 – Acquire
AcquireProcess
StoreQuery & Mine
Visualize
Big Data – an example from weather
13
Big Data – an example from weather
• Source Data• National weather data• Satellite data• Airplanes with sensors• Sensors on boats• Sensors in the ocean• Sensors on the ground• Historical Data• Social Media
• Results• More accurate predictions
• Tsunami• Tornado
Big Data – an example from health care
• Medical records• Regular• Emergency• Genetic data – 23andMe
• Food data • SparkPeople
• Purchasing • Grocery card• credit card
• Search – Google• Social media
• Twitter• Facebook
• Exercise • Nike Fuel Band• Kinect• Location - phone
BigData = ‘Next State’ Questions
• What could happen?• Why didn’t this happen?• When will the next new thing
happen?• What will the next new thing be?• What happens?
Collecting Behavioral
data
12:00 12:30 1:00 1:30 2:00 2:300
500
1000
1500
2000
2500
Key Monitoring
Sensor Readings
Other Behavioral data
What is the reality of personalized medicine?
BigData and Verticals• Retail• Manufacturing• Health Care• Banking• Education
Collecting BigData• Sensors everywhere• Structured, Semi-structured, Unstructured vs. Data
Standards• M2M• Public Datasets
– Freebase– Azure DataMarket– Hillary Mason’s list
19
DEMO – Hilary Mason’s Datasets• Who is Hilary Mason and why do you care
about her datasets?• How do you get her datasets?• What do you do with her datasets?
Collecting Data – a note about Faces
• Facial recognition• Voice recognition• Gesture capture and analysis
21
Petabytesof
Big Data
Big Data at Apple
Big Data in India
Update: “The total number of AADHAARs issued as of 24-Mar-2013 is over 304 million. This is more than 25% of the population of India.”
BigData Pipeline – STEP 5 - Visualize
AcquireProcess
StoreQuery & Mine
Visualize
DEMO - Visualizing Big Data: Wind Map
26
Demo - Visualizing Big Data – D3
27
BigData Pipeline – STEP 2 - Process
AcquireProcess
StoreQuery & Mine
Visualize
How do you clean up the mess?
• Data Hygiene• Data Scrubbing• Data Sprawl• The true cost of data• …and what about data integrity?• …and security?• …should your data be in the cloud?
Is NoSQL just Hadoop?
HUGE Hype factor since 2011
Apache Hadoop • a software framework that supports data-intensive
distributed applications • under a free license enables applications to work with thousands of
nodes and petabytes of data • was inspired by Google's MapReduce and Google File System (GFS)
papers
What is the relationship?
NoSQL Hadoop ??? BigData
Hadoop in the Enterprise
How you ‘get’ Hadoop
• roll your own
Open source
• Cloudera• MapR• Hortonworks• More…
Commercial distribution
• AWS
Rent it via the cloud
Demo – Get and Use Cloudera CDH4 VM
Working with Hadoop
About Hadoop MapReduce
Image from - https://developers.google.com/appengine/docs/python/images/mapreduce_mapshuffle.png
Demo - HDInsight – MapReduce w/Java
Demo - HDInsight – MapReduce w/ Hive
Example Comparison: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes and greater
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response Time
Can be near immediate Has latency (due to batch processing)
BigData Pipeline STEP 3 – Store
AcquireProcess
StoreQuery & Mine
Visualize
“Small” BigData vs. “Big” BigData
Hadoop
NoSQL
RDBMS
The reality…two pivots
Storage Methods• SQL (RDBMS) • NoSQL or Hadoop
Storage Locations• On premises • Cloud-hosted
Cloud-hosted NoSQL up to 50x CHEAPER
So many NoSQL options• More than just the Elephant in the room• Over 120+ types of NoSQL databases
Flavors of NoSQLKey/ValueVolatile
Key/valuePersistent
Wide-Column Document Graph
Key / Value Database• Just keys and values
– No schema• Persistent or Volatile• Examples
– AWS Dynamo DB– Riak
DEMO - AWS DynamoDB
• Key/Value store on the AWS cloud
NoSQL BLOB Storage Buckets in the Cloud
• Amazon – S3 or Glacier• Google – Cloud Storage• Microsoft Azure BLOBS• Others
– Dropbox– Box– More…
DEMO - Battle of the Buckets
• Google Cloud Storage VS.• Windows Azure BLOBS VS.• AWS S3 / Glacier
Column Database
• Wide, sparse column sets• Schema-light
• Examples:– Cassandra– HBase w/Hadoop– BigTable– GAE HR DS
Types of Column Databases
• Column-families– Non-relational– Sparse– Examples:
• HBase• Cassandra• xVelocity (SQL 2012 Tabular)
• Column-stores– Relational– Dense– Example:
• SQL Server 2012 – Columnstore index
DEMO – SQL Server ‘NoSQL’
• SQL Server 2012 Columnstore Index• SQL Server 2012 Tabular Model (SSAS)
Document Database (Mongo DB)• document-oriented (collection of
JSON documents) w/semi structured data– Encodings include BSON, JSON, XML…
• binary forms – PDF, Microsoft Office documents --
Word, Excel…)
• Examples:– MongoDB– Couchbase
Demo - Mongo DB
Graph Databases
• a lot of many-to-many relationships• recursive self-joins • when your primary objective is quickly
finding connections, patterns and relationships between the objects within lots of data
• Examples:– Neo4J– Google Freebase
DEMO – Neo4J
CAP Theorem applied = ‘how big is it?’
• CA = RDBMS– Highly-available consistency
– Ex. SQL Server• CP = NoSQL
– Enforced consistency– Ex. Hadoop
• AP = NoSQL– Eventual consistency– Ex. MongoDB
“Small” BigData vs. “Big” BigData
Hadoop
Key/Value or Column
Document or Graph
RDBMS
Cloud-hosted RDBMS
• AWS RDS – SQL Server, mySQL, Oracle– Medium cost– Solid feature set, i.e.
backup, snapshot– Use existing tooling
• Google – mySQL– Lowest cost– Most limited RDBMS
functionality• Microsoft – SQLAzure
– Highest cost
DEMO - AWS RDS
• SQL Server, MySQL or Oracle• Essential to understand pricing models
Image - http://blog.outsourcing-partners.com/wp-content/uploads/2012/10/performance.png
NoSQL Applied
Soci
al G
ames
Prod
uct C
atal
ogs
Soci
al a
ggre
gato
rs
Log
File
s
Line
-of-B
usin
ess
ColumnstoreHBase
Key/ValueDynamoDB
DocumentMongoDB
GraphNeo4j
RDBMSSQL Server
Cloud Offerings– RDBMS AND NoSQL
AWS Google Microsoft
RDBMS RDS – all major mySQL SQL Azure
NoSQL buckets S3 or Glacier Cloud Storage Azure Blobs
NoSQL Key-Value DynamoDB H/R Data on GAE Azure Tables
Streaming ML or (Mahout)
Custom EC2 Prospective Search &Prediction API
StreamInsight
NoSQL Document or Graph
MongoDB on EC2 Freebase MongoDB on Windows Azure
NoSQL – ColumnHadoop (HBase)
Elastic MapReduce using S3 & EC2
none HDInsight
Dremel/Warehousing
RedShift BigQuery none
BigData Pipeline STEP 4 – Query
AcquireProcess
StoreQuery & Mine
Visualize
Alw
ays
Map
Redu
ce?
Data Scientists and Languages
Karmasphere Studio for AWS
Can Excel help?
• Connector to Hadoop• Data Explorer• Data Quality Services• Master Data Services• Integration with Azure Data Market• Visualize with PowerView• Data Mining w/Predixion
Demo - Hadoop Connector to Excel
Google BigQuery w/Excel
• Hadoop-like (Dremel) based service• For massive amounts of data• SQL-like query language
DEMO - Google BigQuery• Hadoop-like (Dremel) based service• For massive amounts of data• SQL-like query language
Dremel Realized => Impala
• Interactive Hadoop?
Other types of cloud data services
Hosting public datasets• Pay to read• Earn revenue by offering for
read
Cleaning / matching (your) data • ETL – Microsoft Data
Explorer, Google Refine• Data Quality – Windows
Azure Data Market, InfoChimps, DataMarket.com
NoSQL To-Do ListUnderstand CAP & types of NoSQL databases• Use NoSQL when business needs designate• Use the right type of NoSQL for your business problem
Try out NoSQL on the cloud• Quick and cheap for behavioral data• Mashup cloud datasets• Good for specialized use cases, i.e. dev, test , training environments
Learn noSQL access technologies• New query languages, i.e. MapReduce, R, Infer.NET • New query tools (vendor-specific) – Google Refine, Amazon
Karmasphere, Microsoft Excel connectors, etc…
The Changing Data Landscape
NoSQLRDBMS
OtherServices
www.TeachingKidsProgramming.org• Free Courseware ( • Do a Recipe Teach a Kid (Ages 10 ++)• Java or Microsoft SmallBasic TKP site• C# via Pluralsight
• recipes)
Toward Data Craftsmanship…
Follow me @LynnLangit
RSS my blog www.LynnLangit.com
Hire me• To help build your BI/Big Data solution• To teach your team next gen BI• To learn more about using NoSQL solutions