Upload
mukundan-agaram
View
276
Download
0
Embed Size (px)
Citation preview
Data Analytics with NOSQL
Mukundan AgaramChris Weiss
Some initial thoughts about data...
Continual issues with large scale web apps– Data growth + query response time
● Data growth => performance degradation● Explosion of big data “analytics” use cases
– Increase in unstructured data● More interconnectivity, more formats, lack of structure...● Document oriented data (XML/JSON) are difficult to
manage and search
– Distributed server configurations ● Large systems, more distribution and HA
Cloud services has aggravated these issues
Agenda for the night
● What is NOSQL?● Varieties of NOSQL● Key Industry Use Cases● Applications for Data Analytics● Landscape● Demos/Walkthroughs● Closing Discussions
What is NOSQL?
● “...mechanism for storage and retrieval of datathat is modeled in means other than tabularrelations used in relational databases.”Wikipedia
● Non SQL or Non-relational● Not Only SQL● Technically since late 1960...
– E.g. IDMS, IMS, MUMPS, Cache, BerkeleyDB
What is NOSQL?
● Drivers for modern day NOSQL– Web 2.0
– Big Data
– Facebook, Google, Amazon, Expedia etc.
– Horizontal scaling to clusters of computers● Achilles heel for RDBMS
– Cost
– Provide ● HA● Partition Tolerance (a.k.a sharding)● Speed
NOSQL - Drawbacks and Barriers
● Compromise on consistency (CAP Theorem)● Custom query languages vs. SQL● Lack of standardized interfaces● Existing investments in RDBMS● Most lack true ACID transactions.
– Use an “eventually” consistent model
– Data is replicated with a conflict resolution algorithm
– Methods for conflict resolution and distribution varysignificantly
CAP Theorem
● a.k.a Brewer's theorem● Impossible for a distributed computer system to
simultaneously provide – Consistency
● all nodes see same data at same time
– Availability ● Every request receives a response
– Partition Tolerance● Fault tolerance to partitioning because of network failures
CAP alignment for NOSQL
Source: http://blog.nahurst.com/visual-guide-to-nosql-systems
NOSQL direction
The landscape is morphing...● Current NOSQL industry focus
– Address large distributed systems reactionary to theCAP theorem
● The newer breed of NOSQL address importantaspects such as ACID
● There is a new buzz word …– NewSQL
Database Evolution
NOSQL Model Classification
Key Value Stores &Caches
Data is represented as a collection of (K,V) pairs. In-memory,persistent or eventually persistent.
Document Databases Data is stored in JSON document structures.
RDF, OWL & Triple Stores
Meaningful way to connect information. Can inference overtriples (S,P,O). Can be represented graphically. SPARQL
Wide Column Databases Extensible record set. Stores data tables as sections ofcolumns. Great for EDW.
Graph Databases Stores data as a graph G(V,E). Great for correlation analysis,recommendation engines and fraud detection.
Multi-model Databases Combination of one or more varieties of the above.
NOSQL Models
● Key-Value – Cache (EHCache, BigMemory, Coherence, Memcached)
– Store (Redis, Riak, AeroSpike, Oracle NoSQL)
● Document (MongoDB, CouchDB, AmazonDynamoDB)
● Wide Column (Cassandra, HBase, Vertica)
● Graph (Neo4j, Titan, Giraph)
● Multi-model (OrientDB, ArangoDB, Sqrrl)
Source: www.db-engines.com
Consider NOSQL for...
● Enabling “big data” and “web” scale– Massive distribution through horizontal scaling
● Performant queries (alternatives to RDBMS)– Denormalization and large horizontal scalability
● Massive write volumes (Facebook, Twitter)● Fast and dynamic access to key data ● Flexible schemas and data types● Data/Schema Migration● Developer centric environments
Consider NOSQL for...
● Diverse data organization options– Hierarchical correlation
– Graph correlation
– Semantic relationships
– Set based analytics
● Caching in end usage format● Data Archival● Big Data Analytics
– Cumulative metrics and insights
– Correlation
Where RDBMS/SQL is better..
● OLTP ● Data Integrity● SQL centricity● Complex relationships
– Exception of graph NOSQL
● Maturity, stability and standardization
Use Cases● Log management (unstructured data)● Data synchronization (online vs. offline sources)
– Shopping cart, Field sales/services, PoS, Gaming,Transportation/telemetry
● User profile management● Customer 360 degree view● Fraud detection ● Medical/Healthcare diagnosis● Data Archival● Recommendation Engines
Applications for Data Analytics
● Complements (part of) Hadoop and Big Data● Acts as the persistence infrastructure for larger
machine learning use cases– Predictive Analytics
– Fraud/Anomaly/Outlier Detection
– Recommendation engines
● Provides a back drop for interesting datavisualization initiatives– Integrate with visualization packages such as
Tableau
Interesting links
● Redis in Practice: Who's online?www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/
● Inventory list of NOSQL systemswww.nosql-database.org
● Database Engine ranking and analyticswww.db-engines.com
● Visual guide to NOSQL systemswww.blog.nahurst.com/visual-guide-to-nosql-systems
Case Studies / Demos
● Retail fraud detection – Neo4j
– Contrasting with OrientDB
– Tinkerpop/Gremlin/Blue Print
● 360 degree single view of voter information– MongoDB
● Schema on read – Hadoop
Gremlin Blueprints Architecture
Neo4j OrientDB TitanGraph ArangoDB
Qualified Voter – Use Case
● Tracks registration information for all voters inMichigan
● Uses a tabular geography model● Highly normalized schema
– Data partitioned into subsets● Enable local application instances and row level security
● Expensive queries when doing reporting● Expensive queries for performing “single view”
of voter● Several tables with tens of millions of records
Voter Schema
Find the first 100 voters in Ingham county withstatus and school district
SELECT V.VOTER_IDENTIFICATION_NUMBER,V.FIRST_NAME, V.LAST_NAME, G.CODE AS GENDER,
IDS.NAME AS ID_STATUS, UST.NAME AS UOCAVA_STATUS,
VA.ADDRESS_LINE_ONE, VA.CITY, VA.ZIP_CODE,
DIS.NAME AS SCHOOL_DISTRICT
FROM VOTER V, VOTER_ADDRESS VA, GENDER G,
IDENTIFICATION_STATUS IDS, UOCAVA_STATUS UST, VOTER_STATUS_TYPE VST,
STREET_RANGE SI, DISTINCT_POLITICAL_AREA DPA, DISTINCT_POLITICAL_AREA_DIS DPAD,
DISTRICT DIS, DISTRICT_TYPE DT, COUNTY CO
WHERE V.ID = VA.VOTER_ID AND V.GENDER_ID = G.ID AND V.IDENTIFICATION_STATUS_ID = IDS.ID
AND V.UOCAVA_STATUS_ID = UST.ID AND V.VOTER_STATUS_TYPE_ID = VST.ID AND VST.NAME = 'Active'
AND VA.STREET_RANGE_ID = SI.ID AND SI.DISTINCT_POLITICAL_AREA_ID = DPA.ID
AND VA.IS_ACTIVE = 'Y'
AND DPA.COUNTY_ID = CO.ID AND CO.NAME = 'Ingham'
AND DPA.ID = DPAD.DISTINCT_POLITICAL_AREA_ID AND DPAD.DISTRICT_ID = DIS.ID
AND DIS.DISTRICT_TYPE_ID = DT.ID AND DT.NAME = 'School'
AND ROWNUM <= 100;
Expensive in terms of IO
● Multiple objects read● Two stage IO:● Read index● Read entire table row● Selected and WHERE clause columns
assembled and then filtered● Resources for larger volume query would be
high – memory, CPU, fast disk
Parting conclusions
● NOSQL is a mixed bag of fruit● This space is growing● There are hundreds of products● Best value is realized from identifying the
correct use case– Functional requirements
– Non-functional requirements
Finally you can use NOSQL for...
Thank You!!
Questions?