View
815
Download
3
Tags:
Embed Size (px)
Citation preview
SQL vs NoSQL: Why you’ll never dump your relations17th March 2015
© 2015 EXASOL AG
BCS Data Management Specialist GroupDave Shuttleworth – Principal Consultant, Exasol UKemail: [email protected]: @EXA_DaveS
© 2015 EXASOL AG
Introduction & background
SQL vs NoSQL - observations
Case study King – online gaming
What’s hot?
Q & A
Agenda
© 2015 EXASOL AG
2014-2015 – EXASOL UK – Principal Consultant
Introducing EXASOL DBMS technology into UK
2003 - 2014 – Intelligent Edge Group – Principal Consultant
Data Warehouse design and migration from older technologies to new MPP DBMS
Business Intelligence infrastructure architect
New DBMS technology assessment
1992 - 2003 – WhiteCross Systems (now Kognitio) – Principal Consultant
Pre-sales and post-sales technical support
1989 -1992 – Teradata – Consultant
Pre-sales and post-sales technical support
1980 -1989 – Data General (now part of EMC) – Systems engineer
Pre-sales and post-sales technical support
1975 -1980 – UK retailer – Analyst programmer
Applications design and implementation, system management and tuning
My background
© 2015 EXASOL AG
a column store, in-memory, massively parallel processing (MPP) database
modern software designed for analytics
runs on standard x86 hardware
Uses standard SQL language (with optional extensions)
suitable for any scale of data & any number of users
mature, proven & very cost effective
quick to implement & easy to operate
The World’s Fastest Analytic Database
What is Exasol?
© 2015 EXASOL AG
QphH@1000 GB 1,000,000 2,000,000 3,000,000 4.000,000
Sept ´14
April ´14
June ´12
Feb ´14
Dec ´13
Aug ´11
Sept ´11
Oct ´11
Dec ´11
Source: www.tpc.org / Sept 22,2 0 1 5
We are the benchmark leader
5,246,338
Microsoft 134,117
Oracle 201,487
Oracle 209,533
Microsoft 219,887
Sybase IQ 258,474
Oracle 326,454
Vectorwise 445,529
Microsoft 519,976
On 1 Terabyte of data - an order of magnitude faster than its closest rival
Queries per hour
© 2015 EXASOL AG
Introduction & background
SQL vs NoSQL - observations
Case study King – online gaming
What’s hot?
Q & A
Agenda
© 2015 EXASOL AG
• Databases and Data Warehouses have evolved to meet the needs of business (over many years…!)
• Generally using some form of Relational Database (SQL based)
• Originally tightly structured data, now expanding to include unstructured data
• Ever increasing data volumes and complexity
• New technologies have emerged to address (and extend) the storage and management requirements
• Fast cheap network connectivity
• Cloud services for cheaper and more flexible implementation
• Wider acceptance of open source software for production systems
• Hadoop parallel processing platform – often in a ‘hybrid’ environment
• Alternative database technologies (e.g. document stores, graph databases)
• Publicly accessible data sources (e.g. weather history, flight data, Google searches. Twitter feeds, census data, mapping data)
• More complex analytics needed to stay competitive
SQL vs NoSQL - background
© 2015 EXASOL AG
• Proliferation of NoSQL (‘not only SQL’) databases – over 150 listed on nosql.database.org – classified by type:
• Wide Column Stores• E.g. Hadoop, MapR, Cassandra, MonetDB
• Document stores• Elasticseach, MongoDB, Couchbase, Marklogic
• Key value/tuple store• DynamoDB, Azure Table Storage, Oracle NoSQL, MemcacheDB
• Graph databases• NEO4J, Yarcdata, Graphbase
• Multimodal databases
• Object databases
• etc, etc..
SQL vs NoSQL - background
© 2015 EXASOL AG
• The inherent restrictions of relational databases are addressed by NoSQL implementations :
• More flexible data model – ‘schemaless’ or ‘schema on read’
• ‘Schemaless’ can mean very fast write performance – useful for streaming data
• Simplifies handling of unstructured and semi-structured data such as logfiles, other machine generated data and text
• Designed for easy scale-up (and scale down) to handle seasonal workloads
• High levels of concurrency can be achieved via distributed processing
• High availability via replication is built in to some NoSQL databases
• Maps well to cloud based infrastructure and capabilities (if done well!)
SQL vs NoSQL - background
© 2015 EXASOL AG
Hadoop today is …
Still Open Source !
Began with HDFS and Map/Reduce
Now comprises a number of additional technologies
File systems
(e.g. Tachyon)
Cluster Managers
(e.g. YARN + Mesos)
Execution Engines
(e.g. Tez, Spark etc.)
Analytical Layer and Applications
(e.g. Hive, Pig, various SQL on Hadoop)
© 2015 EXASOL AG
Hadoop With Everything?
Hadoop was invented to more easily distribute the Nutch web search engine across a cluster of machines.
Map/Reduce – distributed processing
HDFS – distributed file system
Began to be used for …. just about everything.
But not all processing tasks are like indexing the Internet
Hadoop started to attract criticism
But usually when it was being used for something it wasn’t designed for
© 2015 EXASOL AG
Definitely NOT jobs for Hadoop
Word processing
Payroll system
Anything on a single computer
Anything with “small” data
© 2015 EXASOL AG
Analytical Queries
“GROUP BY“ logic
i.e. not concerned with individual data items
Analytical Functions
MAX, MEDIAN, MIN, SUM, COUNT, STANDARD DEVIATION …
Table joins, nested subqueries
Usually short-running, ad-hoc and submitted many at a time.
© 2015 EXASOL AG
Map/Reduce and HDFS : the wrong tools for Analytics ?
Queries tend to be short : fault tolerance is less important
If chance of failure in a 5 hour batch is 1 in 300
Chance of failure in a 5 second query is 1 in 1,000,000
Queries tend to be short : start-up time is significant
a 20 second start-up time is NOT OK on a 5 second query
A number of projects started to address these issues
e.g. “Hot containers” in Hive on Tez to reduce start-up time
Also Pushdown via Hive partitions or ORC predicate pushdown
© 2015 EXASOL AG
Example taken from Reynold Xin’s 2012 “Shark: Hive (SQL) on Spark” presentation
Map/Reduce: the wrong language for Analytics ?
Stage 0: Map-Shuffle-Reduce
Mapper(row) {
fields = row.split("\t")
emit(fields[0], fields[1]);
}
Reducer(key, values) {
sum = 0;
for (value in values) {
sum += value;
}
emit(key, sum);
}
Stage 1: Map-Shuffle
Mapper(row) {
...
emit(page_views, page_name);
}
... shuffle
Stage 2: Local
data = open("stage1.out")
for (i in 0 to 10) {
print(data.getNext())
}
© 2015 EXASOL AG
Equivalent in SQL
SELECT
page_name,
SUM(page_views) views
FROM wikistats
GROUP BY page_name
ORDER BY views DESC
LIMIT 10;
© 2015 EXASOL AG
The SQL language
Portable
Well-defined standards exist
No detailed knowledge of the platform required
e.g. you don’t need to manage memory
SQL is assumed by a lot of reporting tools
Widely used and understood even by non-technical people
© 2015 EXASOL AG
I‘m not saying that SQL is perfect
• Try writing the simple Hadoop “Word Count” example in
pure SQL
• Or try to “sessionise” weblog data
• Or anything with data that is not structured• “Which part of STRUCTURED Query Language don’t you
understand …?!”
• All I’m saying is that is an excellent language for
analytical queries.
© 2015 EXASOL AG
Hadoop could handle SQL (via Hive), but historically …
High Latency
Restricted SQL options
All but simple table joins were difficult
Little support for compression & indexing
Merv Adrian (Gartner Research - 2014)
“What is remarkable is that Hadoop does SQL. Just don’t expect it to do it well”
Result : EVERYTHING looked good compared to Hive
© 2015 EXASOL AG
Everyone still likes to compare themselves to Hive
© 2015 EXASOL AG
EXASOL being no exception !
© 2015 EXASOL AG
Hive continues to be improved …
Completed Views (HIVE-1143)
Partitioned Views (HIVE-1941)
Storage Handlers (HIVE-705)
HBase Integration
HBase Bulk Load
Locking (HIVE-1293)
Indexes (HIVE-417)
Bitmap Indexes (HIVE-1803)
Filter Pushdown (HIVE-279)
Table-level Statistics (HIVE-1361)
Dynamic Partitions
Binary Data Type (HIVE-2380)
Decimal Precision and Scale Support
HCatalog
HiveServer2 (HIVE-2935)
Column Statistics in Hive (HIVE-1362)
List Bucketing (HIVE-3026)
Group By With Rollup (HIVE-2397)
Enhanced Aggregation, Cube, Grouping and Rollup (HIVE-3433)
Optimizing Skewed Joins (HIVE-3086)
Correlation Optimizer (HIVE-2206)
Hive on Tez (HIVE-4660)
Vectorized Query Execution (HIVE-4160)
In Progress Atomic Insert/Update/Delete (HIVE-
5317)
Transaction Manager (HIVE-5843)
Cost Based Optimizer in Hive (HIVE-5775)
Proposed Spatial Queries
Theta Join (HIVE-556)
JDBC Storage Handler
MapJoin Optimization
Proposal to standardize and expand Authorization in Hive
Dependent Tables (HIVE-3466)
AccessServer
Type Qualifiers in Hive
MapJoin & Partition Pruning (HIVE-5119)
SQL Standard based secure authorization (HIVE-5837)
Updatable Views (HIVE-1143)
Hive on Spark (HIVE-7292)
© 2015 EXASOL AG
The dream data architecture for analytics …
Based on the SQL language
but leverages Hadoop’s extreme scalability
and Hadoop’s fault tolerance
while not compromising on speed.
Could it please also have some maturity ?
And be easy to use ?
© 2015 EXASOL AG
The current reality
SQL on SQL, which is arguably
Less scalable
Less fault tolerant
Less good with unstructured data
SQL on Hadoop, which is arguably
Less mature
Less easy to use
Slower
© 2015 EXASOL AG
Choices for SQL and Hadoop
SQL AND HADOOP
A Connector
HADOOP ON SQL
User Defined Functions
SQL ON HADOOP
Something like Hive, but better
© 2015 EXASOL AG
Option 1 – SQL AND HADOOP
Run SQL on SQL and Hadoop on Hadoop and use a connector to join the two systems
Pros
Minimal impact (SQL and Hadoop worlds can function as before)
Easier to implement
Cons
Network !
Challenge of optimising across two technologies
© 2015 EXASOL AG
Option 2 – HADOOP ON SQL
Bring Map/Reduce into the Parallel database
For example using Java User Defined Functions
select my_java_map_function(words) a_word,
count(*) word_count
from DOCUMENTS
group by 1
Doesn’t benefit from Hadoop’s storage advantages
© 2015 EXASOL AG
Option 3 - SQL ON HADOOP
Build a relational database on Hadoop storage Impala (Cloudera)
Stinger (Hortonworks)
Presto (Facebook)
SparkSQL (UC Berkeley)
HAWQ (Pivotal)
BigSQL (IBM)
Apache Phoenix (for HBase)
Apache Tajo
Apache Drill
etc etc etc ….
AND DON‘T FORGET HIVE !
© 2015 EXASOL AG
Four possible market outcomes…
Hadoop and SQL databases are on a collision course – only one will survive
No sign of that so far
They are complementary – both will survive
Probably - the challenge is how to make them work together
They will merge and become one
Some indications this is already starting to happen
Something even more amazing will come along and replace them both
Sometimes this happens – Spark ?
© 2015 EXASOL AG
What do the pundits say?
Martin Fowler – Thoughtworks
The rise of NoSQL databases marks the end of the era of relational database dominance
But NoSQL databases will not become the new dominators. Relational will still be popular, and used in the majority of situations. They, however, will no longer be the automatic choice.
The era of Polyglot Persistence has begun - where any decent sized enterprise will have a variety of different data storage technologies for different kinds of data
Emil Eifrem – Neo Technology
When evaluating a NoSQL database, it is critical to demand enterprise-readiness. An enterprise delivering modern applications needs a NoSQLdatabase that can manage today's complex and connected data while still delivering the enterprise strength, transactions and durability that IT departments have relied on for years.
© 2015 EXASOL AG
Introduction & background
SQL vs NoSQL - observations
Case study King – online gaming
What’s hot?
Q & A
Agenda
© 2015 EXASOL AG
37
King in numbers
• 100 million daily active users
• 1 billion game plays per day
• 8 offices
And lots and lots of data...
• 14 billion rows per day
• 500 Gb per day new
• 700 Tb stored
Case Study - King
© 2015 EXASOL AG
King - Getting to know 500 million playersObjectives in game analytics
38
• Metrics and KPIs
• Measure and understand player behaviour
• Player segmentation
• Improve player experience
• Forecasting
• Predictive modelling
© 2015 EXASOL AG
39
Challenges at King
• Extreme scale
• Rate of growth
• Speed of innovation
• Cross platform
• Virtual economies
King - Getting to know 500 million players
© 2015 EXASOL AG
40
The King formula
• Data driven culture
• Engaged business
• Talented embedded data scientists
• AB testing
• Right technology platform
• Right data model
King - Getting to know 500 million players
© 2015 EXASOL AG
System architecture
41
How King does data
Game servers
Log server
ReportsData
scientists
Data WarehouseTSV log files
Dimensional model
Raw data
ETL
© 2015 EXASOL AG
Our data keeps growing...
42
How King does data
King launches on mobile...
© 2015 EXASOL AG
…our technology has to keep up
43
How King does data
Qlikview says no
Infobright CE says no
10 node Hadoop
80 nodes
40 nodes
20 nodes
InfiniDB
Exasol
© 2015 EXASOL AG
Data platform 1.0
44
How King does data
GamesEvent data
Hive
Reports
Data scientists
ETL
© 2015 EXASOL AG
Data platform 1.5
45
How King does data
GamesEvent data
Hive DB
Reports
Data scientists
ETL
© 2015 EXASOL AG
46
Why ExaSolution?
• Speed
• Efficiency
• Tuning free
• Scaling (150Tb and counting...)
• ExaDudes
How King does data
© 2015 EXASOL AG
Performance
47
How King does data
© 2015 EXASOL AG
48
GamesEvent data
Hive Exasol
Reports
Data scientists
ETL
Data platform 2.0How King does data
© 2015 EXASOL AG
49
Benefits
• ETL times slashed
• Cost saving
• Tuning free
• Scaling
How King does data
© 2015 EXASOL AG
Data platform 3.0
50
Where next?
GamesEvent data
Exasol Hive
Reports
Data scientists
ETL
© 2015 EXASOL AG
51
Future challenges
• Keep on scaling
• Closer Hadoop integration
• Evolving data model
• Microbatch ETL
• Real(er) time…
Where next?
© 2015 EXASOL AG
Introduction & background
SQL vs NoSQL - observations
Case study King – online gaming
What’s hot?
Q & A
Agenda
© 2015 EXASOL AG
What’s hot?
© 2015 EXASOL AG
• A definition:• The Internet of Things (IoT) is a scenario in which objects, animals or people are
provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction
• Basic concept has been around for decades – now accepted into the mainstream
• Wide range of potential uses:• Environmental monitoring• Infrastructure management• Manufacturing• Energy management• Medical and healthcare systems• Building and home automation• Transport systems
Internet of Things
© 2015 EXASOL AG
• Wearable technologies – e.g. smart watches, Google Glass• Bio sensors for humans (and other animals)
• Health monitoring
• Already in use on some dairy farms – optimise milk yields and give early warning for possible disease
• Location based data• All modern phones provide location data (either GPS or cell based)
• ‘crowd sourcing’ – e.g. traffic flow based on cellphone signals
• Beacons – e.g. Regent Street in London
• Location-based special offers and advertisement
• Facial recognition• To drive targetted advertisements
Other emerging technologies which produce data
© 2015 EXASOL AG
• Cloud being used for evaluation of new technologies and also as a platform for dev/test (and even DR) environments
• In-database analytics using UDFs in languages such a R, Lua and Python
• Move the processing closer to the data
• Run analytics on full data volumes (no sampling/extract required)
• Get improved performance due to parallelism (where possible)
• Lots of freely available R code on the web
• Automated conversion of analytical results to text (NLG) is emerging
• AI rule-based generation of natural language output
• Readable summaries and recommendations
• Yseop, NarrativeScience, Automated Insights, Arria NLG
Other emerging trends
© 2015 EXASOL AG
• Data and database technology isn’t going away!
• New database approaches are being developed to address the requirements of flexibility, scalability etc
• These technologies drive an increasing need for more analysts, database designers, data scientists
• Hybrid systems are becoming the norm, with companies mixing ‘best of breed’ technologies (possibly open source) to get the best and most cost-effective results – use ‘the right tool for the job’
• SQL databases will continue to be widely utilised – but alongside other technologies and integration will become tighter
Summary
© 2015 EXASOL AG
Introduction & background
SQL vs NoSQL - observations
Case study King – online gaming
What’s hot?
Q & A
Agenda
Presentation to insert name here 60
Presentation to insert name here 61