Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Structured Data Insights:The Vertica Architecture Advantage
#SeizeTheData#HighPerformanceAnalytics
Ben Vandiver, Architect, HPE SW Big Data
Agenda
– What Big Data is
– What Vertica is
– How Vertica works
– When not to use Vertica
– Questions & answers
What Big Data is…“It’s everything”Volume, velocity, varietyMentalityEvolution
Traditional retail data
–Sales transactions
– Tied together by customer loyalty card or website login (maybe)
–Customers
–Products
–Inventories
–Suppliers
–Transaction processing
– “System of Record”
– Integrity Matters
– Performance in TPS
Traditional retail analytics
– Market basket analysis (what is bought together)
– Loss analysis (shoplifting)
– Regional differences
– Seasonal patterns
– Average discounts
– Employee productivity
– Customer-targeted promotions
– Fraud model backtesting
– Return analysis
– Demand forecasting
Retail “Big Data” analytics
– Clickstream
– High Volume: Many product views per sale
– What was viewed? What was purchased instead?
– Website experience
– Site optimization (A/B testing)
– Ad impressions
– Higher volume: Each page has multiple ads/links
– Customer profile & targeting by machine learning
– Product sentiment from social media
Traditional CDRs
–Log of all phone calls made
–Run through mediation, rating, and billing every month
–Batch processing
–Generally pre-aggregated for analytics
Big Data xDRs
–IM and Data Services have increased volumes
–Legal requirements require higher velocity
– E.g. without notification within 15 minutes, large roaming charges are uncollectible
–Analytics on detail data is possible
– Tower placement (slow)
– RAN optimization (medium)
– Geofencing (fast)
Big Data in call centers
–Know your customer’s problem before it is presented, from network data
–Know your customer’s value, from business data
–Know your customer’s mood and influence, from interactions and social media
–Provide the right level of service to keep the best customers the happiest
Big Data is a mentality
–Analytics driven vs. analytically challenged
–Data is a core asset
– Store first, ask questions later
– In God We Trust – all others bring data
–Data science
– Asking (guessing) the right questions vs.
– Doing the right experiment, perhaps by accident
–There are still domain experts, but the data drives things
What Vertica is…
SQL relational database...
– Structured data
– Tables consisting of rows and columns
– Standard Query Language
– Finding
– Aggregating
– Analyzing
– Joining data from multiple tables
– …
– Ecosystem
– ODBC/JDBC, etc.
– BI, reporting, ETL, etc.
But big and fast!And designed from scratch for analytics applications
–Tens of trillions of records (thousands per man, woman, and child in the world)
–Terabytes to petabytes of storage
–Hundreds of computers with tens of thousands of CPUs to crunch the data
Leading customers across industries finding answers
– Promotional testing
– Claims analyses
– Patient records analyses
– Clinical data analyses
– Fraud monitoring
– Financial tracking
– Tick data back-testing
– Behavior analytics
– Clickstream analyses
– Network analyses
– Customer analytics
– Compliance testing
– Loyalty analysis
– Campaign management
ZyngaWinning analytics in a data-driven culture
Challenge
– Provide near real-time analysis on 40-60 billion rows of data ingested per day for 1,000+ employees
Solution
– HPE Vertica Analytics Platform
Result
– Ability to proactively determine what is analyzable, then structure collected data for fast results from HPE Vertica
– Analytics cluster scales 70 times for both Poker and Words With Friends in their fifth year
– 400-600 A/B tests running concurrently with clear metrics
Accelerating health information with an analytics platform
Used by an IT healthcare
provider’s platform to detect
how long it takes certain
application functions to run.
Is the improvement on how long it
took to analyze a single client’s
timers; with HPE Vertica; it now
takes only 20 seconds.
Prior to HPE, Cerner was
collecting 6 billion timers a month.
Now it’s 10 billion.
Greater scale6,000%2,000 timers
How Vertica works…and why it is fast
Design goals/basic architecture
– SQL, for the ecosystem and knowledge pool
– Clusters of commodity hardware
– Linux, x86, Ethernet
– Software-only solution (for flexibility)
– Special-purpose hardware has poor track record in databases
– Shared-Nothing MPP
– Cheaper, but puts more complexity in the software
– Run large queries many times faster than a legacy DB, load as fast, but feel free to snarl and growl at UPDATEs and DELETEs
– Sorted, compressed column store for cost and speed, no in-place updates
– Smart algorithms, query optimizer, etc.
Start from how data is stored on disk…SELECT SUM(volume) FROM trades WHERE symbol = 'HPQ'
AND date = '5/13/2011'
Symbol Date Time Price Volume Etc.
… … … … … …
HPQ 05/13/11 01:02:02 PM 40.01 100 …
IBM 05/13/11 01:02:03 PM 171.22 10 …
AAPL 05/13/11 01:02:03 PM 338.02 5 …
GOOG 05/13/11 01:02:04 PM 524.03 150 …
HPQ 05/13/11 01:02:05 PM 39.97 40 …
AAPL 05/13/11 01:02:07 PM 338.02 20 …
GOOG 05/13/11 01:02:07 PM 524.02 40 …
… … … … … …
Sorted dataSort by symbol, date, and time
Symbol Date Time Price Volume Etc.
… … … … … …
AAPL 05/13/11 01:02:07 PM 338.02 20 …
AAPL 05/13/11 01:02:03 PM 338.02 5 …
… … … … … …
GOOG 05/13/11 01:02:04 PM 524.03 150 …
GOOG 05/13/11 01:02:07 PM 524.02 40 …
… … … … … …
HPQ 05/13/11 01:02:02 PM 40.01 100 …
HPQ 05/13/11 01:02:05 PM 39.97 40 …
… … … … … …
IBM 05/13/11 01:02:03 PM 171.22 10 …
… … … … … …
Column filesSplit into columns
Symbol
…
AAPL
AAPL
…
GOOG
GOOG
…
HPQ
HPQ
…
IBM
…
Date
…
05/13/11
05/13/11
…
05/13/11
05/13/11
…
05/13/11
05/13/11
…
05/13/11
…
Time
…
01:02:07 PM
01:02:03 PM
…
01:02:04 PM
01:02:07 PM
…
01:02:02 PM
01:02:05 PM
…
01:02:03 PM
…
Price
…
338.02
338.02
…
524.03
524.02
…
40.01
39.97
…
171.22
…
Volume
…
20
5
…
150
40
…
100
40
…
10
…
Etc.
…
…
…
…
…
…
…
…
…
…
…
…
Compression + RLE
Symbol Date Volume
GOOG (x18M)
HPQ (x22M)
IBM (x19M)
…
05/13/2011 (x150K)
…
…
05/13/2011 (x220K)
…
…
05/13/2011 (x150K)
…
…
22
150
40
…
…
99
100
40
…
…
200
10
18
…
(8K distinct) (250/yr)
Clustering/MPP/scale-out
– Parallel design enables distributed storage and workload
– “Active” redundancy
– Automatic replication, failover, and recovery
– Shared-nothing database architecture provides high-scalability on clusters of commodity hardware
– Add nodes to achieve optimal capacity and performance
– Lower data center costs, higher density, scale-out
– No specialized nodes
– All nodes are peers
– Query/load to any node
– Continuous/ real-time load and query
Client network
Private data network (IP)
Node 1
– 2 hex core– 96 + GB RAM
Node 1
– 2 hex core– 96 + GB RAM
Node 1
– 2 hex core– 96 + GB RAM
Nodes are peers
5+ TB 5+ TB 5+ TB
Distributed query execution
– Client connects to a node and issues a query
– Node the client is connected to becomes the initiator node
– Other nodes in the cluster become executor nodes
– Initiator node parses the query and picks an execution plan
– Initiator node distributes query plan to executor nodes
select sum(volume) from fact;
EXECUTOR
INITIATOR
EXECUTOR
Distributed query execution
– All nodes execute the query plan locally
– Nodes exchange data during aggregation and joins
– Executor nodes send partial query results back to initiator node
– Initiator node aggregates results from all nodes
– Initiator node returns final result to the user
EXECUTOR EXECUTOR
select sum(volume) from trades;
3
103
4
1010
INITIATOR
Transactions
– Vertica offers full ACID (just at low TPS)
– Queries take a snapshot of the relevant list of files, and need no locks at READ COMMITTED isolation
– Loads do not conflict with each other
– COMMIT – keep the new files
– ROLLBACK – discard them
– Table level locks for SERIALIZABLE
– Database is essentially its own undo/redo log
–Recovery can be as simple as file copies
*All Operations are on-line
A
B
B
C
C
D
D
A
Changes
Ch
an
ge
s
Simple query processing
–Optimal data storage and physical schema
– True columnar, sorted, compressed + encoded
– Segmented, cosegmented, and replicated
– Partitioning with partition elimination
– Large I/O reads + writes
–Lock-free queries
–Optimized, Vectorized, JIT compiled code
– Fast data types designed for modern CPUs
–Fast predicate application
– Expression Analysis for sorted/partitioned data
Complex query processing
– Sort, segmentation, and RLE Optimizations for expressions, predicates, aggregation, and joins
– Sophisticated query optimizer designed for columnar query execution
– Subqueries flattened into joins
– Segment data around cluster nodes and CPUs for parallelism
– Two-pass algorithms that are skew-tolerant and reduce reliance on optimizer decisions
– Passes of and joining are interleaved by the planner/executor, so the most effective strategy is chosen at run time
– Special join implementations for “late materialization,” range lookups, and event series
– Detection and optimization of “Top K” queries
Automatic database physical designVertica Database Designer (DBD)
Schema
Data
Queries/DML
Segmentation
Sort Order
Compression
DBD (Magic)
Workload management
– Don't want reports to take over the entire system, preventing loads or tactical queries
– Keep some resources (e.g. memory) reserved so that high-priority queries can always begin
– Apply run-time prioritization to manage CPU and I/O
System Loader Web refresh General
Dynamic prioritization
Q: Are optimizer cost model estimates really that bad?
A: Doesn’t matter!
0 50 100 150 200
0
20
40
60
80
100
120
TIME (S)
CU
MU
LA
TIV
E C
OM
PL
ET
ION
(%
)
Unprioritized Dynamic Priority
Analytics platform extensions
–Event Series Extensions– Sessionization
– Pattern Matching
– Gap Filling and Interpolation
– Event Series Joins
–User-Defined Extensions– Load source, stream filtering, and parsing
– Scalar functions, aggregates, transforms
– Growing variety of languages to choose from
–Packs/examples for– Geospatial
– Sentiment
– Data Mining, Logistic Regression, etc.
–Data Variety: Flex Tables, files, integration
–Analytics Packs
When not to use Vertica
Vertica is NOT an OLTP system
–Single/few record retrievals are, in theory and in practice, way worse in column stores
–While Vertica is ACID compliant, transaction throughput is in the 10s-100s of TPS
– INSERTs must be batched, or use the COPY command
– UPDATEs, and DELETEs are run serially within a table
–Referential integrity constraints are not enforced
–Instead, use Vertica in conjunction…
– Keep a log of what happened in the OLTP DBMS system, or NoSQL “eventually consistent” system
Vertica is not for huge numbers of small queries
–Data sets much less than a terabyte may not warrant an analytic database
–Use an in-memory database/tool (Membase, Memcached, etc.) with Vertica to handle large numbers of tiny point queries
Keep the environment simple
– Linux x86 64-bit only
– While they “should work,” use of shared storage, filers, etc., will add cost, add potential bottlenecks, and perplex our support department if anything goes wrong
– As it is a bit silly to break machines up into VMs, only to stitch them back together with an MPP database; virtualization is not recommended
– Reasonable network performance is essential
– Loads and some queries may use all-to-all bandwidth
– Do not attempt to span WANs