Structured Data Insightshpeanalyticstour.com/wp-content/uploads/2016/04/... · 2017. 2. 19. · Ben Vandiver, Architect, HPE SW Big Data. Agenda –What Big Data is –What Vertica

Structured Data Insights:The Vertica Architecture Advantage

#SeizeTheData#HighPerformanceAnalytics

Ben Vandiver, Architect, HPE SW Big Data

Agenda

– What Big Data is

– What Vertica is

– How Vertica works

– When not to use Vertica

– Questions & answers

What Big Data is…“It’s everything”Volume, velocity, varietyMentalityEvolution

Traditional retail data

–Sales transactions

– Tied together by customer loyalty card or website login (maybe)

–Customers

–Products

–Inventories

–Suppliers

–Transaction processing

– “System of Record”

– Integrity Matters

– Performance in TPS

Traditional retail analytics

– Market basket analysis (what is bought together)

– Loss analysis (shoplifting)

– Regional differences

– Seasonal patterns

– Average discounts

– Employee productivity

– Customer-targeted promotions

– Fraud model backtesting

– Return analysis

– Demand forecasting

Retail “Big Data” analytics

– Clickstream

– High Volume: Many product views per sale

– What was viewed? What was purchased instead?

– Website experience

– Site optimization (A/B testing)

– Ad impressions

– Higher volume: Each page has multiple ads/links

– Customer profile & targeting by machine learning

– Product sentiment from social media

Traditional CDRs

–Log of all phone calls made

–Run through mediation, rating, and billing every month

–Batch processing

–Generally pre-aggregated for analytics

Big Data xDRs

–IM and Data Services have increased volumes

–Legal requirements require higher velocity

– E.g. without notification within 15 minutes, large roaming charges are uncollectible

–Analytics on detail data is possible

– Tower placement (slow)

– RAN optimization (medium)

– Geofencing (fast)

Big Data in call centers

–Know your customer’s problem before it is presented, from network data

–Know your customer’s value, from business data

–Know your customer’s mood and influence, from interactions and social media

–Provide the right level of service to keep the best customers the happiest

Big Data is a mentality

–Analytics driven vs. analytically challenged

–Data is a core asset

– Store first, ask questions later

– In God We Trust – all others bring data

–Data science

– Asking (guessing) the right questions vs.

– Doing the right experiment, perhaps by accident

–There are still domain experts, but the data drives things

What Vertica is…

SQL relational database...

– Structured data

– Tables consisting of rows and columns

– Standard Query Language

– Finding

– Aggregating

– Analyzing

– Joining data from multiple tables

– …

– Ecosystem

– ODBC/JDBC, etc.

– BI, reporting, ETL, etc.

But big and fast!And designed from scratch for analytics applications

–Tens of trillions of records (thousands per man, woman, and child in the world)

–Terabytes to petabytes of storage

–Hundreds of computers with tens of thousands of CPUs to crunch the data

Leading customers across industries finding answers

– Promotional testing

– Claims analyses

– Patient records analyses

– Clinical data analyses

– Fraud monitoring

– Financial tracking

– Tick data back-testing

– Behavior analytics

– Clickstream analyses

– Network analyses

– Customer analytics

– Compliance testing

– Loyalty analysis

– Campaign management

ZyngaWinning analytics in a data-driven culture

Challenge

– Provide near real-time analysis on 40-60 billion rows of data ingested per day for 1,000+ employees

Solution

– HPE Vertica Analytics Platform

Result

– Ability to proactively determine what is analyzable, then structure collected data for fast results from HPE Vertica

– Analytics cluster scales 70 times for both Poker and Words With Friends in their fifth year

– 400-600 A/B tests running concurrently with clear metrics

Accelerating health information with an analytics platform

Used by an IT healthcare

provider’s platform to detect

how long it takes certain

application functions to run.

Is the improvement on how long it

took to analyze a single client’s

timers; with HPE Vertica; it now

takes only 20 seconds.

Prior to HPE, Cerner was

collecting 6 billion timers a month.

Now it’s 10 billion.

Greater scale6,000%2,000 timers

How Vertica works…and why it is fast

Design goals/basic architecture

– SQL, for the ecosystem and knowledge pool

– Clusters of commodity hardware

– Linux, x86, Ethernet

– Software-only solution (for flexibility)

– Special-purpose hardware has poor track record in databases

– Shared-Nothing MPP

– Cheaper, but puts more complexity in the software

– Run large queries many times faster than a legacy DB, load as fast, but feel free to snarl and growl at UPDATEs and DELETEs

– Sorted, compressed column store for cost and speed, no in-place updates

– Smart algorithms, query optimizer, etc.

Start from how data is stored on disk…SELECT SUM(volume) FROM trades WHERE symbol = 'HPQ'

AND date = '5/13/2011'

Symbol Date Time Price Volume Etc.

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

IBM 05/13/11 01:02:03 PM 171.22 10 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

Sorted dataSort by symbol, date, and time

Symbol Date Time Price Volume Etc.

… … … … … …

AAPL 05/13/11 01:02:07 PM 338.02 20 …

AAPL 05/13/11 01:02:03 PM 338.02 5 …

… … … … … …

GOOG 05/13/11 01:02:04 PM 524.03 150 …

GOOG 05/13/11 01:02:07 PM 524.02 40 …

… … … … … …

HPQ 05/13/11 01:02:02 PM 40.01 100 …

HPQ 05/13/11 01:02:05 PM 39.97 40 …

… … … … … …

IBM 05/13/11 01:02:03 PM 171.22 10 …

… … … … … …

Column filesSplit into columns

Symbol

…

AAPL

AAPL

…

GOOG

GOOG

…

HPQ

HPQ

…

IBM

…

Date

…

05/13/11

05/13/11

…

05/13/11

05/13/11

…

05/13/11

05/13/11

…

05/13/11

…

Time

…

01:02:07 PM

01:02:03 PM

…

01:02:04 PM

01:02:07 PM

…

01:02:02 PM

01:02:05 PM

…

01:02:03 PM

…

Price

…

338.02

338.02

…

524.03

524.02

…

40.01

39.97

…

171.22

…

Volume

…

20

5

…

150

40

…

100

40

…

10

…

Etc.

…

…

…

…

…

…

…

…

…

…

…

…

Compression + RLE

Symbol Date Volume

GOOG (x18M)

HPQ (x22M)

IBM (x19M)

…

05/13/2011 (x150K)

…

…

05/13/2011 (x220K)

…

…

05/13/2011 (x150K)

…

…

22

150

40

…

…

99

100

40

…

…

200

10

18

…

(8K distinct) (250/yr)

Clustering/MPP/scale-out

– Parallel design enables distributed storage and workload

– “Active” redundancy

– Automatic replication, failover, and recovery

– Shared-nothing database architecture provides high-scalability on clusters of commodity hardware

– Add nodes to achieve optimal capacity and performance

– Lower data center costs, higher density, scale-out

– No specialized nodes

– All nodes are peers

– Query/load to any node

– Continuous/ real-time load and query

Client network

Private data network (IP)

Node 1

– 2 hex core– 96 + GB RAM

Node 1


Node 1


Nodes are peers

5+ TB 5+ TB 5+ TB

Distributed query execution

– Client connects to a node and issues a query

– Node the client is connected to becomes the initiator node

– Other nodes in the cluster become executor nodes

– Initiator node parses the query and picks an execution plan

– Initiator node distributes query plan to executor nodes

select sum(volume) from fact;

EXECUTOR

INITIATOR

EXECUTOR

Distributed query execution

– All nodes execute the query plan locally

– Nodes exchange data during aggregation and joins

– Executor nodes send partial query results back to initiator node

– Initiator node aggregates results from all nodes

– Initiator node returns final result to the user

EXECUTOR EXECUTOR

select sum(volume) from trades;

3

103

4

1010

INITIATOR

Transactions

– Vertica offers full ACID (just at low TPS)

– Queries take a snapshot of the relevant list of files, and need no locks at READ COMMITTED isolation

– Loads do not conflict with each other

– COMMIT – keep the new files

– ROLLBACK – discard them

– Table level locks for SERIALIZABLE

– Database is essentially its own undo/redo log

–Recovery can be as simple as file copies

*All Operations are on-line

A

B

B

C

C

D

D

A

Changes

Ch

an

ge

s

Simple query processing

–Optimal data storage and physical schema

– True columnar, sorted, compressed + encoded

– Segmented, cosegmented, and replicated

– Partitioning with partition elimination

– Large I/O reads + writes

–Lock-free queries

–Optimized, Vectorized, JIT compiled code

– Fast data types designed for modern CPUs

–Fast predicate application

– Expression Analysis for sorted/partitioned data

Complex query processing

– Sort, segmentation, and RLE Optimizations for expressions, predicates, aggregation, and joins

– Sophisticated query optimizer designed for columnar query execution

– Subqueries flattened into joins

– Segment data around cluster nodes and CPUs for parallelism

– Two-pass algorithms that are skew-tolerant and reduce reliance on optimizer decisions

– Passes of and joining are interleaved by the planner/executor, so the most effective strategy is chosen at run time

– Special join implementations for “late materialization,” range lookups, and event series

– Detection and optimization of “Top K” queries

Automatic database physical designVertica Database Designer (DBD)

Schema

Data

Queries/DML

Segmentation

Sort Order

Compression

DBD (Magic)

Workload management

– Don't want reports to take over the entire system, preventing loads or tactical queries

– Keep some resources (e.g. memory) reserved so that high-priority queries can always begin

– Apply run-time prioritization to manage CPU and I/O

System Loader Web refresh General

Dynamic prioritization

Q: Are optimizer cost model estimates really that bad?

A: Doesn’t matter!

0 50 100 150 200

0

20

40

60

80

100

120

TIME (S)

CU

MU

LA

TIV

E C

OM

PL

ET

ION

(%

)

Unprioritized Dynamic Priority

Analytics platform extensions

–Event Series Extensions– Sessionization

– Pattern Matching

– Gap Filling and Interpolation

– Event Series Joins

–User-Defined Extensions– Load source, stream filtering, and parsing

– Scalar functions, aggregates, transforms

– Growing variety of languages to choose from

–Packs/examples for– Geospatial

– Sentiment

– Data Mining, Logistic Regression, etc.

–Data Variety: Flex Tables, files, integration

–Analytics Packs

When not to use Vertica

Vertica is NOT an OLTP system

–Single/few record retrievals are, in theory and in practice, way worse in column stores

–While Vertica is ACID compliant, transaction throughput is in the 10s-100s of TPS

– INSERTs must be batched, or use the COPY command

– UPDATEs, and DELETEs are run serially within a table

–Referential integrity constraints are not enforced

–Instead, use Vertica in conjunction…

– Keep a log of what happened in the OLTP DBMS system, or NoSQL “eventually consistent” system

Vertica is not for huge numbers of small queries

–Data sets much less than a terabyte may not warrant an analytic database

–Use an in-memory database/tool (Membase, Memcached, etc.) with Vertica to handle large numbers of tiny point queries

Keep the environment simple

– Linux x86 64-bit only

– While they “should work,” use of shared storage, filers, etc., will add cost, add potential bottlenecks, and perplex our support department if anything goes wrong

– As it is a bit silly to break machines up into VMs, only to stitch them back together with an MPP database; virtualization is not recommended

– Reasonable network performance is essential

– Loads and some queries may use all-to-all bandwidth

– Do not attempt to span WANs

Documents

Structured Data Insightshpeanalyticstour.com/wp-content/uploads/2016/04/... · 2017. 2. 19. · Ben Vandiver, Architect, HPE SW Big Data. Agenda –What Big Data is –What Vertica