The modern analytics architecture

Preview:

Citation preview

The Modern Analytics ArchitectureMaking Big Data UsefulJoseph D’Antoni, Solutions

ArchitectAnexinet

May 7-9, 2014 | San Jose, CA

Please silence

cell phones

Joey D’AntoniJoey has over 15 years of experience with a wide variety of data platforms, in both Fortune 50 companies as well as smaller organizationsHe is a frequent speaker on database administration, big data, and career managementHe is the co-president of the Philadelphia SQL Server User’s GroupHe wants you to make sure you can restore your data

Agenda

• Data Warehouses—how did we get here?• Big Data—Hadoop and more• Modern Analytic Tools• Building Our New Architecture

4

5

Data Warehouses—A History

• Data Warehousing had it origins in the 1970s—A.C. Nielsen provided clients with data marts

• In 1988—Bill Inmon (IBM) published “An Architecture for a Business Information System”

• In 1996—Ralph Kimball published “The Data Warehouse Toolkit” which showcased models for OLAP style modelling

6

Data Warehouse Models

• Star Schema

• Advantage is that the DW is easier to use

• Facts and dimensions allow queries to perform faster

• Loading and ETL become more complicated

• Structure changes are very expensive

Dimensional Model

7

Data Warehouse Model

• Tables are grouped by subject area (consumer, finance, products)

• Tables are linked by joins

• Very easy to add information into the database

• Queries are harder to write, and joins can be very expensive performance wise

Normalization

8

Data Warehousing Challenges

Data QualityETLPerformance and ScalabilityCosts—Licensing and Hardware

9

Data Quality

10

Extract, Transform, Load (ETL) Process

Some Database Business Doesn’t

Care About

Process

Your

Some

Credit—Buck Woody, Microsoft

11

Performance and Scalability

Given the volume of data, DW queries can be very slowWe use techniques like data compression to make them fasterCPU was older problem—now tends to be storage

12

Costs

Data Warehouses need large serversDatabase systems are licensed by the size of the server (core)Data Warehouses need a whole lot fast storageLarge volumes of fast storage (SANs) are expensive

13

Traditional Solutions

Classic Data Analysis

Data Warehouse & BI Solutions

ETL

…Uses Just a Subset

Common Technical Themes

There are a lot of “big data” solutions, but most of have a lot of things in common

• Built in HA/DR through multiple copies of the data• Designed for analytics processing more than OLTP• Derived from Open Source solutions• Designed around local storage and commodity

hardware

Components Of Modern ArchitectureHadoop• (And it’s ecosystem)

EDWAnalytics EngineVisualization Engine

Big Data Workflow for Combined Data and Analytics

Data Acquire Organize Analyze Decide

Str

uct

ur

ed

Sem

i-S

tru

ctu

red

Un

-S

tru

ctu

red

Master and

Reference

Transactions

Machine Generated

(Logs)

Web

Text, Image, Audio, Video

DBMS (OLTP)

Files

NoSQL(Key Value

Data Store)

HDFS

ETL/ELT

Change Data

Capture

Real-Time

Message-Based

Hadoop MR

ODS

Data Warehouse

Streaming(CEP

Engine)

In-Database Analytics

Analytics

• Reporting and dashboards

• Alerting and recommendations

• EPM, Social Apps

• Text analytics and search

• Advanced analytics

• Interactive discovery

Hardware

Big Data Cluster

High Speed

Network

RDBMS Cluster

In-MemoryAnalytics

Source—Gartner, Credit Suisse, 8/12

Are We Leaving the RDBMS?

19

CPUs

Hadoop Project StartsExadata Launched

20

Costs—Big Data versus Data Warehouse

Server Storage Licensing Total $-

$50,000.00

$100,000.00

$150,000.00

$200,000.00

$250,000.00

$300,000.00

$350,000.00

Hadoop and Data Warehouse Costs

Hadoop Data Warehouse

• For same costs you build a 15-node Hadoop cluster

• The Hadoop cluster would have 3840 GB of RAM versus the 1024 in the DW sever

Enter the Yellow Elephant

21

Hadoop

Hadoop is the leading Big Data platform (eco-system)Invented by Yahoo• Scales Horizontally (2 socket x86 servers

in massive clusters)• Uses big, slow, local storage • Extremely fault-tolerant• In a nutshell—it’s a Distributed File

System (3 copies of data in cluster) and a programming framework called MapReduce

23

Introducing Hadoop

Host 1

Name Node

Host 3

Data Node

Host 5

Data Node

Host 2

Secondary Name Node

Host 4

Data Node

Host 6

Data Node

24

How Map Reduce Works

• Automatic parallelism

• Fault tolerance

Map Phase

Input File: foo.log

HDFS Block

1

HDFS Block

19

HDFS Block 1051) Read

splits into records

Split 1

K:0 V…

Map Task 1

K:INFO V…

Split 2

K:123 V…

Map Task 2

K:INFO V:1K:WARN

V:1

Split 3K:332 V…

K:368 V…

Map Task 3

K:Debug V:1

K:INFO V:1

2) Run Map

3) Write and Sort Output

Hadoop Ecosystem

HDFS

MapReduce

Note: This is only a subset of ecosystem!

YARN

28

Spark and Shark

• Hadoop 2 Enhancements

• Spark is in-memory• Shark integrates

Spark with Hive

Hadoop Architectural Decisions

• Distribution• Components• Support• Cloud vs On-Premises

Choosing Your Hadoop Distribution

Hadoop Vendors

Technology Vendor Description

Hadoop Distributions Apache Completely open source software for distributed clusters and map/reduce

Cloudera Industry leading commercial distribution, good management tools

Hortonworks Open source distribution—Apache compatible

MapR Multiple enhancements to Apache Hadoop (rewrite of HDFS), high performance, enterprise ready

Pivotal HD EMC spinoff with strong financial backing, this is full high performance RDBMS (with BI connectors) on top of Hadoop

32

Cloud vs On-Premises

• Short Term Use• Rapid Scale

• Test Use Cases• Pay as you go• Internet data

source

• Large long term implementations

• Well known workloads• Shared clusters• Large initial investment

On-Premises

Analytics Engine33

34

Analytics

Hadoop is was not fastFull scans of filesSo How Do We Rapidly Analyze Data?

35

Columnar Databases

Microsoft SQL Server (2012 & 2014)PDWHP VerticaHBaseParAccelInfiniDBEMC Greenplum

36

In-Memory Databases

SQL Server 2014SAP HanaOracle Times TenVoltDBApache Spark

37

Analytics Tools Past and Present

38

Data Visualization

Tools for Data Visualization

Excel (Power View and Power Map)TableauQlikPlatforaPentaho

40

Bringing This All Together

Power Query (Excel)

Some Database Business Doesn’t

Care About

Process

Your

Some

Q & A ?

Session Evaluations

Submit by 5pmFriday May 9 to WIN prizes

Your feedback is important and valuable.

ways to access

Go to passbac2014/evals

Download the PASS EVENT App from your App Store and search: PASS BAC 2014

Follow the QR code link displayed on session signage throughout the conference venue and in the program guide

for attending this session and the PASS Business Analytics Conference 2014

Thank

You

May 7-9, 2014 | San Jose, CA

Recommended