43
The Modern Analytics Architecture Making Big Data Useful Joseph D’Antoni, Solutions Architect Anexinet May 7-9, 2014 | San Jose, CA

The modern analytics architecture

  • Upload
    jdanton

  • View
    349

  • Download
    1

Embed Size (px)

Citation preview

Page 1: The modern analytics architecture

The Modern Analytics ArchitectureMaking Big Data UsefulJoseph D’Antoni, Solutions

ArchitectAnexinet

May 7-9, 2014 | San Jose, CA

Page 2: The modern analytics architecture

Please silence

cell phones

Page 3: The modern analytics architecture

Joey D’AntoniJoey has over 15 years of experience with a wide variety of data platforms, in both Fortune 50 companies as well as smaller organizationsHe is a frequent speaker on database administration, big data, and career managementHe is the co-president of the Philadelphia SQL Server User’s GroupHe wants you to make sure you can restore your data

Page 4: The modern analytics architecture

Agenda

• Data Warehouses—how did we get here?• Big Data—Hadoop and more• Modern Analytic Tools• Building Our New Architecture

4

Page 5: The modern analytics architecture

5

Data Warehouses—A History

• Data Warehousing had it origins in the 1970s—A.C. Nielsen provided clients with data marts

• In 1988—Bill Inmon (IBM) published “An Architecture for a Business Information System”

• In 1996—Ralph Kimball published “The Data Warehouse Toolkit” which showcased models for OLAP style modelling

Page 6: The modern analytics architecture

6

Data Warehouse Models

• Star Schema

• Advantage is that the DW is easier to use

• Facts and dimensions allow queries to perform faster

• Loading and ETL become more complicated

• Structure changes are very expensive

Dimensional Model

Page 7: The modern analytics architecture

7

Data Warehouse Model

• Tables are grouped by subject area (consumer, finance, products)

• Tables are linked by joins

• Very easy to add information into the database

• Queries are harder to write, and joins can be very expensive performance wise

Normalization

Page 8: The modern analytics architecture

8

Data Warehousing Challenges

Data QualityETLPerformance and ScalabilityCosts—Licensing and Hardware

Page 9: The modern analytics architecture

9

Data Quality

Page 10: The modern analytics architecture

10

Extract, Transform, Load (ETL) Process

Some Database Business Doesn’t

Care About

Process

Your

Some

Credit—Buck Woody, Microsoft

Page 11: The modern analytics architecture

11

Performance and Scalability

Given the volume of data, DW queries can be very slowWe use techniques like data compression to make them fasterCPU was older problem—now tends to be storage

Page 12: The modern analytics architecture

12

Costs

Data Warehouses need large serversDatabase systems are licensed by the size of the server (core)Data Warehouses need a whole lot fast storageLarge volumes of fast storage (SANs) are expensive

Page 13: The modern analytics architecture

13

Traditional Solutions

Page 14: The modern analytics architecture

Classic Data Analysis

Data Warehouse & BI Solutions

ETL

…Uses Just a Subset

Page 15: The modern analytics architecture

Common Technical Themes

There are a lot of “big data” solutions, but most of have a lot of things in common

• Built in HA/DR through multiple copies of the data• Designed for analytics processing more than OLTP• Derived from Open Source solutions• Designed around local storage and commodity

hardware

Page 16: The modern analytics architecture

Components Of Modern ArchitectureHadoop• (And it’s ecosystem)

EDWAnalytics EngineVisualization Engine

Page 17: The modern analytics architecture

Big Data Workflow for Combined Data and Analytics

Data Acquire Organize Analyze Decide

Str

uct

ur

ed

Sem

i-S

tru

ctu

red

Un

-S

tru

ctu

red

Master and

Reference

Transactions

Machine Generated

(Logs)

Web

Text, Image, Audio, Video

DBMS (OLTP)

Files

NoSQL(Key Value

Data Store)

HDFS

ETL/ELT

Change Data

Capture

Real-Time

Message-Based

Hadoop MR

ODS

Data Warehouse

Streaming(CEP

Engine)

In-Database Analytics

Analytics

• Reporting and dashboards

• Alerting and recommendations

• EPM, Social Apps

• Text analytics and search

• Advanced analytics

• Interactive discovery

Hardware

Big Data Cluster

High Speed

Network

RDBMS Cluster

In-MemoryAnalytics

Source—Gartner, Credit Suisse, 8/12

Page 18: The modern analytics architecture

Are We Leaving the RDBMS?

Page 19: The modern analytics architecture

19

CPUs

Hadoop Project StartsExadata Launched

Page 20: The modern analytics architecture

20

Costs—Big Data versus Data Warehouse

Server Storage Licensing Total $-

$50,000.00

$100,000.00

$150,000.00

$200,000.00

$250,000.00

$300,000.00

$350,000.00

Hadoop and Data Warehouse Costs

Hadoop Data Warehouse

• For same costs you build a 15-node Hadoop cluster

• The Hadoop cluster would have 3840 GB of RAM versus the 1024 in the DW sever

Page 21: The modern analytics architecture

Enter the Yellow Elephant

21

Page 22: The modern analytics architecture

Hadoop

Hadoop is the leading Big Data platform (eco-system)Invented by Yahoo• Scales Horizontally (2 socket x86 servers

in massive clusters)• Uses big, slow, local storage • Extremely fault-tolerant• In a nutshell—it’s a Distributed File

System (3 copies of data in cluster) and a programming framework called MapReduce

Page 23: The modern analytics architecture

23

Introducing Hadoop

Host 1

Name Node

Host 3

Data Node

Host 5

Data Node

Host 2

Secondary Name Node

Host 4

Data Node

Host 6

Data Node

Page 24: The modern analytics architecture

24

How Map Reduce Works

• Automatic parallelism

• Fault tolerance

Page 25: The modern analytics architecture

Map Phase

Input File: foo.log

HDFS Block

1

HDFS Block

19

HDFS Block 1051) Read

splits into records

Split 1

K:0 V…

Map Task 1

K:INFO V…

Split 2

K:123 V…

Map Task 2

K:INFO V:1K:WARN

V:1

Split 3K:332 V…

K:368 V…

Map Task 3

K:Debug V:1

K:INFO V:1

2) Run Map

3) Write and Sort Output

Page 26: The modern analytics architecture

Hadoop Ecosystem

HDFS

MapReduce

Note: This is only a subset of ecosystem!

Page 27: The modern analytics architecture

YARN

Page 28: The modern analytics architecture

28

Spark and Shark

• Hadoop 2 Enhancements

• Spark is in-memory• Shark integrates

Spark with Hive

Page 29: The modern analytics architecture

Hadoop Architectural Decisions

• Distribution• Components• Support• Cloud vs On-Premises

Page 30: The modern analytics architecture

Choosing Your Hadoop Distribution

Page 31: The modern analytics architecture

Hadoop Vendors

Technology Vendor Description

Hadoop Distributions Apache Completely open source software for distributed clusters and map/reduce

Cloudera Industry leading commercial distribution, good management tools

Hortonworks Open source distribution—Apache compatible

MapR Multiple enhancements to Apache Hadoop (rewrite of HDFS), high performance, enterprise ready

Pivotal HD EMC spinoff with strong financial backing, this is full high performance RDBMS (with BI connectors) on top of Hadoop

Page 32: The modern analytics architecture

32

Cloud vs On-Premises

• Short Term Use• Rapid Scale

• Test Use Cases• Pay as you go• Internet data

source

• Large long term implementations

• Well known workloads• Shared clusters• Large initial investment

On-Premises

Page 33: The modern analytics architecture

Analytics Engine33

Page 34: The modern analytics architecture

34

Analytics

Hadoop is was not fastFull scans of filesSo How Do We Rapidly Analyze Data?

Page 35: The modern analytics architecture

35

Columnar Databases

Microsoft SQL Server (2012 & 2014)PDWHP VerticaHBaseParAccelInfiniDBEMC Greenplum

Page 36: The modern analytics architecture

36

In-Memory Databases

SQL Server 2014SAP HanaOracle Times TenVoltDBApache Spark

Page 37: The modern analytics architecture

37

Analytics Tools Past and Present

Page 38: The modern analytics architecture

38

Data Visualization

Page 39: The modern analytics architecture

Tools for Data Visualization

Excel (Power View and Power Map)TableauQlikPlatforaPentaho

Page 40: The modern analytics architecture

40

Bringing This All Together

Power Query (Excel)

Some Database Business Doesn’t

Care About

Process

Your

Some

Page 41: The modern analytics architecture

Q & A ?

Page 42: The modern analytics architecture

Session Evaluations

Submit by 5pmFriday May 9 to WIN prizes

Your feedback is important and valuable.

ways to access

Go to passbac2014/evals

Download the PASS EVENT App from your App Store and search: PASS BAC 2014

Follow the QR code link displayed on session signage throughout the conference venue and in the program guide

Page 43: The modern analytics architecture

for attending this session and the PASS Business Analytics Conference 2014

Thank

You

May 7-9, 2014 | San Jose, CA