20

20130117 - Big Data Architectures

Embed Size (px)

DESCRIPTION

Presented at the Northeast briefing "Big Data Made Real", 17 January 2013, at Microsoft, Cambridge MA

Citation preview

Page 1: 20130117 - Big Data Architectures
Page 2: 20130117 - Big Data Architectures

A new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis

technologies

What is Big Data?

architectures

Page 3: 20130117 - Big Data Architectures

Big Data’s impact can be expressed by The Five V’s

VELOCITY

VARIETY

VOLUME

VALUE

+ V ISUALIZATION

Page 4: 20130117 - Big Data Architectures

E-Commerce Site fed by outsourced Ad Servers

Ads appear on a wide range of sites with various offers

Massive amount of data is generated by these servers: • Web logs and click stream data from the E-Commerce Site

• Ad logs and click stream data from the Ad Servers

• Results in relational transactions on the site

Goal: Maximize Traffic Analysis for Business Value

• Velocity Demo: Pinpoint activity in real-time & react

• Variety Demo: Examine historical trends across sources

• Visualization Demo: Enable ad-hoc data analysis for insights

Demo Context

Page 5: 20130117 - Big Data Architectures

High volume stream of log activity coming in:

• Web logs and Ad Server logs

Real-time stream analysis allows for pinpointing data when it happens

Simultaneously join structured and unstructured data in a persistent query

Can be used for A/B testing, Offer improvement, Site Dynamic behavior, or Fraud Detection

Velocity Architecture

LOG FILES

WEB SERVERS

How to identify when Ad clicks results in Site Traffic?

AD SERVERS

Page 6: 20130117 - Big Data Architectures

DEMO: StreamInsight

Page 7: 20130117 - Big Data Architectures

M/R

Variety Architecture

Ad Servers and Web Servers generate different log files with different formats making them hard to analyze

Map/Reduce processing allows for us to execute a query across variant data formats stored in Hadoop

Hive provides a traditional query interface to Map/Reduce

Correlate and connect high variety data for trend analysis

How to do historical analysis on unstructured data?

LOG FILES

WEB SERVERS

AD SERVERS

Page 8: 20130117 - Big Data Architectures

Hive HQL Queries

CREATE EXTERNAL TABLE logs ( date1 STRING, time1 STRING, action STRING, page_uri STRING, cookie STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'asv://logs/logs/';

CREATE TABLE log_summary AS SELECT l.cookie ,MAX(regexp_replace(cookie, '[-]', '') % 36) AS geo_hash ,MAX(l.time1) AS time1 ,l.page_uri ,MAX(CASE LOWER(action) WHEN 'click' THEN concat(l.date1, ' ', l.time1) ELSE NULL END) AS click_time ,MIN(CASE LOWER(action) WHEN 'view' THEN concat(l.date1, ' ', l.time1) ELSE NULL END) AS view_time ,MAX(l.date1) AS date1 FROM logs l GROUP BY l.cookie, l.page_uri;

Access Azure blob storage via a Hive “view” and aggregate session data

Page 9: 20130117 - Big Data Architectures

DEMO: Azure HDInsight

Page 10: 20130117 - Big Data Architectures

Hadoop Ecosystem Overview

• Hadoop is HDFS, the kernel & M/R

• MapReduce brings the code to the data

• Open set of tools exist to extend its functional uses and representations

Hadoop is an open source framework for building large scale,

distributed, data- intensive applications

Page 11: 20130117 - Big Data Architectures

Map/Reduce Distributes Processing of Operations

The "Reduce" step

Each reducer executes a function on all values for a given

key. The framework ensures that all values for the same

key are sent to the same reducer.

The "Map" step

The mappers are responsible for reading the input data and

emitting key/value pairs. The input file can be CSV, XML, or any

format as long as it can be converted into k/v pairs.

Page 12: 20130117 - Big Data Architectures

M/R

Visualization Architecture

Ad Servers and Web Servers generate different log files with different formats making them hard to analyze

Map/Reduce processing allows for us to execute a query across variant data formats stored in Hadoop

Hive provides a traditional query interface to Map/Reduce

Correlate and connect high variety data for trend analysis

How to do ad-hoc data discovery and visualizations?

LOG FILES

WEB SERVERS

AD SERVERS

Page 13: 20130117 - Big Data Architectures

DEMO: Excel & Hive Adapter

Page 14: 20130117 - Big Data Architectures

Big Data & Analytics Projects are often Additive

• New Capabilities layered on top of existing data & apps

• Analytics can drive Applications in new ways

Visualizations put Big Data in the hands of the Business

Summary

Page 15: 20130117 - Big Data Architectures

We are BlueMetal Architects

Page 16: 20130117 - Big Data Architectures

Take the next steps – Imagine, Define, Build

Page 17: 20130117 - Big Data Architectures

Take the next steps - our offerings

Envisioning & Strategy Briefing: Big Data, Analytics & Collaboration

Envisioning Session: Data is the App – Envisioning the Next Generation, Data Driven Enterprise

Architecture Design Session: Big Data & Analytics

Healthcare / Life Sciences: Strategy Briefing or Architecture Design Session – Big Data Architecture, Cloud & Use Case Driven Analytics and applications, Portal, M-Health and UX design for Providers, Patients, Pharma & Biotechnology

Financial Services: Strategy Briefing or Architecture Design Session – Big Data & Analytics for Banking, Capital Markets, Retail Brokerage or Insurance

Page 18: 20130117 - Big Data Architectures

Thank You

Page 19: 20130117 - Big Data Architectures

Who We Are

Differentiation

Specialization

Foundation

UX DATA SOCIAL

DESIGN

CODE

Page 20: 20130117 - Big Data Architectures

UX DATA SOCIAL

SERVICES

Who We Are

Differentiation

Specialization

Foundation

DESIGN

Desktop

Mobile

Web Client

Analytics

Big Data

Core SQL

Web Content

Intranets

Collaboration

.NET

Java

On-Premise

Cloud PPP

Strategy Creative Analysis