Upload
bluemetalinc
View
670
Download
3
Embed Size (px)
DESCRIPTION
Presented at the Northeast briefing "Big Data Made Real", 17 January 2013, at Microsoft, Cambridge MA
Citation preview
A new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high velocity capture, discovery and/or analysis
technologies
What is Big Data?
architectures
Big Data’s impact can be expressed by The Five V’s
VELOCITY
VARIETY
VOLUME
VALUE
+ V ISUALIZATION
E-Commerce Site fed by outsourced Ad Servers
Ads appear on a wide range of sites with various offers
Massive amount of data is generated by these servers: • Web logs and click stream data from the E-Commerce Site
• Ad logs and click stream data from the Ad Servers
• Results in relational transactions on the site
Goal: Maximize Traffic Analysis for Business Value
• Velocity Demo: Pinpoint activity in real-time & react
• Variety Demo: Examine historical trends across sources
• Visualization Demo: Enable ad-hoc data analysis for insights
Demo Context
High volume stream of log activity coming in:
• Web logs and Ad Server logs
Real-time stream analysis allows for pinpointing data when it happens
Simultaneously join structured and unstructured data in a persistent query
Can be used for A/B testing, Offer improvement, Site Dynamic behavior, or Fraud Detection
Velocity Architecture
LOG FILES
WEB SERVERS
How to identify when Ad clicks results in Site Traffic?
AD SERVERS
DEMO: StreamInsight
M/R
Variety Architecture
Ad Servers and Web Servers generate different log files with different formats making them hard to analyze
Map/Reduce processing allows for us to execute a query across variant data formats stored in Hadoop
Hive provides a traditional query interface to Map/Reduce
Correlate and connect high variety data for trend analysis
How to do historical analysis on unstructured data?
LOG FILES
WEB SERVERS
AD SERVERS
Hive HQL Queries
CREATE EXTERNAL TABLE logs ( date1 STRING, time1 STRING, action STRING, page_uri STRING, cookie STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION 'asv://logs/logs/';
CREATE TABLE log_summary AS SELECT l.cookie ,MAX(regexp_replace(cookie, '[-]', '') % 36) AS geo_hash ,MAX(l.time1) AS time1 ,l.page_uri ,MAX(CASE LOWER(action) WHEN 'click' THEN concat(l.date1, ' ', l.time1) ELSE NULL END) AS click_time ,MIN(CASE LOWER(action) WHEN 'view' THEN concat(l.date1, ' ', l.time1) ELSE NULL END) AS view_time ,MAX(l.date1) AS date1 FROM logs l GROUP BY l.cookie, l.page_uri;
Access Azure blob storage via a Hive “view” and aggregate session data
DEMO: Azure HDInsight
Hadoop Ecosystem Overview
• Hadoop is HDFS, the kernel & M/R
• MapReduce brings the code to the data
• Open set of tools exist to extend its functional uses and representations
Hadoop is an open source framework for building large scale,
distributed, data- intensive applications
Map/Reduce Distributes Processing of Operations
The "Reduce" step
Each reducer executes a function on all values for a given
key. The framework ensures that all values for the same
key are sent to the same reducer.
The "Map" step
The mappers are responsible for reading the input data and
emitting key/value pairs. The input file can be CSV, XML, or any
format as long as it can be converted into k/v pairs.
M/R
Visualization Architecture
Ad Servers and Web Servers generate different log files with different formats making them hard to analyze
Map/Reduce processing allows for us to execute a query across variant data formats stored in Hadoop
Hive provides a traditional query interface to Map/Reduce
Correlate and connect high variety data for trend analysis
How to do ad-hoc data discovery and visualizations?
LOG FILES
WEB SERVERS
AD SERVERS
DEMO: Excel & Hive Adapter
Big Data & Analytics Projects are often Additive
• New Capabilities layered on top of existing data & apps
• Analytics can drive Applications in new ways
Visualizations put Big Data in the hands of the Business
Summary
We are BlueMetal Architects
Take the next steps – Imagine, Define, Build
Take the next steps - our offerings
Envisioning & Strategy Briefing: Big Data, Analytics & Collaboration
Envisioning Session: Data is the App – Envisioning the Next Generation, Data Driven Enterprise
Architecture Design Session: Big Data & Analytics
Healthcare / Life Sciences: Strategy Briefing or Architecture Design Session – Big Data Architecture, Cloud & Use Case Driven Analytics and applications, Portal, M-Health and UX design for Providers, Patients, Pharma & Biotechnology
Financial Services: Strategy Briefing or Architecture Design Session – Big Data & Analytics for Banking, Capital Markets, Retail Brokerage or Insurance
Thank You
Who We Are
Differentiation
Specialization
Foundation
UX DATA SOCIAL
DESIGN
CODE
UX DATA SOCIAL
SERVICES
Who We Are
Differentiation
Specialization
Foundation
DESIGN
Desktop
Mobile
Web Client
Analytics
Big Data
Core SQL
Web Content
Intranets
Collaboration
.NET
Java
On-Premise
Cloud PPP
Strategy Creative Analysis