Download pdf - Machine Data Analytics

Big Data MeetupMachine Data Analytics

Raghuram Velega

IBM Software ArchitectBig Data Analytics

© 2013 IBM Corporation

Relevant Operations Data is Huge

A Typical Enterprise of 5000 servers with 125 applications across 2 or 3 data centers generates in excess of 1.4 TB of data per day

• 9 Gb Storage Data per day: 175K fiber ports

175 fiber ports,10 metrics per port, collected every 5 minutes, .5KB per port25K volumes, 10 metrics per volume, .5KB per volume5KB*(65K ports and volumes)*12*24 = 9.3 GB/day

• 2Gb Network performance data for Data Center networks (not access networks)

180x64 port Switches and 4 Routers to manage physical network.

Data flow of approximately 1TB unstructured data, and .4TB metric data per day,Scaled to 20K servers, approx 4TB unstructured, 1.6TB metric data

Daily Metric Output:

•250 Mb of event data from 125,000 Events

•125Mb of endpoint mgmt data from 5K servers

•12 Gb of performance data for 5000 servers

•1 Gb of performance for 5000 Virtual Machine

•8 Gb or Application middleware data

Assumptions: 40% of servers running monitored middleware

Average 60 metrics each, collected every 15 minutes

Average PMDB insert 1000 bytes, 40 inserts/server

•500 Mb Application transaction tracking data for 125 Applications

•1 Tb Log file data per day

200 Mb average per server (some will be smaller, some larger)

Example: WAS instances typically produce 400MB-750MB logs/day

•.35Tb Security data collected per day

Operational data growing

15-20% per year.

Shifting market for IT Operations

� APM Digest survey* of Senior IT Ops @ Fortune 500

− 50% growing dissatisfaction with traditional performance management solutions for Production IT

− Inability to adapt to rapidly changing applications & workloads

− 30% of them believe that they do not have a way to proactively detect problems

− Looking to operate on raw data and gain actionable insights

� IT Analytics solutions can predict, detect and help solve problems by churning through piles of data and translating this to understandable, relevant information, and actionable insights.

* Source: APMDigest:http://apmdigest.com/it-analytics-emerging-as-dissatisfaction-grows-with-apm-and-bsm-tools

Operational Visibility

IT Overwhelmed by data

Simple ad-hoc and scheduled Reporting to enable comparison of multiple metrics and data-sources

Streaming data analytics to provide realtime information and process Big Data volumes easily

Predictive Analytics enables forecasting and trending to provide foresight in resource demand, capacity & availability and clarify potential risks.

Provide holistic and accurate diagnosis by using guiding technology with behavioral learning capabilities.

Advanced correlation and pattern recognition to identify and resolve complex and undetectable events in real-time.

Performance

trending to plan

for growth

Self-learning

capabilities to

automatically adapt

to change

Detect capacity issues

prior to business

impact

Notice problems

sooner and more

accurately

Reduce false alerts

to lower

management costs

Automated

threshold setting for

quicker deployment

Leveraging analytics for

IT Operations

Exploiting IBM’s breadth of Analytics InitiativesProactively mitigate risk, attain insights to optimize actions, and reduce cost of ownership across Business, IT Operations, Asset Management, and more….

InfoSphereBigInsights

• How should I plan maintenance to efficiently keep my assets operational, given what I know today about my six month resource availability.

• “What-if” we change our preventive maintenance strategy?

• Help me track capacity and performance of applications & services in cloud / virtual environments, when do I need to add more capacity?

• Show me how to reduce cost of running my virtual infrastructure & making it more compliant with best practices.

Se

arch

• What is driving my high maintenance costs and what can I do to address this?

• How can I reduce reserved material inventory due to work order backlog?

• How do we make sense out of the terabytes of metric and log data that is generated by our applications and the infrastructure on which they run to isolate problems and reduce downtime?

• Can I use analysis of my channel traffic analysis to achieve improved customer insight and intelligence?

Op

timize

IT Operations needs analytics to predict, to search and to optimize

Pre

dict

• Can we predict/project failure occurrences for specific asset types?

• How can we get early warning of failures in my critical retail applications?

• Can I predict which KPIs are going to cause application issues without manually configuring thresholds? I have 100s of thousands of KPIs.

• I want to predict my online banking outages and take corrective actions before customers hit them.

How the Big Data Platform Can Help?

Raghuram Velega - IBM Software Architect(Big Data Analytics)

� Assemble and combine relevant mix of information

� Discover and explore with smart visualizations

� Analyze, predict and automate for more accurate answers

� Take action and automate processes

� Optimize analytical performance and IT costs

� Reduced infrastructure complexity and cost

� Manage, govern and secure information

Enabling organizations to

Performance Management

Content Analytics

Decision Management

Risk Analytics

Business Intelligence and Predictive Analytics

Information Integration and Governance

BIG DATA PLATFORM

SECURITY, SYSTEMS, STORAGE AND CLOUD

Sales | Marketing | Finance | Operations | IT | Risk | HR

ANALYTICS

SOLUTIONS

Industry

CONSULTING and IMPLEMENTATION SERVICES

Content Management

Data Warehouse

Stream Computing

HadoopSystem

IBM Provides a Holistic and Integrated Approach to Big Data and Analytics

Accelerators

Information Integration & Governance

Data Warehouse

Stream Computing

HadoopSystem

DiscoveryApplication

Development

Systems

Management

Data Media Content Machine Social

BIG DATA PLATFORM

The Platform for New Insight and Applications

InfoSphere Streams Analyze streaming data and large data bursts for real-time insights

InfoSphere BigInsights Cost-effectively analyze Petabytesof unstructured and structured data

InfoSphere Data ExplorerDiscover, understand, search, and navigate federated sources of big data

Big Data ExplorationFind, visualize, understand all big data to improve business knowledge

Enhanced 360o Viewof the CustomerAchieve a true unified view, incorporating internal and external sources

Operations AnalysisAnalyze a variety of machinedata for improved business results

Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency

Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time

The 5 High Value Big Data Use Cases

Observed Big Data Use Cases

10 12/11/2013

4

5

8

8

10

13

13

14

18

19

20

22

23

24

29

32

71

139

143

197

0 20 40 60 80 100 120 140 160 180 200

BigInsights as NoSQL store

Transportation/ SCM

Medical/ Transcriptional Profiling

File storage or ECM offload

Event Processing

Smart Grid Apps

Environmental Sensor apps

Real Time Processing

Fraud / Risk

Financial Apps Algo Trading

Statistical /predictiveAnalysis

Geospatial Location/ Space exploration

Cyber Security

Analytic Apps

Audio, Video, Image Analysis

Telco Apps

Text Analytics

Database Offload, reporting,mining

Customer behavior/Social analysis

Machine Data Analysis

Source: Multiple websites , n=933 available data for n= 812, count of use cases is not mutually exclusive

Big Data Creates A Challenge – And an Opportunity

What If You Could...

Traditional Big Data Approach

Leverage All of the Data Captured

Reduce Effort Required to Leverage

Data

Let Data Lead The Way, and continuously explore

Leverage data as it is captured – In Motion

IBM Infosphere BigInsights : Machine Data Analytics

Machine Data Analytics: Customer Example

• Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management

• Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance

• Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos

� Do you deal with large volumes of machine data?

� How do you access and search that data?

� How do you perform root cause analysis?

� How do you perform complex real-time analysis to correlate across different data sets?

� How do you monitor and visualize streaming data in real time and generate alerts?

Would Operations Analysis benefit you?

Product Starting Point: InfoSphere BigInsights, InfoSphere Streams

Raw

Logs a

nd

Machin

e D

ata

Indexing, Search

Statistical Modeling

Root Cause Analysis

Federated Navigation

& Discovery

Real-time Analysis

Only storewhat is needed

BigInsights : Machine Data Analytics

Machine Data

Accelerator

Taking Full Advantage of Machine Data Requires New Thinking

Machine Data Characteristics

� From variety of complex systems with complex formats – no standards

� May not always have context

� Structured and unstructured data

� Extremely large volumes of data

� Streaming data as well as data at rest

� Time sensitive - agile in interpretation and ability to respond

� Requires sophisticated text analysis

� Adaptive/dynamic algorithms to efficiently process data

� Large scale indexing

Taking Full Advantage of Machine Data Requires New Thinking

� Correlation across different data sets and/or different environments

� Data may need to be enriched or transformed to provide proper context

� Causal analysis (if problem on Tuesday, what happened on Monday to cause this)

� Pattern analysis

� Time and spatial based analysis

� Unique Visualization/UI needs based on data type and industry/application

� Sophisticated search capabilities.

Customer Usage Pattern of Log Analysis with MDA

� Step 1:

− “What is happening in my systems?”

� Step 2:

− “Let me try to use my experience to correlate the events and sequence”

� Step 3:

− “I need a tool to do Step 2 – I have too many systems and too many logs”

� Step 4:

− “I need to combine with my system KPI data and monitor / report in a dashboard. Provide possible solutions to the problem / anomaly”

� Step 5:

− “I need to predict the behavior when I make changes, add error codes. or add new systems”

Step 1: What is happening in my system?

� This is accomplished get all the log data, extract, parse, indexand search through a faceted interface.

� This is also the phase where basic event level metrics – max, min, counts, builtin range metrics, alerts when KPIs are not in range – are desired and tested.

� Dashboards that are dynamic and actionable in sync with the searches are highly desirable.

� The MDA provides the Faceted Search interface.

� KEY TECHNOLOGIES – Text Analytics, Faceted Search, BI

Step 2: Let me correlate events

� In this phase, the customer performs searches and endeavors to make sense of the events and sequences

− We usually work side by side with the customer in this stage

− We extract the vital tribal knowledge and applications in the domain.

− We log their “experiential” notions of event sequences and correlations – this is essential to verify results when the user wants to go to Step 3.

� KEY TECHINOLOGIES – Big Sheets

Step 3: I have too many systems and logs to correlate

� In this phase, the customer essentially wants to find relationships and patterns of occurrence between log events across systems and applications.

� The MDA provides uses sessionization and sequence mining capability to accomplish this step.

� KEY TECHNOLOGIES – Text Analytics, Machine Learning

Step 4: Combine with my KPI, Topology data

� Once Step 3 is completed, the integration with the KPI, topology, and monitoring data is possible.

� This step allows us to expose the capabilities to the Network Operator and end user.

� KEY TECHNOLOGIES – Data Joins, SQL/JAQL, Big Sheets, Reporting Dashboards

Step 5: Predict events based on patterns

� The more advanced customers and network operators would like to build predictive models based on the patterns they see in the events in log data.

� Customers want to build models that help with meeting enterprise SLAs for systems

� Downtime scheduling for systems is a complex problem for most data centers.

� KEY TECHNOLOGIES – Machine Learning (R, SPSS, System ML)

High-Level Workflow

Apply Adapter

� What– Copy the logs from these machines where logs are generated using

into hdfs.

� How– BigInsights Distributed copy app + MDA extensions

� Advantages• Use ftp/ sftp protocols supported by Distributed Copy App

• MDA extensions allow batch incremental processing, batch replement

• MDA extensions associating metadata like server names, or any other,

which is available to downstream analysis

Import

� What– Identify log record boundaries

– Extract information from log records in text and XML

� How– BigInsights Text Analytics

� Advantages– Robust text extraction using SQL like language

• Avoid ‘brittle’ custom parsers

– Library of extractors for common log files

• Syslogs, websphere, web access, datapower, csv, generic

– Extensive tooling for custom extractor development and app customization

• Eclipse based IDE

Extract

The Extract Stage: Text analytics applied to log files

RecordSplitting

(HDFS/GPFS)

Raw Log

Files

Field and Entity

Extraction

Log

Records

(text)

Semi-

Structured

Data

(JSON)

To Transform

Stage

AQLAQL AQLAQL

AQL extractors available for many common formats [syslog, websphere, csv, ...] BigInsights ships with tools for creating new extractors.

Index

� What

− Index and facet extracted records and fields so it can be available for searching via the faceted searching user interface

� How

− BigInsights BigIndex

� Advantages

� Find correlated, log entries based on time through interactive UI

� Add/inject other data (e.g Excel) to enrich log context.

� Allow operations staff to quickly find log entries based on search terms such as, web service name, server name, exception code, transaction id etc

Transform

� What

– Link and enrich log information from different entities• Find relationships between log records

• Integrate structured data with log data

– network configuration, user account information…

� How– JAQL

� Advantages– High level language that is Big Data aware

– Out of the box transformers

– Extensive tooling for application customization

• Eclipse IDE

The Transform Stage: Linking logs from and other

information from varied sources

� Input: Parsed log records, additional structured data

� Output: Individual log records, from different IT entities, linked and enriched

Fault data

Network log

Web log

MQ log

Server logTransaction log

Performance

Data

Performance

and Fault data

Raw Logs

(HDFS/GPFS)

(HDFS/GPFS)

Text Files

Structured data

from non-log

sources

1.IT logs of a single business activity or transaction

– Up & down the IT stack

2.Log of a activity across one layer of IT stack (e.g. OS layer)

– Messages flowing through a sequence of routers

3.…

Link logs corresponding to Outlier

Detection

Correlations,Predictive

Models

Analyze

� What– Correlate across fields

– Find frequently occurring sequences and combinations of events

– Potential for predictive modeling in the future

� How– System ML

� Advantages– Scalable to perform analytics on Big Data

– Flexible and customizable

– Easy to plugin into applications via a JAQL/Java interface

Agenda

� Introduction

� High Level Workflow

� Some Highlights

� Demo

Machine Data Adapters

� What are Adapters

− Adapt a variety of inputs to a standard output

� Why do we need Machine Data Adapters

− To handle different ‘machine data’ formats

Adapters in High-Level Workflow

Apply Adapter

Adapter Functions

� Create

− Enter Adapter-Name, LogType, ‘sample machine data’ and first ‘timestamp’ in the ‘sample machine data’

− Check the recommended ‘DataTime Format’ and ‘preTimeStamp Regex’ and select defaults like ‘timezone’, ‘year’ and ‘month’.

− Verify the extracted output and save if you find it good

− If extracted out is bad, then you can go back and edit parameter ‘Data Source Type’, ‘DataTime Format’ and ‘preTimeStamp Regex’

� Edit

� View

� Apply

� Delete

Create Machine Data Adapter – Step-1



Display Machine Data Adapter

Edit Machine Data Adapter – Step-1



Display Machine Data Adapter

Apply Machine Data Adapter

Verify the Adapter (metadata.json)

Delete

Data Explorer for Indexing Application

� Data Explorer Index Configuration File to support generic schema for extracted

machine data.

� Parallelizing data pushing to Data Explorer Indexer.

� Run Data Explorer Index Application

Data Explorer Index Configuration File

� The Data Explorer index config file specifies which fields to index, which field

contains record ID as well as Data Explorer index field definitions: field name,

type, searchable, retrievable, filterable and sortable.

Example:

{ "source": { "dateFormat": "MMM dd yyyy HH:mm:ss.SSS Z", "fieldName":

"LogDatetime[].normalized_text", "suppress": false }, "target": {

"deFieldName": "LogDatetime", "filterable": true, "isRecordID": false,

"retrievable": true, "searchable": true, "sortable": true, "type": "Date" }}

� Default Index Configuration file is provided.

Parallelizing data pushing to Data Explorer Indexer

� The application uses Oozie jaql action to parallelize the job to multiple tasks.

Jaql Hadoop Task 1

BigSearchBigSearch

Zookeeper

Cluster

Zookeeper

Cluster

Jaql Hadoop Task M

BigSearchBigSearch

DE Backend

Shard 1

DE Backend

Shard 1…

Locate shards

…

DE Backend

Shard N

DE Backend

Shard N

HDFS

Indexing app

BI platform/IDE

Run Data Explorer index Application

Basic Facet Search UI on Application Builder

BI Log Monitoring and Analysis

• Ingest BigInsights logs in HBase in real time.

• Create Log Monitoring Extraction application that extracts log records from

HBase.

• Create Index Management application to delete old index log records from DFS.

• Embed the MDA Search UI within the BigInsights Dashboard for BigInsights log

search.

Ingesting BigInsights Logs into HBase

� Chukwa agents setup on Name Node and each of the Data Nodes

� Adapters are programmatically installed and removed depending on user

configuration.

� Custom Chukwa writer class created to add logs into HBASE in real time.

� Log4j Interface streams logs to the adaptors which stream logs to HBASE

� Different log types are concurrently recorded in HBASE in a single table

Data Collection Diagram

HBASE

Name Node

Hadoop Secondary Name Node

Hadoop Jobtracker

Hadoop Name Node

Data Node 1

Hadoop Data Node

Hadoop Task Tracker

Hadoop Task Attempt

Data Node 2

Hadoop Data Node

Hadoop Task Tracker

Hadoop Task Attempt

Data Node 3

Hadoop Data Node

Hadoop Task Tracker

Hadoop Task Attempt

� For HDFS with Symphony MapReduce Installation: Hadoop Data Node, Hadoop

Name Node and Hadoop Secondary Name Node logs are supported

� For GPFS with Apache MapReduce Installation: Hadoop Job Tracker, Hadoop

Task Tracker and Hadoop Task Attempt logs are supported

� For GPFS with Symphony MapReduce Installation: Only Hadoop Task Attempt

logs are supported

HDFS with Apache Map Reduce

BigInsights Dashboard

� User starts the BigInsights log collection from the LogCollection app.

� User is able to stop the BigInsights log collection from the LogCollection app. Or

by turning off the Monitoring.

� The MDA Search UI is wrapped by a frame in BigInsights Dashboard.

Dashboard

LogCollection app.

BigInsights Log Monitoring Application

� Is a BigInsights Chained application.

� Contains Log Monitoring Extraction application and Index application.

� Assumes that Log Monitoring Extraction application is running on schedule

mode.

� The BigInsights Logs is selected assumed as the workflow for Index application.

� Any configuration files are assumed to be the default configuration files installed

with MDA

� “Index Only New Logs” check-box in the Index application is assumed to be

unchecked.

BigInsights Log Monitoring Application

Agenda

� Introduction

� High Level Workflow

� New Features in MDA 2.1

� Demo