Big Data MeetupMachine Data Analytics
Raghuram Velega
IBM Software ArchitectBig Data Analytics
© 2013 IBM Corporation
Relevant Operations Data is Huge
A Typical Enterprise of 5000 servers with 125 applications across 2 or 3 data centers generates in excess of 1.4 TB of data per day
• 9 Gb Storage Data per day: 175K fiber ports
175 fiber ports,10 metrics per port, collected every 5 minutes, .5KB per port25K volumes, 10 metrics per volume, .5KB per volume5KB*(65K ports and volumes)*12*24 = 9.3 GB/day
• 2Gb Network performance data for Data Center networks (not access networks)
180x64 port Switches and 4 Routers to manage physical network.
Data flow of approximately 1TB unstructured data, and .4TB metric data per day,Scaled to 20K servers, approx 4TB unstructured, 1.6TB metric data
Daily Metric Output:
•250 Mb of event data from 125,000 Events
•125Mb of endpoint mgmt data from 5K servers
•12 Gb of performance data for 5000 servers
•1 Gb of performance for 5000 Virtual Machine
•8 Gb or Application middleware data
Assumptions: 40% of servers running monitored middleware
Average 60 metrics each, collected every 15 minutes
Average PMDB insert 1000 bytes, 40 inserts/server
•500 Mb Application transaction tracking data for 125 Applications
•1 Tb Log file data per day
200 Mb average per server (some will be smaller, some larger)
Example: WAS instances typically produce 400MB-750MB logs/day
•.35Tb Security data collected per day
Operational data growing
15-20% per year.
Shifting market for IT Operations
� APM Digest survey* of Senior IT Ops @ Fortune 500
− 50% growing dissatisfaction with traditional performance management solutions for Production IT
− Inability to adapt to rapidly changing applications & workloads
− 30% of them believe that they do not have a way to proactively detect problems
− Looking to operate on raw data and gain actionable insights
� IT Analytics solutions can predict, detect and help solve problems by churning through piles of data and translating this to understandable, relevant information, and actionable insights.
* Source: APMDigest:http://apmdigest.com/it-analytics-emerging-as-dissatisfaction-grows-with-apm-and-bsm-tools
Operational Visibility
IT Overwhelmed by data
Simple ad-hoc and scheduled Reporting to enable comparison of multiple metrics and data-sources
Streaming data analytics to provide realtime information and process Big Data volumes easily
Predictive Analytics enables forecasting and trending to provide foresight in resource demand, capacity & availability and clarify potential risks.
Provide holistic and accurate diagnosis by using guiding technology with behavioral learning capabilities.
Advanced correlation and pattern recognition to identify and resolve complex and undetectable events in real-time.
Performance
trending to plan
for growth
Self-learning
capabilities to
automatically adapt
to change
Detect capacity issues
prior to business
impact
Notice problems
sooner and more
accurately
Reduce false alerts
to lower
management costs
Automated
threshold setting for
quicker deployment
Leveraging analytics for
IT Operations
Exploiting IBM’s breadth of Analytics InitiativesProactively mitigate risk, attain insights to optimize actions, and reduce cost of ownership across Business, IT Operations, Asset Management, and more….
InfoSphereBigInsights
• How should I plan maintenance to efficiently keep my assets operational, given what I know today about my six month resource availability.
• “What-if” we change our preventive maintenance strategy?
• Help me track capacity and performance of applications & services in cloud / virtual environments, when do I need to add more capacity?
• Show me how to reduce cost of running my virtual infrastructure & making it more compliant with best practices.
Se
arch
• What is driving my high maintenance costs and what can I do to address this?
• How can I reduce reserved material inventory due to work order backlog?
• How do we make sense out of the terabytes of metric and log data that is generated by our applications and the infrastructure on which they run to isolate problems and reduce downtime?
• Can I use analysis of my channel traffic analysis to achieve improved customer insight and intelligence?
Op
timize
IT Operations needs analytics to predict, to search and to optimize
Pre
dict
• Can we predict/project failure occurrences for specific asset types?
• How can we get early warning of failures in my critical retail applications?
• Can I predict which KPIs are going to cause application issues without manually configuring thresholds? I have 100s of thousands of KPIs.
• I want to predict my online banking outages and take corrective actions before customers hit them.
How the Big Data Platform Can Help?
Raghuram Velega - IBM Software Architect(Big Data Analytics)
� Assemble and combine relevant mix of information
� Discover and explore with smart visualizations
� Analyze, predict and automate for more accurate answers
� Take action and automate processes
� Optimize analytical performance and IT costs
� Reduced infrastructure complexity and cost
� Manage, govern and secure information
Enabling organizations to
Performance Management
Content Analytics
Decision Management
Risk Analytics
Business Intelligence and Predictive Analytics
Information Integration and Governance
BIG DATA PLATFORM
SECURITY, SYSTEMS, STORAGE AND CLOUD
Sales | Marketing | Finance | Operations | IT | Risk | HR
ANALYTICS
SOLUTIONS
Industry
CONSULTING and IMPLEMENTATION SERVICES
Content Management
Data Warehouse
Stream Computing
HadoopSystem
IBM Provides a Holistic and Integrated Approach to Big Data and Analytics
Accelerators
Information Integration & Governance
Data Warehouse
Stream Computing
HadoopSystem
DiscoveryApplication
Development
Systems
Management
Data Media Content Machine Social
BIG DATA PLATFORM
The Platform for New Insight and Applications
InfoSphere Streams Analyze streaming data and large data bursts for real-time insights
InfoSphere BigInsights Cost-effectively analyze Petabytesof unstructured and structured data
InfoSphere Data ExplorerDiscover, understand, search, and navigate federated sources of big data
Big Data ExplorationFind, visualize, understand all big data to improve business knowledge
Enhanced 360o Viewof the CustomerAchieve a true unified view, incorporating internal and external sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse AugmentationIntegrate big data and data warehouse capabilities to increase operational efficiency
Security/Intelligence ExtensionLower risk, detect fraud and monitor cyber security in real-time
The 5 High Value Big Data Use Cases
Observed Big Data Use Cases
10 12/11/2013
4
5
8
8
10
13
13
14
18
19
20
22
23
24
29
32
71
139
143
197
0 20 40 60 80 100 120 140 160 180 200
BigInsights as NoSQL store
Transportation/ SCM
Medical/ Transcriptional Profiling
File storage or ECM offload
Event Processing
Smart Grid Apps
Environmental Sensor apps
Real Time Processing
Fraud / Risk
Financial Apps Algo Trading
Statistical /predictiveAnalysis
Geospatial Location/ Space exploration
Cyber Security
Analytic Apps
Audio, Video, Image Analysis
Telco Apps
Text Analytics
Database Offload, reporting,mining
Customer behavior/Social analysis
Machine Data Analysis
Source: Multiple websites , n=933 available data for n= 812, count of use cases is not mutually exclusive
Big Data Creates A Challenge – And an Opportunity
What If You Could...
Traditional Big Data Approach
Leverage All of the Data Captured
Reduce Effort Required to Leverage
Data
Let Data Lead The Way, and continuously explore
Leverage data as it is captured – In Motion
IBM Infosphere BigInsights : Machine Data Analytics
Machine Data Analytics: Customer Example
• Intelligent Infrastructure Management: log analytics, energy bill forecasting, energy consumption optimization, anomalous energy usage detection, presence-aware energy management
• Optimized building energy consumption with centralized monitoring; Automated preventive and corrective maintenance
• Utilized InfoSphere Streams, InfoSphere BigInsights, IBM Cognos
� Do you deal with large volumes of machine data?
� How do you access and search that data?
� How do you perform root cause analysis?
� How do you perform complex real-time analysis to correlate across different data sets?
� How do you monitor and visualize streaming data in real time and generate alerts?
Would Operations Analysis benefit you?
Product Starting Point: InfoSphere BigInsights, InfoSphere Streams
Raw
Logs a
nd
Machin
e D
ata
Indexing, Search
Statistical Modeling
Root Cause Analysis
Federated Navigation
& Discovery
Real-time Analysis
Only storewhat is needed
BigInsights : Machine Data Analytics
Machine Data
Accelerator
Taking Full Advantage of Machine Data Requires New Thinking
Machine Data Characteristics
� From variety of complex systems with complex formats – no standards
� May not always have context
� Structured and unstructured data
� Extremely large volumes of data
� Streaming data as well as data at rest
� Time sensitive - agile in interpretation and ability to respond
� Requires sophisticated text analysis
� Adaptive/dynamic algorithms to efficiently process data
� Large scale indexing
Taking Full Advantage of Machine Data Requires New Thinking
� Correlation across different data sets and/or different environments
� Data may need to be enriched or transformed to provide proper context
� Causal analysis (if problem on Tuesday, what happened on Monday to cause this)
� Pattern analysis
� Time and spatial based analysis
� Unique Visualization/UI needs based on data type and industry/application
� Sophisticated search capabilities.
Customer Usage Pattern of Log Analysis with MDA
� Step 1:
− “What is happening in my systems?”
� Step 2:
− “Let me try to use my experience to correlate the events and sequence”
� Step 3:
− “I need a tool to do Step 2 – I have too many systems and too many logs”
� Step 4:
− “I need to combine with my system KPI data and monitor / report in a dashboard. Provide possible solutions to the problem / anomaly”
� Step 5:
− “I need to predict the behavior when I make changes, add error codes. or add new systems”
Step 1: What is happening in my system?
� This is accomplished get all the log data, extract, parse, indexand search through a faceted interface.
� This is also the phase where basic event level metrics – max, min, counts, builtin range metrics, alerts when KPIs are not in range – are desired and tested.
� Dashboards that are dynamic and actionable in sync with the searches are highly desirable.
� The MDA provides the Faceted Search interface.
� KEY TECHNOLOGIES – Text Analytics, Faceted Search, BI
Step 2: Let me correlate events
� In this phase, the customer performs searches and endeavors to make sense of the events and sequences
− We usually work side by side with the customer in this stage
− We extract the vital tribal knowledge and applications in the domain.
− We log their “experiential” notions of event sequences and correlations – this is essential to verify results when the user wants to go to Step 3.
� KEY TECHINOLOGIES – Big Sheets
Step 3: I have too many systems and logs to correlate
� In this phase, the customer essentially wants to find relationships and patterns of occurrence between log events across systems and applications.
� The MDA provides uses sessionization and sequence mining capability to accomplish this step.
� KEY TECHNOLOGIES – Text Analytics, Machine Learning
Step 4: Combine with my KPI, Topology data
� Once Step 3 is completed, the integration with the KPI, topology, and monitoring data is possible.
� This step allows us to expose the capabilities to the Network Operator and end user.
� KEY TECHNOLOGIES – Data Joins, SQL/JAQL, Big Sheets, Reporting Dashboards
Step 5: Predict events based on patterns
� The more advanced customers and network operators would like to build predictive models based on the patterns they see in the events in log data.
� Customers want to build models that help with meeting enterprise SLAs for systems
� Downtime scheduling for systems is a complex problem for most data centers.
� KEY TECHNOLOGIES – Machine Learning (R, SPSS, System ML)
High-Level Workflow
Apply Adapter
� What– Copy the logs from these machines where logs are generated using
into hdfs.
� How– BigInsights Distributed copy app + MDA extensions
� Advantages• Use ftp/ sftp protocols supported by Distributed Copy App
• MDA extensions allow batch incremental processing, batch replement
• MDA extensions associating metadata like server names, or any other,
which is available to downstream analysis
Import
� What– Identify log record boundaries
– Extract information from log records in text and XML
� How– BigInsights Text Analytics
� Advantages– Robust text extraction using SQL like language
• Avoid ‘brittle’ custom parsers
– Library of extractors for common log files
• Syslogs, websphere, web access, datapower, csv, generic
– Extensive tooling for custom extractor development and app customization
• Eclipse based IDE
Extract
The Extract Stage: Text analytics applied to log files
RecordSplitting
(HDFS/GPFS)
Raw Log
Files
Field and Entity
Extraction
Log
Records
(text)
Semi-
Structured
Data
(JSON)
To Transform
Stage
AQLAQL AQLAQL
AQL extractors available for many common formats [syslog, websphere, csv, ...] BigInsights ships with tools for creating new extractors.
Index
� What
− Index and facet extracted records and fields so it can be available for searching via the faceted searching user interface
� How
− BigInsights BigIndex
� Advantages
� Find correlated, log entries based on time through interactive UI
� Add/inject other data (e.g Excel) to enrich log context.
� Allow operations staff to quickly find log entries based on search terms such as, web service name, server name, exception code, transaction id etc
Transform
� What
– Link and enrich log information from different entities• Find relationships between log records
• Integrate structured data with log data
– network configuration, user account information…
� How– JAQL
� Advantages– High level language that is Big Data aware
– Out of the box transformers
– Extensive tooling for application customization
• Eclipse IDE
The Transform Stage: Linking logs from and other
information from varied sources
� Input: Parsed log records, additional structured data
� Output: Individual log records, from different IT entities, linked and enriched
Fault data
Network log
Web log
MQ log
Server logTransaction log
Performance
Data
Performance
and Fault data
Raw Logs
(HDFS/GPFS)
(HDFS/GPFS)
Text Files
Structured data
from non-log
sources
1.IT logs of a single business activity or transaction
– Up & down the IT stack
2.Log of a activity across one layer of IT stack (e.g. OS layer)
– Messages flowing through a sequence of routers
3.…
Link logs corresponding to Outlier
Detection
Correlations,Predictive
Models
Analyze
� What– Correlate across fields
– Find frequently occurring sequences and combinations of events
– Potential for predictive modeling in the future
� How– System ML
� Advantages– Scalable to perform analytics on Big Data
– Flexible and customizable
– Easy to plugin into applications via a JAQL/Java interface
Agenda
� Introduction
� High Level Workflow
� Some Highlights
� Demo
Machine Data Adapters
� What are Adapters
− Adapt a variety of inputs to a standard output
� Why do we need Machine Data Adapters
− To handle different ‘machine data’ formats
Adapters in High-Level Workflow
Apply Adapter
Adapter Functions
� Create
− Enter Adapter-Name, LogType, ‘sample machine data’ and first ‘timestamp’ in the ‘sample machine data’
− Check the recommended ‘DataTime Format’ and ‘preTimeStamp Regex’ and select defaults like ‘timezone’, ‘year’ and ‘month’.
− Verify the extracted output and save if you find it good
− If extracted out is bad, then you can go back and edit parameter ‘Data Source Type’, ‘DataTime Format’ and ‘preTimeStamp Regex’
� Edit
� View
� Apply
� Delete
Create Machine Data Adapter – Step-1
Create Machine Data Adapter – Step-2
Create Machine Data Adapter – Step-3
Display Machine Data Adapter
Edit Machine Data Adapter – Step-1
Edit Machine Data Adapter – Step-2
Edit Machine Data Adapter – Step-3
Display Machine Data Adapter
Apply Machine Data Adapter
Verify the Adapter (metadata.json)
Delete
Data Explorer for Indexing Application
� Data Explorer Index Configuration File to support generic schema for extracted
machine data.
� Parallelizing data pushing to Data Explorer Indexer.
� Run Data Explorer Index Application
Data Explorer Index Configuration File
� The Data Explorer index config file specifies which fields to index, which field
contains record ID as well as Data Explorer index field definitions: field name,
type, searchable, retrievable, filterable and sortable.
Example:
{ "source": { "dateFormat": "MMM dd yyyy HH:mm:ss.SSS Z", "fieldName":
"LogDatetime[].normalized_text", "suppress": false }, "target": {
"deFieldName": "LogDatetime", "filterable": true, "isRecordID": false,
"retrievable": true, "searchable": true, "sortable": true, "type": "Date" }}
� Default Index Configuration file is provided.
Parallelizing data pushing to Data Explorer Indexer
� The application uses Oozie jaql action to parallelize the job to multiple tasks.
Jaql Hadoop Task 1
BigSearchBigSearch
Zookeeper
Cluster
Zookeeper
Cluster
Jaql Hadoop Task M
BigSearchBigSearch
DE Backend
Shard 1
DE Backend
Shard 1…
Locate shards
…
DE Backend
Shard N
DE Backend
Shard N
HDFS
Indexing app
BI platform/IDE
Run Data Explorer index Application
Basic Facet Search UI on Application Builder
BI Log Monitoring and Analysis
• Ingest BigInsights logs in HBase in real time.
• Create Log Monitoring Extraction application that extracts log records from
HBase.
• Create Index Management application to delete old index log records from DFS.
• Embed the MDA Search UI within the BigInsights Dashboard for BigInsights log
search.
Ingesting BigInsights Logs into HBase
� Chukwa agents setup on Name Node and each of the Data Nodes
� Adapters are programmatically installed and removed depending on user
configuration.
� Custom Chukwa writer class created to add logs into HBASE in real time.
� Log4j Interface streams logs to the adaptors which stream logs to HBASE
� Different log types are concurrently recorded in HBASE in a single table
Data Collection Diagram
HBASE
Name Node
Hadoop Secondary Name Node
Hadoop Jobtracker
Hadoop Name Node
Data Node 1
Hadoop Data Node
Hadoop Task Tracker
Hadoop Task Attempt
Data Node 2
Hadoop Data Node
Hadoop Task Tracker
Hadoop Task Attempt
Data Node 3
Hadoop Data Node
Hadoop Task Tracker
Hadoop Task Attempt
� For HDFS with Symphony MapReduce Installation: Hadoop Data Node, Hadoop
Name Node and Hadoop Secondary Name Node logs are supported
� For GPFS with Apache MapReduce Installation: Hadoop Job Tracker, Hadoop
Task Tracker and Hadoop Task Attempt logs are supported
� For GPFS with Symphony MapReduce Installation: Only Hadoop Task Attempt
logs are supported
HDFS with Apache Map Reduce
BigInsights Dashboard
� User starts the BigInsights log collection from the LogCollection app.
� User is able to stop the BigInsights log collection from the LogCollection app. Or
by turning off the Monitoring.
� The MDA Search UI is wrapped by a frame in BigInsights Dashboard.
Dashboard
LogCollection app.
BigInsights Log Monitoring Application
� Is a BigInsights Chained application.
� Contains Log Monitoring Extraction application and Index application.
� Assumes that Log Monitoring Extraction application is running on schedule
mode.
� The BigInsights Logs is selected assumed as the workflow for Index application.
� Any configuration files are assumed to be the default configuration files installed
with MDA
� “Index Only New Logs” check-box in the Index application is assumed to be
unchecked.
BigInsights Log Monitoring Application
Agenda
� Introduction
� High Level Workflow
� New Features in MDA 2.1
� Demo