View
223
Download
3
Category
Preview:
Citation preview
Copyright © 2014 Splunk Inc.
Sagi Zelnick Principal Architect, Yahoo
Exploratory AnalyAcs for Shared-‐service Hadoop Clusters
Disclaimer
2
During the course of this presentaAon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauAon you that such statements reflect our current expectaAons and
esAmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,
please review our filings with the SEC. The forward-‐looking statements made in the this presentaAon are being made as of the Ame and date of its live presentaAon. If reviewed aRer its live presentaAon, this presentaAon may not contain current or accurate informaAon. We do not assume any obligaAon to update any forward-‐looking statements we may make. In addiAon, any informaAon about our roadmap outlines our general product direcAon and is subject to change at any Ame without noAce. It is for informaAonal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligaAon either to develop the features or funcAonality described or to
include any such feature or funcAonality in a future release.
Overview ! Hadoop @ Yahoo: 8+ years of innovaAon ! Hunk @ Yahoo: organizaAon-‐wide investment for next 3+ years ! Yahoo providing Hunk as a self-‐service to explore, analyze & visualize data in HDFS – Hunk allows for visually browsing very complex tables (250+ fields) – Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the enAre job/query to finish – Cuts down on the development cycles by faster interacAon with results – Built-‐in graphs/charts makes for a powerful soluAon for many situaAons
History of Hadoop InnovaAon @ Yahoo
Over 600PB of Hadoop Storage (Over Half an Exabyte)
! Very large clusters used by many groups across the enterprise ! More than 35,000 individual datanodes ! Hadoop is provided as a service ! MulAple cluster types such as research, dev, sandbox and producAon
! Services such as HBase, Hive, Oozie, etc… ! Users are free to run jobs, but have resource constraints ! Maintained by the Grid OperaAons Group
Integrated AnalyAcs Plajorm for Diverse Data Stores
Full-‐featured, Integrated Product
Fast Insights for Everyone
Works with What You Have Today
Explore Visualize Dashboards
Share Analyze
Hadoop Clusters NoSQL and Other Data Stores
Hadoop Client Libraries Streaming Resource Libraries
Improving OperaAonal Visibility with Hunk ! We pointed Hunk at many operaAonal logs and event data we already had on the grid
! This includes system metrics, HDFS ops, JVM stats and YARN metrics ! Created instrumentaAon to measure usage per user and job ! Analyzed terabytes of NameNode audit logs ! Job history leveraged for visualizing usage/growth and historical views ! Custom events for HBase staAsAcs
Tracking Hadoop Performance & Metrics in Hunk
Use Case Customer Benefits System metrics from 35k nodes Grid Ops / Grid Customers IdenAfy slow tasks/nodes
when debugging
Historical insights of resources All Grid Customers Track organic growth
Job performance All Grid Customers Improved job SLAs
HBase metrics All Grid Customers Track region/RS/table metrics…
Job logs in near real-‐Ame All Grid Customers / Ops Search for errors directly from the YARN logs
Measuring NameNode Performance Pre & Post Upgrades
! Historical visualizaAons of all operaAons ! Search data in Hunk from billions of NameNode events ! Measure JVM and memory usage ! Insights into operaAonal performance
New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa
n=1h avg(number*) as num_*
Last 7 days
✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
Fri May 162014
Sun May 18 Tue May 20
200,000,000
400,000,000
600,000,000
_time ↕
num_BlockReports ↕
num_CopyBlockOpera
tions ↕
num_HeartBeats ↕
num_ReadBlockOpera
tions ↕
num_ReadMetadataOperati
ons ↕
num_ReplaceBlockOperat
ions ↕
num_WriteBlockOpera
tions ↕
num_blockChecksumOp ↕
2014-05-15 01:00 1124437.7359
02
46721126.819672
514957.3840
98
12930433.077869
0.000000 94210832.786885
63512425.967213
13975.306557
2014-05-15 02:00 1115496.2904
92
53597000.262295
298717.6370
49
10402176.717213
0.000000 94109944.655738
93916552.393443
35459.288689
2014-05-15 03:00 1110372.4173
56566721.704918
428494.9449
13296385.590164
0.000000 94141430.295082
97353478.229508
20307.549344
Visualization VisualizaAon Using Hunk
New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa
n=5m avg(number*) as num_*
Last 2 days
✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)
_time
num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp
12:00 PMTue May 202014
12:00 AMWed May 21
12:00 PM
1,000,000,000
250,000,000
500,000,000
750,000,000
_time ↕
num_BlockReports ↕
num_CopyBlockOpera
tions ↕
num_HeartBeats ↕
num_ReadBlockOpera
tions ↕
num_ReadMetadataOperati
ons ↕
num_ReplaceBlockOperat
ions ↕
num_WriteBlockOpera
tions ↕
num_blockChecksumOp ↕
2014-05-20 01:15:00 1056047.0240
00
34677652.000000
124121.2640
00
26242490.800000
0.000000 88112292.800000
126478486.400000
51405.346000
2014-05-20 01:20:00 1055517.9240
00
30920700.800000
1065390.086
000
22756041.800000
0.000000 87745422.400000
92323387.200000
32070.482000
2014-05-20 01:25:00 1055457.2000
33068504.400000
27622.56200
11396610.700000
0.000000 88569211.200000
94593716.800000
28873.618000
Visualization
Sample TroubleshooAng in Hunk of 750 Million Events
Big Picture Plus Granular Details
Analyzing NameNode RPC Calls (TroubleshooAng)
! Who is making what RPC call (open, listStatus, create, etc.) ! How oRen are they making these RPC calls ! From which IP/host are they coming from ! Search and visualize historical data from billions of events ! Prevent NameNode abuse/misuse
Visualizing 834 Million Discrete Events …
ConAnued
Queue Insights (Capacity & Provisioning) ! Each Hadoop job runs in a specific queue ! We track every aspect of the YARN framework ! Immediate queue performance and configuraAon profiling via job history server
! Historical views and trends that enable beper capacity management ! Improved queue uAlizaAon and allocaAon management
New Searchindex="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec
onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum
(gb_hours) as gb_hours by queue
Last 7 days
✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)
200,000
400,000
600,000
_time ↕
OTHER
↕
apg_dailyhigh_
p3 ↕
apg_dailymedium
_p5 ↕
apg_hourlyhigh_
p1 ↕
apg_hourlylow_
p4 ↕
apg_hourlymedium
_p2 ↕
apg_p7
↕
curveball_larg
e ↕
curveball_me
d ↕
slingshot
↕
slingstone
↕
2014-05-20 18:00 4154
45512 7071 25643 12111 29664 3473
26547 14192 60875
45376
2014-05-21 00:00 19341
92661 18005 41008 22944 88115 10896
38648 8693 48186
87670
2014-05-21 06:00 21160
108137 38398 35627 14934 101925 24458
29269 14066 24344
47831
2014-05-21 12:00 24238
74849 22695 47431 17731 53673 17332
37079 14479 44873
96909
2014-05-21 18:00 5792
95449 2737 44214 20325 48339 10222
34390 4605 168593
24298
2014-05-22 00:00 10177
68048 12853 36921 23248 57740 16005
44138 9142 88121
34544
2014-05-22 06:00 12720
85048 21977 35870 15503 100364 7823
35179 8086 33973
19802
2014-05-22 12:00 5459
76489 13154 34703 11204 34877 20178
22631 40567 98 24250
2014-05-22 18:00 8169
38394 2211 49840 19977 52438 4050
38066 27973 49333
31312
2014-05-23 00:00 12898
117518 7354 36422 16426 52918 8179
28202 21798 79808
37078
2014-05-23 06:00 6572
105431 26941 48614 29159 120424 14317
26011 12433 16745
35928
Visualization
_time
Wed May 212014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...
1 of 2 5/27/14, 3:20 PM
Visualizing Queues
Self-‐Service Job Reports ! Each job is unique and so are the map and reduce elements ! How to start analyzing jobs? ! Historical job performance and profiling enables in-‐depth performance tuning
! Long terms historical views and trending of growth
cluster
↕
user
↕
queue
↕ jobName ↕ jobId ↕ status
↕ gb-hours ↕
run_mins
↕
cobalt
gmon
grideng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_315271
SUCCEEDED
108.00
33.07
cobalt
gmon
grideng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_312700
SUCCEEDED
104.00
37.37
cobalt
gmon
grideng
PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_309715
SUCCEEDED
88.00 29.83
cobalt
gmon
gridops
distcp: job_1398982765383_309921
SUCCEEDED
36.00 68.49
cobalt
gmon
gridops
SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765383_313570
SUCCEEDED
25.00 14.26
cobalt
gmon
gridops
nnaudit_DR_2014_05_25 job_1398982765383_308938
SUCCEEDED
25.00 15.43
cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07
New Searchindex="jobsummary_logs_all_blue" cluster="*" user="gmon" |
eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) |
eval gb_hours=((total_slot_seconds * 0.5) / 3600) |
eval gb_hours=round(gb_hours,2) |
eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours
avg(runtime) as run_mins
by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours
Yesterday
✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)
Statistics (4,871)
...It’s Not Just Logs We’re Looking At
! Using the metastore we can setup virtual indexes to any table(s) in Hive, without the need to define the schema up-‐front
! Visualize very complex tables (250+ fields) ! Rapid prototyping for new jobs with almost instant results for searches,
without having to wait for the enAre job/query to finish ! Built-‐in aggregates and graphs/charts ! Accelerates development workflow by providing faster interacAon with data
More data to tap into with the metastore / Hive sources
THANK YOU
Recommended