Application Timeline Server Past, Present and Future

Application Timeline Server- Past, Present & Future NAGANARASIMHA G R & VARUN SAXENA

Agenda Who we are ?

Why we need History Server?

Application History Server

Timeline Server V1

Timeline Server V2

Who we are ?Naganarasimha G R

Senior Technical Lead @ Huawei

Active Apache Hadoop Contributor.

Currently working in Hadoop Platform Dev team

Earlier worked in Reporting Domain

Varun Saxena Technical Lead @ Huawei

Active Apache Hadoop Contributor.

Currently working in Hadoop Platform Dev team

Earlier worked in Telecom Data Network Domain

Both of us are currently participating in ATS V2 development

Agenda Who we are ?



Timeline Server V1

Timeline Server V2

Need for new History Server Job History server is only for MR app, YARN supports

many Applications.

YARN level Events and Metrics are not captured.

Storage is HDFS only, Not good for adhoc analysis.

JHS is only for historical or completed jobs.

On failures of Application Master, Data for current running application is lost.

Storage is very MR specific- Counters- Mappers and Reducers

Agenda Who we are ?



Timeline Server V1 & V1.5

Timeline Server V2


Separate Process

Resource Manager directly writes to Storage(HDFS)

Aggregated Logs

Separate UI, CLI and Rest End Point

Data stored :- Application level data (queue, user etc…)- List of ApplicationAttempts- Information about each ApplicationAttempt- List of containers for ApplicationAttempt- Generic information about each container.

CLI and REST Query interfaces were supported

Drawbacks :

Storing Application specific custom data not supported

RM crashes, HDFS files are not readable

Hard limit no number of Files

Upgrades / Update

Supports only completed jobs.

Agenda Who we are ?



Timeline Server V1

Timeline Server V2

Application Timeline Service Motivation : YARN takes care of it

- Relieving the application from monitoring service Application diversity

- Framework specific metadata/metrics

ATS V1 : Data Model Timeline Domain

- Namespace for Timeline server which supports isolations users and applications

- Timeline server Security is defined at this level Timeline Entity

- An abstract concept of anything- Defines the relationship between entities- Can be an application, an application attempt, a

container or any user-defined object- contains Primary filters which will be used to index the

entities in the Timeline Store.- uniquely identified by an EntityId and EntityType.

Timeline Event

- Event that is related to a specific Timeline Entity of an application

- Users are free to define what an event means, such as starting an application, getting allocated a container,

ATS V1 : Architecture

Separate Process

Pluggable store – defaults to LevelDB

REST Interfaces

ATS V1 : Level DB Key- value store

Lightweight

Open source Compatible license

Used to store - TimelineStore : Domain, Entity, Events and metrics - TimelineStateStore : Security Tokens

Supports Data Retention

ATS V1 : Client & API Timeline client

- Wrap over REST POST method- POJO objects

TimelineEntity TimelineEvent

- In Client/AM/Container

Rest APIs, JSON as the media- Get timeline entities

http://localhost:8188/ws/v1/timeline/{entityType}

- Get timeline entityhttp://localhost:8188/ws/v1/timeline/{entityType}/{entityId}

- Get timeline eventshttp://localhost:8188/ws/v1/timeline/{entityType}/events

http://localhost:8188/ws/v1/timeline/%7BentityType%7D

http://localhost:8188/ws/v1/timeline/%7BentityType%7D

http://localhost:8188/ws/v1/timeline/%7BentityType%7D/%7BentityId%7D

http://localhost:8188/ws/v1/timeline/%7BentityType%7D/%7BentityId%7D

http://localhost:8188/ws/v1/timeline/%7BentityType%7D/events

ATS V1 : SecurityHTTP SPNEGO

Kerberos Authentication

Delegation Token- Performance- AM/Container no Kerberos

Access Control- Admin/owner- Timeline entity-level

ATS V1 : Use cases

Agenda Who we are ?



Timeline Server V1

Timeline Server V2

Why ATSv2 ? Scalability• Single global instance of writer/reader• ATSv1 uses local disk based LevelDB storage

Usability• Handle flows as first-class concepts and model aggregation.• Elevate configuration and metrics to first-class members.• Better support for queries.

Reliability• Data is stored only in a local disk .• Single daemon so single point of failure.

Existing external tooling: hRaven, Finch, Dr. Elephant, etc. As new Hadoop versions are rolled out, maintenance of these tools becomes an issue.

Key Design Points Distributed writers (per app and per node)• Per App Writer/Collector launched as part of RM.• Per Node Collector/Writer launched as an auxiliary service in NM.• In future, will support standalone writers.

Scalable and reliable backend storage (HBase)

A new object model API with flows built into it.

Separate reader instance(s). Currently have a single reader instance.

Aggregation i.e. rolling up the metric values to the parent.• Online aggregation for apps and flow runs.• Offline aggregation for users, flows and queues.

Timeline Reader

Timeline Reader

ATSv2 ComponentsApplication

Master

Node Manager

Timeline Writer

App Events / Metrics

Container Events / Metrics

Storage

Resource Manager

Timeline WriterTimeline Reader

User Queries

Timeline Reader Pool

App / Container Events

Resource Manager

RMApp

Distributed Writers / Collectors

Node Manager 1

{app_1_collector_info….}

List of app collectors

App Master3. Launch App Master

App Collector

App Collector

Aux Service

4. Notify Aux Service to bind new collector 5. Bind new collector

NODE 1

HBase

NM Collector Service

6. Register new collector

RM Events

Heartbeat with collector info

App Collector

App Collector

Node Manager 2

Node Manager X

1. User submits app

Heartbeat with collector info

2. RMApp launches companion app collector on new app submission

7. Report new collector info. (IP + Port)

Container Events

AM reports events to app collector notified in HB by RM.

NM reports events to app collector notified in HB by RM.

{app_1_collector_infoapp_2_collector_info….}

App 1

App 2

App 3 App 4

Run

Run at 9:00 pm

Flow

Script / Program(eg. HIVE Query /

Pig Script)

App 1

App 2

App 3 App 4Run

Run at 7:30 pm

Joe

Data ModelEntity

ID + Type

Configurations

Metadata(Info)

Parent-Child Relationships

Metrics

Events

Cluster

Type

Cluster Attributes

Flow

Type

User

Flow Runs

Flow Attributes

Flow Run

Type

User

Running apps

Flow Run Attributes

Application

Type

User

Flow + Run

Queue

Attempts

Attempt

Type

Application

Queue

Containers

Container

Type

Attempt

Attributes

Entities of first class citizens

User

Username(ID)

Aggregated metrics

Queue

Queue(ID)

Sub queues

Aggregated metrics

Aggregation

Event

ID

Metadata

Timestamp

Metric

ID

Metadata

Single Value or Time Series(with

timestamps)

HBase vs Phoenix evaluationBased on the evaluation of both Hbase and Phoenix, it was decided that HBase will be used on write path. With Hbase, much higher throughput, a lower IO wait and far lower CPU load was witnessed.

Test descript ion

Map tasks

Entities per mapper

Total entities written

Phoenix Transaction Rate (per mapper) ops/sec

HBase Transaction Rate (per mapper) ops/sec

Phoenix Write Time (job counter TIMELINE_ SERVICE_ WRITE_TIME)

Hbase Write Time (job counter TIMELINE _SERVICE _WRITE_TIME)

Synthetic Data 170 1k 170k 112.83 2285.13 1506704 74394

Synthetic Data 170 10k 1.7M 53.029 636.41 32057957 2671241

Synthetic Data 1 50k 50k 196.67 19770.66 254225 2529

9 History Files 33 - 85k 319.19(write errors)

962.32 265460 88049

555 History Files

33 - 810k 206.25(write errors)

927.62 4102364 874151

Aggregation Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as

SUM, AVG ,etc. while rolling them up and store them in the parent.

App level aggregation will be done by app collector as and when it receives different metrics.

Online or real time aggregation for apps would be a simple SUM of metrics of child entities. Additional metrics will also be stored which indicate AVG, MAX, AREA(time integral) etc. More on this in next slide.

App to flow run aggregation will be done via a HBase coprocessor on the read path. Cell tags used to achieve this.

For user/flow, aggregation happens periodically(not real time i.e. offline). For this, Phoenix tables will be used. To achieve offline aggregation, a MR job is run which reads application table and writes to user and flow aggregation tables

Container A1(CPUCoresMillis = 400)

Container A2(CPUCoresMillis = 300)

Container B1(CPUCoresMillis = 200)

App A(CPUCoresMillis = 700)

App B(CPUCoresMillis = 200)

Flow(CPUCoresMillis = 900)

Accumulation While aggregating, we also accumulate metric values along the time dimension. This is especially useful for gauges. Consider the

table below which displays the CPU utilization for containers belonging to an app(in terms of cores). Here t1…t16 represents time 10ms. apart. This table shows how values are aggregated for an app and how they are accumulated and averages calculated. Trapezoidal integration rule is used to calculate area under the curve i.e. Area under the curve = ((valuet1 + valuet2)/2) * Dt

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16

Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 5 0.5 0.5 1 0

Application

Area(CoreMillis)Average


Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 5 0.5 0.5 1 0

Application 1

Area(CoreMillis)Average


Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 5 0.5 0.5 1 0

Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0

Area(CoreMillis)

-

Average -


Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5

Container 5 0.5 0.5 1 1

Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 1

Area(CoreMillis)

- 15

Average - 1.5


Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0

Container 5 0.5 0.5 1 0

Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0

Area(CoreMillis)

- 15 42

Average - 1.5 2.1


Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0

Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0

Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0

Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0

Container 5 0.5 0.5 1 0

Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0

Area(CoreMillis)

- 15 42 82 122 160 192 222 252 282 307 325 335 345 355 360

Average - 1.5 2.1 2.7 3.1 3.2 3.2 3.1 3.1 3.1 3.1 3 2.8 2.6 2.5 2.4

Series10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

CPU Cores for App Avg

HBase Table Schema Entity Table – Used for storing Timeline Entity object. Contains configs, metrics and other info (events,

parent child relationships, etc.). Row Key : clusterId!user!flowId!flowRunId!appId!entityType!entityId

Application Table – Used for storing YARN Application entity. Contains configs, metrics and other info.

Same as entity table but added for better performance. Row Key : clusterId!user!flowId!flowRunId!appId

App To Flow Table – Used for getting flowId and flowRunId information based on cluster and app. This is

helpful in querying entity table on the basis of just the cluster and app information. Row Key : clusterId! appId

Flow Run Table – Stores flow run information aggregated across apps. Row Key : clusterId!user!flowId!flowRunId

HBase Table Schema (Contd.) Flow Activity Table – Used for storing daily activity records for a flow. For quick lookup of flow level info.

Row Key : clusterId!inverted top of the day timestamp!user!flowId

Phoenix Tables for Offline Aggregation :

Flow Aggregation Table – Stores aggregated metrics at flow level. Metrics are aggregated from application table.

Primary Key : user, cluster, flowId

User Aggregation Table – Stores aggregated metrics at user level. Metrics are aggregated from application table.

Primary Key : user, cluster

Querying ATSv2 ATSv2 offers major enhancement over ATSv1 in terms of queries supported. Efficient queries around

flows, flow runs, apps, etc. are possible. Moreover, ATSv2 can support complex queries to filter out results.

ATSv1 offered only primary filters and secondary filters for filtering out entities. ATSv2 offers ability to filter out entities based on config values, metric values, entity parent child relationships and events. It also supports returning only certain configurations and metrics in the result.

ATSv1 queries supported only “equal to” match for primary and secondary filters. But for metrics this does not quite make sense. A user would while filtering on the basis of metric values would more likely be using relational operators such as >=, <=, != etc. All these relational operators are supported in ATSv2 for metrics. In addition to this different predicates in filters can be combined using “AND” and “OR” operators.

All in all this gives ATSv2 a very powerful query interface.

Querying ATSv2 (Contd.) ATSv2, like ATSv1 supports a REST API interface with JSON as the media. Some examples are given below.

Get Entities – Returns a set of TimelineEntity objects based on cluster, app and entity type. The query also supports multiple optional query parameters such as limit on number of entities to be returned, configurations and metrics to be returned, filter on the basis of created and modified time window, config filters, metric filters and event filters.

http://localhost:8188/entities/{clusterId}/{appId}/{entityType}

Example : - http://localhost:8188/entities/cluster1/application_1334432321_0002/YARN_CONTAINER?limit=5&metrics=memory,cpu

Get Entity – Returns a Timeline Entity object based on cluster, app, entity type and entityId.

http://localhost:8188/entity/{clusterId}/{appId}/{entityType}/{entityId}

http://localhost:8188/entities/%7BclusterId%7D/%7BappId%7D/%7BentityType%7D



http://localhost:8188/entities/cluster1/application_1334432321_0002/YARN_CONTAINER?limit=5&metrics=memory,cpu

http://localhost:8188/entities/cluster1/application_1334432321_0002/YARN_CONTAINER?limit=5&metrics=memory,cpu

http://localhost:8188/entity/%7BclusterId%7D/%7BappId%7D/%7BentityType%7D/%7BentityId%7D





Possible use cases Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical data.

Mappers / reducers optimizations.

Application performance over time.

Identifying job bottlenecks.

Ad-hoc troubleshooting and identification of problems in cluster.

Complex queries possible at flow, user and queue level. For instance, queries like % of applications which ran more than 10000 containers.

Full DAG from flow to flow run to application to container level can be seen.

Team Members Sangjin Lee, Vrushali C and Joep Rottinghuis (Twitter)

Junping Du, Li Lu and Vinod Kumar Vavillapalli (Hortonworks)

Zhijie Shen (formerly Hortonworks)

Varun Saxena and Naganarasimha G R (Huawei)

Robert Kanter and Karthik Kambatla (Cloudera)

Inputs from LinkedIn, Yahoo! and Altiscale.

Feature Status Distributed per-app and per-node writers (as Aux Service)

RM Companion writer

NM, RM and AM writing events and metrics to ATS

File based readers and writers for test

HBase and Phoenix writer implementations

Performance evaluation of these writers

HBase based reader implementation

Support for flows

App and flow run level online Aggregation

Offline Aggregation

Query Interface

Feature Status (Contd.) Standalone timeline writer

Distributed timeline readers and a reader pool

ATSv2 UI

Security

Support for migration

Click icon to add picture

Thank You !

Technology

Application Timeline Server Past, Present and Future