Upload
hieu-quang
View
80
Download
1
Embed Size (px)
Citation preview
2017 February 07Hieu LE ([email protected])Fujitsu Vietnam LimitedPODC (Platform Offshore Development Center)Vietnam OpenStack Community - VFOSSA
Logging/Request Tracing in Distributed Environment
Copyright 2017 Fujitsu Vietnam Limited
/me
2 APRICOT 2017
Hieu LEVietnam Official OpenStack Community OrganizerVFOSSA Executive MemberOpenStack Project leader @ FujitsuOpenStack ATC/AUCEmail: [email protected]
Outline
3 APRICOT 2017
1. Intro
2. Current Logging solution
Pros
Cons
3. Tracing requirements
4. Request tracing
Demo with OpenStack
Intro
4 APRICOT 2017
Distributed Environment:
Cloud Computing – Fog Computing.
IoT environment.
Micro-services architecture.
IoT – Fog – Cloud
5 APRICOT 2017
(Virtual) Storage Services/Servers
Virtual Compute Resources
Virtual Network
O2M2 Thingworx DeviceHiveOther
Platforms
Multiple Clouds
- Routing+ Optimizing paths+ Data pre-processing
6 APRICOT 2017
• What if something happened in our system?
• How can we resolve the problems as quick as possible?
Current Logging solution (1)
7 APRICOT 2017
ELK, Graylog:
Collecting logs from systems and appliances.
Indexing and filtering RCA
Multiple Alert/Notify mechanisms.
Visualization based on user’s needs.
Current Logging solution (2)
8 APRICOT 2017
Pros: Quickly trouble-shoot problems of systems/appliances. Reduce cost for storing log, based on PCI DSS or HIPAA
requirements.
Cons: Mostly depend on systems/appliances log. Require more efforts on sizing/deploying, maintaining and operating
these logging solution. Ate up resources (mostly storage) May not suitable for small
sensors.
Current Logging solution (3)
9 APRICOT 2017
Example 01:
Single request for launching 01 VM in OpenStack cloud system can
go through at least 04 micro-services.
Log INFO level sometimes contain misleading information or not-
enough information for trouble-shooting
Turn on DEBUG log level
Too much information and eat up storage.
Hard to control the overhead threshold.
Current Logging solution (4)
10 APRICOT 2017
Example 02:
ELK/Graylog requires some tweaks and efforts on visualize,
collecting, profiling and RCA in distributed environment.
Consider following queries in environments with >10 services:
“Find me the root cause of all error requests where the requests
process X business.”
“Find me requests where the user was logged in and the request
took more than two seconds and a DB transaction was held open
for more than 500 ms.”
Tracing Requirements
Address the Data Explosion
Logs, Metrics, Events,Active/Passive Checks,
…
End-to-End DebuggingUnderstand what the real
issue is and what is affected when errors occur
VisibilityDeliver centralized
intelligence for cloud operations at scale
Operator NeedsResource UtilizationUnderstand resource
availability and utilization
Solution RequirementsAble to Collect,
Store and Access all types of data
in one place
Highly Performant and
Scalable Platform
Flexible Processing Pipeline that can support multiple use cases: diagnostics, root cause analysis,
SLA calculations, utilization reporting, …
Extensible Platform that can be extended to
support new types of data and processing
11 APRICOT 2017
Tracing Requirements
• Users need centralize solution that provide enoughinformation related to machine centric (monitor) andworkflow centric (tracing).
– Provide general picture for every workflow: thecommunication steps, req/resp time for each stepfor performance reviewing purpose.
– Show monitoring metrics of hardware/services foreach step at the time of investigation.
– Provide general purpose RCA method for quicklytroubleshooting.
12 APRICOT 2017
Workflow Centric solution quick survey
There are many solutions aim to tracing the workflow centric, divided into 3 categories: [1]
1. Explicit metadata propagation: inject tracing metadata into current system (Zipkin, Kieker, X-Trace, Tracelytics, Cloudera Htrace, ExplorViz, OpenTracing - CNCF)
2. Schema-based: rely on the event semantics of system and use temporal schema of custom log message for tracing. (Magpie)
3. Black-box tracing: rely on log analysis for inferring relationship among events. (Fchain, Netmedic)
[1]. HANSEL: Diagnosing Faults in OpenStack – IBM Research
13 APRICOT 2017
Workflow centric solutions (1)
14 APRICOT 2017
• Figure of traditional workflow
Service A Service B Service C Service D
Req
Workflow centric solutions (2)
15 APRICOT 2017
• Explicit metadata propagation
Figure of explicit metadata tracing workflow: inject metadata in request/response and send to tracing mechanism (Zipkin, Dapper..)
Service A Service B Service C Service DTracing
Mechanism
Req
Workflow centric solutions (3)
16 APRICOT 2017
• Explicit metadata propagation
Pros:
• Give enough detail for tracing the problems
• Highly scalability.
Cons:
• Must modify code base and inject meta-data into header of each request and response
• Increase network packet (maybe a little bit like Zipkin - around 500bytes)
Workflow centric solutions (4)
17 APRICOT 2017
• Schema-based: based on sematic of event generated from system (including OS, services and applications), then joining all related event schema for final inference.
Service A Service B Service C Service D
Authenticate
Authenticate
Authenticate
Get Image
Create port, IP and attach
Req Read/Write
DB
Event Listener
Workflow centric solutions (5)
18 APRICOT 2017
• Schema-based
Pros:• Less modification into code base
Cons:• Low scalability. (the result is delayed until all event are collected).
• Less details than explicit meta-data. (the semantic of event, the event list and also the way to join schemas define the success of this approach we need to build a warehouse of event semantic)
Workflow centric solutions (6)
19 APRICOT 2017
• Black-box tracing: collect logs of all services, then do analyzing all the logs and infer the root cause of problem.
Service A Service B Service C Service D
DB
Log Collector and Analyzer
LogsLogs Logs Logs
Logs
Workflow centric solutions (7)
20 APRICOT 2017
• Black-box tracing:
Pros:• No modification to code base.
Cons:• High error rate. (almost is probabilistic data mining approaches)
Example (1)
21 APRICOT 2017
Magpie: Schema-based
Example (2)
22 APRICOT 2017
Zipkin: Explicit metadata propagation
Demo with OpenStack
23 APRICOT 2017
OSProfiler: Explicit metadata propagation small library
Q & A
THANK YOU!
24 APRICOT 2017