Application Logging in the 21st century - 2014.key

Preview:

DESCRIPTION

Slides for my talk at the Austrian Perl Workshop in Salzburg on October 10th. A video of the talk can be found at https://www.youtube.com/watch?v=4Qj-_eimGuE

Citation preview

Application Logging in the 21st Century

Austrian Perl Workshop – Oct 2014

1

Logging is Like Lego

Not the focus of this talk

Many InterchangeableOptions

2

• Almost no logging when I joined in 2008

• Incremental improvements as a background project over years

• Currently capturing 600-900 logs / minute from ~200 machines

• Not claiming "best practice", just some hopefully useful tips from our long journey

Our Journey

3

• Adopted Log::Log4perl

• Wrote utility function to add a log file

• Intercept warnings and fatal exceptions

• Simple layout with timestamp and severity

Log file per-application

4

Log4perl Layout

Config file

log4perl.rootLogger = INFO, TLScreen!log4perl.appender.TLScreen = Log::Log4perl::Appender::Screenlog4perl.appender.TLScreen.layout = Log::Log4perl::Layout::PatternLayoutlog4perl.appender.TLScreen.layout.ConversionPattern = %d{yyMMdd HH:mm:ss} %.1p> %m{chomp} [@%F{1}:%L %M{1}()}]%n

Example output

140929 14:06:25 I> some info message [@Broker.pm:221 process()]140929 14:06:27 W> a warning [@BlackOakClientRole.pm:296 get_runner_for_class()]

5

Capture Warnings

$SIG{__WARN__} = sub {! # protect against infinite recursion return warn @_ ## no critic (RequireCarping) if $within_log_sig or not defined $Log::Log4perl::Logger::ROOT_LOGGER; local $within_log_sig = 1;! local $Log::Log4perl::caller_depth = $Log::Log4perl::caller_depth + 1;! chomp(my $msg = shift); get_logger()->warn($msg);};

6

Capture Fatal Exceptions$SIG{__DIE__} = sub {! return if $^S; # We're in an eval, so ignore it die @_ if not defined $^S; # Parsing module/eval! # protect against infinite recursion die @_ ## no critic (RequireCarping) if $within_log_sig or not defined $Log::Log4perl::Logger::ROOT_LOGGER; local $within_log_sig=1;! local $Log::Log4perl::caller_depth = $Log::Log4perl::caller_depth + 1;! chomp(my $msg = shift); get_logger()->fatal($msg); die "$msg\n"; # may duplicate message but that's better than loosing it};!

7

Were there any errors?

log4perl.rootLogger = INFO, TLScreen, TLErrorBuffer!!log4perl.appender.TLErrorBuffer = TigerLead::Log::Appender::RecentSummaryBufferlog4perl.appender.TLErrorBuffer.Threshold = ERRORlog4perl.appender.TLErrorBuffer.max_messages = 10log4perl.appender.TLErrorBuffer.layout = Log::Log4perl::Layout::PatternLayoutlog4perl.appender.TLErrorBuffer.layout.ConversionPattern = %m{chomp}!!Ring buffer for log messages. Used at the end of old batch job code to decide if something went wrong.

8

State of play

• Timestamped log message with severity etc

• Per-app log files

• Can tell if warnings or errors were produced

But:

• Not capturing stdout/stderr & non-perl apps

9

Apps

Flow of log messages

X

10

AppsApps

Fileslogs

Flow of log messages

X

11

setsid $start_daemons_command 2>&1 \ | setsid $capture_logs_command &!setsid puts deamons into a separate process group, isolated from terminal. Capture stdout/stderr from all child processes and pipe to logger process. Logger process is also in a separate isolated process group We use daemontools so for us: start_daemons_command="svscan $supervise_dir" capture_logs_command="multilog t s1000000 n100 dir $logdir" multilog t prepends high-resolution timestamps to log messages multilog t accuracy depends on when the log was flushed multilog s1000000 n100 dir does log rotation for us Logger exits only when all child processes have closed stdout/stderreven if they've become daemons, forked more child processes and died.

Capturing stdout/stderr

12

AppsApps common

Fileslogs

Flow of log messages

13

State of play

• Capturing stdout/stderr & non-perl apps

But:

• We had to login to see what was happening

• No single place to watch errors and warnings across the systems

• Wanted to parse log messages to extract more useful info

14

Stream: Logstash – collect, edit, and forward logs

Store: Elasticsearch – real-time distributed search and analytics engine. JSON REST over Lucene

View: Kibana – browser based analytics and search dashboard for Elasticsearch

Log Stream-Store-View

15

Inputs: collectd drupal_dblog elasticsearch eventlog exec file ganglia gelf gemfire generator graphite heroku imap

invalid_input irc jmx log4j lumberjack pipe puppet_facter rabbitmq rackspace redis relp s3 snmptrap sqlite sqs stdin stomp syslog tcp twitter udp unix varnishlog websocket wmi xmpp zenoss zeromq

Codecs: cloudtrail collectd compress_spooler dots edn edn_lines fluent graphite json json_lines json_spooler

line msgpack multiline netflow noop oldlogstashjson plain rubydebug spool

Filters: advisor alter anonymize checksum cidr cipher clone collate csv date dns drop elapsed elasticsearch

environment extractnumbers fingerprint gelfify geoip grep grok grokdiscovery i18n json json_encode kv metaevent metrics multiline mutate noop prune punct railsparallelrequest range ruby sleep split sumnumbers syslog_pri throttle translate unique urldecode useragent uuid wms wmts xml zeromq Outputs: boundary circonus cloudwatch csv datadog datadog_metrics elasticsearch elasticsearch_http elasticsearch_river email exec file ganglia gelf gemfire google_bigquery google_cloud_storage graphite graphtastic hipchat http irc jira juggernaut librato loggly lumberjack metriccatcher mongodb nagios nagios_nsca null opentsdb pagerduty pipe rabbitmq rackspace redis redmine riak riemann s3 sns solr_http sqs statsd stdout stomp syslog tcp udp websocket xmpp zabbix zeromq

Logstash Stream Processing

16

Logstash Configurationinput { stdin { }}!filter { grok { match => { "message" => "%{COMBINEDAPACHELOG}" } } date { match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ] }}!output { elasticsearch { host => localhost } stdout { codec => rubydebug }}

17

• Document oriented. Schema free.

• JSON in and out. RESTful API.

• Powerful indexing and search via Lucene.

• Distributed and massively scalable.

• Big community, rapid growth.

• Generally awesome.

Elasticsearch Buzzwords

18

Kibana

19

• Started with single machine

• Now using three machines

• Logstash, Elasticsearch and Kibana on each

• Elasticsearch cluster across all three

• HAProxy load balancer in front of all three

Our ELK setup

20

AppsApps common

logstash

ES

Kibana

Filesfiles

Flow of log messages

21

• Forwarding system syslog was easy first step

• We're using CentOS6 with rsyslog v7.6

• Started forwarding notice+ severity messages but now forward info+

syslog forwarding

22

Rsyslog forwarding

# buffering config$WorkDirectory /var/lib/rsyslog # where to place spool files$ActionQueueFileName logstash # unique name prefix for spool files$ActionQueueMaxDiskSpace 1g # 1gb space limit$ActionQueueSaveOnShutdown on # save messages to disk on shutdown$ActionQueueType LinkedList # run asynchronously$ActionResumeRetryCount -1 # infinite retries if host is down!!# forward info+ level logs from all facilities to logstash*.info @@logstash-app-stag.tigerlead.local:5544; RSYSLOG_ForwardFormat!!# RSYSLOG_ForwardFormat gives us high-resolution timestamp and timezone# We use TCP (not UDP) for reliability may switch to RELP later

23

AppsApps common

logstash

ES

Kibana

System rsyslog

queue

Filesfiles

Flow of log messages

24

25

• Wanted to parse messages but didn't want to do that on the central logstash server

• Started with a Message::Passing utility to tail and parse specific logs files and ship as JSON

• Turned out we don't need much parsing

• Now using an extra rsyslogd that follows log files and forwards to the local root rsyslogd

Ship our logs to logstash

26

AppsApps common

Shipper logstash

ES

Kibana

System rsyslog

queue

Filesfiles

Flow of log messages

27

AppsApps common

rsyslog logstash

ES

Kibana

System rsyslog

queue

Filesfiles

Flow of log messages

28

• Still have our 'app log files' separate from the 'system log files' in /var/log/*

• Harder to correlate events between them

• Experiment: use syslog for more/everything?

• Want: per-app log files, high-res timestamp with lexical ordering (sort -m *.log | ...)

• Let the system look after log rotation etc

Eradicating 'our' log files

29

Send app logs to syslog

log4perl.rootLogger = INFO, TLScreen, TLErrorBuffer, TLSyslog!log4perl.appender.TLSyslog = TigerLead::Log::Appender::Sysloglog4perl.appender.TLSyslog.layout = Log::Log4perl::Layout::PatternLayoutlog4perl.appender.TLSyslog.layout.ConversionPattern = %m{chomp} [@%F{1}:%L %M{1}()}]%n!The syslog format provides program name, severity and pid.

30

Eradicating 'our' log filestemplate( name="sortable_log_format" type="string" # format for log lines # e.g. "2014-06-28 17:47:11.636078 $facility.$severity $program: $message" string="%TIMESTAMP:::date-pgsql%.%TIMESTAMP:::date-subseconds% %PRI-TEXT% %syslogtag%%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%\n")!template(name="file_per_programname" type="string" # format for log file names # e.g. program="run-parts(/etc/cron.hourly)" # becomes "/var/log/tiger/run-parts" using the 'leading safe characters' string="/var/log/tiger/%programname:R,ERE,0,ZERO:^[-_a-zA-Z0-9]+--end%.log")!ruleset(name="write_tiger_progname_log_files") { action( Type="omfile" Template="sortable_log_format" DynaFile="file_per_programname")}!if ( ($syslogseverity <= 5) or not ($programname == [ ... ]) ) then { call write_tiger_progname_log_files}

31

AppsApps common

rsyslog logstash

ES

Kibana

System rsyslog

queueFilesfiles

Flow of log messages

Filesfiles

32

Logstash Enrichment #1

hostgroup - first word of server name

• handy to focus in on a group of servers related to a particular service

punct - just the punctuation chars

• handy to focus on, or exclude, a particular 'shape' of message

33

Quick Demo

• Overview

• Drill-down

• Time ranges

• Multiple queries

• Share URL

34

State of play

• No longer had to login to multiple machines to see what was happening

• Can easily drill-down to explore the logs from multiple machines and systems

• Can share a URL to that view - very handy

But now:

• Want to be able to live-stream errors

35

• Separate production and staging channels

• Currently just error severity or higher

• Messages with 'alert' or 'emergency' severity are also sent to main developer channel

• Proven to be very useful

Live-stream to IRC

36

But:

• occasionally have floods of messages

• logstash irc rate limiting behaviour is dumb

• want to rate-limit only 'repeated' messages

• 'repeated' should allow for minor differences

• logstash can help...

Live-stream to IRC

37

Enrichment: message_gistmutate { add_field => [ "message_gist", "%{message}" ] # copy to edit}mutate { # normalize numbers gsub =>[ "message_gist", "[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?", "N" ] # normalize double quoted strings gsub =>[ "message_gist", "\"[^\"]*\"", "S" ] # normalize single quoted strings, but try to avoid matching apostrophes gsub =>[ "message_gist", "(\A|\W)'[^']*'(?!\w)", "\1S" ] # truncate urls to remove the query/fragment part gsub =>[ "message_gist", "(\w:/[^?\#\s]*)\S*", "\1" ]}fingerprint { # convert the normalized string into an integer hash source => "message_gist" target => "message_gist" method => "MURMUR3"}

38

Enrichment: repeat tagif [severity] and [severity] =~ /0|1|2|3|4/ {! throttle { period => 60 # seconds! before_count => -1 after_count => 2 # allow N within period before throttling! key => "%{hostgroup}%{severity}%{program}%{message_gist}" max_counters => 10000 # track this many variants! add_tag => "repeat" }! # may add a more strict 'duplicate' tag here in future # using period=>5, after_count=>1, and %{message} not %{message_gist}}

39

Enrichment: late tag# flooding may cause a backlog that delays messages reaching logstash# tag messages that arrive 'late'ruby { code => " msg_age = Time.now - event['@timestamp']! if msg_age >= +60 then msg_tag = 'late' # delayed elsif msg_age <= -60 then msg_tag = 'early' # craziness end! if msg_tag then event.tag msg_tag event['message_delay'] = msg_age.to_i # age end "}

40

Better IRC live-streamif [severity] and [severity] =~ /0|1|2|3|4/and "repeat" not in [tags]and (![message_delay] or [message_delay] < 600) # not too 'late'{ if [severity] =~ /0|1|2|3/ { # 4 (warning) is currently too noisy irc { channels => [ "#logprod" ] messages_per_second => 10 format => "%{severity_label} %{host} %{program}: %{message}" } } if [severity] =~ /0|1/ { # emergency and alert only irc { channels => [ "#l2dev" ] messages_per_second => 5 format => "%{severity_label} %{host} %{program}: %{message}" } }}

41

AppsApps common

rsyslog logstash

IRC

ES

Kibana

System rsyslog

queueFilesfiles

Flow of log messages

Filesfiles

42

State of play

• Live-stream to IRC, promotes awareness

• Developers work to reduce spurious noise

But now we want more context:

• "what was the app working on when that warning or error was triggered?"

• "what was the web request URL?" or "what were the async job parameters?"

43

• Add more info into every log message text, then parse it out again? Not ideal.

• Start by capturing all the HTTP access logs

• Could do log-shipping for each access log file

• But all traffic passes through HAProxy

• So HAProxy logging can give us everything

How to get context?

44

• already had haproxy notice+ messages

• now added haproxy traffic logs, first HTTP then TCP as well

• can include one request and response cookie

• plus multiple request and response headers

HAProxy logs

45

HAProxy Configuration

defaults mode tcp log-format %ci\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %U\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq!defaults mode http log-format %ci\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %U\ %B\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %ID\ %{+Q}r\ %ST\ %Tq/%Tr\ %{+Q}CC\ %{+Q}hr\ %{+Q}CS\ %{+Q}hs!frontend stripes-prod-frontend 108.168.241.12:80 # example service capture request header Referer len 200 capture request header User-agent len 300 capture response header Location len 300 capture cookie _session= len 63

46

HAProxy Logs

Example TCP log:!10.60.201.12 [09/Oct/2014:22:29:45.317] carbon-stag-frontend carbon-stag-backend/carbon-app-stag-ddc-01 1/0/2 3040 0 -- 57/45/45/45/0 0/0!Example HTTP log:!10.60.199.78 [09/Oct/2014:21:34:04.361] apex-fe-stag-frontend apex-fe-stag-backend/apex-fe-stag-ddc-01 0/0/2594 956 86661 ---- 63/1/0/0/0 0/0 0A3CC74E:CC62_0A3CC933:0050_5436FF4C_462C7E:696C "GET /a/sa/search?rgu=0&domain_id=10366 HTTP/1.1" 200 337/2256 "_session=4889b2859286db6511f2e9e9b33cdbe37f5b43ab" "{|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36}" "f_session=4889b2859286db6511f2e9e9b33cdbe37f5b43ab" "{}"

47

• change the host field (and thus hostgroup) to the backend machine name, so the logs from haproxy appear to be coming from the appropriate machine

• parse out request URL parameters

• decode URL parameters

Logstash for HAProxy

48

Logstash for HAProxy# extract the request url params into a 'params' hashmutate { gsub => [ "request", "#.*", "" ] } # remove fragment, if any, firstkv { source => "request" field_split => "&?" target => "params" }!# XXX disabled re https://github.com/elasticsearch/logstash/issues/1695# urldecode { field => "params" all_fields => true }!if [response] >= 500 { mutate { replace => [ "severity", "4", "severity_label", "warn" ] }}else if [response] >= 400 { mutate { replace => [ "severity", "5", "severity_label", "notice" ] }}!mutate { # replace raw message with a human friendly version to view/search on gsub => [ "request", "\?.*", "" ] # remove params now we've extracted them replace => [ "message", "%{be_host} %{client_ip} %{Tw}/%{Tc}/%{Tt}ms %{bytes_in}b %{bytes_out}b %{response} %{verb} %{request}" ]}

(Abridged!)

49

State of play

• now have detailed TCP and HTTP traffic logs

But:

• still parsing textual messages

• still hard to handle multi-line messages

• still don't have contextual data for logs

• still can't correlate http to application logs

50

• Parsing textual log messages to extract data that your own code put there is a bit dumb

• Log as JSON lines instead (jsonlines.org)

• Opens the door to logging extra information

• Bonus: solves the multi-line message problem, at least for perl apps

Log as JSON from app

51

Log::Log4perl::Layout::JSON

log4perl.rootLogger = INFO, TLScreen, TLFile, TLErrorBuffer, TLSyslogJSON!log4perl.appender.TLSyslogJSON = TigerLead::Log::Appender::Sysloglog4perl.appender.TLSyslogJSON.Threshold = INFOlog4perl.appender.TLSyslogJSON.layout = Log::Log4perl::Layout::JSONlog4perl.appender.TLSyslogJSON.layout.prefix = @cee: # used as taglog4perl.appender.TLSyslogJSON.layout.field.message = %mlog4perl.appender.TLSyslogJSON.layout.field.src_file = %F{1}log4perl.appender.TLSyslogJSON.layout.field.src_sub = %M{1}log4perl.appender.TLSyslogJSON.layout.field.src_line = %L!Example output (spaces and line breaks added for clarity):!2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{"message":"...\n...\n...", "src_file":"Foo.pm", "src_sub":"frobnicate", "src_line":"18" }!Note that src_file, src_sub and src_line used to be appended to the message text.

52

Decoding JSON in logstash

grok { # @cee: is syslog 'CEE Event Flag' per https://cee.mitre.org/ match => { message => "^@cee: ?%{GREEDYDATA:cee_data}" } add_tag => [ "cee" ] tag_on_failure => [] }! if ("cee" in [tags]) { json { source => "cee_data" remove_field => [ "cee_data" ] } }

53

State of play

• now have rich JSON formatted log messages

• multi-line messages are no longer a problem

But:

• still only very basic contextual data for logs

• still can't correlate http to application logs

54

• Significant items of 'ambient information'

• The current 'things being worked on'

• Would like that info added to any log msgs

• Including warnings and fatal exceptions(e.g. if hooked via $SIG{__WARN__})

"Context Data"

55

Context Data

for my $foo_id (@list_of_foo_ids) {!    # we want the current $foo_id value to be included # in any log messages in this scope!    do_something_useful($foo_id);}!# we DON'T want $foo_id to be included in any future log messages

56

• Put the 'ambient information' in a hash

• Add the contents of the hash to the JSON

• Use local to limit the scope

Context Data

57

Context Datafor my $foo_id (@list_of_foo_ids) {     local log_context->{foo_id} = $foo_id; # simple!     do_something_useful($foo_id); }

The imported log_context utility:

sub log_context { return \%Log::Log4perl::MDC::MDC_HASH }

The Log::Log4perl::Layout::JSON config line:

log4perl.appender.TLSyslogJSON.layout.include_mdc = 1

58

Context Data

Context added to root hash by default:

2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{"message":"...\n...\n...", "src_file":"Foo.pm", "src_sub":"frobnicate", "src_line":"18", "foo_id":42 }

Optionally put context data items into a nested hash:

log4perl.appender.TLSyslogJSON.layout.name_for_mdc = extra_stuff!2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{"message":"...\n...\n...", "src_file":"Foo.pm", "src_sub":"frobnicate", "src_line":"18", "extra_stuff":{ "foo_id":42 } }

59

State of play

• now have easy way to add contextual data

• array and hash refs work (keep it small)

But:

• what contextual data should we include?

• request URL? decoded parameters?

• expensive to include in every message

60

• We have a stream of haproxy logs

• We have a stream of application logs

• Want to be able to correlate them

"what HTTP request caused this warning?"

• Add unique-id to HTTP log & HTTP header

HAProxy Correlation

61

HAProxy Configurationdefaults mode http unique-id-format %{+X}o\ %ci:%cp_%fi:%fp_%Ts_%rt:%pid unique-id-header X-TLXID log-format %ci\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %U\ %B\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %ID\ %{+Q}r\ %ST\ %Tq/%Tr\ %{+Q}CC\ %{+Q}hr\ %{+Q}CS\ %{+Q}hs!

• HAProxy now generates a unique-id for each HTTP request

• Adds it to the HTTP request as a X-TLXID header

• Includes the unique-id value in the syslog message

62

Capture X-TLXIDpackage TigerLead::Plack::Middleware::SetUpLogContext;use strict;use warnings;use parent qw( Plack::Middleware );!use Plack::Request;use TigerLead::Log qw(log_context);!sub call { my($self, $env) = @_;! my $req = Plack::Request->new($env); # reset log context at start of a new request %{log_context()} = (tlxid => scalar $req->header('X-TLXID'));! return $self->app->($env);}

63

• Given any log message from a web app we can now find the HTTP request that was being processed at the time

• That includes the session cookie, so we can view the stream of requests for that session

• Demo...

Correlation

64

Questions?

65

Recommended