Inside Azure Diagnostics (DevLink 2014)

Preview:

DESCRIPTION

Discussion of diagnostic/troubleshooting options with the Azure diagnostic agent in Cloud Services.

Citation preview

Inside Azure Diagnostics

Michael S. CollierPrincipal Cloud Architect

michaelc@aditi.com@MichaelCollierwww.MichaelSCollier.com

17

COLUMBUS, OH OCTOBER 17, 2014 CLOUDDEVELOP.ORG

Today’s Agenda1. The need for diagnostic

data in cloud applications

2. Data we can monitor

3. Using the Azure Diagnostic Agent

4. Real-world guidance for troubleshooting Azure apps

Successful projects share at least one common trait . . .

Success vs. Failure

node.js C# Java

Agile- vs -

Waterfall

Successful projects share at least one common trait . . .

Success vs. Failure

Diagnostics Data / Telemetry

A True Story

Scenario1 week before date of production launch. “Am I ready?”

Well, we eventually log

any fatal errors, but that’s all.

OH . . .

Logs? Yeah . . .we really don’t have logs.

Let’s run some tests and look at your logs

I guess that’s better than

nothing.

We looked at Azure diagnostic logging but

didn’t see much value in it

A True Story

You’re kidding? Right?

A True StoryScenarioo Determine if solution is

production readyo Deployed as an Azure Cloud

Serviceo No load testso No performance testso No unit testso Very little instrumentation

We have a problemhttp://www.cutedaily.com/wp-content/uploads/2011/11/shockedbaby.jpg

A True StoryResolution1. Enable Azure diagnostics

– Set key performance counters

2. Add logging statements around key functionality– Especially external

services3. Test, test, test4. Analyze5. Fix it

Scenarioo Determine if solution is

production readyo Deployed as an Azure Cloud

Serviceo No load testso No performance testso No unit testso Very little instrumentation

Instrumentation more important in “the cloud”o Need to have good instrumentation for on-premises

applications

o Cloud – it matters more!

o Distributed environments and serviceso Composite applicationso Reliance on 3rd party vendors . . . such as Microsoft for Azureo Highly automated environmentso Scale out modelo Massive amounts of data

The Cloud Scales

worker roles

web roles

The Cloud Scales . . . You Do Not

worker roles

web roles

Diagnostic Data – 4x

Diagnostic DataWhat data do you gather today?

Performance Counters

Custom Logs(nLog, Log4net, etc.)

IIS Logs

Windows Event Logs

Crash Dumps

Diagnostic Data

Performance Counters

Custom Logs(nLog, Log4net, etc.)

IIS Logs

Windows Event Logs

Crash Dumps

Diagnostic Data – Azure Not so Different

Performance Counters

Custom Logs(nLog, Log4net, etc.)

IIS Logs

Windows Event Logs

Crash Dumps

Azure

Sto

rage

Diagnostic Data StorageDiagnostic Item Table Name Blob Container

NameWindows Event Logs WADWindowsEventLogsTable  

Performance Counters WADPerformanceCountersTable  

Trace Log Statements WADLogsTable  

Azure Diagnostic Infrastructure Logs

WADDiagnosticInfrastructureLogs

 

Custom Logs(i.e. log4net, NLog, etc.)

  <custom>

IIS Logs WADDirectoriesTable* wad-iis-logfiles

IIS Failed Request Logs WADDirectoriesTable* wad-iis-failedreqlogfiles

Crash Dumps WADDirectoriesTable*  * Location of the blob log file is specified in the Container field and name of the blob in the RelativePath field. The AbsolutePath field contains the name of the file as it existed on the role instance.

Diagnostic Monitor Agent

1. Role starts2. Diagnostic monitor agent

starts3. Diagnostics configured4. Data buffered locally5. Data transferred to storage

wad-control-containero Container in Azure blob

storage

Diagnostic Monitor Agent

Configuration Options

Default Configuration

Imperative Configuration

Declarative Configuration

o Trace logso IIS logso Infrastructure

logs

o No transfer

o OnStart()

o Overrides default

o diagnostics.wadcfg

o Root of worker or \bin of web

Imperativepublic override bool OnStart(){    // Create the DiagnosticMonitorConfiguration object to use for configuring the monitoring agent.    DiagnosticMonitorConfiguration config = DiagnosticMonitor.GetDefaultInitialConfiguration();     // Performance Counter configuration    config.PerformanceCounters.DataSources.Add(new PerformanceCounterConfiguration    {        CounterSpecifier = @"\Processor(_Total)\% Processor Time",        SampleRate = TimeSpan.FromSeconds(30)    });       config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);     // Log configuration    config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Information;    config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);     // Event Log configuration    config.WindowsEventLog.DataSources.Add("Application!*");    config.WindowsEventLog.DataSources.Add("System!*");    config.WindowsEventLog.ScheduledTransferLogLevelFilter = LogLevel.Warning;    config.WindowsEventLog.ScheduledTransferPeriod = TimeSpan.FromMinutes(1);    // Start the diagnostic monitor with the new configuration    DiagnosticMonitor.Start("Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString", config);     return base.OnStart();}

Impacts local agent only!

Imperative

Deployment ID

Declarative Configuration using Visual Studio

demo

1. wad-control-containera. Created for each role instance

2. Imperative codea. RoleInstanceManager.SetCurrentConfiguration() – update instance’s

diagnostics.wadcfg onlyb. DiagnosticMonitor.Start() – impacts current instance only; will not

update diagnostics.wadcfg

3. Declarative configurationa. Root of worker role or bin of web roleb. Updates to diagnostics.wadcfg take effect only if the wad-control-container

blob has never been updated programmatically.

4. Default configurationa. Last resortb. Collects, but doesn’t transfer to Azure storage

There’s a Precedence

Proble

m?

oDeployment Updateo Change configuration and redeploy

package

oRemotelyo Visual Studioo APIo Cerebrata Azure Management Studio

Update Diagnostic Configuration

On-Demand TransferInstruct WAD to transfer specific data sources to storageSpecify which data sourcesSpecify time range to transferSpecify a notification queueCode or API (or tool)

Overwrites current diagnostic configurationUse sparingly . . . . With caution

More info see http://mcollier.net/DiagOnDemand

Bonus: Verbose LoggingAdditional host-level data – not DiagnosticAgent.exe

WAD*deploymentID*PT*aggregation_interval*[R|RI]Table

Aggregation at 5 minutes, 1 hour, and 12 hour intervals

10 day retention period

Let’s Get Realo Sample every 1 -2 minutes*o Transfer every 5 minutes*

o Transfer only what is needed

o Azure Diagnostics writes data in 60 second wide partitions

o Too much data could overwhelm the partition

* Don’t take my word for it. You don’t know me. Test and validate for your situation.

Query Azure Diagnostic Data

demo

o Two separate channels for telemetry dataoVital informationo Application or service failures. Higher level of alerting.o Fix and return to “normal” as soon as possibleo Alert now – email, SMS, dashboard, ninjas from ceiling, etc.

oDay-to-day operational datao Root cause analysisoHow to prevent in the futureo Azure diagnostics

o Fine tune the alerts – reduce false alarms and noise

Set Priorities

Define Key Metrics

Compute node resource usage

Windows Event logs

Database queries

response times

Application specific

exceptions

Database connection & cmd failures

Microsoft Azure Storage

Analytics

Process for Azure hosted solutions is not that different from traditional, on-premises solutions.

o Log all calls to external services. Challenge an SLA?

o Log details of transient faults

o Partition telemetry data by date (or hour) – reduce impact of data aggregation or reporting

o Use a different storage account!

o Remove old / non-relevant telemetry data

o Apply to development, test, and QA versions – validate performance & ensure telemetry systems operating correctly

Considerations

o Use declarative configuration (diagnostics.wadcfg) exclusively.

o Bring Azure diagnostic data into relational databaseo Easier reportingo Periodically fetch from Azure table and insert into SQL Database table.

Use PK and keep most recent.o Custom code

o Supplement Azure diagnostics with other toolso New Relic or AppDynamicso Cerebrata Azure Management Studioo AzureWatch (Paraleap)

Considerations (cont.)

o Instrumentation and telemetry are key to successful projects

o Cloud metrics similar to metrics for traditional applications

o Be realistic and set priorities

o 3rd party tools can be essential for troubleshooting

Summary

o Diagnostics Configuration Order of Precedence – http://bit.ly/1eomek9

o Use the Azure Diagnostic Configuration File – http://bit.ly/1mVHN3u

o Cloud Service Fundamentals (wiki) – http://bit.ly/1k1YkjI

o Failsafe: Guidance for Resilient Cloud Architectures – http://bit.ly/Q33mkU

o Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services – http://bit.ly/1qp4omC

Resources

oMulti-part series on Azure diagnostics

oMany other fantastic articles:o Getting Started with Azure Searcho Azure storage queueso Cloud Serviceso Automated testing in Azure

Just Azure

www.JustAzure.com

Questions?

Thank You!Michael S. CollierPrincipal Cloud Architect

michaelc@aditi.com@MichaelCollierwww.MichaelSCollier.com

Recommended