Upload
peter-varhol
View
74
Download
0
Embed Size (px)
Citation preview
Using Machine Learning to Optimize DevOps Practices
Building Learning into Monitoring and Feedback
Peter Varhol
About me
• International speaker and writer• Degrees in Math, CS, Psychology• Technology communicator• Former university professor, tech journalist• Cat owner and distance runner• [email protected]
Agenda
• What is machine learning?
• How is machine learning applied to DevOps?
• Challenges in training these systems
• What constitutes an issue?
• Summary and conclusions
What is Machine Learning?
• Layered algorithms that change parameters based on feedback from know data• Can be linear or nonlinear
• Algorithms can be fixed in production or adaptive• Fixed – algorithms do not adjust once deployed
• Adaptive – algorithms continually adjust to new data
• Usually part of a larger system
Adaptive Systems
• Airline pricing• Ticket prices change three times a day based on demand
• It can cost less to go farther
• It can cost less later
• Ecommerce systems• Recommendations try to discern what else you might want
• Can I incentivize you to fill up the plane?
Why Use Adaptive?
• The “right” result will vary over time
• Trying to optimize a particular result• Revenue
• The problem domain is not static
Confidential, Dynatrace LLC
How Are Fixed Systems Used?
• Transportation• Self-driving cars
• Aircraft/Drones
• Ecommerce• Recommendation engines
• Medical• Diagnosis systems
Why Use Fixed Machine Learning Systems
• The problem domain is static
• The expectations remain constant
• The right answer is known under most conditions
• The original algorithms remain valid over a long period of time
DevOps Practices Generate Data
• During development• Agile metrics, JIRA issues, test case metrics
• During continuous integration• System test metrics
• During continuous deployment• Quality metrics for deployments
• After deployment and into production• Application availability and performance
• Usage log files
Focus on Monitoring
• Ongoing data on availability and performance• RUM
• Synthetic tests
• Application monitoring
• Monitoring tackles the back end of DevOps• Identifying unhealthy trends
• Diagnoses failures and poor performance
• Recommends action
• Fixed or adaptive depends on your goals
Where Do Predictive Analytics Come In?
• Big data makes possible predictions of future events• Are we going to fail?
• How will we perform with traffic surges?
• As well as past events• What went wrong and how do we fix it
• We can rely on past data• Adaptive systems may not perform as well
• Clear goals needed
What Technologies Are Involved?
• Neural networks
• Genetic algorithms
• Rules engines
Neural Networks
• Set of layered algorithms whose variables can be adjusted via a learning process
• The learning process involves training with known inputs and outputs
• The algorithms adjust coefficients to converge on the correct answer (or not)
• You freeze the algorithms and coefficients, and deploy• Or you optimize on a particular set of characteristics
A Sample Neural Network
Genetic Algorithms
• Use the principle of natural selection
• Create a range of possible solutions
• Try out each of them
• Choose and combine two of the better alternatives
• Rinse and repeat as necessary
Bringing in DevOps
• DevOps has data that can be used to train neural networks• Health of the application
• Trends in application traffic and responsiveness
• Application failure
Machine Learning Helps DevOps
• Decisions are complex• Why is the CPU maxed?
• What is causing disk thrashing?
• Why did the network slow?
• Why did the application fail?
• Data is massive• Potentially thousands of data points a day
How Good Are Decisions?
• Expert versus machine
• Given the same data• In many domains they tie
• With additional data, the human can be better
• But machine learning will get better
• But only as good as the data
We Want to Do Two Things
• Identify trends that may indicate future problems• Increasing response times
• More page errors
• Diagnose faults once they have happened• Why did the application fail?
• How can we fix it as quickly as possible?
Fixed Algorithms Work for Some Problems
• Immediate performance and failure identification
• Diagnosis of failures and performance issues
• These are readily identifiable from known data
Adaptive Systems Supplement These Tools
• Predictions of future events• Performance
• Availability
• The target is moving• So we need current data to adjust the algorithms
The Machine Helps the DevOps Expert
• The machine learning app provides:• Early warning on possible performance issues and failures
• Immediate notification of failure or impending failure
• Trend analysis of data to predict unhealthy outcomes
• The machine learning is an assistant• It can’t fix anything
• It can’t necessarily identify the root cause
What is the Goal?
• We have many ways of monitoring• Many of them are represented at this conference
• Each measures something a little different• Latency, response time, availability, network, DNS . . .
• Too much data can be no better than no data at all
• Machine learning can correlate across measurements• Focus to eliminate false positives
Intelligent Systems Are Sometimes Wrong
• The problem domain is ambiguous
• There is no single “right” answer• “Close enough” is good
• We don’t know quite why the software responds as it does• We can’t easily trace code paths
Testing Machine Learning Systems
• Have objective acceptance criteria
• Test with new data
• Don’t count on all results being accurate
• Understand the architecture of the network as a part of the testing process
• Communicate the level of confidence you have in the results to management and users
A Cautionary Tale
• All events are not created equal
• AI systems treat events equally• A failure of a system during busy season is the same as any other
• DevOps pros know otherwise• And can exert additional effort in response
• And actually fix the problem
• We can’t automate what we don’t understand
• You need the human in the loop
Confidential, Dynatrace LLC
Conclusions
• DevOps is a natural environment for machine learning systems• Any activity that generates data and requires a decision is fair game
• Monitoring is low-hanging fruit
• Fixed systems for failure and diagnosis, adaptive for trend analysis
Confidential, Dynatrace LLC
References
• https://qz.com/989137/when-a-robot-ai-doctor-misdiagnoses-you-whos-to-blame/
• https://pvarhol.wordpress.com/2017/07/22/what-brought-about-our-ai-revolution/
• https://pvarhol.wordpress.com/2017/06/21/analytics-dont-apply-in-the-clutch/
Confidential, Dynatrace LLC