Vayacondios: Divine into Complex Systems
Huston Hoburg & Flip Kromer Infochimps, a CSC Company
MongoDB Austin 2014 March 24th
Infochimps• Big Data Platform for Large Companies
• Cloud::Queries (ElasticSearch, MongoDB, HBase)
• Cloud::Hadoop (Dynamic Hadoop)
• Cloud::Streams (Storm+Trident)
• Managed Service, Enterprise Features
• Recently sold to CSC, and it’s quite awesome
• We’re Hiring (natch)
Vayacondios
• Built for our Visibility Stack…
• … but we think it has wider use
!
• “Data Goes In, the Right Thing Happens”
• Prompt, Comprehensive and Faithful
Circulatory
Immune
Clotting
OK, Glass
“OK Glass, Show me a skeuomorphism”
Immune
Circulatory
Digestive
Respiratory
Non-Numeric Metrics
Target INR = 2-3
Low Platelets = H.I.T. (bad)
Heparin (Blood Thinner)
Low Platelets
• Folic Acid, Vitamin B12
• Medication (Valproic Acid, Singulair, Heparin)
• Sepsis
• HIV
• (about three dozen others)
Systems• Anatomical Systems: Circulatory, Immune, etc
• Interventions: Drugs, Surgeries, …
• Course of Treatment: topline progress indicators
• Diagnosis
• Practitioner
• Medical Devices
ICU
• Model the patient, not the data source
• Highlight Interactions among systems
• Highlight Interactions among numbers
• Broaden your view of “systems”
Monitoring Sucks
Operations
System != Machine• Whole-System MongoDB:
• Machines it runs on, Volumes it uses
• Systems writing to it
• Applications and Collections
• Data Files, Logs, Repl Sets, Oplog, Arbiters
• Codebase repo, Cookbooks, Configuration
• Issue Tracker Tickets, Change Events
Operations• Cognitive model for Humans, not from Robots
• Go beyond the Time-series Graph
• Highlight Interactions
• Link to Systems that write to this DB
• Link to Github for Repos & Cookbooks
• Drill into System
• Issues in Issue Tracker
• Broaden your view of “systems”
• 15 clients, 15 architectures
• < 1 operator per client, 2 continents
• 1500 machines in 150 clusters
• 30+ technologies (HBase, MongoDB, Storm, …)
• 4 Providers (AWS, Metal, VCE, OpenStack)
• 3 Virtualizations (AWS, VMWare, OpenStack)
• Max 21 minutes downtime / month (99.95% SLA)
Our Challenge
Systems to Instrument• WholeSystems: ZookeeperSystem, ElasticsearchSystem, HbaseSystem, HadoopMapredSystem, HadoopHdfsSystem,
KafkaSystem, MysqlSystem, MysqlClientSystem, ListenerSetSystem, StormTridentSystem, MongodbSystem, NfsSystem, VayacondiosSystem, TachyonSystem, SplunkSystem, S3System, RdsSystem, PigSystem, HiveSystem, HueSystem
• Machines: ZookeeperMachine, ElasticsearchDatanodeMachine, HBaseRegionserverMachine, HBaseMasterMachine, HadoopDnttMachine, HadoopTtonlyMachine, HadoopNamenodeMachine, HadoopJobtrackerMachine, HadoopSecondaryNamenodeMachine, HadoopFailoverMonitorMachine, MysqlServerMachine, KafkaBrokerMachine, PlatformListenerMachine, StormBolterMachine, StormMasterMachine, MongodbMachine, NfsServerMachine, VayacondiosServerMachine, PlatformApiMachine, TachyonServerMachine, HueMachine
• Daemons: n, ElasticsearchDaemon, HbaseRegionserverDaemon, HbaseMasterDaemon, HadoopDatanodeDaemon, HadoopTasktrackerDaemon, HadoopNamenodeDaemon, HadoopJobtrackerDaemon, HadoopSecondaryNamenodeDaemon, HadoopFailoverDaemon, KafkaBrokerDaemon, MysqlDaemon, PlatformListenerDaemon, StormNimbusDaemon, StormUiDaemon, StormSupervisorDaemon, MongodbDatanodeDaemon, NfsServerDaemon, NtpDaemon, NfsClientDaemon, VayacondiosServerDaemon, TachyonServerDaemon, PlatformApiServerDaemon, HueBeeswaxDaemon
• Providers: AwsProvider, CloudTrailProvider, OpenstackProvider, VceProvider, ChefServerProvider, Route53Provider, ElbProvider
• Manifests: most of the above have a planned version and the realized version • Events: MachineLifecycle, CronJobLifecycle, ChefClientLifecycle • Build Artifacts:: FitDeployArtifact, DebArtifact, RpmArtifact, GemArtifact, AmiArtifact, OpenstackImageArtifact,
VceTemplateArtifact, NpmArtifact, TarballArtifact • PlatformApps: HadoopJobLifecycle (Hive, Pig, Wukong), TridentJobLifecycle, MountweaselLifecycle • OpsProcesses: IncidentLifecycle, ChangeRequestLifecycle, FiredrillLifecycle, GitCommitLifecycle, ProblemLifecycle (JIRA),
LunchladyLifecycle
Vayacondios
• Visibility Stack for our operations team
• Open-sourcing this summer
• Internals in Ruby
• Access anywhere (HTTP or log file)
• MongoDB (but now please forget that fact)
Cognitive Model• MongoDB:
• is_a Data store
• has_many Network Services
• has_many Daemons
• has_many Machines
• has_many Volumes
• has_many Collections
• …etc
Model DSL (domain-specific language)
Model DSL (domain-specific language)
Faithful• Whiteboard rule: how do folks talk about system?
• If you need it, it’s in the system
Prompt• As fast as joint laws of Economics & Physics allow
Comprehensive
Biographizing Isn’t Pretty
Faithful to Source
• crap data => well-formed data
• uniform JSON-ready hash
• syntax cleaned up
• semantically unchanged
• encouraged to model it, but let Wookiee win
Write Contract
• Vaya Con Dios, “Go with God”. As the kingdom of heaven is unknowable, so is further fate of data:
• How used
• By Whom
• How Processed
• Where Stored
Reporters/Reports
• Assemble Biographies into Reports
• Faithful to application
• Don’t know when will be run, why, etc
Presentation
Dashboarding
text metrictext metric
text metric text metrictext metric text metric text metric
text metric
Model-Driven Templates
Repeatable Partials
Model/Presenter/View
• Report == Model
• Reporter == Presenter
• Dashboard .xml == View
Model/Presenter/View
• More targets that just dashboard!
• Splunk+PagerDuty Alerts
• Cucumber tests
• Auditing reports (Security, Good Manners)
System Checks
• Correctness, Consistency
• Attached Directly to the Model
• No worthwhile distinction between QA (integration tests) and live Alerts
• Drive Splunk+Pager Duty for Alerts
• Author Cucumber specs(!) for QA tests
Safe Systems
System Drift
• Cognitive Model
• Discoverable Interface
• Testable Contract
Inevitability• If configured and reported, consistency checks
• If reported, dashboard exists
• If is_a generic system (eg filesystem), gets correctness tests (eg “capacity < 75%”)
• If system A discovers system B:
• dashboard has link from A to B
• connectivity & security checks from A to B
Interaction
• Monitoring systems do a terrible job here
• Hard sources of failure:
• Drift conceived != realized
• Interaction unexpected consequences
• Change oops
Application Design
Application Design• Visibility into complex systems:
• Biography of raw parts (raw Model) => Reporter (Presenter) => Summary of Systems (View-ready Model)
• Database-driven Application • Model =>
Presenter =>View
Simple Blog
Blog: Views
Author Page
Post Page
Index Page
Blog: ViewsAuthor Page
Post Page
Index Page
PostSynopsisReport
PostReport
UserReport
CommentReport
“Query on the way In”!
• New/Updated Post: Update Post triggers…
• Update PostReport
• Update SynopsisReport
• Update UserReport
“Query on the way In”!
• User fullname changes: Update User triggers…
• Update UserReport
• Update their SynopsisReports
• Update their PostReports
• Update their CommentReports
Vayacondios Contract
Faithful• Whiteboard rule: how do folks talk about system?
• If you need it, it’s in the system
Prompt• As fast as joint laws of Economics & Physics allow
Comprehensive
Faithful• Single concern: subject of the biography
• look at what’s offered, look at what reports need
Prompt• Run as often as needed (not your concern)
Comprehensive
Faithful• One Reporter per Application (*) & Topic
• USCE Method: Utiliz’n, Saturat’n, Connections, Errors
Prompt• Run as often as needed (not your concern)
Comprehensive
Benefits
• Separation of concerns:
• Source complexity (API, parsing, translation)
• Timing
• Transport
• Individual Applications
• Reliability
Benefits
• Separation of concerns: Source, Timing, Transport, Individual Applications, Reliability
• No external libraries in application
• Uniform access times
• Reduce risk from multiple-dependencies
So What?
• There’s not much to it: shims and conventions
• VCD is not MongoDB
• just like MongoDB is not mmap tables
• Power through constraint:
We’re [email protected]
github.com/infochimps-labs