Upload
colleen-montgomery
View
32
Download
0
Embed Size (px)
DESCRIPTION
ai - config -team report. 28/08/2014. Puppet run incident. What we know: Puppet runs start to fail when a puppetdb query in base starts timing out Puppetdb postgres backend maxes out cpu with this one class of query responsible for majority of load - PowerPoint PPT Presentation
Citation preview
ai-config-team report
28/08/2014
Puppet run incident
• What we know:– Puppet runs start to fail when a puppetdb query in
base starts timing out– Puppetdb postgres backend maxes out cpu with this
one class of query responsible for majority of load– Load balancers become overloaded with queue– Spiral of death: LB stops responding to lbd, DNS entry
removed, ENC not reachable, comes back, puppetdb replace_facts storm, PDB slows to crawl, repeat
ai-pdb raw /v3/facts --query '["and", ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_2"], ["=", "value", "adm"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_0"], ["=", "value", "bi"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_1"], ["=", "value", "inter"]]]]]]'
Actions and plans
• CRM-623: remove the allow ssh from aiadm rule which included “$aiadm_nodes = query_nodes('hostgroup_0="bi" and hostgroup_1="inter" and hostgroup_2="adm"', ipaddress)”
• Reduced number of fact-names (thanks Dan), cleaned up foreman (thanks Nacho)
• Longer term: reduce amplification effect from load balancers
• Read only puppetdb for API access
Things we don’t know
• What triggers the problem? Normally load on db is minimal
• Perhaps updating new facts? New fact name across lots of plant this week. Looking at previous events
• Will engage upstream, but we are behind on puppetdb versions due to dependencies
Other activity
• Postgres dbod slave for puppetdb– so far no stable replication
• Updates to puppetdb & postgres modules to support r/o puppetdb
• Raising issues with upstream for foreman issues with hostgroup filtering in new version
• New teigi::secret::sub_file type testers required
In QA
• CRM-401 add an option to enable UDT for gridftp servers• CRM-567 Smartd Puppet Module• CRM-575 Add smartd to the base pluginsync whitelist• CRM-576 Including the smartd module into the hardware modu
le• CRM-577 Deploy blockdevice driver monitoring in QA for EL5 an
d EL6• CRM-611 Update of site.pp to support 10-deep hostgroup• CRM-613 Drop alarmed fact from sapp_puppetmaster• CRM-615 Removing megacli and adding storcli for vendors trans
tec and viglen• CRM-620 New cern_hwcontract function to extract contractid fr
om hwdb cache• CRM-622 New 'ssds' facts• CRM-623 Emergency backout of allow ssh from aiadm
QA-Prod• CRM-591 Do not clobber ADFS-metadata.xml with puppet.• CRM-595 Enable buildMap="1" for new (3.5) shibboleths when memcache
is enabled.• CRM-604 facter 1.7.4 -> 1.7.6 upgrade.• CRM-605 Upgrade mcollective filemgr, package and service plugins• CRM-606 Add fact to expose the tenant name• CRM-607 Drop active installation nrpe mco! llective plugin• CRM-608 New Redhat/7.yaml hiera file.• CRM-609 Add CentOS (7) support to osrepos.• CRM-610 CentOS as valid OS name• CRM-612 Update of hiera config to support 10-deep hostgroups• CRM-616 ai-tools 8.2-1• CRM-617 Update module to upstream version 1.7.9• CRM-618 RHEL5 repo fixes for osrepos