11
ai-config-team report 28/08/2014

ai - config -team report

Embed Size (px)

DESCRIPTION

ai - config -team report. 28/08/2014. Puppet run incident. What we know: Puppet runs start to fail when a puppetdb query in base starts timing out Puppetdb postgres backend maxes out cpu with this one class of query responsible for majority of load - PowerPoint PPT Presentation

Citation preview

Page 1: ai - config -team report

ai-config-team report

28/08/2014

Page 2: ai - config -team report
Page 3: ai - config -team report

Puppet run incident

• What we know:– Puppet runs start to fail when a puppetdb query in

base starts timing out– Puppetdb postgres backend maxes out cpu with this

one class of query responsible for majority of load– Load balancers become overloaded with queue– Spiral of death: LB stops responding to lbd, DNS entry

removed, ENC not reachable, comes back, puppetdb replace_facts storm, PDB slows to crawl, repeat

Page 4: ai - config -team report
Page 5: ai - config -team report
Page 6: ai - config -team report

ai-pdb raw /v3/facts --query '["and", ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_2"], ["=", "value", "adm"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_0"], ["=", "value", "bi"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_1"], ["=", "value", "inter"]]]]]]'

Page 7: ai - config -team report

Actions and plans

• CRM-623: remove the allow ssh from aiadm rule which included “$aiadm_nodes = query_nodes('hostgroup_0="bi" and hostgroup_1="inter" and hostgroup_2="adm"', ipaddress)”

• Reduced number of fact-names (thanks Dan), cleaned up foreman (thanks Nacho)

• Longer term: reduce amplification effect from load balancers

• Read only puppetdb for API access

Page 8: ai - config -team report

Things we don’t know

• What triggers the problem? Normally load on db is minimal

• Perhaps updating new facts? New fact name across lots of plant this week. Looking at previous events

• Will engage upstream, but we are behind on puppetdb versions due to dependencies

Page 9: ai - config -team report

Other activity

• Postgres dbod slave for puppetdb– so far no stable replication

• Updates to puppetdb & postgres modules to support r/o puppetdb

• Raising issues with upstream for foreman issues with hostgroup filtering in new version

• New teigi::secret::sub_file type testers required

Page 10: ai - config -team report

In QA

• CRM-401 add an option to enable UDT for gridftp servers• CRM-567 Smartd Puppet Module• CRM-575 Add smartd to the base pluginsync whitelist• CRM-576 Including the smartd module into the hardware modu

le• CRM-577 Deploy blockdevice driver monitoring in QA for EL5 an

d EL6• CRM-611 Update of site.pp to support 10-deep hostgroup• CRM-613 Drop alarmed fact from sapp_puppetmaster• CRM-615 Removing megacli and adding storcli for vendors trans

tec and viglen• CRM-620 New cern_hwcontract function to extract contractid fr

om hwdb cache• CRM-622 New 'ssds' facts• CRM-623 Emergency backout of allow ssh from aiadm

Page 11: ai - config -team report

QA-Prod• CRM-591 Do not clobber ADFS-metadata.xml with puppet.• CRM-595 Enable buildMap="1" for new (3.5) shibboleths when memcache

is enabled.• CRM-604 facter 1.7.4 -> 1.7.6 upgrade.• CRM-605 Upgrade mcollective filemgr, package and service plugins• CRM-606 Add fact to expose the tenant name• CRM-607 Drop active installation nrpe mco! llective plugin• CRM-608 New Redhat/7.yaml hiera file.• CRM-609 Add CentOS (7) support to osrepos.• CRM-610 CentOS as valid OS name• CRM-612 Update of hiera config to support 10-deep hostgroups• CRM-616 ai-tools 8.2-1• CRM-617 Update module to upstream version 1.7.9• CRM-618 RHEL5 repo fixes for osrepos