Director of Cloud Operations
Ben Rockwood
SmartOSOperations
Tuesday, October 2, 12
Agenda
• Principles of Operation
• Provisioning
• Monitoring
• Configuration Management
• Orchestration
• Authentication
• Access Control
• Auditing
• Logging
• Metrics
• Tips & Tools2
Tuesday, October 2, 12
Obligatory DevOps Pitch
• DevOps is about 3 things:
• The Collaboration of People
• The Convergence of Process
• The Creation & Exploitation of Tools
• Its primary goal is providing quality & value to customers
• It is concerned with flow and encourages system thinking
• Born from TPS/LEAN, TOC, Agile, & classical Operations Management
3
Tuesday, October 2, 12
Principles of Operation
• Goals:
• Omnipotence: All Powerful
• Omnipresence: All Seeing
• Omniscience: All Knowing
• ... Since we can’t really do that...
• Make change control simple and standardized
• Monitor as deeply as possible and alert a human as needed
• Leverage a suite of tools to help us analyze problems quickly
4
Tuesday, October 2, 12
Principles of Operation (cont’d)
• Man is mortal
• Follow the “no snowflake” rule; minimize variation to maximize maintainability and predictability
• Maintain a set of standard operating procedures (SOP’s) to ensure quality across the organization
• Make tools as simple and productive as possible to avoid ad hoc (rouge) administration
• Keep it simple stupid (KISS); cleverness is temporary but grok’ability is forever
• Leverage industry standard tools and stock supplied facilities, avoid excessive customization
5
Tuesday, October 2, 12
Provisioning
• USB Keys, CD/DVD/ISO, or PXE possible
• PXE is preferred for all serious production deployments
• PXE greatly simplifies upgrade/downgrade; just change the TFTP image to boot and reboot the machine.
• Much faster and controllable than USB re-images in place
• Shameless Smart Data Center Plug
6
Tuesday, October 2, 12
Monitoring• JPC uses Zabbix
• Free & Open Source
• Proxy architecture allows for multiple data centers easily
• Agent or Agent-less Operation
• Agent-less: Supports IPMI & SNMP
• Agent is tiny, written in C and can compile static for easy binary installation without dependancies
• Extremely easy to customize and add custom metrics
• Dashboard provides a “single pane of glass” view of your entire infrastructure
• Includes historical graphing of all metrics
• Caution: Use Percona as the backend-database7
Tuesday, October 2, 12
Zabbix System Status
8
Tuesday, October 2, 12
Monitoring: Completing the Solution
9
• Zabbix Agents installed by Chef
• Statically compiled binaries distributed in Cookbook
• Zabbix Alerts
• All alerts sent to Ops Staff Jabber & Email directly
• “Disaster” alerts sent to PagerDuty (SMS Ops)
• Pingdom used as backup/redundant solution, alerts sent to PagerDuty
Tuesday, October 2, 12
Configuration Management
• In SmartOS, CM is mandatory (imho)
• JPC uses Chef-Solo for Configuration Management
• Bootstrap script is curled and piped to bash, which:
• Downloads Chef “Fat Client”
• Creates Chef-Solo Configuration
• Creates SMF Service and Runs Chef
• Each data center has its own “attributes file” which specifies Zabbix Server, LDAP servers, SSH Keys, etc.
• One set of Cookbooks are used for all DC’s
10
Tuesday, October 2, 12
Config Management w/Chef
• JPC Chef Cookbooks include:
• “joyent”: Base cookbook run on all nodes, installs basic tools, fixes anything undesirable in SmartOS, adds BMC driver, adds MegaSAS tools, etc.
• “computenode”: Modifications specific to general purpose compute nodes (currently empty)
• “ldap”: Configures LDAP client, modifies PAM for netgroups support, creates user directories, configures ZFS for delegated administration, etc.
• “zabbix”: Installs and configures Zabbix
• ... others, including “northstar”, “bart”, “logging”, etc.
11
SmartOS Cookbooks & Tools: github.com/joyent/smartos_cookbooks
Tuesday, October 2, 12
Orchestration
• Orchestration layer is required for ad-hoc mass control of nodes, for:
• Re-running Chef
• Mass service control (“svcadm disable zones” on all nodes)
• Auditing
• ... things you can’t foresee
• Several options exist: Mcollective, pssh, mussh, etc.
• SDC includes an Mcollective like solution (sdc-oneachnode)
12
Tuesday, October 2, 12
Orchestration w/ Mussh
13
$ p mussh -H compute-nodes -c 'svcs -H zones'10.0.96.22: online Feb_17 svc:/system/zones:default10.0.96.23: online Feb_17 svc:/system/zones:default10.0.96.24: online Feb_17 svc:/system/zones:default10.0.96.25: online Feb_17 svc:/system/zones:default
$ cat zonecount.mussh RUNNING=`zoneadm list -vc | grep running | grep -v global | wc -l`INSTALLED=`zoneadm list -vc | grep installed | wc -l`
echo "Node ${HOSTNAME}: ${RUNNING} Zones Running - \ ${INSTALLED} Zones in Installed State"
$ p mussh -H nodes -C zonecount.mussh 10.0.96.22: Node 45SY9R1: 9 Zones Running - 2 Zones in Installed State10.0.96.23: Node 8X7Y9R1: 7 Zones Running - 1 Zones in Installed State10.0.96.25: Node 8BZY9R1: 8 Zones Running - 1 Zones in Installed State10.0.96.26: Node 5MPY9R1: 18 Zones Running - 4 Zones in Installed State10.0.96.28: Node H4FY9R1: 5 Zones Running - 1 Zones in Installed State
Tuesday, October 2, 12
Other Orchestration Tools to Consider
• ClusterSSH (cssh): http://sourceforge.net/projects/clusterssh/
• RunDeck (formerly ControlTier): http://rundeck.org
14
Tuesday, October 2, 12
User Management & Authentication
• Use LDAP!
• JPC uses OpenLDAP
• Easy to manage; lots of resources
• Flexible replication schemes
• Flat text file configuration makes change control easier
• Client Access via Simple-SSL (636)
• Don’t enable Anon access, you do NOT need it
• Firewalled legacy 389 access provided for some appliances
• Preform daily management via Apache DirectoryStudio
• Generate User Passwords using apg (20 char len)15
Tuesday, October 2, 12
LDAP Considerations
• The “Hard Part” is creating the Schema & seeding the DIT; JPC’s “ldap_kit” will be open sourced soon
• Always deploy LDAP Servers in pairs
• Use MirrorMode replication
• Enforce auth for all users (no anon) and only use SSL if you can
• Don’t mess around with anything other than OpenLDAP & the standard Illumos LDAP Client (ie: don’t go chasing Linux PAM projects, you don’t need them)
• When configuring clients via CM, modify files directly. Trying to exec ldapclient init may have mixed results.
16
Tuesday, October 2, 12
A Word About Kerberos
• Its not worth the administrative overhead (imho)
• I don’t believe in SSO for administration in production environments (password entry encourage boundary awareness)
• Keep an eye on ApacheDS (directory.apache.org) & FreeIPA (freeipa.org) Projects
17
Tuesday, October 2, 12
Access Control
• Use Role Based Access Control (RBAC)
• Its not hard... really!
• Manage RBAC in LDAP, if possible
• Create abstraction profiles, ex:
• Joyent Level D: Normal user + DTrace
• Joyent Level 1: Normal user + Zone/VM Management
• Joyent Level 2: Admin, All but security
• Joyent Level 3: “Primary Administrator” (uid=0)
18
Tuesday, October 2, 12
RBAC: Simple Example
19
[root@smartos01 ~]# zfs create -o mountpoint=/home zones/home
[root@smartos01 ~]# useradd -s /bin/bash -m -d /home/benr -P "Primary Administrator" benr80 blocks
[root@smartos01 ~]# passwd benrNew Password: Re-enter new Password: passwd: password successfully changed for benr
benr@smartos01:~$ grep benr /etc/user_attrbenr::::type=normal;profiles=Primary Administrator
=====
$ ssh benr@xxxxxxxxxPassword:
benr@smartos01:~$ cat /etc/shadowcat: cannot open /etc/shadow: Permission denied
benr@smartos01:~$ pfexec cat /etc/shadowroot:$5$YB.Wp7J7$3iLhl.ivH4TCCAFoih6oXCqGIF0SMAjws3w4xwxwOZ4:14897::::::daemon:NP:6445::::::bin:NP:6445::::::
Tuesday, October 2, 12
RBAC: Learning More
• Authorizations are in /etc/security/auth_attr
• Execs are in /etc/security/exec_attr
• Profiles associate auths and execs for easy reference in /etc/security/prof_attr
• They are associated with users in /etc/user_attr
20
$ grep ZFS /etc/security/prof_attrSoftware Installation:::Add application software to the system:profiles=ZFS File System Management;help=RtSoftwareInstall.htmlZFS File System Management:::Create and Manage ZFS File Systems:help=RtZFSFileSysMngmnt.htmlZFS Storage Management:::Create and Manage ZFS Storage Pools:help=RtZFSStorageMngmnt.html
$ grep ZFS /etc/security/exec_attrZFS File System Management:solaris:cmd:::/sbin/zfs:euid=0ZFS Storage Management:solaris:cmd:::/sbin/zpool:uid=0ZFS Storage Management:solaris:cmd:::/usr/lib/zfs/availdevs:uid=0
Tuesday, October 2, 12
RBAC Example in LDAP
21
Tuesday, October 2, 12
RBAC Shells
• pfbash, pfcsh, pfsh, etc.
• Avoid them; intended for roles, not users.
22
Tuesday, October 2, 12
Auditing
• Basic Security Module (BSM) Lives!
• BSM Auditing is enabled by Default
• Audit trails in /var/audit
• Make sure to add a crontab to rotate audit trails (“audit -n”) daily or weekly; by default it does not.
• Print audit trails using “praudit -ls <trail>”; example:
23
# praudit -ls 20120830170449.20120930090225.78-2b-cb-47-af-7d | grep EXECVE | \> awk -F, '{ print $12 " (" $7 "): " $19 " " $20}' root (2012-08-30 17:08:58.862 +00:00): /usr/bin/zonenameroot (2012-08-30 17:08:58.879 +00:00): /usr/sbin/zoneadm listroot (2012-08-30 17:08:58.897 +00:00): /usr/sbin/zfs listroot (2012-08-30 17:08:58.934 +00:00): /bin/bash /usr/bin/sysinforoot (2012-08-30 17:08:58.938 +00:00): uname -sroot (2012-08-30 17:08:58.940 +00:00): zonenameroot (2012-08-30 17:08:58.950 +00:00): cat /tmp/.sysinfo.jsonroot(2012-08-30 17:09:04.155 +00:00): /usr/node/bin/node /usr/sbin/vmadmroot(2012-08-30 17:09:04.273 +00:00): /usr/bin/zonename subjectroot(2012-08-30 17:09:04.287 +00:00): /usr/sbin/zoneadm -zroot(2012-08-30 17:09:04.306 +00:00): /usr/sbin/zfs list
Tuesday, October 2, 12
Auditing with BART
• BART == Basic Auditing & Reporting Tool
• Similar to TripWire
• Consider using “BARTlog”
24
2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/chef/lib/ruby/gems/1.9.1/gems/chef-10.14.2/lib/chef/provider/package/smartos.rb size 2811 3401 mtime 5064a239 5066c229 contents 93d30d38740082bce6529a24ee1024bf 54b367a570a1bc273237add5628b12b4 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include size 7 8 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11 add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/DECkeysym.h add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/HPkeysym.h add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/Sunkeysym.h add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/X.h add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/XF86keysym.h add 2012-09-30T00:00:01+00:00 78-2b-cb-47-af-7d root: [ID 702911 audit.error] BART Reports Change: /opt/local/include/X11/XWDFile.h add
Tuesday, October 2, 12
Logging
• SmartOS ships with Rsyslog; will fallback to stock syslogd if you wish
• Rsyslog is a syslog server for this century, includes TCP support, TLS, filtering, compression, database support, etc.
• SMF Services log to /var/svc/log
• System logs found in /var/adm & /var/log
25
Tuesday, October 2, 12
Logging Tips
• Enable BSM Syslog Plugin
• Sadly, command executions do not include ARGV today :(
• Use logger(1) in your scripts to write syslog messages
• Centralize Syslog
• Leverage Rsyslog’s TCP capabilities for clients
• Leverage Rsyslog’s filtering capabilities for building centralized syslog servers
• ... if you can afford it, buy Splunk or SumoLogic
• ... if you can’t, consider Graylog2 and/or Logstash
• If you have too much time on your hands, go Hadoop26
Tuesday, October 2, 12
Metrics
• “If it moves, graph it. If its important, alert on it.”
• Kstats are your friend (See all available: “kstat -p”)
• For everything else, there is dtrace
27
Tuesday, October 2, 12
Metrics: Kstats
28
$ kstat -p | wc -l 33461
$ kstat -p bnx:0:mac:rbytes && sleep 60 && !kstatkstat -p bnx:0:mac:rbytes && sleep 60 && kstat -p bnx:0:mac:rbytesbnx:0:mac:rbytes 614389071bnx:0:mac:rbytes 614419131
30,060 Bytes Recv’d on bnx0
• A “registry” of kernal statistics
• Most stats are counters; to calculate activity find the delta
• Most common tools use Kstats as their source data, ex:
• vmstat
• iostat
• fsstat
Tuesday, October 2, 12
Metrics Graphing Solutions
• RRDtool: All-in-One database and graphing solution; local
• Ganglia: Flexible cluster graphing solution, based on RRDtool (agent-based)
• Graphite: Modern alternative to RRDtool; network based graphing and “rrd” data storage. (agent-less)
• In the end, data is feed into nearly all tools as key/value pairs with a timestamp.
29
Tuesday, October 2, 12
Northstar RRDtool Example
30
Tuesday, October 2, 12
Graphite In Use
31
http://graphite-server:8888/render/?width=400&height=250&target=dtrace.newton.syscall.read.entry&from=-1hours
echo "test.cpu 20 $(date +%s)" | nc graphite-server 2003
Feed Graphite Data via netcat:
View the graph via the “URL API”:
Examples of DTrace & Graphite at: https://github.com/benr/graphite-dtrace
API also supports CSV, JSON, and XML output!
Tuesday, October 2, 12
Other Tools & Tips
• Use the Ptools to observe processes
• pfiles: List open file descriptors of a process
• pargs: List arguments & env vars on a process
• pmap: Show memory allocation of a process
• and... pldd, pflags, pcred, pstack, pstop, prun, pwait, etc.
• Monitor per mount file system activity with fsstat
• SmartOS includes ziostat and zmemstat
32
Tuesday, October 2, 12
Other Tools & Tips
• Use IPMI if you’ve got it! IPMI goodies include:
• Sensor Data Repository (sdr)
• System Event Log (sel)
• Serial Console Redirection Over LAN (sol)
• FRU Inventory (fru)
• Know your place! Use LLDP if your network provides it.
• ‘getldp.pl’ uses snoop to listen for LLDP packets
33
$ p ./getldp.pl -lx -i bnx0Watching for LLDP packet on bnx0 for 60 seconds... device-id: 00:1c:73:XX:XX:XX platform: Arista Networks EOS version 4.7.8 running on an Arista Networks DCS-7048T-A port-id: Ethernet37 sysName: XXX-AR48-TOR-3-XX native-vlan: 998
Tuesday, October 2, 12
... now go forth and operate that thing!
34
Tuesday, October 2, 12
Thank You.
35
Tuesday, October 2, 12