Upload
docker-inc
View
4.851
Download
0
Embed Size (px)
Citation preview
Persistent, stateful services with docker clusters, namespaces and docker volume magicMichael NealeCo-founder, CloudBees (that Jenkins company)
Agenda
Supercontainers and storagePrivilegesIt’s all files (part 2)Controlling the host and peer containersStorage engines
Stateful docker clusters“off the shelf” cluster schedulingThe solution chosenOther tools out thereCredits…
BackgroundUse-case for stateful servicesDocker volumesQuick namespaces revisionnsenter
Mounts and VolumesIt’s all files (part 1)the mount namespacecreating bind mountsdocker volume api (use it!)
BackgroundThe Need for Stateful Services
Basis of this presentation:
.. was learned while building an elastic and scalable Jenkins based product for multiple cloud
environments, on docker
—Michael Neale
“No containers were hurt as part of this
production.”
My history with docker
Ex Red Hat where I heard about “control groups”Starting CloudBees, looking at ways to fairly multi tenantLater would discover (and with much help) use LXCSaw a video of Solomon demoing docker and didn’t believe itStill didn’t believe itFor the longest time didn’t believe it
CloudBees & Docker
Actually spoke about this at DockerCon 2014 (the first one!)cgroups -> LXC -> LXC + ZFS copy-on-writeLike dotCloud - ran a PaaS (as well as CI/CD toolchain)In 2014 moved to focus on CI/CD (dotCloud focussed on docker)In 2014 moved to adopt docker over LXC (and ZFS)Using: Docker Hub (private repos), Private RegistryMany of our customers are commercial users of dockerDocker Jenkins plugins: docker hub, build and publish and many more
Put all the things (OSS and commercial) on docker hub
I started the “official” jenkins image early onupdated now ~weekly (with LTS images also)
one MEEELION ??
A stateless cluster of apps is the dream
But the reality is, many apps still need state, a diskDatabases for exampleHands up who would run Oracle on NFS?
Reality: local diskNetwork filesystems are great*But sometimes you need the data close to the processingEBS, HDFS, GCP, OpenStack block storage… BUT: how to balance this need for local state with “ephemeral” serversServers come and go, need to restore the data (fast)Need to backup the data (delta/snapshots - fast)Alternatives: SANs (reattach volumes to replacement nodes, some clouds also support this)Reason for backups: resilience. Volumes can disappear too.
Current product
Years of experience with containersEC2, ZFS, EBS, LXClearn from it to build something new and “turn key” installable, powered by dockerI accidentally created a cluster scheduler (it happens.. please don’t)An evolved “pre-docker” system
Aim: a new product
A distributed Jenkins cluster10000s of “masters”, 100000s of elastic build workersUtilise “Off The Shelf” expertise based around docker: Mesos, Docker Swarm, KubernetesWork within existing constraints of a lively and evolving open source project(this means accepting local disk state… for now)
Additional ConstraintsOnly want to depend on docker being present on “worker nodes”Off the shelf cluster schedulerUse local disk*Multiple target clouds to be supportedMultiple storage “engines” to be supported
* Would love to refactor to DB backed
“Storage engines?”“The thing that backs up and restores local disks”
eg: EBS (snapshots), rsync, NFS, ZFS send …
Same cluster management, same api, different storage tech for different clouds/needs.
Ensures volumes are backed up in a consistent state (using LVM snapshot, xfs_freeze, as needed)
Docker volumes
Docker helpfully lets you bind mount to hostGiving you a choice of ways to get data to the hostContainers can remain ephemeralHowever, you need to manage those underlying volumes
Note: you shouldn’t need to do what I did. Use something off the shelf if you can. If you must, there is an excellent docker plugin api and volume plugin api.
Solving local disk with docker
client cluster sched. docker host storage
runn
request appfind free slot
ask for dataprovide data
Container fully running with data
Using “trickery”
client cluster sched. docker host storage
runn
request appfind free slot
request data
provide data, bind mount
container starts, asks for dynamic
bind mount, waits
With docker volume plugin api
client cluster sched. docker host storage
runn
request appfind free slot
jsonprovide datadocker calls
volume plugin BEFORE
container starts,
launches with bind mount
However: Docker plugin api did not exist yet!
I had to make do with “trickery”Other choices like powerstrip existed, but wanted “standard” dockerAnd you are here for namespace trickerySo lets learn from it…
—Unknown
“Hard work pays off eventually, but laziness
pays off right now.”
Namespaces - really quick…Along with cgroups are “foundational tech” for containers6 types: Mount, UTS, IPC, PID, Network and UserMy favourites: Mount: filesystem stuff (that I used)PID, Network and the exciting User namespaces!
https://lwn.net/Articles/531114/
How do we access these namespaces?
nsenter - command line toolnsenter allows you to “enter” a namespace and do something in the context of itAvailable out of the box in many linux distros now
https://github.com/karelzak/util-linux/blob/master/sys-utils/nsenter.c
https://blog.docker.com/tag/nsenter/
Mounts and VolumesIt’s all files in Linux - part 1
Mount namespace
Containers don’t see all mount points, all devices, just their ownAllows dockers “bind mount” to workA “bind mount” in linux is really an “alternative view of an existing directory tree”A docker bind mount takes that “alternative view” and makes it visible to the container (via its mount name space)Magic? No. Linux.
It’s all files, part 1
Start any container Access docker host and run this to get the pid of the whole container:
docker inspect --format {{.State.Pid}} <container id>
You can then see the 6 namespaces in /proc/<PID>/ns:
ls /proc/7865/ns/ipc mnt net pid user uts
/proc virtual filesystem and nsenter/proc is a virtual filesystem (http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html)
Run a command inside a given containers namespace:
nsenter --mount=/proc/$PID/ns/mnt -- /usr/bin/command param
RUN A COMMAND FROM HOST AS IF YOU ARE IN THAT CONTAINER
—Spidermans Uncle
“With great nsenter power, comes great
responsibility ”
Creating a bind mount on a running container( -v /var/foo:/var/bar ) High level steps:
Get the underlying device from the host, into the containermount the device in the containerbind mount in the container to the “directory you want”unmount the device in container remove the initial mount
What you are left with: a bind mount to the volume on the host you wanted in the first place, and only that path. Not the whole device/volume on host.
You don’t need to do all this yourself, ever!
# Using a device’s numbers we can create the same device in container
# use nsenter to create a device file IN the container (using its $PID): nsenter --mount=/proc/$PID/ns/mnt -- mknod --mode 0600 /dev/sda1 b 8 0
# Now we have the device ALSO in the container!# We can mount it (normal linux)# bind mount to the desired directory (also normal linux)!# all from the host
I told you not to panic!
Now we have a dynamic bind mount
As if we used -v /var/foo:/var/bar on startupRemember: DON’T DO THIS!Really: you shouldn’t need to do this yourself. Use the docker plugin volume api! (if you must)
Docker plugin API
Out of process JSON based api (but running on same host)plugins are installed by putting a file in a directory, and referred by name (minutes the extension)Well defined JSON protocol
https://docs.docker.com/extend/plugin_api/
Docker volume plugin API
docker run -v volumename:/data --volume-driver=mydriver ..
“volumename” is passed to the registered volume-driver(which is listening on http) volume-driver then prepares the data somewhere on the host, returns where it lives (via json)… docker then bind mounts it in as /dataAll happens BEFORE container startshttps://docs.docker.com/extend/plugins_volume/
Docker volume plugin API
Would not require messing with namespacesStill allow an out of process “volume service” to take care of messy volume detailsHowever - DOES require you to register the plugin with docker on the hostAnd less terrifying fun than nsenter and namespaces
If you really must
https://github.com/michaelneale/bind-mount-supercontainer
Sample python code that I prototyped this with. Use with care!
Supercontainersand storage enginesLike containers, only more… uh super…
Supercontainers - concept
Term came from Red Hat http://developerblog.redhat.com/2014/11/06/introducing-a-super-privileged-container-concept/You have heard of privileged containers?
docker run --privileged ..
Drops all namespace restrictions“Super privileged containers” add in more access to the underlying host…
It’s all files (part 2)
Add in the host root filesystem, docker daemon, and all the rest:
docker run -v /var/run/docker.sock:/var/run/docker.sock
—privileged
-v /:/media/host
my-super-container
Brings in docker socket, and root as /media/host/media/host then contains ALL devices, virtual files, /proc etc
It’s all files (part 2)
Why? We can do everything we did with nsenter before but from WITHIN a “peer container”
It’s all files (part 2)
We can do everything we did with nsenter before but from WITHIN a “peer container”Remember requirements: vanilla docker, only docker installed on hostUse super-container as a “agent” container, do all the automation you could wantNo need for extra bits on the host boxAllows using “off the shelf” cluster scheduling (only docker need be installed)
Controlling the host
Host can be accessed from super-container via nsenter PID of host is 1!
eg, from super-container, get all mounts: nsenter --mount=/media/host/proc/1/ns/mnt -- cat /proc/mounts
Run a command, from container, on the host (stuff after “--")/media/host lets us get to the host. Even devices.
Controlling the host
Host can be accessed from super-container via nsenter Do all the steps as before, but with “nsenter —mount=/media/host/proc/1/ns/mnt” prefixed
Controlling peer containers from supercontainer
Peers are other “ordinary” containers on the same host as the super containerPeers can be accessed from super-container also via nsenter Just like before, we use nsenter, with the peer containers $PIDBut prefix it with the hosts filesystem:
nsenter --mount=/proc/$PID/ns/mnt -- ..
becomes:
nsenter --mount=/media/host/proc/$PID/ns/mnt -- ..
Controlling peer containers
Why?Once again, use he super-container as the controlling agent on a hostLess bits to install on the host
Storage engines
My requirement: multiple implementations for different cloudsDifferent clouds have different storage enginesSuper container great place to host volume serviceDifferent implementations on service depending on what is on offerEBS, NFS, openstack rsync and moreThis “volume service super-container” is responsible for backup/restore
Storage engines - eg an AWS region
zone-1 zone-2
serverA
serverBserver
Aserver
Bvol-1 vol-2
vol-1vol-1 vol-1vol-2snapshots
request backup
Snapshots/backups
Snapshots a cheap and quickZone resilienceVolumes (ie: disks) are not as durable as snapshots/backupsSimilar in other platforms: GCP, OpenStack, Azure. Google compute persistent disks: does allow volumes read-only extra mounts across instances for redundancy of compute nodesIn our case: failing over is “restoring from backup” - always test your backups!
Supercontainers - summary
A useful tool for low level controlNo need to install bits on the hostCan control peers directlyCould be a great place to host a docker volume plugin implementation(not currently recommended in Docker plugin api docs)
Stateful clustersEveryone wants to be stateless…
What we built…
.. an elastic and scalable Jenkins based product for multiple cloud environments, on docker
Cluster schedulers/managers
Remember: I have build schedulers before, would rather not againDocker Swarm, Mesos/Marathon, Kubernetes etcSome have concepts of volumesAll can schedule “plain” docker containersSuper containers can give you a way to get lower level access
What we settled on
Super containers to implement volume serviceSupport for multiple storage engines for different cloudsScheduled via mesos+marathonOnly docker (+ mesos in this case) required on the hostsWhy mesos: practical choice for us but not a tight coupling(could mesos be in a super container? probably)Using containers for all the things: elastic search nodes, builds, even haproxyFor us, 5 minute or event based backups/snapshots are fine
Running supercontainers
Eg. marathon: schedule a super container to run on each host Constraint on volume service: one per host, size: number of servers in cluster (3 in this case):
vol service vol servicevol service
master masterelastic search
haproxy
(free)
Working with EBS (an example)
client container volume service EBS api
requests backup
freeze for snapshotinitiate snapshot
unfreeze backup delta,copy to s3
optimisation: use LVM snapshot instead of freeze
Backups, backups
Servers are ephemeralServers come and goDisks are fallible (even if cloud platforms call them “volumes”)Workload moves aroundRestore data when workload is moved to a new locationDelta backups are used to avoid full copies each time
Cluster schedulers/managers
Storage awareness is being built in increasingly (Kubernetes volumes, mesos storage awareness)Ideal world: your cluster manager will do all this for you. If you live in that world: congrats. Make yourself a cocktail:
My recipe for no-sugar old fashioned:https://gist.github.com/michaelneale/6034145
“off the shelf” stateful volume tools
Rexray: use volume plugin api for Amazon EBS, Rackspace and moreFlocker from ClusterHQKubernetes volume supportApache “Mysos”: MySQL service backed up to HDFS on mesosTutum from Docker! has support for persistent volumesWatch this space… (changing constantly)
https://docs.clusterhq.com/en/1.4.0/labs/docker-plugin.html
https://github.com/emccode/rexray
Stateful volumes summary
It is possible with dockerAvoid doing it yourself is someone else already hasUsing local filesystem directly does feel a bit like “legacy”But it is a reality for some apps (especially database services)Lovely to port everything to be stateless, database backed, blobstore backed, but it takes timeLean on the capabilities of the underlying platform where you can
Credits
Jérôme Petazzoni (@jpetazzo) - years of inspirational blog posts, hacks on linux/docker/volumes. And great hair. http://jpetazzo.github.io/2015/01/13/docker-mount-dynamic-volumes/ - BTW Jerome - it works for real!Red Hat for Super Container concepts: Daniel Walsh: http://developerblog.redhat.com/2014/11/06/introducing-a-super-privileged-container-concept/Trevor Jay from Red Hat for some final namespace tips https://securityblog.redhat.com/author/tjay/I really just mashed up the above concepts: https://michaelneale.blogspot.com.au/2015/02/mounting-devices-host-from-super.html
@jpetazzo’s hair - imminent singularity?
2012 2013 2014 20150
45
90
135
180
225
Region 1
Thank you!Michael Neale@[email protected]