The Survival Handbook for Going On-Premises

The

Going On-Premise Survival Handbook

HOW TO:

➡ Avoid deployment fragmentation.

➡ Simplify installation complexity

➡ Handle application upgrades.

➡ Enforce resource requirements.

➡Manage releases and versioning.

brought to you by

version 0.2.0

Table of Contents

INTRODUCTION 2

SECTION 1: LIFECYCLE MANAGEMENT AND OPERATIONS AVOID DEPLOYMENT FRAGMENTATION 5

SIMPLIFY INSTALLATION COMPLEXITY 6

HANDLE INCOMPATIBLE RESOURCES 7

SIMPLIFY UPGRADE COMPLEXITY 9

HANDLE APPLICATION UPGRADE FAILURES 10

CREATE CONSISTENT APPLICATION ENVIRONMENTS 11

SECTION 2: MANAGING RELEASE AND UPGRADE CYCLES REDUCE UPGRADE CYCLES 13

MANAGE RELEASES AND VERSIONING 14

MANAGE EXTERNAL DEPENDENCIES 16

PUBLISH INSTALLABLE SOFTWARE 17

SECTION 3: THE ROAD TO PRODUCTION MANAGE OPEN SOURCE SOFTWARE DEPENDENCIES 19

PASS SECURITY AUDITS 20

OFFER AND ENFORCE EVALUATIONS 22

MANAGE HA DATABASE DEPLOYMENTS 23

SIMPLIFY THE OPERATIONS OF KUBERNETES 24

RECOVER FROM FAILURES 25

MONITOR AND TROUBLESHOOT DEPLOYMENTS 26

SET UP DATA STORAGE 28

MAINTAIN DATA INTEGRITY 29

ACCESS DEPLOYMENTS REMOTELY 30

CONFIGURE NETWORKING 31

SECTION 4: MANAGING ORGANIZATIONAL ISSUES SET EXPECTATIONS WITH CUSTOMER IT 33

PREVENT INTERNAL TEAM FRAGMENTATION 36

CONCLUSION 37

Copyright © 2018 Gravitational, Inc. All rights reserved.

INTRODUCTION

Almost every successful B2B SaaS vendor eventually receives a request to deliver their offering on-premise from a large, security-minded customer. There are several key reasons why customers may require an on-premise installation but these can be generalized into two major categories:

• Security and regulation: Sensitive information can not leave the premise, with only privileged access to that data from within the company or through vetted service providers.

• Data locality and latency: The solution is meant to be run on-premise (e.g., network and security monitors, load balancers and web application firewalls are intended to run in the data center) or it is easier to run the application where the data already is located (data processing and machine learning services).

After several years of deploying and running complex applications in some of the most secure, air-gapped data centers in the world, we put together this survival handbook for our customers (or potential customers) to help them evaluate, prepare and survive going on-prem.

In this guide, we share some of the technical and organizational challenges we have seen when going on-premise. We will also present some of the solutions we have researched, developed and how we have productized some of these solutions through our Gravity platform. We believe that Kubernetes offers a lot of advantages when delivering complex applications on-premise, so many of our solutions will focus on how to leverage Kubernetes to overcome the challenges described.

We have also have a series of workshops that focuses on the more technical aspects of the technologies we use, namely Docker and Kubernetes.

Before we dive in, we should mention an important caveat - you should only offer private installations if customers are ready to pay a large premium for them. It will require a significant investment, so make sure there is significant and repeatable demand for your efforts.

Page ! of ! 2 37Copyright 2018 Gravitational, Inc. All rights reserved.

https://gravitational.com/gravity

https://github.com/gravitational/workshop

If you are ready to forge ahead, the challenges that you will face can be grouped into the following categories:

• Lifecycle management and operations: installing and upgrading applications.

• Release cycle management: packaging, publishing and versioning releases.

• Production readiness: security, licensing, monitoring and high availability.

Finally, there are also challenges to consider that are not directly related to the technical implementation details. We will touch upon some of these organizational challenges.

Good luck. We hope this handbook helps.

- The Gravitons


SECTION 1: LIFECYCLE MANAGEMENT AND OPERATIONS


HOW TO AVOID DEPLOYMENT FRAGMENTATION

Challenge

Delivering an on-premise offering, in addition to an existing, hosted offering may result in two different ways to deploy the application. This will lead to a bifurcation of team responsibilities and doubling the amount of work.

Solution

It is possible to unify deployments by migrating to Kubernetes as the primary platform for both deployments.

Kubernetes provides a way to abstract away the details of underlying infrastructure like disks, load balancers and network security rules. You can read more about Kubernetes in its documentation

In addition to adopting Kubernetes, Helm (the Kubernetes native package manager) should be used to split components into independent packages.

Once the migration to Kubernetes and Helm is complete, the on-premise edition becomes just another deployment target alongside cloud deployments.

For example, many of our customers use Gravity’s supported upstream Kubernetes as a deployment target for on-premise and a managed Kubernetes service like GKE or AKS for their cloud deployments.


https://kubernetes.io/docs/user-guide/walkthrough/

https://helm.sh/

HOW TO SIMPLIFY INSTALLATION COMPLEXITY

Challenge

Installing a highly available, distributed system is difficult on infrastructure you control, not to mention infrastructure that you don’t. Setting up dozens of components and dependencies will lead to a multi-step installation process that is very hard to debug and execute by untrained, on-site personnel.

Some installations will fail and it will take many hours to troubleshoot the root cause while going back and forth with the customer. Completing an installation may take days and numerous attempts to get right. Eventually, customers may entirely abandon the idea in frustration, which can damage your reputation with your customers.

Solution

Automating as many steps as possible removes the human factor of the installation process. However, it can be difficult to automate the installation of complex applications for every environment. It is a good idea to limit the types of supported environments and to only support specific components so that you can safely implement automation (see section on “How to handle incompatible resources”). There also need to be an easy way to log and share information externally for debugging if something went wrong.

Speaking of installation failures, support teams should have the ability to roll back the installation to the point of failure rather than starting over. This has the potential to save wasted hours by not having to restart the installation from scratch when failures occur late in the installation process.

Gravity automates the installation and reduces the number of installation steps to one command which installs Kubernetes alongside all dependencies and application containers. It also has a simple way to collect operational reports which captures all possible information and allows for manually overriding the automated installation, if a failure occurs.


https://gravitational.com/telekube/docs/installation/#standalone-offline-cli-installation

https://gravitational.com/telekube/docs/installation/#standalone-offline-cli-installation

https://gravitational.com/telekube/docs/installation/#troubleshooting-installs

HOW TO HANDLE INCOMPATIBLE RESOURCES

Challenge

When you don’t control the infrastructure, installation problems can be caused by a variety of things outside of your control - slow disks, networks or an old OS distribution provided by customer. This can cause hours of troubleshooting and it will be unclear why the installation failed to work with a seemingly correct setup and configuration.

Solution

Always specify and enforce systems requirements for disk space and speed, network requirements, and OS distribution with every installation. The system should refuse to install unless requirements are met and have proper error reporting for the requirements that are not met. It is usually not enough to provide guidance in form of documentation to the customers because they often ignore it. We use a set of pre-checks to specify the list of operating systems, disk speed and capacity, network bandwidth and open ports.

Also, equip your services teams with lightweight tools to pre-check the system readiness (like our gravity status tool) that can run even before the installation has begun to make sure that basic requirements are satisfied.

Here is some advice on more specific requirements to consider:

• When using Kubernetes, require a separate disk for etcd (the internal Kubernetes database) and any other database that you ship. This requirement can be lifted for trial deploys, but make sure to include it in production specifications.

• Isolate slow network attached storage by providing a minimum performance requirements list for storage volumes. At a minimum, pick something as low as 20 MB/s just to eliminate completely incompatible or broken storage.

• Always set up capacity requirements for temporary and root partitions and database partitions. You will be surprised how often you will get VMs with minimal disk space available if you don’t.

• Apply baseline network throughput requirements. Setting something as low as 5MB/s will spare you from troubleshooting congested networks.


https://gravitational.com/telekube/docs/pack/#sample-application-manifest


https://gravitational.com/telekube/docs/cluster/#cluster-status

• Specify and encode all networking and port requirements needed for the application to run.

• Start with one or two of the most popular supported OS distributions. Typically, larger customers have RHEL available. This will spare you from troubleshooting a range of 5 different distros and kernels. Here are our guidelines on supported distributions, for reference.


https://gravitational.com/telekube/docs/requirements/#distributions

https://gravitational.com/telekube/docs/requirements/#distributions

HOW TO SIMPLIFY UPGRADE COMPLEXITY

Challenge

Installing distributed systems is hard, but upgrading it is an order of magnitude more complex. Sometimes, only certain components of the system need to be upgraded but it may difficult to only upgrade those components in a safe way.

Upgrade failures can turn into quagmires. During an upgrade operation on-premise, there is no easy way to reinstall the OS or add new nodes to the rotation. Any part of the upgrade can fail at any time due to known or unknown circumstances like power outages, system going out of disk space or simply containers hanging because of older kernels. Complex updates will contribute to longer upgrade cycles, as customers will be wary of the risk and spending 2-3 days upgrading the system.

Solution

Our upgrade process is consist of a single command that launches a full cluster and application upgrade. However, if the upgrade fails, it can be easily resumed from the explicit stage it last completed, instead of starting from the beginning.

This approach makes it possible to continue the upgrade even in the face of unexpected failures and it keeps the cluster running during failures. This also leaves a good impression on the customer, as you know at which stage the upgrade failed and can provide insight as to why it happened, which is difficult with a black box upgrade procedure.


https://gravitational.com/telekube/docs/cluster/#performing-upgrade

HOW TO HANDLE APPLICATION UPGRADE FAILURES

Challenge

Having a platform like Gravity is helpful but it’s not a magic bullet. If the application is not architected correctly, an upgrade can lead to failed database migrations and lost data, which can lead to many hours of troubleshooting and rollbacks.

Solution

We offer some guides and training on implementing proper application upgrade procedures. Here is just one example:

During the upgrade process, we strongly advise taking automatic backups of the system and drain off the write traffic to the database to avoid conflicts during migration. In addition to that, we recommend using a test suite like robotest to roll automated regression and upgrade testing with every code and deployment change.

We also offer upgrade hooks that your application can use with Gravity. Here is a sample application upgrade process that can be automated with the upgrade hook:

✓ Run migration as a separate process for the cluster instead of migrations running as part of individual service startup.

✓ Switch product landing page and API endpoints to show an “upgrade page” to prevent writes to the database during migration process.

✓ Drain off the traffic to databases.

✓ Take backup of the data.

✓ Run migrations on the database.

✓ Check that migrations are run safe by using simple sanity test.

✓ Upgrade services.

✓ Switch the traffic back to the services from the landing page.


https://github.com/gravitational/workshop/blob/master/k8sprod.md

https://github.com/gravitational/robotest

HOW TO CREATE CONSISTENT APPLICATION ENVIRONMENTS

Challenge

Even though Kubernetes and Docker can abstract away infrastructure differences, now you have to maintain Docker and Kubernetes and make sure they are consistent across deployments. Each version is slightly different and you will encounter slightly different behavior with various combinations of software, OS distributions and storage engines. This will introduce fragmentation and your ops team will be constantly asking customers questions about components versions and their respective configurations.

Solution

We create a “bubble of consistency” by using the following methodologies:

We package Kubernetes and all of its dependencies, including etcd, docker, dnsmasq, systemd, etc. We test to make sure these dependencies are compatible before installation. This helps to ensure conflicting software is not running on the host during the install.

We isolate the processes running in a special linux container and minimizes interaction with distribution packages.

The runtime section of our application manifest sets up approved docker storage drivers that are production ready and can work reliably without losing data.

We only support the most popular OS distributions and specify other requirements. Components are tested before each release.

Doing these things means the support and services teams will never have to ask questions like what Docker or dnsmasq version is installed, because the packages are predetermined and tested for reliability and supportability.


https://gravitational.com/telekube/docs/pack/#application-manifest

https://gravitational.com/telekube/docs/requirements/

SECTION 2: MANAGING RELEASE AND UPGRADE CYCLES


HOW TO REDUCE UPGRADE CYCLES

Challenge

One of the biggest shocks to SaaS companies delivering software on-premise is the longer release cycles. In addition, customers may not keep versions up to date, with upgrade cycles up to one year. This puts a lot of strain on the team that has to support older versions of the software.

Solution

Many customers are wary of upgrading complex systems because they often break and require full reinstalls and/or lead to an outage. If you can provide simple and stable upgrades, teams are usually more open to more frequent upgrades and it is possible to get down to bi-weekly upgrade frequency with most of your customer base.


HOW TO MANAGE RELEASES AND VERSIONING

Challenge

SaaS businesses are used to multiple-times-a-day release cycles. Shipping versions on-prem, even with be-weekly updates, poses a challenge for them, especially if the system is a mix of microservices with loosely coupled release cycles.

Solution

Again, it is important to use the same platform for your on-premise and cloud deployments. We recommend using Kubernetes for both.

Packaging

Use Helm, the Kubernetes package manager, and its best practices to transition microservices releases to a package-style approach with clear dependencies. We have a first-class integration with Helm to simplify the build and deployment process.

Versioning

Picking a right versioning scheme is mission critical for on-premise deployments. Unlike in SaaS deployments, versioning plays a very important role as it is used to inform customers about the frequency of software release cycles and the risks associated with the upgrades.

Adopt semantic versioning and set up clear dependencies between components.

Use clear signaling to the customer on the upgrade risks by using major, minor and patch versions of the software.

For example, with semantic versioning customers would expect patch versions 2.5.1 and 2.5.2 upgrades to be trivial and backwards compatible, upgrades from 2.6.3 to 2.7.3 be possible but a bit more risky and involving potential migrations, and upgrading 2.0.0 to 3.0.0 to be a major undertaking.


https://docs.helm.sh/

https://docs.helm.sh/chart_best_practices/

https://gravitational.com/telekube/docs/pack/#helm-integration

https://semver.org/

Upgrades

Make sure customers are used to a stable upgrade schedule, this allows them to set up planning their side and include upgrades in their development milestones. Here are a couple of recommendations on the upgrade schedules:

Publish bi-weekly upgrades for stateless services, so most of the customer’s deploys are up to date. These upgrades should not run any migrations or perform any dangerous or risky operations.

On a monthly schedule, provide a more complex upgrades that involve database schema migrations.

We upgrade the platform (Kubernetes and dependencies) approximately every two months using Gravity’s LTS release upgrade schedules.

The Golang programming language is a very good example of a team publishing upgrades on a predictable schedule (in this case, every 6 months).


https://gravitational.com/telekube/docs/changelog/#lts-releases

HOW TO MANAGE EXTERNAL DEPENDENCIES

Challenge

Many on-prem deploys are air-gapped, which means they can not make any outbound internet calls to function or update. This makes it impossible to provide installations, patches or updates which pull dependencies from external resources.

Solution

We designed our deployments to be entirely self-sufficient using the following methodologies:

The build process scans all Kubernetes docker image dependencies, and packages them with every install. See documentation on tele build.

We ship a self-contained Docker registry that hosts the images, so a cluster remains highly available and can pull images from local registries instead of pulling from the internet.


https://gravitational.com/telekube/docs/pack/#helm-integration

HOW TO PUBLISH INSTALLABLE SOFTWARE

Challenge

SaaS companies are usually not familiar with the process of publishing downloadable software. Sending out binaries without an official, centralized way to download and validate software appears unprofessional and results in a bad user experience. Customers end up sharing FTP password protected endpoints and sending passwords over email. In addition, you need a seamless process for sending out updates, patches and monitoring the status of each download.

Solution

We built a way for our customers to publish applications so their users can install, download and pull updates manually (for offline situations) or automatically, depending on their security and deployment practices.


https://gravitational.com/telekube/docs/pack/#publishing-applications

https://gravitational.com/telekube/docs/cluster/#uploading-an-update

SECTION 3: THE ROAD TO PRODUCTION


HOW TO MANAGE OPEN SOURCE SOFTWARE DEPENDENCIES

Challenge

Many times you will be requested to provide a full list of third-party software used with all the versions and dependencies shipped with the product. This is to make sure there is no copyright infringement and to reduce the likelihood of vulnerabilities. This requires scanning the product for licenses, collecting all the software versions and assessing the license dependencies. This can can take some time and can block a deal until it is completed.

Solution

We recommend using Fossa to set up on-going scans for every pull request. If you do come across a restrictive license, we recommend checking in with a copyright lawyer (we use Silicon Legal who can provide guidance and assistance with your questions. You can also reference TLDR to educate yourself on the most common licenses.

For Docker containers, use private registries with security scanning capabilities that can show the software and all common vulnerabilities reported (CVEs). Quay.io is a good example.


https://fossa.io/

http://www.siliconlegal.com/

https://tldrlegal.com/

https://quay.io/

HOW TO PASS SECURITY AUDITS

Challenge

It is likely one of the major reasons your customer requested an on-prem install is due to tight security requirements. This will usually lead to a full security audit to get a green light on the production deploy, especially if customer is a regulated entity like a bank or government entity. You may need to redesign a deployment on short notice if there are vulnerabilities discovered.

Solution

On the application level, here are some important steps to take to make a Kubernetes application is ready for a customer-driven external audit:

Application security

Infrastructure security audits vary in the level of thoroughness, but usually they all consist of network security scans and application black box scans.

Network security scanner will find out any ports that respond with plain text HTTP, or using weak ciphers and older protocols like SSLv2. Application security scanner will find out basic vulnerabilities, for example if server discloses version to non-authenticated clients or contains dependencies with versions known to vulnerable to CSRF attacks. In addition to that security auditor can conduct more advanced review by trying to find hard coded secrets in the code or break into the application.

Here are some guidelines on how to get application ready for the audit:

• Set up mutual TLS in your application using side-car patterns. As a rule of thumb, there should be no unencrypted data floating between servers.

• Do not use the same static passwords/api-keys for every install, make sure you generate them on the fly during the installation process.

• Disable weak ciphers, use Mozilla’s recommendations as a starting point.

• A common gotcha with TLS is that if the web page or endpoint is external (customer facing), make sure TLS ciphers and certificates should be configurable, as all large customers have their own guidelines and requirements.

• Focus on common web security issues by going through OWASP Top 10.


https://torstenwalter.de/openshift/nginx/sidecar/tls/encryption/certificates/2017/08/07/nginx-as-sidecar-container-to-secure-openshift-traffic.html

https://wiki.mozilla.org/Security/Server_Side_TLS

https://www.owasp.org/index.php/Category:OWASP_Top_Ten_Project

Once your product is ready to go post POC, it is helpful to engage with a third party security review agency to conduct an external review. We recommend Cure53 as we had some positive experience working with them over past years and they will publish their work upon request.

Kubernetes Deployment Security

Kubernetes deployment security has its own deployment gotchas that will be important at the time of audit.

Set up a restrictive Kubernetes deployment, by following some fine-grained security policies. For example, you should make sure that containers are not privileged and running as root if they don’t need to be.

Use Kubernetes secrets to store infrastructure secrets like API keys and database passwords.

If the application is not ready to set up and handle TLS in a scalable way on its own (for example python or nodejs services), it is helpful to set up a proxy sidecar container terminating TLS and sending traffic to the local app. Read more on side-car containers here.

Gravity itself is reasonably audit-ready by using mutual TLS on the control plane and following the security best practices for Kubernetes deployments.


https://cure53.de

https://gravitational.com/telekube/docs/cluster/#securing-a-cluster

https://gravitational.com/telekube/docs/cluster/#securing-a-cluster

https://kubernetes.io/blog/2015/07/strong-simple-ssl-for-kubernetes

http://www.apple.com

https://www.cisecurity.org/benchmark/kubernetes/

HOW TO OFFER AND ENFORCE EVALUATIONS

Challenge

It’s much easier to monitor usage of hosted software than installable software. With installable software, you need to figure out a way to monitor and enforce usage according to the license. In addition, evaluation or POC periods can extending beyond the intended time frame without enforcement, which leads to longer sales cycles.

Solution

Selling downloadable software requires a certain level of trust. Our position is that if someone really wants to pirate your software, they will likely succeed. Instead of spending expensive engineering cycles creating “unhackable” software, we recommend limiting your dealings with reputable customers who would not risk their reputation by knowingly using your software illegally.

Many customers will not want to report usage automatically back to you, given one of the reasons for running the application on-prem may be data privacy. So you’ll need to allow for some other reporting mechanism in the contract. Many customers will send quarterly summary reports and, in general, usage is usually bucketed into tiers or plans so that fine-grained usage reporting is not necessary.

There are also several third party vendors to take care of the license enforcement. In our experience they are either too complex or designed for legacy software, so adoption for SaaS offerings is a challenge. Initially, you may not need license enforcement to cover all use cases but a time based “reminder” flow for trials is a good minimal implementation.

Gravity does have a way to define a limited trial license in the application manifest. This will shut down the software or limit the amount of servers it is used on during the trial period to motivate the customer to close faster.



HOW TO MANAGE HA DATABASE DEPLOYMENTS

Challenge

It is very difficult to deploy traditional database on-premise, in a highly available manner, without risking data loss. Unfortunately, Kubernetes does not bring an out of the box solution to the problem.

Solution

There are entire books written about this. To keep it short, if you don’t have significant in-house expertise with a database, find a good partner that will provide a production-ready deployment of the database on Kubernetes that they will support. For example, we partner with Citus Data to deliver production ready HA Postgres with on-premise deployments.


https://gravitational.com/blog/running-postgresql-on-kubernetes/

https://gravitational.com/blog/running-postgresql-on-kubernetes/

https://www.citusdata.com/

HOW TO SIMPLIFY THE OPERATIONS OF KUBERNETES

Challenge

Kubernetes is a complex system that consists of a distributed database (etcd), overlay network (VXLAN), container engine (Docker), Docker registry and many other components like iptables rules to keep in mind.

A successful install is just the beginning. The platform will degrade over time. Here are just some of the problems we have encountered in the past:

• Security teams automated blocking ports and stopping services without warning.

• Monitoring daemons set up by customer consuming all RAM on the host.

• Running out of disk space.

• Customer’s DNS server blocking queries.

What happens if it fails and how do you troubleshoot in a scenario when you don’t have the access to the infrastructure? How does the customer even know if the platform is in degraded state?

Solution

There is no easy solution for this problem, however here are some steps we have taken to help manage Kubernetes:

Our tool, gravity status, helps to diagnose the most common reasons for cluster failure, reducing time to resolution. The tool provides fast checks on some common outages that we have seen in the past. Gravity uses our monitoring system, satellite, that constantly checks the parameters of the system, not only during the install, but after the platform has been set up.

Gravity provides integrated alerting.

We offer training for field teams to help them understand Kubernetes and Docker architecture so they can become more efficient during troubleshooting sessions with customers.


https://gravitational.com/telekube/docs/cluster/#cluster-status

https://gravitational.com/telekube/docs/faq/

https://gravitational.com/blog/monitoring_kubernetes_satellite/

https://gravitational.com/telekube/docs/monitoring/#custom-and-default-alerts

https://github.com/gravitational/workshop

HOW TO RECOVER FROM FAILURES

Challenge

Recovering a partially failed system can be harder than setting up a new one, as you don’t have fresh hardware to begin with and have to repair the system in place. In the absence of published runbooks, services teams will struggle to provide fast assistance to the customer.

Solution

We have published a series of runbooks targeting the most common cluster failure and recovery scenarios. We review the runbooks with customers, breaking clusters and recovering them, so services teams are comfortable providing assistance on the spot.


https://gravitational.com/telekube/docs/cluster/

HOW TO MONITOR AND TROUBLESHOOT DEPLOYMENTS

Challenge

Actively monitoring a multitude of on-premise deployments is difficult. You may not even have access to the deployments. In order to provide proper support you need a consistent and scalable way to assess the situation and troubleshoot your deployments when issues arise.

Solution

Every install of your application should ship with pre-built alerts, application specific metrics and a dashboard so services teams and customers will get the same monitoring and visibility no matter which environment they are in.

Metrics and Alerts

Metrics and alerts go side-by side - anomalies in metrics cause trigger alerting. Gravity integrates with the TICK stack and Grafana to create built-in application dashboards and alerts. The Google SRE book has great advice on setting up proper alerting and monitoring in the application.

Here are some tips on how to set it up with Gravity:

• Use a TICK stack integration to to ship pre-built dashboards.

• Set up built in alerts using Kapacitor integration.

• Set up retention policies and rollups for application metrics or use the ones shipped with Gravity by default.

Logging

The 12-Factor app manifesto provides good guidance on setting up the logs as structured event streams. Docker and Kubernetes make it easy to send logs for every application by capturing logs sent to stdout and stderr.


https://landing.google.com/sre/book.html

https://gravitational.com/telekube/docs/monitoring/#cluster-monitoring

https://gravitational.com/docs/monitoring/#custom-and-default-alerts

https://gravitational.com/docs/monitoring/#retention-policies

https://12factor.net/logs

We use a 12-factor set up when deploying applications with Kubernetes, so the logs can be captured later. We can forward logs to the endpoint of customer choice using a log forwarder configuration.

Status checks

Application metrics and alerts are great for debugging, however most of the time customers only need an answer to one question: “Is everything up and running?”. That’s why it’s important to provide “self checkers” or “smoke tests” which are programs running in the cluster to make sure everything is in a good state. Once checkers detect any failure, they communicate to the customer that the system is in degraded state via UI and alerts.

Our customers write application specific “smoke test” programs and integrate them with status hooks to give customers clear visible notification if the application has been degraded.

Sending reports

In most of the cases there is no easy way to access to on-prem deployments, so we use use our cluster management tooling to take a snapshot of all system logs and metrics, and ship it to the development and support teams for inspection.


https://github.com/gravitational/workshop/blob/master/k8sprod.md#production-pattern-logging

https://gravitational.com/telekube/docs/cluster/#configuring-a-cluster

https://gravitational.com/telekube/docs/cluster/#application-status

https://gravitational.com/telekube/docs/cluster/#gravity-tool

HOW TO SET UP DATA STORAGE

Challenge

There may not be any block storage available to use. Even if it is available, it may not be possible to integrate with it in a timely manner.

Solution

Here are some general storage recommendations:

• You may need to rely on disk storage as the only available storage solution. Use Kubernetes host volumes to use local disk and provide clear requirements to use disk by using Gravity’s application manifest.

• If customer has an external NFS server, provide integration NFS endpoints powered by Kubernetes pluggable volumes to connect your application to it.

• Use clustered database deployments that are designed to work on bad hardware, like the Cassandra-powered S3 storage system, Pithos. Avoid using unproven and experimental storage systems that are designed to work with Kubernetes or systems with a large operational footprint like Ceph unless you have an in-house Ceph team to handle the support load.

• For services doing simple metadata storage, consider using custom resources provided by Kubernetes. Custom resources provide a powerful abstraction, generating versioned, secure API with RBAC using Etcd as a backing storage.

• Avoid deploying risky storage methods, unless you have seasoned data storage expertise on the team. As a rule of thumb, try not to experiment with data storage combinations as a part of the on-prem release. Make sure that any deployment with mission critical data is vetted with a storage expert.

• Consider the operational costs of any database. For example, the ELK stack is easy do deploy, but extremely hard and expensive to manage.


https://kubernetes.io/docs/concepts/storage/volumes/#nfs

https://github.com/exoscale/pithos

https://kubernetes.io/docs/concepts/api-extension/custom-resources/

HOW TO MAINTAIN DATA INTEGRITY

Challenge

Elastic block storage solutions at cloud providers hide the frequency data corruption by using software and hardware-powered data replication strategies. When going on-prem, this often won’t be available and as a result you will encounter data corruption much more often.

Solution

We use the gravity backup subsystem to provide a way to backup and restore the important application state. We set up alerts to detect the absence of backups for a period of time to make sure they are happening.

Backups should be external - stored outside of the cluster’s storage. This makes it possible to quickly recover the system in case of data corruption. Solutions like zfs-snapshotting on the same disk won’t work when the disk is corrupted.

We test backup and restore functionality for every release in an automated way with robotest suite to make sure that backups work as intended, otherwise every release can have regressions.


https://gravitational.com/telekube/docs/cluster/#backup-and-restore

https://gravitational.com/telekube/docs/monitoring/#kapacitor

https://github.com/gravitational/robotest/blob/master/suite/README.md

HOW TO ACCESS DEPLOYMENTS REMOTELY

Challenge

Many customers will not allow remote access to their infrastructure. The ones that do will require robust security measures in place. So how do you set up a secure way to get access or get the data you need to troubleshoot problems if you can’t get access?

Solution

In the cases where remote access to customer’s infrastructure is not possible, we rely on automated cluster management tools to get a snapshot view of customers infrastructure. In addition, we provide them with training using runbooks to solve problems, with escalation available.

In the cases when access is possible, you usually need to meet some restrictive requirements:

• Ability to segment time and duration of the access

• Limit access privileges using role based access controls.

• Never open any inbound internet accessible ports.

• Audit and record every action performed.

• Use second factor authentication and have ability to revoke access completely.

• Use approved crypto standards and protocols and turn off weak ciphers.

We built Teleport to meet these requirements. Gravity fully integrates with Teleport and adds abilities to fine-tune remote access management.


https://gravitational.com/telekube/docs/cluster/#gravity-tool

https://github.com/gravitaitonal/teleport

https://gravitational.com/telekube/docs/manage/#controlling-access-to-clusters

HOW TO CONFIGURE NETWORKING

Challenge

Kubernetes ships with a specific set of requirements for overlay networks. Customers can face problems especially in cases of complex network topologies. For example, they could experience problems making custom subnet ranges routable within their data center.

Solution

Gravity uses the simplest possible overlay networking for Kubernetes, VXLAN, that encapsulates all traffic in UDP packets and does not need any special routing and only needs simplest connectivity between machines. You can read more about VXLAN here.


https://www.kernel.org/doc/Documentation/networking/vxlan.txt

SECTION 4: MANAGING ORGANIZATIONAL ISSUES


HOW TO SET EXPECTATIONS WITH CUSTOMER IT

Challenge

In the case of an on-premise deployment, the operation of the application is more akin to a partnership between vendor and customer. Many times, the customer will attempt to solve initial issues before escalating. In addition, it may not be clear if the problem is due to customer infrastructure or the application. Most customer’s IT departments have been through frustrating episodes of trying to support vendors’ applications. There are some things you can do to be proactive in alleviating their fears of supporting yours.

Solution

This answer is highly dependent on the service and access level involved but in any case, we highly recommend to have a checklist that services teams can go over and share with the customer. Sharing this information will arm the customer’s IT people responsible for running the application with a clear production roadmap and give them peace of mind when going to production.

Here is a sample production checklist that you can use as a starting point:

System access

✓ If external SSO is necessary/available it should be set up in advance. In practice, many customers delay this step and only contact support when something needs troubleshooting. You want to prevent this by having access setup beforehand.

Backups and alerting

✓ Platform alerting is integrated with the customer’s alerting system. Usually, every customer has email integration that can trigger alerts for their team. Make sure the integration actually works by triggering an alert.

✓ Backups are configured to external devices outside of the cluster. In case if backups are local, disk corruption will bring down both local data and backups.

✓ Alerts are being sent if no backups are done for 30 minutes or more. Many customers set backups and forget about them until the problem occurs. Unfortunately this is usually too late, as the backup script on customers side could be broken. We recommend testing by breaking a backup script and waiting for the alert to occur right on the customer site.


✓ Procedures are clear for how to backup and restore the application from scratch. Customers will get a peace of mind if they know how to completely recover the platform, in the worse case scenario so it is helpful to have this process reviewed with them. This will also reduce the support load, as they can recover the platform when they detect a problem.

Monitoring and troubleshooting

✓ Logs are being forwarded to customer’s infrastructure logger of choice. Security and ops teams on customer side would want to capture most important logs from the application. Make sure they can do that by setting up logging and making sure logs show up.

✓ Customer is aware of the simple recovery and troubleshooting runbooks on the cluster. Try out a couple of simple scenarios with customers - e.g. how to check that everything is running and how to interpret basic output commands.

✓ Customer knows how to check that application is up and running using status hook. Show the ops teams where to look first to get application status in the user interface and console commands.

✓ Customer has been walked through all built-in dashboards and charts and knows how to interpret health dashboard of the system. Explain the meaning and significance of built in monitoring dashboards, for example how to read memory and CPU utilization of the cluster.

Availability expectations

✓ System resilience expectations are communicated with the customer. (E.g. system can lose 1 node out of 3). As a rule of thumb, follow the optimal cluster size guidelines from etcd admin guide. Customers usually don’t have clear understanding of the HA concepts, so make sure to communicate that they can only lose 1 node out of 3 node cluster to keep the system running and recoverable.

✓ Cases when restore will be required are communicated with the customer. (e.g., the majority of the servers with a database are lost.). As a follow up to the previous point, make sure customers know when the system has to be recovered from the backup, e.g. in case if 2 out of 3 nodes are lost on a 3 node cluster. This will help to set the right expectations before the first support call happens.


https://coreos.com/etcd/docs/latest/v2/admin_guide.html

✓ Application support and EOL cycle for every version is clearly communicated with the customer. Make sure customers know when to upgrade and when releases will not be supported any more. Publish a web or wiki page with release schedule and EOL times.

Advanced: Fire Drill exercises

This is a more advanced, but highly recommended section of the checklist. Services team should conduct basic fire-drill exercises with the customer, if they are sharing operational responsibility, showing basic failure/recovery scenarios:

✓ Perform a system reboot and check that system is up and running. Check the system health after reboot/OS upgrade.

✓ Recover the failed hardware node on n >= 3 node cluster. Make sure team knows how to remove the faulty node from the cluster and add a replacement.

✓ Troubleshoot basic disk/CPU pressure fire drill exercise. It is very easy to simulate the CPU pressure by running some CPU consuming process as a container and on the host. Make sure the team can spot the process quickly.

✓ Show customer how to troubleshoot basic networking problems by turning firewall rules on and showing basic diagnostic tooling output. Make sure customer can run simple gravity status tool to see that there is a network problem.


HOW TO PREVENT INTERNAL TEAM FRAGMENTATION

Challenge

Hopefully your on-premise offering is successful. If it is, be prepared that as your on-premise offering matures, there are organizational issues that may occur. The most common is fragmentation into two separate teams, the “cloud team” and the “on-prem team”.

Solution

We recommend avoiding this, if at all possible. Instead, train your existing teams to fully migrate to Kubernetes as the only deployment platform supported in the company. Use rotation for ops and services engineers to support cloud and on-prem cases based on bi-weekly or monthly rotation schedule. Set up a rule for the teams to use the exact same deployments for on-prem and cloud application. As a result, there will be no fragmentation, as the same deployment will be done in the cloud and in the on-prem. In our experience, this is important to maintain team morale.


CONCLUSION

The road to a successful and efficient on-premise offering is hard but can be worth it. You should due your due diligence and make sure you are aware of the steps that should be taken, the investment required and the appropriate amount to charge in order to recover that investment.

We have seen companies significantly increased their revenue and establish deeper relationships with their key customers by offering on-premise offerings of their cloud software. It also signals maturity and leadership among competitors.

There is no silver bullet to solving all these challenges and no platform can claim to be a complete out-of-the-box solution but they can help to alleviate many of the common challenges you will encounter.

When done correctly, the end result is a highly trained and motivated team, improved deployment process, higher revenues and a successful extension of the business line.

We wish you the best in your endeavors!

- The Gravitons


Documents

The Survival Handbook for Going On-Premises