Static provisioning – Resources & People

It takes a team to manage servers, networks, databases, and infrastructure, and performing maintenance and changes can cost a company in both overtime pay and employee morale. Even the best provisioning of static resources leaves some 80 percent of available computing power unused. A few years ago, YP began working on an elastic compute solution, a concept that enables computing resources (such as CPU processing power, memory, bandwidth and disk usage) to be easily scaled up and down as needed. Using cutting-edge technologies, and some in-house ingenuity, we created a solution that has enabled us to overcome two primary issues and manage the workload dynamically.

Elastic Compute can help companies overcome two main problems caused by static provisioning: first, static provisioning of computing resources causes over-provisioning and ridiculously low use of available computer power.

For most IT organizations across the spectrum, here’s the typical situation: companies provision static resources for their applications, but when those applications are actually profiled for their use of computing resources, it becomes clear they are over-provisioned. Yet they use just 20 percent or less of available computing power, on average.

Organizations have been well-aware of the problem but what can they do about it? Maybe buy cheaper hardware or try to optimize the app to consume optimal resources, or try to put as many applications as they can on a single machine.

Second, static provisioning of people who manage that computing infrastructure requires a team and can be costly.

Organizations that don’t have dynamic elastic compute technology need a dedicated team to manage servers, networks, databases, infrastructure, etc., to watch over the infrastructure and perform changes and maintenance. Any change to that structure requires a long maintenance window involving all stakeholders and lots of overtime pay, and results in increased team member frustration and less than favorable work/personal life balance.

The solution, incorporating new technologies, solves both problems by putting systems on auto-pilot

With the advent of Mesos, Docker and others, the ecosystem has unleashed a wide range of technologies that enable you to put systems and infrastructure on auto-pilot. Mesos and Docker provide tools and technologies that allow IT to add value to a business.

Enterprise-level Solution

Understanding the potential of these technologies, YP embarked on a journey a couple of years ago to bring effective change to the organization. We quickly realized the new technologies are really cutting-edge, and with a lot of work, we could make it a real enterprise-level offering to support production workload.

Because the workload was supposed to run in Docker containers, it lacked several core features such as centralized logging, provisioning application secrets in the containers, persistent storage, application configuration management, etc. We didn’t want to wait for these features to become available with Docker and Mesos, so our talented engineering team took the bull by the horns and developed those solutions in-house. We made some of those contributions open-source, so you can check them out at YP Engineering GitHub account here. With those solutions in place, we are running a heterogeneous, containerized workload in production.

Through this experience, we picked up valuable information in sustaining and scaling that workload dynamically. We incorporated several key components on top of Mesos and Docker that work together to make Elastic Compute an enterprise-level solution.

Our engineering team has been invited to share our expertise at a number of conferences, including USENIX, SCALE (Southern California Linux Expo) and MesosCon, to share our expertise. We presented “Lessons Learned from Running Heterogeneous Workload on Mesos,”. To watch the video , click here.

Application Configuration Management with Mesos

Configuration Management has always been a big ticket item. It is much bigger now that we are running applications at scale with orchestration systems like Mesos. Everyone has their own opinion and they have their different ways to specify and retrieve configurations for the application. Some people use key value store, some use home grown solution whereas some have tied it with their version control repositories. We want a way to systematically specify, control and account for changes to the application running through an orchestration system. We want to maintain software integrity and traceability throughout the lifecycle of that application.

With all the effort invested around creating multiple frameworks for Mesos, adding new features to the Mesos kernel, and creating a nice frontend like DCOS from Mesosphere, a very little effort has been spent around the configuration management of an application running through Mesos. As things stand right now, if you run containerized workload, you bake the configuration of your application in the image itself. Maybe you can make it more dynamic by passing ENV values through the framework to the containers. But then again, how many ENV variables will you expose if there are multiple configurations for different environments ? That is the first part of the issue.

Second part is how would you store those configurations externally and how would they interact with Mesos’s orchestration of containers. Lets say you are able to retrofit your existing configuration management system with Mesos, but how would you built in access control mechanism since the containers are ephemeral and they have a unique name.

That brings us to our third problem. What if you want your applications to automatically adapt to the changes happening in the infrastructure or one of your endpoint changes. How would the application pick up those changes, reconfigure themselves and automatically adapt to it without having to relaunch your application through the frameworks (for ex: Marathon).

These are real burning questions that one needs to address before running heterogeneous workload on Mesos. Things would have been simple if you are running simple workload and their configuration is pretty static. With Mesos, you can scale those applications to millions of copies (provided you have resources). But things can be tremendously different when you have a heterogeneous workload and they have myriad set of configurations.

We, at YP, are one of the big Mesos shop. We have spend significant amount of time and energy to run heterogeneous workload on Mesos and have written tools and technologies around it. Since we are leveraging open source technology for our benefit, we feel we should share our findings and solutions so that we can make a healthy Mesos ecosystem.

I would love to talk about this during the #MesosCon2016 if they accept this proposal. Register for the conference if you haven’t.

IT shouldn’t be a cost-center with Docker

IT is critical for every big and small organization. It touches all the business processes and data in every company. Every year organization allocates budget to keep the “Lights On”. The #1 priority of IT is to make sure that the business is running as usual. That means that it reinforces the view of IT as a cost-center. Every aspect in the IT is focused on minimizing downtime rather than providing value to the organization. And that is the how things are and that is how every company is being run. All the “C-Suite” bosses are happy to cut the fat checks and are content with it. There is nothing wrong with that. Except that you can’t improvise on that and can’t convert that cost-center into a profit-center.

If you are running a data center of your own or outsourcing it to cloud services or running a hybrid cloud, then you need a whole ensemble of a team to manage your systems, networks, databases, infrastructures etc. They will babysit your infrastructure, perform changes and do maintenances. Any change to the state of production would require long maintenance window involving all the stakeholders, tons of overtime pay, increased frustration and no family life. It is not only the people and processes, but you will also accumulate computing power along the way and keep on piling your racks in the data centers. You will tons of wasted resources and would hardly be using even 20% of the computing power that you have available. Unfortunately, this is the sorry state of affair for most of IT departments in majority of the companies.

Alas! IT Gods have finally smiled upon us. Things doesn’t have to be same like before with the advent of technologies like docker. Docker and its ecosystem have unleashed myriad range of technologies that lets you put your systems and infrastructure on an auto-pilot. Docker ecosystem provides you with the tools and technologies that allows your IT to add value to your business. I am not claiming we will be able to solve all the issues with old-school IT but there are already tools at our disposal which when combine with docker can help us solve real complex IT solutions that we have been dreading all these years. Much has been talked about docker being a developer’s friend, but it can truly help alleviate IT Ops problem and can help us run a real lean DevOps IT organization.

Using docker successfully to manage and run part of our IT operations at YellowPages, we have leveraged its capabilities with technologies like push based metrics, service discovery in ephemeral environment, orchestration tools, build pipeline and message queues to solve real world IT problems. There were lots of new challenges that we overcame doing that. Also, there is a lot of work that needs to be done for docker to become a IT department’s best friend. In short, things are looking up and we can do quite a few things with docker which will help us evolve the eco-system to make IT a truly profit center.

Build time secrets with docker containers

Check out first part of this blog post if you are looking to provide secrets during container runtime.
Part 1: Runtime Secrets with Docker Containers

So far we have talked about run time secrets. But you also need secrets during the build time of docker images. For ex: During your build process, you need:

  • To hit some private repo to pull dependency.
  • Need private keys for remote SSH connection to pull stuff.
  • Build environment is behind PROXY than deploy envt.
  • Tons of other use cases.

There isn’t a single solution that addresses all the use-cases I listed above. For some there are features available in the latest docker version (1.9), for some there are third party solutions and for the others, there are custom hacks. I will list all of them depending upon their use-cases.

Solution 1: Dockerfile ENV variable

You can use ENV directive in Dockerfile to define variables and use it with other primitives in Dockerfile. According to me, this just helps to make your Dockerfile pretty and it shouldn’t be use for passing secrets or any sensitive information during the build process. Here’s why:

Issues:

  • Env variables that are passed are preserved in the final image as well as all the intermediate layers of the image. Anyone can do a “docker inspect” and can see the values of these variables. So, if you use this approach to pass secrets for your buildtime dependencies, you fear exposing them to anyone who pulls down the image.
  • Your Dockerfile is static with this approach. You cannot override ENV. values if you are building the image from various hosts that require different proxies.
  • Be mindful that the variables(secrets) are written to the disk at every intermediate and final layer of the image.

Solution 2: Docker build time variables

To overcome the issue about the variable persisting in the intermediate and final image, Docker introduced -build-arg variable in their 1.9 release. This flag allows you to pass the build-time variables that are accessed like regular environment variables in the RUN directive. Also, these values don’t persist in the intermediate or final images like ENV values do.

A good example is http_proxy or source versions for pulling intermediate files. The ARG instruction lets Dockerfile authors define values that users can set at build-time using the –build-arg flag:

Example: docker build –build-arg HTTP_PROXY=http://10.20.30.2:1234 .

Issues:

mapuri mentions it shouldn’t be used when the image caching is turned on.

Solution 3: Flattening Images

For the fear that your variables are visible in the intermediate layers of the image, people have started flatting their images. This also helps in reducing the overall size of the image and can save on your image upload/deploy time.

There are quite a few projects that lets you do that:

  • docker-squash from jwilder
  • Or you can do it yourself with exporting and importing the image:
    ID=$(docker run -d image-name /bin/bash)
    docker export $ID | docker import – flat-image-name

Issues:

  • You have to add flatteing(hacking) in your build and deployment workflow.
  • It will remove the secrets from the intermediate layers, but it may still end up in the build cache.
  • It unnecessary adds complexity when you are automatically building images using CI/CD pipeline.
  • Be mindful that the variables(secrets) are written to the disk at every intermediate and flattened final layer of the image.

Solution 4: Hosting secrets on a server

To serve secrets or ssh keys to the build process, there are various tools available like vault from dockito, that runs in its own container to serve the key over the HTTP.During the build, it’s invoked from the Dockerfile, using a special RUN directive, where you just pass it the command that requires the secret. It will fetch the key from the server and executes the command.
Vault from Dockito

docker-ssh-exec is a bit more advanced that vault from dockito. In that, it fetches the key over the network from the server container, writes it to disk, executes the desired command, and then removes the key. Since all this occurs within the scope of a single RUN directive, the key data is never written into the resulting filesystem layer.

Conclusion

There isn’t a single solution that satisfies every use case for the build time secrets. People have gotten restless about the lack of features from docker. So, they have went ahead and added their own custom hacks.

As mentioned by thaJeztah, there are lots of PR(Pull Request) at Docker still pending:

  • Add private files support #5836
  • Add secret store #6075
  • Continuation of the docker secret storage feature #6697
  • The Docker Vault” #10310
  • Provide roadmap / design for officially handling secrets. Make injecting secrets pluggable, so that they use existing offerings in this area, for example: Vault, Keywhiz, Sneaker

Runtime secrets with docker containers

We, at YP are using docker containers for quite some time now. Onboarding onto docker wasn’t always that easy. There are lots of things to account for before running a docker container in production. One of the thing to address is how to deal with secrets during runtime.

We have done significant work on that front. I will be discussing in multiple blog posts about the problem and the potential solution with regards to injecting secrets to the docker container. In this post, I will be talking about how do people use secrets with docker containers and their issues.

Why are secrets important?

Secrets are important for every application. Some of the application secrets that you may need are:

    • database credentials
    • api tokens
    • ssh keys
    • TLS certificates
    • GPG keys etc.

Traditionally, we have been storing these secrets under some packages that are encrypted or storing it in “secrets store” or just putting it as a part of the source code. Well that was all ok and good. But we cannot use the similar solutions with docker images. Then how do we use it with the docker containers?

Solution 1: Baking it in the image

Well this is straight forward: you will just put it as part of the image. This is the first thing you will do when you are onboarding your app onto docker. Maybe you will put under some dot file, chown it to root and think that everything is fine. This is the most prevalent anti-pattern in security.

Issues:

  • When it is published to any registry, anyone can pull that image and the secrets would be at their disposal.
  • None of Github or Dockerhub or your repository is designed to securely store or distribute secrets.
  • Updating secrets is a tedious job. Updating all the images.
  • This could still be ok if you have few number of images, but consider you tie in CI/CI pipeline to your image build process. Now you are managing tons of images.
  • Accounting for certificate expiration becomes difficult.
  • Old, EOL/EOS or decommissioned hardware can cause secrets leak.

Solution 2: Put it under ENV variables

This is the most common way to pass secrets to the applications (more than 90% of people do it). It is widely used because 12 factor app guidelines recommend apps to be delivered and consumed as a service.
Example:  docker run –it –e “DBUSER=dbuser” –e “DBPASSWD=dbpasswd” myimage /bin/bash

Issues:

thaJeztah and  diogomonica have captured in detail about the best practices about using secrets. However, I am just summarizing the issues with this solution here:

  • Kept in intermediate layers of image and can be easily viewed using “docker inspect”.
  • Accessible by all the processes in the image. Can be easily leaked.
  • Shared with any linked container.
  • Incredibly common having the app grab the whole envt., print it out or even send it part of error report or pager duty.
  • Env. variables are passed down to child processes. Imagine that you call third party tool to perform some action, all of a sudden that third party has access to  your environment.
  • Very common for the apps that crashes to store env. variables in log files for debugging.

Solution 3: Volume Mounts

This is again as straight-forward as passing ENV variables. You put your secrets in some directory structure on docker hosts. That directory structure can be on local file system, NFS or DFS like CEPH. You then mount the right directory inside the container for that particular app.
Example: docker run –i –t –v /mnt/app1/secrets:/secrets myimage /bin/bash

Issues:

  • Bad design putting all the secrets for all the images on a single machine.
  • Secrets are unencrypted, in plain text.

Solution 4: Secrets encryption

Some people are paranoid about keeping their secrets in plain text. And they are even more paranoid about putting image with plaintext secrets to some private/public docker registries. So, they encrypt the secrets using public key and elliptic curve cryptography using tools like “ejson” from Shopify and others. To decrypt, private keys are hosted on the docker hosts and those production machines are locked down. At least, with this way your image is safe from snooping.

Issues:

  • To update secrets, you need to create new images.
  • Solution is fairly static.
  • You can still see which private keys are used to decrypt using “docker inspect”.

Solution 5: Secrets store

There are secrets management and distribution services like: HashiCorp’s Vault, Square’s Keywhiz and Sneaker (for AWS). They help you generate and distribute secrets for services. Main benefit of this approach is that secrets are centrally managed in a secure manner. And there is also an auditability with the secret access. Almost all these solutions are API based and are mostly reliable.

There is already an integration of Keywhiz secrets store with docker as a volume-driver plugin. This solution is the robust of all and already integrated with docker. However, if docker (or docker swarm) is the only way you manage and run your containers. This plugin doesn’t extend well if you are using orchestration tools like Mesos or Kubernetes to manage/run your containers.

If you orchestrate containers through Mesos or Kubernetes, watch out my next series of post regarding the solution.

Persisting solutions for the ephemeral containers

With the advent of Docker, container scene has exploded and it is taking everyone by storm. It has gotten everyone excited: from Developers to QA engineers, from Deploy managers to System Administrators. Everyone wants to adopt it and start incorporating it in their workflow.

However, there are inherent challenges to manage, run and operate such systems at scale. First challenge begins  with trying to run ephemeral containers in a static world of hardwares. Most of us are trying to retrofit existing solutions and our mindsets into the new way of doing things. Others are focused on building orchestration tools like Kubernetes, Mesos or Cloud Foundry’s Lattice. I call these orchestration tools as a kernel of the data center operating system. There is a very little focus given to building tools around the kernel to make a truly distributed data-center specific GNU like operating system. Things like centralized logging, monitoring and alerting, metrics collection, persistent storage, service discovery etc. are the things, which needs solidification for the container ecosystem.

To compare, we just need to go back few decades to see how GNU operating system got evolved. We need to put our GNU hat on and see how they made it possible with the collection of applications, libraries, developer tools and even games with a solid Linux based kernel.

We, at YP, have devised some of the solutions around these problems. You can check out some of the work that we have made opensource.

Sysdig has also compiled “The Container Ecosystem Project”. Please check them out. There are some really interesting technologies mentioned there.

As I mentioned, there is a lot of movement and everybody is trying to get a head start in developing their own technologies that works for them. I feel the GNU god has to come down once again: to show us the right way to consolidate all these disjointed systems to work as a true data center operating system.