Static provisioning – Resources & People

It takes a team to manage servers, networks, databases, and infrastructure, and performing maintenance and changes can cost a company in both overtime pay and employee morale. Even the best provisioning of static resources leaves some 80 percent of available computing power unused. A few years ago, YP began working on an elastic compute solution, a concept that enables computing resources (such as CPU processing power, memory, bandwidth and disk usage) to be easily scaled up and down as needed. Using cutting-edge technologies, and some in-house ingenuity, we created a solution that has enabled us to overcome two primary issues and manage the workload dynamically.

Elastic Compute can help companies overcome two main problems caused by static provisioning: first, static provisioning of computing resources causes over-provisioning and ridiculously low use of available computer power.

For most IT organizations across the spectrum, here’s the typical situation: companies provision static resources for their applications, but when those applications are actually profiled for their use of computing resources, it becomes clear they are over-provisioned. Yet they use just 20 percent or less of available computing power, on average.

Organizations have been well-aware of the problem but what can they do about it? Maybe buy cheaper hardware or try to optimize the app to consume optimal resources, or try to put as many applications as they can on a single machine.

Second, static provisioning of people who manage that computing infrastructure requires a team and can be costly.

Organizations that don’t have dynamic elastic compute technology need a dedicated team to manage servers, networks, databases, infrastructure, etc., to watch over the infrastructure and perform changes and maintenance. Any change to that structure requires a long maintenance window involving all stakeholders and lots of overtime pay, and results in increased team member frustration and less than favorable work/personal life balance.

The solution, incorporating new technologies, solves both problems by putting systems on auto-pilot

With the advent of Mesos, Docker and others, the ecosystem has unleashed a wide range of technologies that enable you to put systems and infrastructure on auto-pilot. Mesos and Docker provide tools and technologies that allow IT to add value to a business.

Enterprise-level Solution

Understanding the potential of these technologies, YP embarked on a journey a couple of years ago to bring effective change to the organization. We quickly realized the new technologies are really cutting-edge, and with a lot of work, we could make it a real enterprise-level offering to support production workload.

Because the workload was supposed to run in Docker containers, it lacked several core features such as centralized logging, provisioning application secrets in the containers, persistent storage, application configuration management, etc. We didn’t want to wait for these features to become available with Docker and Mesos, so our talented engineering team took the bull by the horns and developed those solutions in-house. We made some of those contributions open-source, so you can check them out at YP Engineering GitHub account here. With those solutions in place, we are running a heterogeneous, containerized workload in production.

Through this experience, we picked up valuable information in sustaining and scaling that workload dynamically. We incorporated several key components on top of Mesos and Docker that work together to make Elastic Compute an enterprise-level solution.

Our engineering team has been invited to share our expertise at a number of conferences, including USENIX, SCALE (Southern California Linux Expo) and MesosCon, to share our expertise. We presented “Lessons Learned from Running Heterogeneous Workload on Mesos,”. To watch the video , click here.

Application Configuration Management with Mesos

Configuration Management has always been a big ticket item. It is much bigger now that we are running applications at scale with orchestration systems like Mesos. Everyone has their own opinion and they have their different ways to specify and retrieve configurations for the application. Some people use key value store, some use home grown solution whereas some have tied it with their version control repositories. We want a way to systematically specify, control and account for changes to the application running through an orchestration system. We want to maintain software integrity and traceability throughout the lifecycle of that application.

With all the effort invested around creating multiple frameworks for Mesos, adding new features to the Mesos kernel, and creating a nice frontend like DCOS from Mesosphere, a very little effort has been spent around the configuration management of an application running through Mesos. As things stand right now, if you run containerized workload, you bake the configuration of your application in the image itself. Maybe you can make it more dynamic by passing ENV values through the framework to the containers. But then again, how many ENV variables will you expose if there are multiple configurations for different environments ? That is the first part of the issue.

Second part is how would you store those configurations externally and how would they interact with Mesos’s orchestration of containers. Lets say you are able to retrofit your existing configuration management system with Mesos, but how would you built in access control mechanism since the containers are ephemeral and they have a unique name.

That brings us to our third problem. What if you want your applications to automatically adapt to the changes happening in the infrastructure or one of your endpoint changes. How would the application pick up those changes, reconfigure themselves and automatically adapt to it without having to relaunch your application through the frameworks (for ex: Marathon).

These are real burning questions that one needs to address before running heterogeneous workload on Mesos. Things would have been simple if you are running simple workload and their configuration is pretty static. With Mesos, you can scale those applications to millions of copies (provided you have resources). But things can be tremendously different when you have a heterogeneous workload and they have myriad set of configurations.

We, at YP, are one of the big Mesos shop. We have spend significant amount of time and energy to run heterogeneous workload on Mesos and have written tools and technologies around it. Since we are leveraging open source technology for our benefit, we feel we should share our findings and solutions so that we can make a healthy Mesos ecosystem.

I would love to talk about this during the #MesosCon2016 if they accept this proposal. Register for the conference if you haven’t.

Runtime secrets with docker containers

We, at YP are using docker containers for quite some time now. Onboarding onto docker wasn’t always that easy. There are lots of things to account for before running a docker container in production. One of the thing to address is how to deal with secrets during runtime.

We have done significant work on that front. I will be discussing in multiple blog posts about the problem and the potential solution with regards to injecting secrets to the docker container. In this post, I will be talking about how do people use secrets with docker containers and their issues.

Why are secrets important?

Secrets are important for every application. Some of the application secrets that you may need are:

    • database credentials
    • api tokens
    • ssh keys
    • TLS certificates
    • GPG keys etc.

Traditionally, we have been storing these secrets under some packages that are encrypted or storing it in “secrets store” or just putting it as a part of the source code. Well that was all ok and good. But we cannot use the similar solutions with docker images. Then how do we use it with the docker containers?

Solution 1: Baking it in the image

Well this is straight forward: you will just put it as part of the image. This is the first thing you will do when you are onboarding your app onto docker. Maybe you will put under some dot file, chown it to root and think that everything is fine. This is the most prevalent anti-pattern in security.

Issues:

  • When it is published to any registry, anyone can pull that image and the secrets would be at their disposal.
  • None of Github or Dockerhub or your repository is designed to securely store or distribute secrets.
  • Updating secrets is a tedious job. Updating all the images.
  • This could still be ok if you have few number of images, but consider you tie in CI/CI pipeline to your image build process. Now you are managing tons of images.
  • Accounting for certificate expiration becomes difficult.
  • Old, EOL/EOS or decommissioned hardware can cause secrets leak.

Solution 2: Put it under ENV variables

This is the most common way to pass secrets to the applications (more than 90% of people do it). It is widely used because 12 factor app guidelines recommend apps to be delivered and consumed as a service.
Example:  docker run –it –e “DBUSER=dbuser” –e “DBPASSWD=dbpasswd” myimage /bin/bash

Issues:

thaJeztah and  diogomonica have captured in detail about the best practices about using secrets. However, I am just summarizing the issues with this solution here:

  • Kept in intermediate layers of image and can be easily viewed using “docker inspect”.
  • Accessible by all the processes in the image. Can be easily leaked.
  • Shared with any linked container.
  • Incredibly common having the app grab the whole envt., print it out or even send it part of error report or pager duty.
  • Env. variables are passed down to child processes. Imagine that you call third party tool to perform some action, all of a sudden that third party has access to  your environment.
  • Very common for the apps that crashes to store env. variables in log files for debugging.

Solution 3: Volume Mounts

This is again as straight-forward as passing ENV variables. You put your secrets in some directory structure on docker hosts. That directory structure can be on local file system, NFS or DFS like CEPH. You then mount the right directory inside the container for that particular app.
Example: docker run –i –t –v /mnt/app1/secrets:/secrets myimage /bin/bash

Issues:

  • Bad design putting all the secrets for all the images on a single machine.
  • Secrets are unencrypted, in plain text.

Solution 4: Secrets encryption

Some people are paranoid about keeping their secrets in plain text. And they are even more paranoid about putting image with plaintext secrets to some private/public docker registries. So, they encrypt the secrets using public key and elliptic curve cryptography using tools like “ejson” from Shopify and others. To decrypt, private keys are hosted on the docker hosts and those production machines are locked down. At least, with this way your image is safe from snooping.

Issues:

  • To update secrets, you need to create new images.
  • Solution is fairly static.
  • You can still see which private keys are used to decrypt using “docker inspect”.

Solution 5: Secrets store

There are secrets management and distribution services like: HashiCorp’s Vault, Square’s Keywhiz and Sneaker (for AWS). They help you generate and distribute secrets for services. Main benefit of this approach is that secrets are centrally managed in a secure manner. And there is also an auditability with the secret access. Almost all these solutions are API based and are mostly reliable.

There is already an integration of Keywhiz secrets store with docker as a volume-driver plugin. This solution is the robust of all and already integrated with docker. However, if docker (or docker swarm) is the only way you manage and run your containers. This plugin doesn’t extend well if you are using orchestration tools like Mesos or Kubernetes to manage/run your containers.

If you orchestrate containers through Mesos or Kubernetes, watch out my next series of post regarding the solution.

Persisting solutions for the ephemeral containers

With the advent of Docker, container scene has exploded and it is taking everyone by storm. It has gotten everyone excited: from Developers to QA engineers, from Deploy managers to System Administrators. Everyone wants to adopt it and start incorporating it in their workflow.

However, there are inherent challenges to manage, run and operate such systems at scale. First challenge begins  with trying to run ephemeral containers in a static world of hardwares. Most of us are trying to retrofit existing solutions and our mindsets into the new way of doing things. Others are focused on building orchestration tools like Kubernetes, Mesos or Cloud Foundry’s Lattice. I call these orchestration tools as a kernel of the data center operating system. There is a very little focus given to building tools around the kernel to make a truly distributed data-center specific GNU like operating system. Things like centralized logging, monitoring and alerting, metrics collection, persistent storage, service discovery etc. are the things, which needs solidification for the container ecosystem.

To compare, we just need to go back few decades to see how GNU operating system got evolved. We need to put our GNU hat on and see how they made it possible with the collection of applications, libraries, developer tools and even games with a solid Linux based kernel.

We, at YP, have devised some of the solutions around these problems. You can check out some of the work that we have made opensource.

Sysdig has also compiled “The Container Ecosystem Project”. Please check them out. There are some really interesting technologies mentioned there.

As I mentioned, there is a lot of movement and everybody is trying to get a head start in developing their own technologies that works for them. I feel the GNU god has to come down once again: to show us the right way to consolidate all these disjointed systems to work as a true data center operating system.