Last week I had the opportunity to attend the Large Scale Production Engineering meet up group at Yahoo’s office in Sunnyvale. The group meets monthly to discuss various technologies related to running large production environments. This past week the topic was dynamic scaling, with presentations from Coburn Watson of Netflix, Aren Sandersen of Pinterest, and Sebastian Stadil of Scalr.

Over the past few years, it’s been really exciting to see how a number of companies have started utilizing cloud providers for a virtual extension of their own network. Most of the presentations discussed Amazon Web Services and how they can spin up instances on-the-spot based on threshold configuration of their current production load. For webapp companies, this has become very convenient, as it allows companies to scale their app quickly without having to build out a significant amount of their own infrastructure. Pinterest is a great example of this model, where a young start up can automate their entire system configuration with Puppet/Chef and other configuration tools. If the company starts to see exponential growth like Pinterest has in the past few years, they can dynamically add new infrastructure without upfront capital costs of hardware, datacenter space, and bandwidth.


At OpenDNS, we have automated the entire configuration of our production sites, which we run on our own hardware that’s spread over 14 global datacenters (Sneak peek: there’s an additional six coming online in the next few weeks). Our service is a bit unique given our requirements on advanced anycast routing and extensive peering at internet exchanges. We use Puppet and other tools to quickly spin up new sites, and we’ve historically added 2-3 locations annually.

Our typical process is to rack the bare metal hardware and simply turn it on. The machines will pxeboot install using a custom kernel image we’ve set up for the automated install process. During the boot process, the machine will execute a small portion of code we’ve written. The code runs in the early userspace part of the boot process and fetches a pre-generated configuration of the machines profile. During the boot up process, the machine will detect all aspects of its configuration and build a DNS server, web server, database server, etc. Once the machine has fully been built, puppet will then submit the hardware information to an Open Source inventory management system called Rack Tables.

We’ve done quite a bit of in-house customization to the software to fit our needs. Having the inventory process automated and included in the install process is very important as you scale to thousands of machines. Any member of the operations or engineering staff can easily look up all details of a particular machine, such as rack location, facility contact information, service profile, warranty information, vendor information, etc. This process helps things run smoothly, self-documents, and helps us avoid stupid human mistakes. The process also creates a great database of information about our infrastructure that we can easily query. For example, we can easily generate a list of every model and serial number of every harddrive on our network. For an operator of a large scale environment, this information is invaluable.  We also use this database for further automation and configuration of additional services such as monitoring. The accounting and finance folks also really appreciate having clear documentation of every machine that has ever been purchased, which over the long term can be a substantial amount of money.

In the near term, we have plans to start utilizing a few different cloud providers to dynamically spin up different environments and leveraging the power of cloud infrastructure. Being a provider of cloud-delivered services ourself, we’ll always have a substantial amount of infrastructure to run, but that doesn’t mean we shouldn’t take advantage of other providers out there.

The next LSPE meet up, taking place on Thursday, March 21, is focused on workload distribution. The meet up already has a waiting list of 49 people at the time of this blog, but I’d recommend trying to go anyway if you’re responsible for running a large infrastructure. If you do attend and happen to see me, feel free to grab my ear for a few moments to introduce yourself. We’re hiring here and I’m always happy to meet other engineers and operators in similar roles.

This post is categorized in: