Last week the engineering team started to work on migrating our internal Docker registry from the ancient, outdated and “so 2013” v1 to the newer and amazing v2. We started planning the migration because v2 provided some impressive performance improvements in our early tests and, well, because we were getting tired of seeing deprecation warnings when pushing or pulling images from our repositories and assumed these pushes and pulls would eventually turn into failures.
One of the biggest problems we needed to tackle was dealing with migrating our users’ thousands of images. Thankfully, the Docker migrator was created for this exact purpose. The Docker migrator pulls images from a configurable registry — using the V1_REGISTRY environment variable — to a local Docker cache, then tags the images and pushes them up to the new docker registry, configured through the V2_REGISTRY environment variable. We spun up an EC2 instance and quickly ran a few tests using repository filters to limit the number of images that would be migrated. To prevent adding load to our infrastructure, we launched copies of our v1 and v2 registries on the same instance as the migrator and configured them to use the same S3 storage backend as our existing registries.
The first few tests with the migrator looked good. So we gave the EC2 instance a massive disk, configured a Jenkins job to run the migration and launched the job.
Almost four hours later, our staging environment v1 registry was migrated, or so we thought. Looking through the logs we found out that quite a few of our pushes failed with a somewhat cryptic error message:
time="2016-01-19T02:12:54Z" level=fatal msg="open /var/lib/docker/aufs/mnt/de5e040c21448c28ad2ee750e264ff63223b5062459cb396da32d10f8f73b241/.wh..wh.plnk/345.7528401: operation not permitted"
A bit of googling found this issue that was reported to Docker in August, and a resulting fix in the 1.10 release. As a workaround, we downgraded Docker to version 1.7.1 and blew away the /var/lib/docker/aufs directory as suggested in that issue. We ran the job again and it happily ran in just under seven hours. We ran the job a second time and realized that the entire operation still took way too long as it needed to pull down all the images before pushing them to the new registry.
We added an environment flag to migrator.sh to keep the locally cached Docker images, rather than cleanup them up at the end of a migration. With this change, we could run the migration multiple times without having to download each image every time. In addition, we put together a simple Python script to determine which images were in the old repository but not in the new one, which would reduce the number of pull/pushes needed. This took our migration time down to 19 minutes on subsequent runs. Considering running the initial migration on our production system took more than two days, reducing the runtime to less than a fifth is huge.
The updated migrator and registry comparison scripts are available in our Github repo here.