Extending build containers: Generic caching

Declaring your build environment by using Dockerfiles and creating separate containers for each and every build is good practice. But starting in a fresh environment each time prevents various caching mechanisms from having any effect. The result is an increase in build time as well as network traffic, where especially the latter can cause some bottlenecking to occur. This affects mainly git repositories, build tools like Maven and npm, but also Docker builds itself. The git repository has to be cloned from scratch. Dependencies for Maven and npm have to be fetched from their official repositories on every build, even if the majority did not change. In case of Docker builds, at least the base image must be pulled. The Docker build cache is also empty.

This post covers a couple of methods to overcome these limitation while still keeping isolation and reproducibility guarantees.

tl;dr: We create a Docker image that contains a generic template for building caches for git repositories, Maven artifacts and npm dependencies by (a) cloning relevant git repositories into a single bare reference repository and (b) searching for pom.xml and package.json files and pre-fetching their dependencies, thus presenting depending builds with an immediately available cache. That image must be build either on a regular basis or on demand in order to keep the cache up to date.

The details and full implementation can be found on Github.

Building a generic caching solution for each tool

The following sections show the taken approach for git, maven and npm. The caching of Docker images and builds will be discussed in a future blog post.

Setup and strategy

Let’s make some assumptions before we begin. Assume we have a Docker image hub.gee-whiz.de/build-env:latest where builds are executed in. This contains our default build environment with a set of standard tools. This would also be the starting point for new users who want to use a build container in general.

There’s also a central git server where multiple projects, each with multiple repositories, reside. Those are stored in hierarchical form like git.gee-whiz.de/project/repository. We’re aiming to build a generic caching solution for each project for all its repositories. To do so, we create a separate so called project specific Docker image with an embedded cache by extending our default build image and creating the actual cache as part of the docker build process. Say we have a project named website, the resulting image will be named hub.gee-whiz.de/build-env-website:latest. That image will contain a cache for every repository of that project.

The cache is now being build as follows. Depending on your environment and requirements, you may of course deviate in certain steps or add additional mechanisms:

  1. Select a project we want to build an image with a suitable cache for.
  2. Determine available git repositories of said project and clone them all in a single bare reference (git doc) repository. This will act as the git cache.
  3. Temporarily checkout the default branches of every repository. On each we
    1. Search for Maven projects (pom.xml) and build them with --fail-never and the dependency:go-offline goal. The fetched artifacts will be stored in the local .m2 folder.
    2. Search for npm projects (package.json) and install them. The fetched dependencies will be stored in the local npm cache.

We now dive into the implementation details of each step involved. Note that we also support to optionally ignore repositories by setting the environment variable IGNORE_REPOS. There are also a bunch of other variables being used that have to be set appropriately. We come back to those later.

Git

Determine available git repositories for the project website from a central Atlassian Bitbuck instance:

FILTER="$(echo "(\"${IGNORE_REPOS}\")" | sed 's/,/"|"/g')" \
&& REPOS=$(curl -S -u ${BITBUCKET_HTTP_CREDENTIALS} https://bitbucket.gee-whiz.de/rest/api/1.0/projects/${PROJECT}/repos?limit=100 \
    | jq -cM '.values[] | {name: .name, url: .links.clone[].href} | select(.url | contains("ssh://"))' \
    | grep -Ev "${FILTER}") \
&& echo -n "Considering $(echo ${REPOS} | jq -cMs '. | length') repositories for buildcontainer creation process: " \
&& echo "$(echo ${REPOS} | jq -cM '.name ' | jq -cMs '.')"

Clone all git repositories into a single bare reference repository:

cd /var/tmp/cache/git \
&& git init --bare \
&& for REPO in ${REPOS}; do \
    REPO_NAME=$(echo ${REPO} | jq -cMr '.name') \
    && REPO_URL=$(echo ${REPO} | jq -cMr '.url') \
    && git remote add ${REPO_NAME} ${REPO_URL} \
    ; done \
&& git fetch --all --quiet \
&& cd /

Temporarily checkout the default branch of every repository:

mkdir -p /tmp/git \
&& for REPO in ${REPOS}; do \
    REPO_NAME=$(echo ${REPO} | jq -cMr '.name') \
    && REPO_URL=$(echo ${REPO} | jq -cMr '.url') \
    && git clone --reference /var/tmp/cache/git ${REPO_URL} /tmp/git/${REPO_NAME} \
    ; done

Prefetch Maven dependencies

for REPO_DIR in /tmp/git/*; do \
    echo "Scanning for Maven project in ${REPO_DIR}" \
    && if [ -f ${REPO_DIR}/pom.xml ]; then \
        echo "Found pom.xml in ${REPO_DIR}" \
        && cd ${REPO_DIR} \
        && JAVA_HOME=${JDK_HOME} ${MAVEN_HOME}/bin/mvn -gs /tmp/build/maven-global-settings.xml -B -V -q --fail-never org.apache.maven.plugins:maven-dependency-plugin:3.0.2:go-offline \
        ; fi \
    ; done \

Prefetch npm dependencies

PATH=${PATH}:${NODEJS_HOME}/bin \
&& for REPO_DIR in /tmp/git/*; do \
    echo "Scanning for npm projects in ${REPO_DIR}/*" \
    && for NODE_PACKAGE in $(find ${REPO_DIR} -maxdepth 2 -iname package.json); do \
        NODE_PACKAGE_DIR=$(dirname ${NODE_PACKAGE}) \
        && echo "Found package.json in ${NODE_PACKAGE_DIR}" \
        && cd ${NODE_PACKAGE_DIR} \
        && npm --globalconfig /tmp/build/npm-global-rc install \
        ; done \
    ; done \

Using ONBUILD

Instead of copying the above bash lines in each and every project specific Dockerfile, in practice we use Dockers ONBUILD instruction: We create a separate template image called hub.gee-whiz.de/build-env-template:latest which inherits from the default build image and contains instructions in the form of ONBUILD RUN build_cache.sh. Those instructions are then being executed when a dependent child image is built. Thus moving the generic caching implementation as some kind of template into a central place, thereby improving maintainability. The project the cache should be build for must be given as a build parameter.

We’re now able to create an arbitrary amount of project specific images by just inheriting from that single template image. If no further customization is needed, the Dockerfile for the image hub.gee-whiz.de/build-env-website:latest would just be:

FROM hub.gee-whiz.de/build-env-template:latest

Creating the project specific image on Jenkins

The previous bash snippets were using a couple of environment variables we’re now going to address. Included where locations for specific tools, configuration files, credentials and of course the projects name. As these kind of artifacts should not be included directly within the Docker image (to maintain a single location for configurations and for security reasons), we inject them during build time with the help of Jenkins.

We also implement the pipeline as a shared library. Thus allowing a simple usage and again, maintainability. It’s implemented as a function named buildProjectSpecificDockerImage and looks like this:

buildProjectSpecificDockerImage {
    triggeredBy = '../build-env-template/master'
    project = 'website'
    ignoreRepos = [
        'testing',
        'old'
    ]
}

The build is triggered whenever the extended template-image was built. We pass the project name the cache should be build for, but also the repositories to ignore. Optionally, the function also allows to specify custom configuration files, credentials or specific tool to use for building the image. You can see the full implementation on Github.

Conclusion

On the premise these project specific build images are build either on a regular basis or on demand, builds running inside those containers have immediate access to the cache for git, Maven and npm. If the image is not quite up to date, for instance when a dependency was bumped recently, only that specific artifact musst be fetched. Same principle goes for the git repository.

Thanks to Dockers image layering technique, the cache can be shared by multiple in parallel running builds on the same host while still being isolated. To further improve build time, the build image might also be distributed to build nodes in advance, right after the build image is created. This alleviates possibly time consuming on demand image pulls.

comments powered by Disqus