Speed-up CI/CD execution based on dependencies change detection with the use of GitLab CI, Docker cache and checksum calculation

November 11, 2018

Ivan Kudriavtsev

The problem of a partial project build in CI/CD system can be pretty complex. It covers several areas, but most important are the build time spent to build the project when the changes detected, and the artifacts deployment impact on production and staging systems during their deployment.

The problem is easily addressed when such build systems as GNU make, Maven, NPM or others are in use. They manage dependencies automatically for certain types of projects. But when the build system does not support an automated dependency management for your project it is a big deal to implement a build in an efficient and performant way. Otherwise, the build will be slow, the artifacts will be duplicated in artifact repositories, connected services may be restarted even when there are no changed modules to deploy.

There are several approaches to decrease the impact of such excessive builds and all of them are based on the implementation of dependencies management:

manual dependencies tracking;
Dockerfile-based build and deploy;
automated checksum-based dependency management.

All these approaches have pros and cons and must be carefully analyzed before the implementation. Let’s observe each of them briefly.

Manual Dependencies Tracking

This is a straight-forward way which assumes that engineers carefully manage every dependency in the build file which helps the build system to avoid excessive builds. This approach works well for the projects with small dependencies list. When there are dozens or hundreds of dependencies it takes enormous efforts to manage them.

Dockerfile-based Build and Deploy

Docker implements a very efficient layered design when each upper layer is built on top of the previous lower layer, and the layers are cached.

This design helps a lot if the project build scenario is represented as an acyclic component graph. E.g., there are components which change very rarely and they are needed for building other components which change rarely, and those, in turn, are used to build frequently changing components.

Dockerfile ADD, COPY and RUN instructions are cached by default, which means that the layer is reused and the command is skipped when Docker decides that the cache is in sync with the command.

The most important requirement to make such kind of an applicable Dockerfile-based build scenario is the limited number of components to be built and simple dependencies between them. The limitation is the number of layers supported by Docker for building images. It depends on the underlying Docker file system.

This approach can lead to monstrously large Dockerfiles which are difficult to support. Let’s take a look at the reorganized Dockerfile for Apache CloudStack Simulator which efficiently builds E2E tests for QA and testing purposes:

FROM ubuntu:16.04

RUN apt-get -y update && apt-get install -y \
    genisoimage libffi-dev libssl-dev git sudo ipmitool \
    maven openjdk-8-jdk python-dev python-setuptools 
    python-pip python-mysql.connector supervisor \
    python-crypto python-openssl

RUN echo 'mysql-server mysql-server/root_password password root' |  debconf-set-selections; \
    echo 'mysql-server mysql-server/root_password_again password root' |  debconf-set-selections;

RUN apt-get install -qqy mysql-server && \
    apt-get clean all && \
    mkdir /var/run/mysqld; \
    chown mysql /var/run/mysqld

RUN pip install pyOpenSSL

RUN echo '''sql_mode = "STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"''' >> /etc/mysql/mysql.conf.d/mysqld.cnf
RUN (/usr/bin/mysqld_safe &); sleep 5; mysqladmin -u root -proot password ''

COPY agent /root/agent
COPY api /root/api
COPY build /root/build
COPY client /root/client
COPY cloud-cli /root/cloud-cli
COPY cloudstack.iml /root/cloudstack.iml
COPY configure-info.in /root/configure-info.in
COPY core /root/core
COPY debian /root/debian
COPY deps /root/deps
COPY developer /root/developer
COPY engine /root/engine
COPY framework /root/framework
COPY LICENSE.header /root/LICENSE.header
COPY LICENSE /root/LICENSE
COPY maven-standard /root/maven-standard
COPY NOTICE /root/NOTICE
COPY packaging /root/packaging
COPY plugins /root/plugins
COPY pom.xml /root/pom.xml
COPY python /root/python
COPY quickcloud /root/quickcloud
COPY requirements.txt /root/requirements.txt
COPY scripts /root/scripts
COPY server /root/server
COPY services /root/services
COPY setup /root/setup
COPY systemvm /root/systemvm
COPY target /root/target
COPY test/bindirbak /root/test/bindirbak
COPY test/conf /root/test/conf
COPY test/metadata /root/test/metadata
COPY test/pom.xml /root/test/pom.xml
COPY test/scripts /root/test/scripts
COPY test/selenium /root/test/selenium
COPY test/src /root/test/src
COPY test/systemvm /root/test/systemvm
COPY test/target /root/test/target
COPY tools/pom.xml /root/tools/pom.xml
COPY tools/apidoc /root/tools/apidoc
COPY tools/checkstyle /root/tools/checkstyle
COPY tools/devcloud4/pom.xml /root/tools/devcloud4/pom.xml
COPY tools/devcloud-kvm/pom.xml /root/tools/devcloud-kvm/pom.xml
COPY tools/marvin/pom.xml /root/tools/marvin/pom.xml
COPY tools/pom.xml /root/tools/pom.xml
COPY tools/wix-cloudstack-maven-plugin/pom.xml /root/tools/wix-cloudstack-maven-plugin/pom.xml
COPY ui /root/ui
COPY usage /root/usage
COPY utils /root/utils
COPY version-info.in /root/version-info.in
COPY vmware-base /root/vmware-base

RUN cd /root && mvn -Pdeveloper -Dsimulator -DskipTests -pl "!:cloud-marvin" install

RUN (/usr/bin/mysqld_safe &) && \
    sleep 5 && \
    cd /root && \
    mvn -Pdeveloper -pl developer -Ddeploydb && \
    mvn -Pdeveloper -pl developer -Ddeploydb-simulator

COPY tools/marvin /root/tools/marvin
COPY tools/docker/supervisord.conf /etc/supervisor/conf.d/supervisord.conf
COPY tools/docker/docker_run_tests.sh /root

RUN cd /root && mvn -Pdeveloper -Dsimulator -DskipTests -pl ":cloud-marvin"

RUN MARVIN_FILE=`find /root/tools/marvin/dist/ -name "Marvin*.tar.gz"` && pip install $MARVIN_FILE

COPY test/integration /root/test/integration
COPY tools /root/tools

RUN pip install --upgrade pyOpenSSL

EXPOSE 8080 8096

WORKDIR /root

CMD ["/usr/bin/supervisord"]

In the example, the component separation led to a significant decrease in the build time if no changes happen in the underlying layers, but the readability and manageability decreased a lot. So, this approach does not work for all projects but can work for projects with weakly coupled components organized into layers.

Automated Checksum-based Dependency Management

This approach assumes that there are weakly coupled components in the project which can be arbitrarily changed but one would like to build and deploy the component only if there are changes in its codebase. Implementing this approach can be pretty easy for many cases, but, again, it is not a silver bullet.

Let’s try to prototype the shell script to gather the state of project as a checksum. Assume, one has a directory project with a file a, with textual content test inside:

$ mkdir project
$ echo "test" > project/a
$ find project
project
project/a

Let’s develop a script which computes a single MD5 sum for the project:

$ PRJ=project; RES=$(tar -cO $PRJ | md5sum)
$ echo $PRJ $RES
project d971868dc52bbbb689e7935d0b851503 -

Now, let’s change project/a and recalculate MD5 sum:

$ echo "test1" > project/a
$ PRJ=project; RES=$(tar -cO $PRJ | md5sum)
$ echo $PRJ $RES
project 7cc28a6a4aaad798cdc2e96a867f5c53 -

As one can see, the checksum changed. Now, let’s create a subdirectory inside and recalculate MD5 sum:

$ mkdir project/subdir
$ PRJ=project; RES=$(tar -cO $PRJ | md5sum)
$ echo $PRJ $RES
project 03e26dd8d8883927740ac71e8595818a -

As it can be seen, any change inside project dir is reflected in the checksum.

So, the idea for CI/CD optimization is to calculate the checksum for every project and next, build and deploy it only when checksum differs from the previously calculated one.

Next, let’s observe how mentioned approaches can be implemented with GitLab CI.

Manual Dependencies Tracking Optimizations With GitLab CI

The build optimization process is pretty straightforward when the manual dependencies management is used. The only point one must manage is to keep final and intermediary build artifacts in the GitLab CI cache and configure the build system to use cache directory (or import those artifacts back to the project tree). The deployment step is just a stage in the building script, so the deployment is called when it is needed.

The example of cache usage is presented below:

#
# https://gitlab.com/gitlab-org/gitlab-ce/tree/master/lib/gitlab/ci/templates/Nodejs.gitlab-ci.yml
#
image: node:latest

# Cache modules in between jobs
cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
  - node_modules/

before_script:
  - npm install

test_async:
  script:
  - node ./specs/start.js ./specs/async.spec.js

The cache in the example is used to optimize external dependencies for Node. More sophisticated approaches can be used when one wants to cache intermediary and final artifacts as well.

Dockerfile-based Build and Deploy With GitLab CI

The dockerfile-based build approach is described very thoroughly in GitLab CI documentation and is widely used in the projects to build and test Docker images. Docker’s cache is used to speed up the build of layers. Docker registry offers a space-efficient way to keep Docker images.

Automated Checksum-based Dependency Management With GitLab CI

This approach is not self-explanatory, so let’s design a small .gitlab-ci.yml which utilizes the script designed previously and GitLab CI cache to keep track of parts which should be rebuilt when changed.

image: ubuntu:latest

# Cache modules in between jobs.
# Every git branch has separate cache.
#
cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
  - build_cache/

# Create a registry which will be used to keep track of changes.
# The registry is created in the cache, so it persists between jobs.
# We create initially registry or just touch it if it is created.
#
before_script:
   - mkdir -p build_cache
   - touch build_cache/registry
 
# Build process itself for project1 inside the git repo.
# 1. Gather a checksum for current tree 'project1/*'.
# 2. Compare the checksum with already known.
# 2.1 If differs, then rebuild
# 2.2 Otherwise, skip rebuild
# 3. Update the registry
#
project1:
 variables:
   PRJ: project1
 stage: build
 script:
   - cat build_cache/registry
   - RES=$(tar -cO $PRJ | md5sum)
   - echo "$PRJ $RES"
   - |
        echo "------------------------------------------------"
        if ! grep -P "^$PRJ $RES" build_cache/registry
        then
            echo "XXXXXX BUILD HAPPENS - '$PRJ $RES' not found"
        else
            echo "XXXXXX BUILD CACHED - '$PRJ $RES' found"
        fi
        echo "------------------------------------------------"
   - |
        if ! grep -P "^$PRJ " build_cache/registry
        then
            echo "$PRJ $RES" >> build_cache/registry
        else
            sed -i "s/$PRJ .*\$/$PRJ $RES/g" build_cache/registry || true
        fi
   - cat build_cache/registry

This approach can be used if the Git repository includes independent projects which should be rebuilt in case of changes. If your project uses external artifacts and includes shared directories, then the script should be changed to reflect additional needs.

Conclusion

We observed three different approaches which can be used to manage dependencies in projects. These approaches are beneficial when used either in CI/CD pipelines or locally because they can significantly decrease build time, and positively impact stable operating of production servers. There is no one approach that fits all. You have to decide what to choose and when. It is better to use standard build systems which automatically track dependencies for you like Maven, GNU Make, Gradle, NPM, Yarn or any other available to you.

Sometimes the methods listed above are handy, sometimes they may lead to a bigger evil than just run complete rebuild every time, but you need to consider the approaches which increase the team productivity because otherwise, your CI/CD is not efficient and engineers may try to sabotage using it.

If you like this post and find it useful, please, share it with friends.