[repost] Continuous Integration: automating development processes and best practices
To illustrate where Continuous Integration fits into modern environments, let’s briefly look at a typical software development workflow. Most modern software projects, whether they be websites, smartphone apps, or desktop applications, generally follow a high-level process like the following:
- A developer writes some code, often referred to as a changeset or patch, which represents a change to the project’s codebase (for example, to add a new feature or fix a bug).
- They integrate (or merge) their change into a centralized authoritative code repository for that project (for example, a repository on GitHub).
- If relevant to the programming language or application the project source code is compiled and then built into a deployable version (often referred to as artifacts or packages).
The steps above are a simplified view of many real-life setups, omitting considerations like branching strategies. When looking at this process and thinking about the responsibilities at each stage, two key questions arise:
- How can we know that a developer’s changeset (from step 1) is acceptable for integration into the project?
- Changes must not break the existing codebase e.g. they don’t introduce new bugs
- Changes need to be of good enough quality. The exact level of “good enough” is context-dependent: applications responsible for human life, such as medical apps, don’t have the same expectations as games.
- Who (or what) is responsible for performing steps 2-3?
In a non-CI development model, the answer to the first question might range from “click merge and hope for the best” to “hope the developer runs tests”.
Question 2, meanwhile, tended to be done manually by developers or operations, or perhaps partially automated with scripts.
Continuous Integration takes a different approach. It seeks to automate the answer to both these questions.
Looking at our high-level workflow above, Continuous Integration focuses on steps 2 and 3. It validates that a developer’s change is acceptable for integration into the main codebase and that a team can still successfully build the project and run the associated tests.
In an ideal CI environment, every code change is integrated as it’s developed. Generally speaking, it’s advisable to integrate several times per day, or even better, at every commit.
What are Continuous Delivery and Continuous Deployment?
Continuous Delivery and Continuous Deployment bring automation a step further, up to having your latest commit automatically distribute an entire new version of your software.
Continuous Delivery means that artifacts are built and made ready to be deployed. But they will not be deployed without a manual decision by a human being.
Continuous Deployment implies all processes are automated, and a single commit triggers an automated pipeline that will eventually bring a new version of your application to the production environment without any human intervention.
While many companies practice Continuous Delivery, few embrace Continuous Deployment. Continuous Deployment is risky, because anyone could introduce a bug into production with a simple commit, and you need to introduce processes to reduce this risk.
The benefits of Continuous Integration
This emphasis on automated integration provides another sizeable benefit over traditional development workflows.
Most software projects using modern version control use a trunk branch (also known as mainlineor master).
In non-CI environments, developers often work in features over a long period on branches from this trunk. Over time these branches tend to diverge more and more from the mainline as other developers integrate their changes.
Integrating the feature branch can be a laborious process to ensure that all the changes are still compatible. It’s a process that developers dread enough that they coined the phrase “integration hell.” CI workflows help you to avoid this problem with their emphasis on easy and regular integration.
Not only does Continuous Integration save developers time, avoiding them having to integrate changes manually, but it also increases the reliability of the software. Teams can add new features with confidence and release them to users automatically, just by writing code (and associated tests).
Requirements of Continuous Integration practices
There are some hard requirements for a Continuous Integration workflow to take place.
- A Version Control System tool
- A Build Tool
- An Artifacts Repository Manager
Continuous Integration relies on Version Control System
The most important requirement is that the codebase must be under version control. Every change applied to the codebase must be safely stored in a dedicated Version Control System (VCS). Once the code is version controlled, it can be accessed by the CI tool.
There are a couple of tools available on the market, and Git is probably the most widespread CVS tool nowadays, so it deserves a short description.
Git, originally created by Linus Torvalds for development of the Linux kernel, is based on 3 main precepts:
- Take Concurrent Versions System (CVS) as an example of what not to do
- Support a distributed workflow
- Include very strong safeguards against corruption, either accidental or malicious
Main Git characteristics include:
- Support for non-linear development: Branching (and merging) is very fast compared to CVS. While CVS branching is made server-side, branching on Git takes place on the developer machine.
- Distributed development: Every developer has a local copy of the whole repository history.
- Efficient handling of large projects: Performance tests demonstrated that Git is fast, even when handling large codebases.
- User Authentication: commits can be cryptographically signed, ensuring the author of the commit is the person it claims to be.
- Cryptographic authentication of history: Just as with blockchain, the ID of a specific commit depends on the content of its ancestors. Changing the history will change the commit ID.
- Garbage collection: Objects rendered unnecessary will be automatically garbage collected at some point. It’s also possible to explicitly call the garbage collection to pack a Git repository when space is required.
A build tool to implement Continuous Integration
The second requirement for CI is a Build Tool: such a tool will handle the sources of the application and will produce the desired software in an automated way.
The build steps of a piece of software, as well as the build tool, depend on the chosen technology stack. As an illustration, here’s a list of build steps for a Java application:
- If necessary, generate
.javafiles from configuration
- Compile the source code (
.javafiles) to bytecode (
- Compile test code to bytecode
- Execute unit tests
- If there are any, execute integration tests
.classfiles into a JAR archive
- If necessary, store the JAR in an Artifact Repository Manager (see below)
- If necessary, tag the code accordingly in the Control Version System
To implement the action chain of our example, there are several build tools available, e.g.:
- Ant, the cross-platform XML-based ancestor of all Java build tools
- Maven, a widespread declarative XML-based biased toward convention over configuration
A build tool and process, whatever they are, allow for reproducible builds.
The idea of reproducible builds means that the same set of source code should result in an identical set of output artifacts. A code base built on a developer’s laptop or the CI system should result in identical results. This provides several benefits:
- first, parity between the developer’s environment and what runs in production reduces unexpected problems when code that works on someone’s laptop ends up on a server in a data center. Or to coin a popular phrase you might have heard or said, “It works on my machine!”.
- similarly, it minimizes the chance that production breaks if the tests in the CI system pass because you can be confident that they are running the same code.
- Finally, it allows for efficient caching of artifacts and sharing of binaries between stages if they can be built in a consistent and reproducible way.
An Artifacts Repository Manager to store the results of your Continuous Integration process
Just as source code needs to be stored in a VCS, artifacts that result from the build process need to be stored as well. Such artifacts could be stored on a remote file system, but as with VCS, dedicated software to manage artifacts offer so much more added value: this is the role of a binary repository manager, or, in a less limiting definition, an Artifact Repository Manager.
Wikipedia offers the following definition:
A binary repository manager is a software tool designed to optimize the download and storage of binary files used and produced in software development. It centralizes the management of all the binary artifacts generated and used by the organization to overcome the complexity arising from the diversity of binary artifact types, their position in the overall workflow and the dependencies between them.
Artifacts Repository Managers offer the following key features:
- Caching: Because the repository manager is installed inside the company’s boundaries, developers access it faster than a remote one. By using it as a proxy, it’s able to cache downloaded third-party artifacts and speed up access to them.
- Retention policies: A repo manager can automatically purge itself from unused artifacts, and reclaim valuable space.
- High availability: A repo manager can be set up in a cluster, so that developers as well as the CI tool may access it at any time. Downtime of a repo manager can definitely impact the smooth running of all enterprise builds.
- User restrictions: Last but not least, repo managers make it possible restrict access rights to specific artifacts or groups of them depending on the user.
A simple CI workflow, from development up to the real build
A CI workflow is tightly related to development best practices. There is potentially a multitude of possible CI workflows out there, depending on your software, stack, and use case. Let’s see a simplified workflow as an example, starting from development up to the real build automation.
Get the latest copy of the codebase. There are two possibilities here: if it’s the first time it’s accessed, one needs to “download” it. Using Git, this is achieved with the
git clone command, that will copy the remote codebase locally.
Alternatively, if the codebase is already present locally, it just needs to be synchronized with the remote repository, and you may do that e.g.: with the
git pull command.
In Version Control Systems, there’s a dedicated branch pointing to the latest stable version of the software (usually
master), and this is what should get released to production.
In order to protect this golden standard from as many bugs as possible, it shouldn’t be possible to directly write anything on it. For this reason, every development should start by creating a dedicated branch from master.
To keep things organized, it is possible to adopt naming schemes for branches: popular ones uses prefixes like
Proper development can now start, whether it’s a full-fledged feature development spanning one or more sprints or a quick production bugfix.
Depending on one’s context, tests can be written before (this is: Test-Driven Design) or after the writing of the code. However, after or before, tests needs to be written, in order to ensure the code works, and possible future regressions will be caught by the test harness.
Coverage of code by tests is dependent on one’s context too: for software that is responsible for human lives, such as plane navigation or assisted surgery, every line of code needs to be checked (or even double- or triple-checked). In other cases, the return over investment of tests may be less important.
Remember that changes were not made directly on master, but on a dedicated branch. Once the development is finished, it’s time to ask team members if those changes can be merged into the master branch.
This is the goal of a Pull Request: you’re basically asking your team to accept changes in the golden standard, and open your patch to peer review.
Once the PR has been opened, the branch can be automatically built using the project’s build tool, to ensure the modifications we made do not break our master branch.
Frequently, other steps will also take place. One such step is an automated review of the committed code: the review scope might be around security, code quality, documentation standards, etc.
In the realm of code quality, it would be hard not to mention SonarQube, one of the leading OpenSource platforms in this area. SonarQube integrates with major CI tools to execute configured checks on one’s codebase. This is what Continuous Inspection is all about:
SonarQube provides the capability to not only show health of an application but also to highlight issues newly introduced. With a Quality Gate in place, you can fix the leak and therefore improve code quality systematically.
The automated build starts automatically once the PR is opened, using one of the available CI tools that will go through all build steps: compilation, testing, packaging, etc. If one (or more) of the automated build steps triggers a failure, we say the build is broken.
In most CI tools, a broken build is displayed in red, while a passing build is shown in green. Hence, you may hear people referring to a passing build as a “green build”. If the build is broken, whatever the reason, it’s up to the developer at the origin of the PR to fix it. At this stage, the build should be passing.
Automation is great, and developers wouldn’t be able to reach what they do today without it. However, it’s not without its limits.
While a tool such as SonarQube can detect a simple bug pattern (e.g. the Double Checked Locking), it cannot detect the infinite variety of possible bugs, only another human mind can do so.
For this reason, the last step before a code change can be merged into master is a manual code review by fellow team members (this is what the PR was for!).
There are probably as many ways to do a code review as there are developers! Suffice to say you’ll probably need to find a base one, such as the excellent What to Look for in a Code Reviewand adapt it to your own needs.
Tagging, Versioning, and storing the built artifact
At this stage, changes can (finally) be merged into master.
In general, this signals either a release or a production hot-fix. To make sure everything is fine, the CI tool should replay the build again, this time on the master branch with the merged changes.
There are additional actions performed, though.
First, the VCS needs to tag the version accordingly, so that it’s marked as such.
Naming conventions for tagging and versioning also exists, as for branches, but frequently with a more creative twist: oftentimes projects choose topics like e.g. mountain names, lake names or cake names, to name a few Exoscale uses internally.
But while “Placid Pangolin” (Ubuntu) or “Oreo” (Android) are great marketing names worth remembering, software developers should use a standard versioning scheme (using figures) in parallel. It’s more than advised to follow the rules of semantic versioning regarding the major, minor and bugfix versions. More info can be found on semver.org.
Second, the build resulting artifact needs to be stored in the Artifacts Repository Manager. This way, if something unexpected happens, and a rollback needs to be performed, the previous working version is available without the need to build from sources again.
Continuous Integration Tools: an overview
With Continuous Integration ever more widely used, there’s a maturing ecosystem of tools to get started. Below we look at some of the most popular and common CI/CD systems used today, in environments ranging from start-ups operating entirely in the cloud to large enterprise organizations running their own complex CI platforms internally.
Jenkins is one of the oldest open source projects in the Continuous Integration space and still one of the most widely used.
Such a long legacy has upsides and downsides. The core architecture has been battle-tested over the years in production environments ranging from small-scale deployments to some of the largest companies in the world, and there is a vibrant online community of Jenkins users to help with solutions for problems you may encounter.
However, the large legacy code base and backward compatibility requirements mean that it’s internal abstractions are often dated – and these frequently leak through to users in different scenarios.
Furthermore, while Jenkins has an extensive plug-in ecosystem that provides many modern features, these plugins are usually community developed and can vary in quality and reliability.
In recent years Jenkins has gained a new language for describing Continuous Integration workflows known as pipelines. These allow developers to declare and describe the build and deployment process. Jenkins also allows you to create modules that you can reuse across different projects to standardize and streamline common processes.
In short, Jenkins has a long history of development and use, a large and active community, and is highly customizable. Perhaps it’s for these reasons that you could say, “nobody ever got fired for choosing Jenkins.”
Travis CI is nearly as venerable as Jenkins, and while many of its components are open source, it’s not possible to self-host without an enterprise account. However, running Travis with any open source projects is free.
Every task you want Travis to run is contained within a .travis.yml file that sits alongside your code, this also means you can run different tasks from different branches of a repository.
Travis aims for simplicity, working directly from GitHub hosted repositories, and maintaining a library of services common in many applications. But if you’re not using GitHub or need more control, then other options may suit you better.
A useful feature of Travis is the ability to run on multiple operating systems, meaning you can test your code on different targets without needing to maintain machines or virtual images.
GitLab Continuous Integration/Continuous Delivery
GitLab started as a source code hosting service, similar to GitHub but also available in an open source version. Unlike GitHub, GitLab now includes an advanced CI/CD implementation (Called AutoDevOps) built right into their platform.
For those already using GitLab to store source code, this tight integration is one of the most helpful aspects of GitLab’s CI/CD offering. You can enable it by adding a .gitlab-ci.yml configuration file to the root directory of a source code repository.
You can integrate GitLab CI/CD with GitHub repositories.
Bamboo is a Continuous Integration/Continuous Delivery offering from Atlassian, a company well-known in most software environments for their JIRA bug tracking software.
One of the key advantages of Bamboo is its tight integration with other Atlassian products (like JIRA and Bitbucket) for those already operating these systems. It also features a large marketplace of add-ons.
On the downside, Bamboo has a smaller community of users so that users may be more reliant on support from Atlassian.
CircleCI is a modern online service (also available as a hosted version) geared at providing a robust CI platform. CircleCI centers their platform around containers and offers fast spin-up times for tests. The Workflows feature allows users to define sequences of CI and CD jobs even for the most complex projects.
CircleCI’s key advantage is that it is a fully managed CI solution, which reduces the amount of time end users have to put into system maintenance.
We’ve published a nice article on how to build a Continuous Delivery Pipeline using CircleCI and Docker where you can get a more detailed idea of it.
While Continuous Integration best practices and tools are important to get right, they are often not enough to take an organization down the CI path. For many traditional software organizations, switching to CI processes from the traditional manual steps requires deep change in the ways software teams work together.
To be able to successfully integrate changesets into a codebase the team must agree on the sets of work patterns and norms and stick to them. Getting to a reliable build step can sometimes require serious refactoring and continuously deploying to production opens an entirely new horizon of issues that the teams need to design around and take care of.
When considered together, however, the benefits of adopting the practices of Continuous Integration for software organizations are undeniable. This is now a new norm in the software world, and the growth in the adoption of the CI practices is only going to accelerate going forward.