This is a guest post by R. Tyler Croy, who is a long-time contributor to Jenkins and the primary contact for Jenkins project infrastructure. He is also a Jenkins Evangelist atCloudBees, Inc. |
For ages I have used the "Build After" feature in Jenkins to cobble together what one might refer to as a "pipeline" of sorts. The Jenkins project itself, a major consumer of Jenkins, has used these daisy-chained Freestyle jobs to drive a myriad of delivery pipelines in our infrastructure.
One such "pipeline" helped drive the complex process of generating the pretty blue charts onstats.jenkins-ci.org. This statistics generation process primarily performs two major tasks, on rather large sets of data:
Generate aggregate monthly "census data."
Process the census data and create trend charts
The chained jobs allowed us to resume the independent stages of the pipeline, and allowed us to run different stages on different hardware (different capabilities) as needed. Below is a diagram of what this looked like:
The infra_generate_monthly_json
would run periodically creating the
aggregated census data, which would then be picked up by infra_census_push
whose sole responsibility was to take census data and publish it to the
necessary hosts inside the project’s infrastructure.
The second, semi-independent, "pipeline" would also run periodically. Theinfra_statistics
job’s responsibility was to use the census data, pushed
earlier by infra_census_push
, to generate the myriad of pretty blue charts
before triggering theinfra_checkout_stats
job which would make sure stats.jenkins-ci.org
was
properly updated.
Suffice it to say, this "pipeline" had grown organically over a period time whenmore advanced tools weren’t quite available.
When we migrated to newer infrastructure forci.jenkins.io earlier this year I took the opportunity to do some cleaning up. Instead of migrating jobs verbatim, I pruned stale jobs and refactored a number of others into properPipelines, statistics generation being an obvious target!
Our requirements for statistics generation, in their most basic form, are:
Enable a sequence of dependent tasks to be executed as a logical group (a pipeline)
Enable executing those dependent tasks on various pieces of infrastructure which support different requirements
Actually generate those pretty blue charts
If you wish to skip ahead, you can jump straight to theJenkinsfile which implements our new Pipeline. |
The first iteration of the Jenkinsfile
simply defined the conceptual stages we
would need:
node {
stage 'Sync raw data and census files'
stage 'Process raw logs'
stage 'Generate census data'
stage 'Generate stats'
stage 'Publish census'
stage 'Publish stats'
}
How exciting! Although not terrifically useful. When I began actually implementing the first couple stages, I noticed that the Pipeline might syncdozens of gigabytes of data every time it ran on a new agent in the cluster. While this problem will soon be solved by theExternal Workspace Manager plugin, which is currently being developed. Until it’s ready, I chose to mitigate the issue by pinning the execution to a consistent agent.
/* `census` is a node label for a single machine, ideally, which will be
* consistently used for processing usage statistics and generating census data
*/
node('census && docker') {/* .. */
}
Restricting a workload which previously used multiple agents to a single one introduced the next challenge. As an infrastructure administrator, technically speaking, I could just install all the system dependencies that I want on this one special Jenkins agent. But what kind of example would that be setting!
The statistics generation process requires:
Fortunately, with Pipeline we have a couple of useful features at our disposal: tool auto-installers and theCloudBees Docker Pipeline plugin.
Tool Auto-Installers
Tool Auto-Installers are exposed in Pipeline through the tool
step and onci.jenkins.io we already had JDK8 and Groovy
available. This meant that the Jenkinsfile
would invoke tool
and Pipeline
would automatically install the desired tool on the agent executing the current
Pipeline steps.
The tool
step does not modify the PATH
environment variable, so it’s usually
used in conjunction with the withEnv
step, for example:
node('census && docker') {/* .. */def javaHome = tool(name: 'jdk8')def groovyHome = tool(name: 'groovy')/* Set up environment variables for re-using our auto-installed tools */def customEnv = ["PATH+JDK=${javaHome}/bin","PATH+GROOVY=${groovyHome}/bin","JAVA_HOME=${javaHome}",
]/* use our auto-installed tools */
withEnv(customEnv) {
sh 'java --version'
}/* .. */
}
CloudBees Docker Pipeline plugin
Satisfying the MongoDB dependency would still be tricky. If I caved in and installed MongoDB on a single unicorn agent in the cluster, what could I say the next time somebody asked for a special, one-off, piece of software installed on our Jenkins build agents?
After doing my usual complaining and whining, I discovered that the CloudBees
Docker Pipeline plugin provides the ability to run containers inside of aJenkinsfile
. To make things even better, there areofficial MongoDB docker images readily
available on DockerHub!
This feature requires that the machine has a running Docker daemon which is accessible to the user running the Jenkins agent. After that, running a container in the background is easy, for example:
node('census && docker') {/* .. *//* Run MongoDB in the background, mapping its port 27017 to our host's port
* 27017 so our script can talk to it, then execute our Groovy script with
* tools from our `customEnv`
*/
docker.image('mongo:2').withRun('-p 27017:27017') { container ->
withEnv(customEnv) {
sh "groovy parseUsage.groovy --logs ${usagestats_dir} --output ${census_dir} --incremental"
}
}/* .. */
}
The beauty, to me, of this example is that you can pass aclosure to withRun
which will
execute while the container is running. When the closure is finished executin,
just the sh
step in this case, the container is destroyed.
With that system requirement satisfied, the rest of the stages of the Pipeline fell into place. We now have a single source of truth, theJenkinsfile, for the sequence of dependent tasks which need to be executed, accounting for variations in systems requirements, and it actually generatesthose pretty blue charts!
Of course, a nice added bonus is the beautiful visualization of ournew Pipeline!