The key identifier can also be the job name, the branch as commit ref, or something else unique for this job, or project pipeline. The example in the blog post could also use
If you are using a monorepo, or work with submodules and different package systems, you may have Python, Ruby, NodeJS in the same CI job. By default, a defined cache needed all 'path' entries as a list, using the same global cache.
Specifying multiple keys with different path locations allows to keep caches separated, and as such, better performance for each specific job. Some jobs may not need the NodeJS, and can only specify to use the Ruby cache key for example.
This can help prevent corrupted caches, e.g. having downloaded older packages which are stale and not used by current dependency trees. Your code still optionally imports them, and jobs fail because of the old dependency. You cannot reproduce that problem in your dev environment though, starting with a fresh container and no caches. I have been debugging these things before, it takes a while to identify local job caches as the culprit. That said, suggesting to go with a little less performance gain and invalidate caches when dependencies change - if it makes sense for the package manager with often changing recursive dependencies. I've seen it with Python.
Tip for failing jobs - by default, the caches are not saved, meaning to say, a large pip install command remains volatile, even if only the user defined unit test command failed afterwards.
They're using cache:keys:files so it will be installing all of them each time yarn.lock changes. When a build is triggered where yarn.lock hasn't changed, it does a build without downloading all the packages.
Come to think of it, builds don't always run in chronological order, so it could wind up with extra packages. Yarn has autoclean, but it says to avoid using it. NPM seems to be quite OK with it, though: https://docs.npmjs.com/cli/v7/commands/npm-prune
I think caching two folders - one that contains the downloads and one that contains the installed packages - could be the way to go. Yarn and npm have caches to prevent downloading files. And maybe only cache the downloads on the main branch.
> And maybe only cache the downloads on the main branch.
$CI_COMMIT_REF_SLUG resolves into the branch when executed in a pipeline. Using it as value for the cache key, Git branches (and related MRs) use different caches. It can be one way to avoid collision but requires more storage with multiple caches. https://docs.gitlab.com/ee/ci/variables/predefined_variables...
In general, I agree, the more caches and parallel execution you add, the more complex and error prone it can get. Simulating a pipeline with runtime requirements like network & caches needs its own "staging" env for developing pipelines. That's a scenario not many have, or might be willing to assign resources onto. Static simulation where you predict the building blocks from the yaml config, is something GitLab's pipeline authoring team is working on in https://gitlab.com/groups/gitlab-org/-/epics/6498
And it is also a matter of insights and observability - the critical path in the pipeline has a long max duration, where do you start analysing and how do you prevent this scenario from happening again. Monitoring with the GitLb CI Pipeline Exporter for Prometheus is great, another way of looking into CI/CD pipelines can be tracing.
CI/CD Tracing with OpenTelemetry is discussed in https://gitlab.com/gitlab-org/gitlab/-/issues/338943 to learn about user experiences, and define the next steps. Imho a very hot topic, seeing more awareness for metrics and traces from everyone. Like, seeing the full trace for pipeline from start to end with different spans inside, and learning that the container image pull takes a long time. That can be the entry point into deeper analysis.
Another idea is to make app instrumentation easier for developers, providing tips for e.g. adding /metrics as an http endpoint using Prometheus and OpenTelemetry client libraries. That way you not only see the CI/CD infrastructure & pipelines, but also user side application performance monitoring and beyond in distributed environments. I'm collecting ideas for blog posts in https://gitlab.com/gitlab-com/marketing/corporate_marketing/...
For someone starting with pipeline efficiency tasks, I'd recommend setting a goal - like shown in the blog post X minutes down to Y - and then start with analysing to get an idea about the blocking parts. Evaluate and test solutions for each part, e.g. a terraform apply might depend on AWS APIs, whereas a Docker pull could be switched to use the Dependency proxy in GitLab for caching.
Each environment has different requirements - collect helpful resources from howtos, blog posts, docs, HN threads, etc. and also ask the community about their experience. https://forum.gitlab.com/ is a good spot too. Recommend to create an example project highlighting the pipeline, and allowing everyone to fork, analyse, add suggestions.
I think it would be amazing if Gitlab CI would allow to send CI pipeline traces to a OTLP endpoint; I can then decide via Orel-collector where I want to send the trace spans to e.g. Google Tracé or Jaeger etc
https://docs.gitlab.com/ee/ci/yaml/index.html#cache
https://docs.gitlab.com/ee/ci/yaml/index.html#cachekey
The key identifier can also be the job name, the branch as commit ref, or something else unique for this job, or project pipeline. The example in the blog post could also use
key: yarn-cache-$CI_COMMIT_REF_SLUG
to better reflect its purpose.
GitLab 13.11 added support for multiple cache keys per job: https://about.gitlab.com/releases/2021/04/22/gitlab-13-11-re...
If you are using a monorepo, or work with submodules and different package systems, you may have Python, Ruby, NodeJS in the same CI job. By default, a defined cache needed all 'path' entries as a list, using the same global cache.
Specifying multiple keys with different path locations allows to keep caches separated, and as such, better performance for each specific job. Some jobs may not need the NodeJS, and can only specify to use the Ruby cache key for example.
In case you like to invalidate the cache every time a specific file (yarn.lock, go.sum, etc) changes, you can explicitly configure this behavior using cache:keys:files https://docs.gitlab.com/ee/ci/yaml/index.html#cachekeyfiles
This can help prevent corrupted caches, e.g. having downloaded older packages which are stale and not used by current dependency trees. Your code still optionally imports them, and jobs fail because of the old dependency. You cannot reproduce that problem in your dev environment though, starting with a fresh container and no caches. I have been debugging these things before, it takes a while to identify local job caches as the culprit. That said, suggesting to go with a little less performance gain and invalidate caches when dependencies change - if it makes sense for the package manager with often changing recursive dependencies. I've seen it with Python.
Tip for failing jobs - by default, the caches are not saved, meaning to say, a large pip install command remains volatile, even if only the user defined unit test command failed afterwards.
To avoid a slow down in the pipeline, you can use cache:when:always to always save the cache. https://docs.gitlab.com/ee/ci/yaml/index.html#cachewhen
Exercises to learn with Python are at slide 109 https://docs.google.com/presentation/d/12ifd_w7G492FHRaS9CXA...