Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature request: Selectively disable caching for specific RUN commands in Dockerfile #1996

Closed
mohanraj-r opened this issue Sep 24, 2013 · 271 comments
Labels
area/builder kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@mohanraj-r
Copy link

branching off the discussion from #1384 :

I understand -no-cache will disable caching for the entire Dockerfile. But would be useful if I can disable cache for a specific RUN command? For example updating repos or downloading a remote file .. etc. From my understanding that right now RUN apt-get update if cached wouldn't actually update the repo? This will cause the results to be different than from a VM?

If disable caching for specific commands in the Dockerfile is made possible, would the subsequent commands in the file then not use the cache? Or would they do something a bit more intelligent - e.g. use cache if the previous command produced same results (fs layer) when compared to a previous run?

@tianon
Copy link
Member

tianon commented Sep 24, 2013

I think the way to combat this is to take the point in the Dockerfile you do want to be cached to and tag that as an image to use in your future Dockerfile's FROM, that can then be built with -no-cache without consequence, since the base image would not be rebuilt.

@mohanraj-r
Copy link
Author

But wouldn't this limit interleaving cached and non-cached commands with ease ?

For e.g. lets say I want to update my repo and wget files from a server and perform bunch of steps in between - e.g. install software from the repo (that could have been updated) - perform operations on the downloaded file (that could have changed in the server) etc.

What would be ideal is for a way to specify to docker in the Dockerfile to run specific commands without cache every time and the only reuse previous image if there is no change (for e.g no update in repo).

Wouldn't this be useful to have ?

@joelreymont
Copy link

What about CACHE ON and CACHE OFF in the Dockerfile? Each instruction would affect subsequent commands.

@konklone
Copy link

Yeah, I'm using git clone commands in my Dockerfile, and if I want it to re-clone with updates, I need to, like, add a comment at the end of the line to trigger a rebuild from that line. I shouldn't need to create a whole new base container for this step.

@githart
Copy link

githart commented Nov 6, 2013

Can a container ID be passed to 'docker build' as a "do not cache past this ID" instruction? Similar to the way in which 'docker build' will cache all steps up to a changed line in a Dockerfile?

@shykes
Copy link
Contributor

shykes commented Jan 6, 2014

I agree we need more powerful and fine-grained control over the build cache. Currently I'm not sure exactly how to expose this to the user.

I think this will become easier with the upcoming API extensions, specifically naming and introspection.

@timruffles
Copy link
Contributor

Would be a great feature. Currently I'm using silly things like RUN a=a some-command, then RUN a=b some-command to break the cache

@rogernolan
Copy link

Getting better control over the cache would make using docker from CI a lot happier.

@crosbymichael
Copy link
Contributor

@shykes

What about changing --no-cache from a bool to a string and have it take a regex for where in the docker we want to bust the cache?

docker build --no-cache "apt-get install" .

@shykes
Copy link
Contributor

shykes commented Feb 7, 2014

I agree and suggested this exact feature on IRC.

Except I think to preserve reverse compatibility we should create a new flag (say "--uncache") so we can keep --cached as a (deprecated) bool flag that resolves to "--uncache .*"

On Fri, Feb 7, 2014 at 9:17 AM, Michael Crosby notifications@github.com
wrote:

@shykes
What about changing --no-cache from a bool to a string and have it take a regex for where in the docker we want to bust the cache?

docker build --no-cache "apt-get install" .

Reply to this email directly or view it on GitHub:
#1996 (comment)

@crosbymichael
Copy link
Contributor

What does everyone else think about this? Anyone up for implementing the feature?

@timruffles
Copy link
Contributor

I'm up for having a stab at implementing this today if nobody else has started?

@timruffles
Copy link
Contributor

I've started work on it - wanted to validate the approach looks good.

  • The noCache field of buildfile becomes a *regexp.Regexp.
    • A nil value there means what utilizeCache = true used to.
  • Passing a string to docker build --no-cache now sends a validate regex string to the server.
  • Just calling --no-cache results in a default of .*
  • The regex is then used in a new method buildfile.utilizeCache(cmd []string) bool to check commands that ignore cache

One thing: as far as I can see, the flag/mflag package doesn't support string flags without a value, so I'll need to do some extra fiddling to support both --no-cache and --no-cache some-regex

@tianon
Copy link
Member

tianon commented Feb 25, 2014

I really think this ought to be a separate new flag. The behavior and syntax of --no-cache is already well defined and used in many, many places by many different people. I'd vote for --break-cache or something similar, and have --no-cache do exactly what it does today (since that's very useful behavior that many people rely on and still want).

Anyways, IANTM (I am not the maintainer) so these are just my personal thoughts. :)

@timruffles
Copy link
Contributor

@tianon --no-cache is currently bool, so this simply extends the existing behaviour.

  • docker build --no-cache - same behaviour as before: ignores cache
  • docker build --no-cache someRegex - ignores any RUN or ADD commands that match someRegex

@tianon
Copy link
Member

tianon commented Feb 25, 2014

Right, that's all fine. The problem is that --no-cache is a bool, so the existing behavior is actually:

  • --no-cache=true - explicitly disable cache
  • --no-cache=false - explicitly enable cache
  • --no-cache - shorthand for --no-cache=true

I also think we'd be doing ourselves a disservice by making "true" and "false" special case regex strings to solve this, since that will create potentially surprising behavior for our users in the future. ("When I use --no-cache with a regex of either 'true' or 'false', it doesn't work like it's supposed to!")

@timruffles
Copy link
Contributor

@tianon yes you're right. Had a quick look and people are using =true/false.

Happy to modify the PR to add new flag as you suggest, what do the maintainers think (@crosbymichael, @shykes)? This would also mean I could remove the code added to mflag to allow string/bool flags.

@crazyscience
Copy link

+1 for @wagerlabs approach

@marcuslinke
Copy link
Contributor

@crosbymichael, @timruffles Wouldn't it be better if the author of the Dockerfile decides which build step should be cached and which should not? The person that creates the Dockerfile is not necessarily the same that builds the image. Moving the decision to the docker build command demands detailed knowledge from the person that just want to use a specific Dockerfile.

Consider a corporate environment where someone just want to rebuild an existing image hierarchy to update some dependencies. The existing Dockerfile tree may be created years ago by someone else.

@hunterloftis
Copy link

+1 for @wagerlabs approach

@cressie176
Copy link
Contributor

+1 for @wagerlabs approach although it would be even nicer if there was a way to cache bust on a time interval too, e.g.

CACHE [interval | OFF]
RUN apt-get update
CACHE ON

I appreciate this might fly against the idea of containers being non deterministic, however it's exactly the sort of thing you want to do in a continuous deployment scenario where your pipeline has good automated testing.

As a workaround I'm currently generating cache busters in the script I use to run docker build and adding them in the dockerfile to force a cache bust

FROM ubuntu:13.10
ADD ./files/cachebusters/per-day /root/cachebuster
...
ADD ./files/cachebusters/per-build /root/cachebuster
RUN git clone git@github.com:cressie176/my-project.git /root/my-project

@tfoote
Copy link

tfoote commented Apr 19, 2014

I'm looking to use containers for continuous integration and the ability to set timeouts on specific elements in the cache would be really valuable. Without this I cannot deploy. Forcing a full rebuild every time is much too slow.

My current plan to work around this is to dynamically inject commands such as RUN echo 2014-04-17-00:15:00 with the generated line rounded down to the last 15 minutes to invalidate cache elements when the rounded number jumps. ala every 15 minutes. This works for me because I have a script generating the dockerfile every time, but it won't work without that script.

@amarnus
Copy link

amarnus commented May 2, 2014

+1 for the feature.

@hiroprotagonist
Copy link

I also want to vote for this feature. The cache is annoying when building parts of a container from git repositories which updates only on the master branch.
👍

@amarnus
Copy link

amarnus commented May 7, 2014

@hiroprotagonist Having a git pull in your ENTRYPOINT might help?

@hiroprotagonist
Copy link

@amarnus I've solved it similar to the idea @tfoote had. I am running the build from a jenkins job and instead of running the docker build command directly the job starts a build skript wich generates the Dockerfile from a template and adds the line 'RUN echo currentsMillies' above the git commands. Thanks to sed and pipes this was a matter of minutes. Anyway, i still favor this feature as part of the Dockerfile itself.

@olavurmortensen
Copy link

I agree that this feature would be very helpful.

At the moment, I use the solution suggested above using the ARG command, to increment a build number. As shown below.

FROM somthing
ARG build=1
RUN some-non-deterministic-command

This works fine. But the problem with this solution is that it requires you to remember to increment the build variable. It is certain that one will, at some point, forget this, and spend two days figuring out what is going wrong.

@thomas10-10
Copy link

Why was this issue closed?

@caio-vinicius
Copy link

I agree that this feature would be very helpful.

At the moment, I use the solution suggested above using the ARG command, to increment a build number. As shown below.

FROM somthing
ARG build=1
RUN some-non-deterministic-command

This works fine. But the problem with this solution is that it requires you to remember to increment the build variable. It is certain that one will, at some point, forget this, and spend two days figuring out what is going wrong.

You can use the $RANDOM env variable.

@caio-vinicius
Copy link

Would love this feature.

@zkscpqm
Copy link

zkscpqm commented Dec 23, 2021

For anyone that has the luxury to automate their builds, this is what I like to do:

I put a placeholder in the Dockerfile template like: https://github.com/zkscpqm/Car-Zix/blob/master/Dockerfile_template#L9 which dictates where my cache ends.

I then spawn a unique Dockerfile each time I do a build and I replace the placeholder with some hash:
https://github.com/zkscpqm/Car-Zix/blob/master/run_tests_docker.py#L38-L44

Hope this helps))

TheTeXnician pushed a commit to islandoftex/texlive that referenced this issue May 19, 2022
@koplenov
Copy link

koplenov commented Jun 9, 2022

Nine years have there been any changes?

In 2013, the CACHE ON and CACHE OFF commands were proposed.
Each instruction will affect subsequent commands.

How is it now?

@HariSekhon
Copy link

HariSekhon commented Jun 10, 2022

Solutions I came up with in the interim:

  1. To invalidate the cache at a specific step every time:
ADD http://date.jsontest.com /etc/builddate

or

ADD http://worldclockapi.com/api/json/utc/now /etc/builddate
  1. For GitHub repos, only invalidate the cache at this step if the repo, in this case HariSekhon/DevOps-Python-tools, has had new commits since the last build:
ADD https://api.github.com/repos/HariSekhon/DevOps-Python-tools/git/refs/heads/master /.git-hashref

I use these sorts of tricks a lot in my large Dockerfiles repo containing lots of different apps and builds, including packaging my GitHub repos tools, scripts and dependencies:

https://github.com/HariSekhon/Dockerfiles

These and other tricks are most succinctly shown in my master Dockerfile template in my Templates repo which has templates for lots of the most popular DevOps technologies like Make, Jenkins, GitHub Actions, Docker, Kubernetes etc...:

https://github.com/HariSekhon/Templates/blob/master/Dockerfile

@MaxTranced
Copy link

  1. To invalidate the cache at a specific step every time:

Neat!

Does the worldclockapi.com service have a documentation page (I could not find it)... Or do you know of any way to invalidate the cache once every Monday? I'm guessing and API endpoint that returns the current week number would achieve that...

Thank you so much!

@HariSekhon
Copy link

@MaxTranced I've used date.jsontest.com more than worldclockapi.com (which is giving me a 503 error right now), but there are some others that should world at the top of a Google Search:

https://timeapi.io/swagger/index.html

http://worldtimeapi.org/pages/examples

The latter seems like it can do week of the year as week_number according to its schema documentation:

http://worldtimeapi.org/pages/schema

This one has an API to return just the week:

https://timezoneapi.io

eg.

https://timezoneapi.io/api/ip/?token=<YOUR_TOKEN>&only=datetime(week)

but it unfortunately also returns the execution time which would bust the cache on every request because the millisecond timing would be different. You might want to contact them and see if there is an option to not do that and point them to this thread as the Dockerfile use case.

Another solution is to wrap your docker build in a Makefile or CI/CD step which does a date '+%W' > week_of_year.txt before the docker build step and then have your Dockerfile COPY week_of_year.txt /etc/ or similar to break the cache once a week.

@MaxTranced
Copy link

@MaxTranced I've used date.jsontest.com more than worldclockapi.com (which is giving me a 503 error right now), but there are some others that should world at the top of a Google Search:
[...]
Another solution is to wrap your docker build in a Makefile or CI/CD step which does a date '+%W' > week_of_year.txt before the docker build step and then have your Dockerfile COPY week_of_year.txt /etc/ or similar to break the cache once a week.

Thank you so much for the suggestions! I did search for a while but did not find the timezoneapi.io service. I will put the advice to good use!

@douglasg14b
Copy link

Using the ARG with a random value doesn't seem to prevent COPY commands from caching what they are copying....

How can I prevent caching of certain COPY commands in my Dockerfile?

@ThomasParistech
Copy link

I tried using the same trick as
ADD https://api.github.com/repos/HariSekhon/DevOps-Python-tools/git/refs/heads/master /.git-hashref

but got the following error when building the image for the second time.
failed to load cache key: invalid not-modified ETag: "5366c7b8ba2a8e3a77f127e5cf2839fcf610582492997674ea17ab659df1cce3"

Any clue ?

@HariSekhon
Copy link

HariSekhon commented Nov 15, 2022

Using the ARG with a random value doesn't seem to prevent COPY commands from caching what they are copying....

How can I prevent caching of certain COPY commands in my Dockerfile?

@douglasg14b

Is the COPY command definitely below the ARG command in the Dockerfile in that case? If so then perhaps that's an optimization that Docker has made more recently... on which version of Docker do you see that behaviour?

@HariSekhon
Copy link

HariSekhon commented Nov 15, 2022

@ThomasParistech

Which version of Docker is that happening for you?

I've definitely used that before... perhaps the behaviour has been changed to reference a cache key load but I'm unsure how that could be interpreted that way given this is the current output of the sample URL I gave above:

$ curl https://api.github.com/repos/HariSekhon/DevOps-Python-tools/git/refs/heads/master
{
  "ref": "refs/heads/master",
  "node_id": "MDM6UmVmNDUwNDkwMjY6cmVmcy9oZWFkcy9tYXN0ZXI=",
  "url": "https://api.github.com/repos/HariSekhon/DevOps-Python-tools/git/refs/heads/master",
  "object": {
    "sha": "dc4b1ce2b2fbee3797b66501ba3918a900a79769",
    "type": "commit",
    "url": "https://api.github.com/repos/HariSekhon/DevOps-Python-tools/git/commits/dc4b1ce2b2fbee3797b66501ba3918a900a79769"
  }
}

Are you querying a different URL that is returning only a hashref that Docker is interpreting differently or are you targeting a /git/refs/heads/master github URL which is returning JSON as shown above?

@ThomasParistech
Copy link

@HariSekhon
I'm using following config:
Docker version 20.10.12, build 20.10.12-0ubuntu2~20.04.1
And got the same kind of JSON as you but I run my command

I'm also using BuildKit, # syntax=docker/dockerfile:1.3

This is a private repo, so I pass my GitHub personal access token as well, but I don't think this explains the difference

@ImanolSantiago
Copy link

Nine years have there been any changes?

In 2013, the CACHE ON and CACHE OFF commands were proposed. Each instruction will affect subsequent commands.

How is it now?

greetings from the future
Bad news... we still don't have a practical solution

I'm surprised they can't or don't want to implement it

punchagan added a commit to punchagan/current-bench that referenced this issue Apr 19, 2023
Simply running git clone as a layer meant that the cached repository is always
used, even when the bench_repo has been changed. To work around this, we use
the GitHub refs API to see if the repo has changed, to decide whether to use
the cached bench_repo or make a new clone of it.  The trick here is from this
GitHub [issue
comment](moby/moby#1996 (comment))

Also, this commit adds support for specifying a specific branch of the
bench_repo to use for running the benchmarks.  The branch can be specified
using the `/tree/<branch-name>` suffix in the bench_repo URL.
punchagan added a commit to punchagan/current-bench that referenced this issue Apr 21, 2023
Simply running git clone as a layer meant that the cached repository is always
used, even when the bench_repo has been changed. To work around this, we use
the GitHub refs API to see if the repo has changed, to decide whether to use
the cached bench_repo or make a new clone of it.  The trick here is from this
GitHub [issue
comment](moby/moby#1996 (comment))

Also, this commit adds support for specifying a specific branch of the
bench_repo to use for running the benchmarks.  The branch can be specified
using the `/tree/<branch-name>` suffix in the bench_repo URL.
ElectreAAS pushed a commit to ocurrent/current-bench that referenced this issue Apr 21, 2023
* Skip installing opam dependencies when using separate bench_repo

* Ensure cached bench_repo is used only when repo has no changes

Simply running git clone as a layer meant that the cached repository is always
used, even when the bench_repo has been changed. To work around this, we use
the GitHub refs API to see if the repo has changed, to decide whether to use
the cached bench_repo or make a new clone of it.  The trick here is from this
GitHub [issue
comment](moby/moby#1996 (comment))

Also, this commit adds support for specifying a specific branch of the
bench_repo to use for running the benchmarks.  The branch can be specified
using the `/tree/<branch-name>` suffix in the bench_repo URL.
@Aiosa
Copy link

Aiosa commented Apr 29, 2023

if a new commit is pushed to that repository, the step won't be executed again, because the RUN itself didn't change. Docker has no way to determine "what" is executed in a RUN instruction, other than what's in the Dockerfile. Generally, recommendation for that is to make the Dockerfile deterministic, e.g. using a build-arg;

This is all theoretically nice, but then you enter a real world use cases in for example kubernetes and you want to be able to run the same image as both a job and a service for example. Then nothing like this works well since you are then forced to keep a bunch of variables and arguments up to date in various configuration files (e.g., yaml). If you have multiple repositories and change stuff frequently (development on cloud with containers with 100GB+ RAM) you realize a theory and a practice all two different things. And the only thing you wanted was to have an up-to-date git repository clone.

@HariSekhon
Copy link

@Aiosa I agree that with git repo clones you want to get an up to date clone... did you see my solution for that above, I thought it was quite novel:

#1996 (comment)

@Edwardius
Copy link

Sad that there's no current solution :(. I love using dev containers, but I have a couple of build commands in the Dockerfile that I wish not to be cached whenever I change my code. It's annoying to find that Docker has cached a past build.

If only there were a way to specify to Docker only to try caching up to a certain build stage.

@tonistiigi
Copy link
Member

Setting no-cache to specific commands can be done with a --no-cache-filter option. You can put the commands you want to control under a named stage and then specify all the stages with the flag.

FROM alpine AS base
RUN cmd-that-keeps-cache

FROM base AS pkgs
RUN install-cmd-every-time

build --no-cache-filter=pkgs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/builder kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet