Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade daemon without restarting containers #2658

Closed
shykes opened this issue Nov 11, 2013 · 30 comments · Fixed by #20662
Closed

Upgrade daemon without restarting containers #2658

shykes opened this issue Nov 11, 2013 · 30 comments · Fixed by #20662
Assignees
Labels
exp/expert kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. roadmap
Milestone

Comments

@shykes
Copy link
Contributor

shykes commented Nov 11, 2013

Docker needs a way to upgrade itself graciously, with minimum interruption of service. Ideally all containers would continue to function with zero downtime and zero behavior change. This might not always be possible, for example if the upgrade introduces significant changes to the container's runtime environment itself. In that case, docker should give the sysadmin maximum flexibility - in a perfect world the upgrade could be rolled out separately for each container.

@jpetazzo
Copy link
Contributor

Some notes about this.

There are two ways to handle upgrades:

  • handle upgrades and crashes the same way (i.e., Docker can fall back on its feet if it crashes, so it doesn't need a specific upgrade path)
  • handle upgrades and crashes differently (because some things cannot be recovered anyway).

Then there are multiple areas to consider:

  • logging (Docker receives output+error of containers through pipes)
  • process monitoring (in particular, exit status of container)
  • operations graph layers (commit, push, pull...)
  • networking (userland proxy, iptables, others)
  • connected clients (in-flight API requests + attached containers)
  • ptys (the pty is currently held by Docker, not by the container)

Each area deserves a whole discussion on its own. Let me know where is the best place for that :-)

@discordianfish
Copy link
Contributor

If we would "handle upgrades and crashes the same way" (as I would prefer), the focus should be on crash recovery rather than planned upgrade. Concerns about crash recovery were the biggest concerns we had when thinking about introducing docker. Logging was a very important topic for us and this is definitely something we can improve on.

@erikh
Copy link
Contributor

erikh commented Jul 9, 2014

I'd like to tackle this, but need a bit more context. My guess is that we could do something like this with container launches:

  • ensure FD_CLOEXEC is not set
  • setsid
  • exec the container

This should allow containers to run standalone. If we do not blow away the iptables rules on docker -d shutdown, pids are gathered and tracked on startup, and the linksv2 stuff is integrated to replace the userspace proxy, I think this will work.

@tianon
Copy link
Member

tianon commented Jul 9, 2014

Isn't this essentially #2733? They're related, at least.

@erikh
Copy link
Contributor

erikh commented Jul 9, 2014

Not exactly. This is replacing docker -d with little or no downtime.

-Erik

On Jul 9, 2014, at 4:24 PM, Tianon Gravi notifications@github.com wrote:

Isn't this essentially #2733? They're related, at least.


Reply to this email directly or view it on GitHub.

@tianon
Copy link
Member

tianon commented Jul 9, 2014

Ah, cheers - thanks @erikh

@bgrant0607
Copy link

+1. If Docker is killed (e.g., via SIGTERM) or dies and restarts for ~any reason, it should not disrupt running containers.

@kuon
Copy link
Contributor

kuon commented Jul 31, 2014

+1, this is tightly related to #6851

@bobrik
Copy link
Contributor

bobrik commented Dec 9, 2014

👍 for this, starting heavy services that need to check terabytes of data on startup is pita with current behavior. I've seen 30% of mesos cluster going down because docker got nil pointer dereference caused by very slow or hanged layer download.

@kim0
Copy link
Contributor

kim0 commented Feb 10, 2015

Now that 1.5 is out .. can someone please hack on this so the world can update peacefully :)

@jessfraz jessfraz added kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. exp/expert and removed kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny labels Feb 26, 2015
@voycey
Copy link

voycey commented Mar 3, 2015

This is the one thing currently that is preventing me from moving everything over 👍

@bobrik
Copy link
Contributor

bobrik commented Jun 5, 2015

Any update on this? Restarting all containers on a huge machine because docker pull hangs on small monitoring image doesn't look a like a good solution. Especially when you do it several times.

@mudverma
Copy link

Hi All,

I have put out a proposal for docker's hot upgrade and SPoF issue. Please take a look.

#13884

Thanks
Mudit

@bfirsh bfirsh changed the title Hot upgrades Restart daemon without restarting containers Dec 11, 2015
@bfirsh bfirsh added the roadmap label Dec 11, 2015
@bfirsh bfirsh changed the title Restart daemon without restarting containers Upgrade daemon without restarting containers Dec 11, 2015
@claytono
Copy link

@thaJeztah Can you explain at a high-level how this will work when it's in place?

@thaJeztah
Copy link
Member

thaJeztah commented Apr 14, 2016

@Dvorak basically, the containerd-shim allows both the docker daemon, and the containerd daemon to be restarted, without killing the container. Upon restart, the daemons can re-connect to the container, by use of the containerd-shim's.

You can already test the basics, by installing an experimental version (https://experimental.docker.com), and running the daemon manually (we're investigating an issue if it's started through systemd #21933).

Basically;

#manually start the daemon
docker daemon -D &

Then, start a container, and kill -9 the daemon. The container should remain running, but docker ps should give an error, because the daemon isn't running.

After that, starting the daemon again (docker daemon -D &), should re-attach it to the containers and docker ps should show the container.

So basically that functionality allows stopping the daemon, upgrading it, and starting it again after upgrading

@IX-Erich
Copy link

IX-Erich commented Apr 23, 2016

Just issued a docker-machine upgrade on my Mac without shutting down docker first - that destroyed all of the containers, which only represented about twelve hours of work. Is that supposed to happen? To the OPs point - not very graceful behavior.

@mbentley
Copy link
Contributor

That's not really relevant to this issue. I would suggest reporting that here: https://github.com/docker/machine/issues

@IX-Erich
Copy link

Sorry! Thought this was the closest hit… Appreciate reply.

On Apr 23, 2016, at 9:02 AM, Matt Bentley notifications@github.com wrote:

That's not really relevant to this issue. I would suggest reporting that here: https://github.com/docker/machine/issues https://github.com/docker/machine/issues

You are receiving this because you commented.
Reply to this email directly or view it on GitHub #2658 (comment)

@crosbymichael crosbymichael added this to the 1.12.0 milestone Apr 29, 2016
@crosbymichael crosbymichael self-assigned this Apr 29, 2016
@crosbymichael
Copy link
Contributor

Added this to the 1.12 milestone and assigned to myself.

What we currently have left to do is:

  • Persist graph driver reference counting
  • Persist network information for restore

@BSWANG
Copy link
Contributor

BSWANG commented May 17, 2016

/var/run/docker.sock mount in containers refuse connection after daemon hot upgrade #22789

@KramKroc
Copy link

Hi, I took the experimental release today to test this out today. I found a few things that I thought would be worth flagging (just to note that my containers were initially started with docker-compose):

  1. After the daemon was restarted, the docker-proxy processes were killed and so could no longer server traffic to the exposed ports. the container itself seems to be fine, but without the port it's hard to send traffic to it to verify its operating as expected.

  2. containers that I had previously stopped prior to killing and restarting the docker daemon were restarted. Will this be expected behaviour in final solution or will the container state be honoured after daemon restart?

$ docker info
Containers: 15
 Running: 13
 Paused: 0
 Stopped: 2
Images: 16
Server Version: 1.12.0-dev
Storage Driver: devicemapper
 Pool Name: docker-253:0-1567975-pool
 Pool Blocksize: 65.54 kB
 Base Device Size: 10.74 GB
 Backing Filesystem: xfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 3.6 GB
 Data Space Total: 107.4 GB
 Data Space Available: 28.54 GB
 Metadata Space Used: 7.041 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.14 GB
 Thin Pool Minimum Free Space: 10.74 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Deferred Deletion Enabled: false
 Deferred Deleted Device Count: 0
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 WARNING: Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.107-RHEL7 (2015-10-14)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: null host bridge
Kernel Version: 3.10.0-327.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 7.64 GiB
Name: localhost.localdomain
ID: E33U:MJER:UUZR:LFBN:RIFS:KAP4:FP5F:OAQF:73AR:3GWU:LX5I:BC76
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 85
 Goroutines: 123
 System Time: 2016-05-24T12:13:22.032865741+01:00
 EventsListeners: 0
Username: 
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Experimental: true
Insecure Registries:
 127.0.0.0/8

@aluminous
Copy link

Hi, sorry to drag up an old issue, but did this make it into the 1.12.0 release or was it pushed back?

We recently upgraded, but restarting dockerd still stops running containers. If I understood correctly, killing dockerd shouldn't kill running containers anymore right?

@thaJeztah
Copy link
Member

@aluminous yes, it was implemented in #23213, which is in 1.12.0, but it's not enabled by default. Documentation was added through #24970, and can be found here; https://docs.docker.com/engine/admin/live-restore/

@aluminous
Copy link

Apologies, I missed that section of the documentation. Thank you!

@thaJeztah
Copy link
Member

No worries, I noticed it wasn't referred to from this issue, so was good to add some links here for reference anyway 😄

lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with DOcker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 29, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 30, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Aug 31, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
lingmann pushed a commit to lingmann/dcos that referenced this issue Sep 1, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
mellenburg pushed a commit to mellenburg/dcos that referenced this issue Sep 2, 2016
On a percentage of DC/OS agents (~5%) with Docker 1.11.2, the Daemon
will fail to start up with the following error:

> Error starting daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to allocate gateway
> (172.17.0.1): Address already in use

This seems to be related to a Docker bug around the network controller
initialization, where the controller has allocated an ip pool and
persisted some state but not all of it. See:

* moby/moby#22834
* moby/moby#23078

This fix simply removes the docker0 interface if it exists before
starting the Docker daemon. This fix will need to be re-evaluated if we
want to enable the 1.12+ containerd live-restore like Docker options as
discussed in:

* https://docs.docker.com/engine/admin/live-restore/
* moby/moby#2658
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/expert kind/enhancement Enhancements are not bugs or new features but can improve usability or performance. roadmap
Projects
None yet
Development

Successfully merging a pull request may close this issue.