Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker 1.8 corrupts /etc/hosts files in containers #17190

Closed
thockin opened this issue Oct 20, 2015 · 86 comments · Fixed by #17325
Closed

Docker 1.8 corrupts /etc/hosts files in containers #17190

thockin opened this issue Oct 20, 2015 · 86 comments · Fixed by #17325
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Milestone

Comments

@thockin
Copy link
Contributor

thockin commented Oct 20, 2015

related: #16619

@mavenugo - We have a problem.

We've been experiencing periodic end-to-end test failures where some containers can not resolve "localhost". Killing the container and letting it restart usually fixes the problem. Today we got it into this state and pinned down the following:

The container in question has a /etc/hosts file that is several hundred lines long. That /etc/hosts file does NOT have localhost or the container's own IP. If I add those 2 lines the tests pass.

Of course I thought of #16619 . So we looked at /etc/hosts of every container on this machine - 84 in total. Those files range in size from 175 lines (correct: 7 boilerplate + 2 per container) to 1700+ lines. Looking at the offending files, there are a lot of duplicate IPs. It looks like old containers have not been removed, but localhost, "self", and the ipv6 boilerplate has been removed.

Even looking at my local test machine (far less churn) I see the same. I have 4 containers with network namespace running, I should have 8+7=15 lines, right?

# wc -l /var/lib/docker/containers/*/hosts
  15 /var/lib/docker/containers/3fc23782814513d1934ce21c0f1b096263df1fcacee6787208c3cda715822356/hosts
  15 /var/lib/docker/containers/4b7e8432a368eeb7f1dc49230f16c1d7c788b471e4dfa7d8fbf709c3495f430f/hosts
  19 /var/lib/docker/containers/5715aa09fc091ee0ff3d4dcf2ee300d8e52603dafcde8f24aaf1def960cf4a28/hosts
  15 /var/lib/docker/containers/6e7fe564dece829de3c335400357b2ef327a1a44c638045350f0b49e5324a189/hosts
  15 /var/lib/docker/containers/7845351acbacf0588142a1ebfa3fa481c3191da48a188b20c5d5fdae689cef1e/hosts
  17 /var/lib/docker/containers/ba23cbabfa90700f4d9ba80c5723afb1d7479a08b12efdded74e5874991ce76b/hosts
  17 /var/lib/docker/containers/c1e64038b35a0527b6aef911dad002278569724263cc922efd8e715747e83538/hosts
  19 /var/lib/docker/containers/fc7c5475fd923b75a0595431dcaf6b6db5fa8a18409d0df4e6282e3c0bd8aedf/hosts

Diffing a 15 and 19 line file shows:

# diff -u -U 100 <(sort -n /var/lib/docker/containers/3fc23782814513d1934ce21c0f1b096263df1fcacee6787208c3cda715822356/hosts) <(sort -n /var/lib/docker/containers/5715aa09fc091ee0ff3d4dcf2ee300d8e52603dafcde8f24aaf1def960cf4a28/hosts)
--- /dev/fd/63  2015-10-19 23:59:20.618879795 +0000
+++ /dev/fd/62  2015-10-19 23:59:20.618879795 +0000
@@ -1,15 +1,19 @@
 ::1    localhost ip6-localhost ip6-loopback
 fe00::0    ip6-localnet
 ff00::0    ip6-mcastprefix
 ff02::1    ip6-allnodes
 ff02::2    ip6-allrouters
+10.244.0.10    distracted_ritchie
+10.244.0.10    distracted_ritchie.bridge
 10.244.0.2 k8s_POD-7be6d81d_fluentd-cloud-logging-kubernetes-minion-isfq_kube-system_dcfde077753b8b6908952cfea6a14200_c4f00d1d
 10.244.0.2 k8s_POD-7be6d81d_fluentd-cloud-logging-kubernetes-minion-isfq_kube-system_dcfde077753b8b6908952cfea6a14200_c4f00d1d.bridge
 10.244.0.3 k8s_POD-7be6d81d_heapster-v10-lx8q7_kube-system_2a3fc6d6-6ddb-11e5-9249-42010af00002_cbe79e10
 10.244.0.3 k8s_POD-7be6d81d_heapster-v10-lx8q7_kube-system_2a3fc6d6-6ddb-11e5-9249-42010af00002_cbe79e10.bridge
 10.244.0.4 k8s_POD-c5371ceb_monitoring-influxdb-grafana-v2-koh3j_kube-system_2a43f315-6ddb-11e5-9249-42010af00002_76d3cb92
 10.244.0.4 k8s_POD-c5371ceb_monitoring-influxdb-grafana-v2-koh3j_kube-system_2a43f315-6ddb-11e5-9249-42010af00002_76d3cb92.bridge
 10.244.0.5 k8s_POD-9db2f941_kube-ui-v2-v4yd6_kube-system_2a42e6da-6ddb-11e5-9249-42010af00002_69a0a0c7
 10.244.0.5 k8s_POD-9db2f941_kube-ui-v2-v4yd6_kube-system_2a42e6da-6ddb-11e5-9249-42010af00002_69a0a0c7.bridge
-10.244.0.5 kube-ui-v2-v4yd6
+10.244.0.6 k8s_POD-6e934112_kube-dns-v9-ee38a_kube-system_2a4c2763-6ddb-11e5-9249-42010af00002_86e49fbc
+10.244.0.6 k8s_POD-6e934112_kube-dns-v9-ee38a_kube-system_2a4c2763-6ddb-11e5-9249-42010af00002_86e49fbc.bridge
+10.244.0.6 kube-dns-v9-ee38a
 127.0.0.1  localhost

Is there any reason the hosts files would be different for different containers on the same node?

I know docker is live-editing the hosts file (which is absolutely unsafe, but should only be unsafe for the users in the container). Any clues on what's going on?

@ArtfulCoder

@phemmer
Copy link
Contributor

phemmer commented Oct 20, 2015

Yeah, modification of this file is rather racy. I can corrupt the file very easily with the following:

docker run -d --name=test-master busybox sleep 99999999

for i in {00..09}; do docker create --name test-$i --link test-master busybox sleep 60; done

for i in {00..09}; do ( for (( j=0; j < 20; j++ )); do docker start test-$i; docker kill test-$i; done ) & done; wait

docker exec test-master cat /etc/hosts

@brendandburns
Copy link

The basic problem is that Docker is using ioutil.WriteFile from golang:

https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/etchosts/etchosts.go#L66
https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/etchosts/etchosts.go#L93

which uses truncate and write:
https://golang.org/pkg/io/ioutil/#WriteFile

Truncate and write is inherently non-atomic and racy.

The only safe-ish thing to do here is to use file rename instead of truncate and write, this still isn't 100% atomic, but it is dramatically safer (eg. http://stackoverflow.com/questions/7054844/is-rename-atomic for details)

Honestly, the only truly safe thing to do here is to use an actual DNS server like we do in Kubernetes, instead of mucking with /etc/hosts

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

rename should be atomic, BUT it doesn't work across a bind mount.

On Mon, Oct 19, 2015 at 10:38 PM, Brendan Burns notifications@github.com
wrote:

The basic problem is that Docker is using ioutil.WriteFile from golang:

https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/etchosts/etchosts.go#L66

https://github.com/docker/docker/blob/master/vendor/src/github.com/docker/libnetwork/etchosts/etchosts.go#L93

which uses truncate and write:
https://golang.org/pkg/io/ioutil/#WriteFile

Truncate and write is inherently non-atomic and racy.

The only safe-ish thing to do here is to use file rename instead of
truncate and write, this still isn't 100% atomic, but it is dramatically
safer (eg. http://stackoverflow.com/questions/7054844/is-rename-atomic
for details)

Honestly, the only truly safe thing to do here is to use an actual DNS
server like we do in Kubernetes, instead of mucking with /etc/hosts


Reply to this email directly or view it on GitHub
#17190 (comment).

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 20, 2015

I agree that DNS would be cool. Messing with hosts also slow (in addition to races) if you have a lot of containers.

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

All that aside - docker 1.8 seems to be totally broken in this regard. Can
we just get this turned off for a 1.8.4?

On Mon, Oct 19, 2015 at 10:43 PM, Alexander Morozov <
notifications@github.com> wrote:

I agree that DNS would be cool. Messing with hosts also slow (in addition
to races) if you have a lot of containers.


Reply to this email directly or view it on GitHub
#17190 (comment).

@brendandburns
Copy link

yes please, second the request for a flag so that we can disable /etc/hosts writing.

@phemmer
Copy link
Contributor

phemmer commented Oct 20, 2015

The basic problem is that Docker is using ioutil.WriteFile from golang

I think it's more than that.
Docker is preserving custom entries in /etc/hosts. It doesn't rebuild it from scratch using only docker's own data. This means that you've got docker reading the file, making changes, then writing it back out. If you get 2 things doing this at the same time, one of them is going to discard the changes of the other.

@rade
Copy link

rade commented Oct 20, 2015

request for a flag so that we can disable /etc/hosts writing

You are not the first person to ask: moby/libnetwork#519

In weave we introduced a hack that prevents docker from updating /etc/hosts. We are not proud of it, but it's the only option available to us.

@mrjana
Copy link
Contributor

mrjana commented Oct 20, 2015

@thockin The first order of business is to fix the racy behavior of writing to /etc/hosts, which btw has existed since --links existed. Just that with CNM we are updating /etc/hosts of all containers in a given network. That makes live updating more common.

Disabling /etc/hosts update is a UI discussion and has implications on the --links option and should be discussed separately as apart of #16619

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

@mrjana 2 thoughts.

  1. I don't know of any way to update a file across a bind-mount atomically. The usual rename() technique simply doesn't work. I don't think this can be made safe.
thockin@freakshow:/tmp/test-bind-rename$ echo first > hosts

thockin@freakshow:/tmp/test-bind-rename$ echo nothing > container.hosts

thockin@freakshow:/tmp/test-bind-rename$ sudo mount --bind hosts container.hosts 

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
first

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
first

thockin@freakshow:/tmp/test-bind-rename$ echo second > container.hosts 

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
second

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
second

thockin@freakshow:/tmp/test-bind-rename$ echo third > new

thockin@freakshow:/tmp/test-bind-rename$ mv new hosts

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
third

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
second

Even hardlinks don't work.

thockin@freakshow:/tmp/test-bind-rename$ echo first > hosts

thockin@freakshow:/tmp/test-bind-rename$ ln hosts container.hosts

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
first

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
first

thockin@freakshow:/tmp/test-bind-rename$ echo second > container.hosts 

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
second

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
second

thockin@freakshow:/tmp/test-bind-rename$ echo third > new

thockin@freakshow:/tmp/test-bind-rename$ mv new hosts

thockin@freakshow:/tmp/test-bind-rename$ cat hosts 
third

thockin@freakshow:/tmp/test-bind-rename$ cat container.hosts 
second
  1. At least with plain docker links you edit once at startup (I think), so it's far less dangerous (but maybe I am wrong). Either way, it's a much reduced exposure than updating every container for every other container. I think this needs an ASAP mititgation.

@mrjana
Copy link
Contributor

mrjana commented Oct 20, 2015

@thockin

For 1) We are going to experiment with acquiring a write lease(using F_SETLEASE) during the short period when live updating the /etc/hosts from docker. When we have write lease all other opens for writing and truncate will block for more than a few seconds(/proc/sys/fs/lease-break-time) and that is sufficient enough for us to complete the update.

For 2) even with links it wasn't just during startup. everytime the child container restarted the parent container's hosts file gets live updated with the child container's new IP

Anyway, I do agree we need to mitigate this race quickly.

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

I did not know about SETLEASE - I learned something. I also tried it, and
while F_SETLEASE seems to work on its own, it does NOT seem to work across
bind mounts.

thockin@freakshow:/tmp/test-bind-rename$ cat lease.c
#define _GNU_SOURCE 1
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <signal.h>

int main(int argc, char *argv[]) {
    signal(SIGIO, SIG_IGN);
    int fd = open("hosts", O_RDWR);
    fcntl(fd, F_SETLEASE, F_WRLCK);
    sleep(10);
    exit(0);
}

thockin@freakshow:/tmp/test-bind-rename$ make lease
cc     lease.c   -o lease

thockin@freakshow:/tmp/test-bind-rename$ echo first > hosts

thockin@freakshow:/tmp/test-bind-rename$ echo nothng > container.hosts

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat hosts
[1] 20737
first
[1]+  Done                    ./lease

real 0m10.001s
user 0m0.000s
sys 0m0.116s

thockin@freakshow:/tmp/test-bind-rename$ sudo mount --bind hosts
container.hosts

thockin@freakshow:/tmp/test-bind-rename$ cat hosts container.hosts
first
first

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat hosts[1] 20953
first

real 0m0.001s
user 0m0.000s
sys 0m0.001s

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat container.hosts
[2] 20977
first

real 0m0.008s
user 0m0.000s
sys 0m0.001s

Same behavior with hard links. My guess is that Linux refuses to promise a
lease on dirents with a link count > 1.

On Tue, Oct 20, 2015 at 8:50 AM, Jana Radhakrishnan <
notifications@github.com> wrote:

@thockin https://github.com/thockin

For 1) We are going to experiment with acquiring a write lease(using
F_SETLEASE) during the short period when live updating the /etc/hosts
from docker. When we have write lease all other opens for writing and
truncate will block for more than a few
seconds(/proc/sys/fs/lease-break-time) and that is sufficient enough for us
to complete the update.

For 2) even with links it wasn't just during startup. everytime the child
container restarted the parent container's hosts file gets live updated
with the child container's new IP

Anyway, I do agree we need to mitigate this race quickly.


Reply to this email directly or view it on GitHub
#17190 (comment).

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

http://lxr.free-electrons.com/source/fs/locks.c#L1576

1587         if ((arg == F_WRLCK) && ((d_count(dentry) > 1) ||
1588             (atomic_read(&inode->i_count) > 1)))
1589                 ret = -EAGAIN;

On Tue, Oct 20, 2015 at 9:19 AM, Tim Hockin thockin@google.com wrote:

I did not know about SETLEASE - I learned something. I also tried it, and
while F_SETLEASE seems to work on its own, it does NOT seem to work across
bind mounts.

thockin@freakshow:/tmp/test-bind-rename$ cat lease.c
#define _GNU_SOURCE 1
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <signal.h>

int main(int argc, char *argv[]) {
    signal(SIGIO, SIG_IGN);
    int fd = open("hosts", O_RDWR);
    fcntl(fd, F_SETLEASE, F_WRLCK);
    sleep(10);
    exit(0);
}

thockin@freakshow:/tmp/test-bind-rename$ make lease
cc     lease.c   -o lease

thockin@freakshow:/tmp/test-bind-rename$ echo first > hosts

thockin@freakshow:/tmp/test-bind-rename$ echo nothng > container.hosts

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat hosts
[1] 20737
first
[1]+  Done                    ./lease

real 0m10.001s
user 0m0.000s
sys 0m0.116s

thockin@freakshow:/tmp/test-bind-rename$ sudo mount --bind hosts
container.hosts

thockin@freakshow:/tmp/test-bind-rename$ cat hosts container.hosts
first
first

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat hosts[1] 20953
first

real 0m0.001s
user 0m0.000s
sys 0m0.001s

thockin@freakshow:/tmp/test-bind-rename$ ./lease & time cat
container.hosts
[2] 20977
first

real 0m0.008s
user 0m0.000s
sys 0m0.001s

Same behavior with hard links. My guess is that Linux refuses to promise
a lease on dirents with a link count > 1.

On Tue, Oct 20, 2015 at 8:50 AM, Jana Radhakrishnan <
notifications@github.com> wrote:

@thockin https://github.com/thockin

For 1) We are going to experiment with acquiring a write lease(using
F_SETLEASE) during the short period when live updating the /etc/hosts
from docker. When we have write lease all other opens for writing and
truncate will block for more than a few
seconds(/proc/sys/fs/lease-break-time) and that is sufficient enough for us
to complete the update.

For 2) even with links it wasn't just during startup. everytime the child
container restarted the parent container's hosts file gets live updated
with the child container's new IP

Anyway, I do agree we need to mitigate this race quickly.


Reply to this email directly or view it on GitHub
#17190 (comment).

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

@mavenugo @mrjana @icecrime

Options that I see:

  1. Fix the merge logic to produce correct results, don't bind-mount /etc/hosts, use something like nsenter to do a proper atomic rename in each container's mount namespace, cut 1.8.4

  2. Fix the merge logic to produce correct results, bind-mount /etc/hosts.$version, symlink /etc/hosts -> /etc/hosts.$version, use something like nsenter to do a proper atomic rename of a new symlink in each container's mount namespace, cut 1.8.4

  3. Implement a flag to turn this whole function off, cut 1.8.4, fix it better in 1.9

Option 3 seems like the least risky, most likely to succeed plan. We can help produce the patch, if needed, though it will be faster if you guys do it (knowing where to look in the code is 75% of the problem).

What do you guys think?

@rade
Copy link

rade commented Oct 20, 2015

@thockin I suspect 1 and 2 would break weave's hack that prevents Docker from updating /etc/hosts.

Option 3, please.

@thockin
Copy link
Contributor Author

thockin commented Oct 20, 2015

Option 3 is our preferred plan, too, and that's the patch we'll carry if we have to.

I know 1.9.x is at RC, but this bug nullifies everything we have done to qual 1.8.x so we're begging for a 1.8.4 patch.

@brendandburns
Copy link

I'm personally willing to develop the patch for 1.8.4, I've dug through the code a bit, and I'm pretty sure I know where to insert the right flag points.

We're likely to develop this patch one way or the other, and we'll just carry a patched 1.8.4 in Kubernetes if we have to, but we would obviously very much rather not.

@mrjana
Copy link
Contributor

mrjana commented Oct 20, 2015

@thockin Getting back to fixing the racy /etc/hosts update here's something that might be the best possible solution:

#define _GNU_SOURCE 1
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>

int main(int argc, char *argv[]) {
  signal(SIGIO, SIG_IGN);
  int fd = open("hosts", O_RDWR);
  int rc;
  struct flock fl;

  memset(&fl, 0, sizeof fl);
  fl.l_type = F_WRLCK;
  fl.l_whence = SEEK_SET;
  fl.l_start = 0;
  fl.l_len = 0;

  rc = fcntl(fd, F_SETLKW, &fl);
  if (rc == -1) {
    perror("fcntl");
  }
  sleep(10);
  rc = fcntl(fd, F_UNLCK, &fl);
  if (rc == -1) {
    perror("fcntl unlock");
  }
  exit(0);
}

The point is to employ mandatory locking and confirmed with the following that it works across bind mounts:

mrjana@dev-1:/tmp/lease$ make lease
cc     lease.c   -o lease
mrjana@dev-1:/tmp/lease$
mrjana@dev-1:/tmp/lease$ echo first > hosts
mrjana@dev-1:/tmp/lease$ touch container.hosts
mrjana@dev-1:/tmp/lease$
mrjana@dev-1:/tmp/lease$ sudo mount -o bind,mand hosts container.hosts
mrjana@dev-1:/tmp/lease$ chmod g+s,g-x container.hosts
mrjana@dev-1:/tmp/lease$ ./lease & sleep 0.1 && time cat container.hosts
[1] 11287
first
[1]+  Done                    ./lease

real    0m10.777s
user    0m0.000s
sys     0m0.000s

Works across bind mounts(I acquire the lock in hosts while I try to read the bind mounted container.hosts)

But there is a catch. In some rare cases when the write system call is well past the check for lock and we try to acquire the lock, the lock will still be sucessfull. So it's not 100% race free but very close.

@thockin
Copy link
Contributor Author

thockin commented Oct 21, 2015

I just tried this. It's clever. PROBLEM: The host FS needs to be mounted
-o mand or the lock is just ignored. You can't safely assume that every
FS is mounted with -o mand. I'm not even sure every FS supports it.

Maybe you can mount all of /var/lib/docker/containers as a tmpfs with -o
mand? That's a little sketchy.

That tiny bit of race is still pretty messy, but obviously better. I still
think a flag to disable the whole thing would be less risky.

On Tue, Oct 20, 2015 at 4:45 PM, Jana Radhakrishnan <
notifications@github.com> wrote:

@thockin https://github.com/thockin Getting back to fixing the racy
/etc/hosts update here's something that might be the best possible solution:

#define _GNU_SOURCE 1
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <signal.h>
#include <string.h>

int main(int argc, char *argv[]) {
signal(SIGIO, SIG_IGN);
int fd = open("hosts", O_RDWR);
int rc;
struct flock fl;

memset(&fl, 0, sizeof fl);
fl.l_type = F_WRLCK;
fl.l_whence = SEEK_SET;
fl.l_start = 0;
fl.l_len = 0;

rc = fcntl(fd, F_SETLKW, &fl);
if (rc == -1) {
perror("fcntl");
}
sleep(10);
rc = fcntl(fd, F_UNLCK, &fl);
if (rc == -1) {
perror("fcntl unlock");
}
exit(0);
}

This point is to employ mandatory locking and confirmed with the following
that it works across bind mounts:

mrjana@dev-1:/tmp/lease$ make lease
cc lease.c -o lease
mrjana@dev-1:/tmp/lease$
mrjana@dev-1:/tmp/lease$ echo first > hosts
mrjana@dev-1:/tmp/lease$ touch container.hosts
mrjana@dev-1:/tmp/lease$
mrjana@dev-1:/tmp/lease$ sudo mount -o bind,mand hosts container.hosts
mrjana@dev-1:/tmp/lease$ chmod g+s,g-x container.hosts
mrjana@dev-1:/tmp/lease$ ./lease & sleep 0.1 && time cat container.hosts
[1] 11287
first
[1]+ Done ./lease

real 0m10.777s
user 0m0.000s
sys 0m0.000s

Works across bind mounts(I acquire the lock in hosts while I try to read
the bind mounted container.hosts)

But there is a catch. In some rare cases when the write system call is
well past the check for lock and we try to acquire the lock, the lock will
still be sucessfull. So it's not 100% race free but very close.


Reply to this email directly or view it on GitHub
#17190 (comment).

@mrjana
Copy link
Contributor

mrjana commented Oct 21, 2015

I have to characterize the host FS mount because in my test I never remounted my rootfs to mand but when I remounted my container.hosts file to mand it somehow propogated to the rootfs as well. That may be required for this to work.

But we don't need /var/lib/docker/containers to be in tmpfs. It may just have to be remounted to carry the mand option so it's not that sketchy.

@thockin
Copy link
Contributor Author

thockin commented Oct 21, 2015

At least on my system, you can not remount a directory with mand unless
the root FS is also mounted mand. Mounting tmpfs with mand DOES work,
though this means none of your state survives reboot (and is suddenly
orders of magnitude more expensive).

It's disappointing that this is still racy - if I already have the file
open()ed while docker tries to write it, it WILL get corrupted. I just
wrote a bunch of tests that convinced me of it. in process A: open, seek
to non-zero, sleep(5), write. In process B: open, take lock, write,
sleep(10), exit. Process B will write its contents, the lock will prevent
process A from write()ing until the lock is released, but eventually A will
write at whatever file offset it was at - corruption.

I still think the only complete answer is to nsenter the mount namespace of
every container, write a tmp file and rename() it to /etc/hosts.

More importantly: I think we're focusing on the race, but I am not
convinced that the race is what is causing the corruption - we don't write
to /etc/hosts in our tests. I think there are bugs in the content-merging
logic that is forgetting to remove some lines and occasionally removing
lines it shouldn't. I think THAT is far more important.

On Tue, Oct 20, 2015 at 5:48 PM, Jana Radhakrishnan <
notifications@github.com> wrote:

I have to characterize the host FS mount because in my test I never
remounted by rootfs to mand but when I remounted my container.hosts file
to mand it somehow propogated to the rootfs as well. That may be required
for this to work.

But we don't need /var/lib/docker/containers' to be in tmpfs. It may just
have to be remounted to carry themand` option so it's not that sketchy.


Reply to this email directly or view it on GitHub
#17190 (comment).

@rade
Copy link

rade commented Oct 21, 2015

I still think the only complete answer is to nsenter the mount namespace of
every container, write a tmp file and rename() it to /etc/hosts.

Please, please, please do not do this. At least not without also providing a way to disable the entire /etc/hosts updating.

@brendandburns
Copy link

Yes, please provide the ability to disable /etc/hosts updating. Users should be allowed to choose whether or not they want this feature. Perhaps the right approach is a per-container option in the API?

@rade
Copy link

rade commented Oct 21, 2015

per-container option in the API

+1

@thockin
Copy link
Contributor Author

thockin commented Oct 21, 2015

@rade why is that worse than what's happening today (or the proposed 90% fix to the race)? At least the atomic replacement will not corrupt the file.

@rade
Copy link

rade commented Oct 21, 2015

it is worse because it breaks the existing hacks for disabling /etc/hosts updating.

@mavenugo
Copy link
Contributor

@thockin I hope we all understand release-cycles and issue management :) My responses to @mfischer-zd is to the technical aspect of issue management and the fact that this issue was filed on 1.8.3 and is applicable to the default bridge network. And hence technically #17325 resolved this. Wont you agree to that ?
And with #17278 we have resolved the internal race issues as well.

With a healthy upstream/downstream relationship with Kubernetes and Docker, I hope this time around, you would be able to try out RCs early in the release cycle and not wait for x.x.3 release to give feedback.

@mfischer-zd
Copy link

Ultimately I leave the decision to you, @mavenugo - my only concern is that /etc/hosts management is in a bit of a quagmire at the moment (though it looks like you're making good progress on fixing them) and there is a large number of issues to sift through to figure out just which issue affects a particular user.

I don't mean to suggest that this is the appropriate issue to track those with, though. What I was trying to say is:

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015

Until there's an open issue specifically about not corrupting reads of
/etc/hosts I think we DON'T all understand issue management. It's
attractive to close this one out, since it got some spotlights on it, but
let's not pretend it's entirely fixed for 1.9.

#17278 sounds like it fixes the biggest problem, though my casual reading
suggests that a hosts file for a shared net namespace (--net=container)
will be processed more than once.

As for zinging us on filing bugs late - I said it before, this bug took is
a LONG time to pin down (though it wasn't the first related bug I filed).
We will do everything we can to file bugs early and often, but we have our
own software to debug, too.
On Oct 25, 2015 7:35 PM, "Madhu Venugopal" notifications@github.com wrote:

@thockin https://github.com/thockin I hope we all understand
release-cycles and issue management :) My responses to @mfischer-zd
https://github.com/mfischer-zd is to the technical aspect of issue
management and the fact that this issue was filed on 1.8.3 and is
applicable to the default bridge network. And hence technically #17325
#17325 resolved this. Wont you
agree to that ?
And with #17278 #17278 we have
resolved the internal race issues as well.

With a healthy upstream/downstream relationship with Kubernetes and
Docker, I hope this time around, you would be able to try out RCs early in
the release cycle and not wait for x.x.3 release to give feedback.


Reply to this email directly or view it on GitHub
#17190 (comment).

@mrjana
Copy link
Contributor

mrjana commented Oct 26, 2015

@thockin As I said before, the racy behavior existed since --link existed so let's not think that this racy behavior was introduced in 1.9. I agree it's not fixed in 1.9 but it's not fixed in 1.8, 1.7, 1.6 or 1.5 and so on as well. It just became more common in 1.8.3 because docker itself was contributing for the race and was reported by you via this issue and we fixed it. Since this issue was reported for that specific problem it is appropriate to close the issue. I am perfectly fine to track the racy behavior issue as a separate issue. But let's not mix a lot of things and confuse users.

Also @mfischer-zd now that the default bridge network does not do automatic service discovery please retest to see if issue number 3 (about /etc/hosts not getting updated if a linked container is restarted) is still seen. If seen it would be helpful to file a separate issue for that because it is definitely not the same as racy behavior issue.

@mrjana
Copy link
Contributor

mrjana commented Oct 26, 2015

@thockin

#17278 sounds like it fixes the biggest problem, though my casual reading
suggests that a hosts file for a shared net namespace (--net=container)
will be processed more than once.

A shared net namespace (--net=container) will not even invoke libnetwork for the child containers. Once the parent container is setup that is all there is to it. All the children simply bind mount the the hosts file created for parent container. So I don't know what in that #17278 gave you the impression that it is getting processed more than once in the shared net namespace case.

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015 via email

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015

Maybe I misunderstood then. I understood that the corruption was parallel
goroutines accessing the file and I assumed that was once per
namespace-user. If not that, then what parallel goroutines? Just multiple
starts/stops in parallel?

Thanks, sorry for any FUD - honest misunderstanding.

On Sun, Oct 25, 2015 at 10:07 PM, Jana Radhakrishnan <
notifications@github.com> wrote:

@thockin https://github.com/thockin

#17278 #17278 sounds like it fixes
the biggest problem, though my casual reading
suggests that a hosts file for a shared net namespace (--net=container)
will be processed more than once.

A shared net namespace (--net=container) will not even invoke libnetwork
for the child containers. Once the parent container is setup that is all
there is to it. All the children simply bind mount the the hosts file
created for parent container. So I don't know what in that #17278
#17278 gave you the impression
that it is getting processed more than once in the shared net namespace
case.


Reply to this email directly or view it on GitHub
#17190 (comment).

@mrjana
Copy link
Contributor

mrjana commented Oct 26, 2015

And let's not be disingenuous - yes it existed forever, but with 1.9
you're encouraging people to create Network objects and as soon as
they do, Docker is back to "contributing for the race".

I am not sure which race you are talking about. But if you are talking about Docker contributing to the race (oh I just corrected my grammatical mistake that I made when I typed by message in a hurry and I am so glad how you pointed out my mistake so subtly by repeating the same mistake in quotes. That was very kind of you) by even writing to /etc/hosts file then you don't have to create Network objects and you don't need 1.9. You just need to use links in any previous release it is supported. I don't know where I am being disingenuous here.

If you are talking about Docker contributing to the race by having multiple threads of its own trying to write to the same file, then that is fixed whether or not you use 1.9's new Network objects.

@mrjana
Copy link
Contributor

mrjana commented Oct 26, 2015

Maybe I misunderstood then. I understood that the corruption was parallel
goroutines accessing the file and I assumed that was once per
namespace-user. If not that, then what parallel goroutines? Just multiple
starts/stops in parallel?

Yep, just single start running in parallel with a lot of stops.

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015

On Sun, Oct 25, 2015 at 10:24 PM, Jana Radhakrishnan
notifications@github.com wrote:

Yep, just single start running in parallel with a lot of stops.

Got it, thanks. Sorry for the confusion.

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015

On Sun, Oct 25, 2015 at 10:23 PM, Jana Radhakrishnan
notifications@github.com wrote:

And let's not be disingenuous - yes it existed forever, but with 1.9
you're encouraging people to create Network objects and as soon as
they do, Docker is back to "contributing for the race".

I am not sure which race you are talking about. But if you are talking about Docker
contributing to the race (oh I just corrected my grammatical mistake that I made when
I typed by message in a hurry and I am so glad how you pointed out my mistake so
subtly by repeating the same mistake in quotes. That was very kind of you) by even

Sorry, wasn't meant to be a jab, I just thought it was more context.
It came across badly.

writing to /etc/hosts file then you don't have to create Network objects and you don't
need 1.9. You just need to use links in any previous release it is supported. I don't know
where I am being disingenuous here.

One of the big features of 1.9 (as I understand the plan) is networks.
Using networks brings up the live-editing-causes-read-corruption
problem which is NOT solved. I expect that using networks, the
live-edit operation will happen far more frequently than with links.
Ergo, users will experience problems at a far higher rate.

Yes, "don't use networks" is a valid workaround, but it's not one I
expect you to advocate too strongly for :)

The disingenuity is in closing this bug without filing another issue
that details the still-existing read-corruption in 1.9.x. That's all
I am agitating for.

@mrjana
Copy link
Contributor

mrjana commented Oct 26, 2015

The disingenuity is in closing this bug without filing another issue
that details the still-existing read-corruption in 1.9.x. That's all
I am agitating for.

Oh, we don't mean to hide it under the carpet. We will properly document it in the Caveats section of 1.9 release(this is all already being planned for and I am not making this up on the fly) and we will file a proper issue describing the problem in greater detail so everybody is aware. This bug was closed by automatic github workflow because we marked this issue as "Fixes" in #17325. Also it is not going to be an effective way to use this issue to track and educate users with the racy behavior issue. It may confuse the users than providing them clarity. That is why we need a separate issue.

You can rest assured that we are not dropping the ball on this.

@thockin
Copy link
Contributor Author

thockin commented Oct 26, 2015

Sounds fair to me. Thanks.

On Sun, Oct 25, 2015 at 10:49 PM, Jana Radhakrishnan <
notifications@github.com> wrote:

The disingenuity is in closing this bug without filing another issue
that details the still-existing read-corruption in 1.9.x. That's all
I am agitating for.

Oh, we don't mean to hide it under the carpet. We will properly document
it in the Caveats section of 1.9 release(this is all already being planned
for and I am not making this up on the fly) and we will file a proper issue
describing the problem in greater detail so everybody is aware. This bug
was closed by automatic github workflow because we marked this issue as
"Fixes" in #17325 #17325. Also it
is not going to be an effective way to use this issue to track and educate
users with the racy behavior issue. It may confuse the users than providing
them clarity. That is why we need a separate issue.

You can rest assured that we are not dropping the ball on this.


Reply to this email directly or view it on GitHub
#17190 (comment).

@aidanhs
Copy link
Contributor

aidanhs commented Oct 26, 2015

Hang on, am I to understand that you're removing a feature people are relying on (in my case, relying on it extensively) between rc2 and the official release?

I can see that having this blindly enabled all the time is problematic and I'm definitely in favour of having opt-out, but it looks like there's a workaround upthread and I've been relying on this behaviour for two months now, since 1.8.0 came out. My impression from #16619 (comment) was that this would just be a flag to disable the behaviour (even if it's on the daemon level).

Make no mistake, this is a breaking change and will cause significant misery for a number of users - the feature is an incredible convenience for a number of reasons, and in some cases it enables use-cases that would otherwise be impossible (for example, exposing an SSH agent inside docker build).

@mfischer-zd
Copy link

@aidanhs I've been looking at the documentation and release notes for Docker and I can't seem to find any mention of the (now reverted) behavior in question being documented anywhere.

Bugs and mistakes happen; relying on undocumented behavior is going to cause problems for you.

As far as I can tell, the only documented way to get Docker to automatically allow containers to address each other is to specify the --link option to docker run. If this changes, it needs to be published; only then can one reasonably rely on it (at least until the next major revision is released).

@aidanhs
Copy link
Contributor

aidanhs commented Oct 26, 2015

@mfischer-zd I'd see this as equivalent to a new behaviour being added to the VOLUME command in dockerfiles such that if you do VOLUME /etc:/outeretc, it mounts `/etc/ from the host into the container.

That functionality would be fairly nasty, but if it had lasted 2.5 months, spanning 3 official releases and two release candidates, I'd understand if the people who did like that functionality at least wanted it available as an argument to docker build given that they may well have built some kind of process around it.

There are 4 different ways the functionality could be made available (excluding reverting the revert) listed at #17325 (comment), but none are available right now.

I'd understand if this was a minor convenience or clearly a hack, but saying 'tough luck' about the removal a feature which feels intentional and seems to align with "batteries included" is fairly uncharitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.