Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Getting Started with Docker (serversforhackers.com)
296 points by fideloper on March 21, 2014 | hide | past | favorite | 75 comments


This skips over the hard part: managing docker containers. Poking a hole directly to the container is a leaky abstraction. A reverse proxy like HAProxy or Varnish should be sitting in front of the container.

Once you have the reverse proxy setup the next problem that arises is routing to containers based on the domain. Now your HAProxy or Varnish config is going to get bloated and every time you deploy a container the config needs to be modified and reloaded. By this time you might be looking at chef or puppet for automating the config generation.

Chef and puppet are not simple to learn. They have their own set of quirks (like unreliable tooling support on Windows). I'm in the process of conquering this, but I hope one day there will be a simpler way.


This is a great point. The initial Docker examples make everything seem easy, but we blew way past our estimated time in integrating docker into our workflow because of the points you mention. I am still happy with the choice to use docker though and our team will be better at server administration in the future.

One thing about this getting started guide is that it recommends the Phusion base image which boots init. That seems to go against the best practices outlined in a recent article by Michael Crosby - http://crosbymichael.com/dockerfile-best-practices-take-2.ht...


I'm one of the authors behind Phusion's baseimage.

Phusion's baseimage does not go against Michael Crosby's best practices. His best practices states not to boot init, and with that he means the normal init. He states that not because it's the init process itself that's a bad idea, but because the normal init performs all kinds of work that is either unnecessary or harmful inside a Docker container. In fact, this is exactly what the Baseimage-docker documentation also states: don't use the normal init.

The Phusion baseimage does not contain a normal init, but a special init that is specifically designed for use in Docker.


Nice, hadn't read those. Thanks! I was wondering about that, but still need a solution for logging, cron jobs, and similar (perhaps running those on the host machine is the answer)


I am still finding good solutions for those too, and trying to add some concepts to my toolbox like orchestration, service discovery, proxies, data containers, ambassador containers, and so on. It's hard for me to wrap my head around the different recommended ways to use docker compared to my initial expectations.


the comment below by user:

pg_fukd_mydog

Pointed out that FreeBSD jails do this right.


What is that username, something from reddit? I really have no interest in FreeBSD.


To be clear - I think FreeBSD is an awesome project and the jails look solid, but it is really the Docker community and exploding Docker ecosystem that makes Docker appealing to me. For example, Red Hat embracing Docker means that I can build Docker containers for enterprises that embrace CentOS and Red Hat. None of the corporations or organizations I have worked with use or support FreeBSD, but they all support CentOS. Reference: http://www.infoworld.com/t/application-virtualization/red-ha...


Update etcd with connection details on container start/stop. Then use a script to watch the appropriate directory in etcd for changes and regenerate the config.

Look at "fleet" from CoreOS, and especially their "sidekick" example that uses systemd dependencies to trigger etcd updates: https://coreos.com/docs/launching-containers/launching/launc... though you can certainly do this without fleet too.

Then on the haproxy/varnish box (or put them in a container), put something that does "etcdctl exec-watch /services/website -- updateconfig.sh", where updateconfig.sh would be a script to watch for changes and regenerate the config / reload.

I don't see how your config will get "bloated" any more than it would otherwise - presumably your number of domains won't increase.


This is what DNS is for. If you specify your backends by hostname instead of IP, then each time the load-balancer tries to connect to the backend, it'll get a list of A records from its DNS resolver and pick one in a round-robin fashion. Thus, if you have a dynamic DNS server that queries your presence service, it can return exactly the hosts that are up right now as the round-robin set.

Right now, if you use SkyDNS[1] as your DNS server, and attach Skydock to the Docker host, this all Just Works. Most people want to use etcd instead of Skydock, though, so support for that is coming soon[1] too.

[1] https://github.com/skynetservices/skydns

[2] https://groups.google.com/forum/#!topic/coreos-dev/iklEYHh5J...


SkyDNS looks interesting, but it doesn't appear to do any heath checks on the endpoint. I don't want clients to receive an answer to an A record query that contains the IP address of an endpoint that is down.

Am I mistaken?


You probably don't want a healthcheck done for every DNS request.

Better to have a healthcheck service doing healthchecking, and modify SkyDNS along with whatever else happens when a service goes down.


In order to be robust in the event of a network partition, the client should perform the the health check itself. This can be done in a background thread; it doesn't have to be synchronous with the DNS lookup (and that would be very bad for performance anyway).


Doesn't it mean that every tcp connection would have to query dns at the beginning? What about client dns cache? Also, isn't that expensive in terms of connection time?


> Poking a hole directly to the container is a leaky abstraction. A reverse proxy like HAProxy or Varnish should be sitting in front of the container.

It might be a stupid question but I wonder what's considered a leaky abstraction in this case.

By the way, I'm not sure I fully understand your concerns over reverse proxy routing, but I recall that Ambassador pattern linking[0] is a suggested way of tying Docker containers over network. Also, these slides by dotCloud[1] may be helpful as well (I'm not sure if approaches described are up-to-date, though).

[0] http://docs.docker.io/en/latest/use/ambassador_pattern_linki...

[1] http://www.slideshare.net/dotCloud/deploying-containers-and-...


>It might be a stupid question but I wonder what's considered a leaky abstraction in this case.

I consider poking a hole a leaky abstraction because you are exposing the internals of your stack. The consumer should not know or care that you are using docker containers to serve the application. From a security perspective directly exposing a container may lead to potential exploits of docker itself.


Exposing a container would not to exposing potential exploits of Docker. Docker is not running in the container (and doesn't know anything about Docker).

Exposing a container is a hell of a lot safer than exposing a service on the host OS.

Also not sure I follow you how a consumer would know you are using docker or not.


I just came here to basically say the same; which I guess is the question shared by 80% of Docker's target market.

I have a box sitting somewhere which, like virtually any dedicated machine, is wildly overprovisioned for it's current usage patterns.

I would like to virtualize my services so that I can one day, when my needs outgrow my box, scale out without having to rewrite any code.

My box has limited IPs available, so I'll need the network between services to be private/internal.

How do I set that up with Docker?

I think it won't be until you can truly easily answer that question that Docker will really take off.


I'm probably just going to show my ignorance, here...but why doesn't container linking solve this problem?

Could you not run multiple docker containers/services behind a single nginx or apache container on a production server? Then the nginx container basically gets one of your public IP addresses, and you use linking to that container to provide it with knowledge of the other running processes' IP addresses (each within their own container, of course). In that way, you have one public facing container which has knowledge of the other containers and can use the information provided through -link to configure the nginx server to route requests appropriately. This requires a bit of bash script / sed command line hackery to update your nginx configuration to accommodate the changing IP addresses of the other containers on restart (unless you can set them by hand now using Docker, we still don't), but once you get it setup you never have to think about it again.

Like I said, maybe I'm just showing my ignorance, but something like the above scenario is how we get around hosting multiple services with limited public IP addresses available.


I don't like the container linking because it is basically tied to one server without extra complexity.

CoreOS' "fleet" ( https://coreos.com/docs/launching-containers/launching/launc... ) gives a cluster-wide solution.

But even without using fleet, the overall mechanism is fairly easy to adapt: Either use systemd dependencies like in their example, or have a script that queries docker on each host to spot changes in running containers, and update an etcd instance (or whichever your preferred config server is).


The "extra complexity" needed for multi-machine setups over linking containers is actually pretty minor.

Your services need to read something to get the IP, which ultimately comes from an ENV variable. In the linked container scenario Docker sets that variable. Otherwise you set it manually. That's the only extra complexity.

I was worried about this too, so I tried it out[1]. In this case I have a YAML config file, which can be overridden by ENV variables (which may come from Docker).

This isn't as automatic as CoreOS (eg, no failover etc), but it is a lot less complex.

[1] https://github.com/nlothian/Acuitra/blob/master/services/que...


Why not use ipv6 for the network between services?


I'd love to see an article on a setup using ipv6, that'd be cool.


I've been thinking along these lines recently, specifically service discovery for front-end load-balancers.

Most (all?) of the available reverse proxies will stop sending traffic to a server that is offline, but not discover them. There are solutions such as etcd which you can hook into, or you can write a toy application to use UDP-broadcasts to advertise "Hey I'm http://dev.local.com/ on port 4444", but there isn't a lot beyond that.

Templating configuration files and running "haproxy reload" is a common enough middle-ground, but I've seen it fail often. (Specifically keepalived not reloading correctly and still sending traffic to old nodes.)

ObRelated: Varnish is a beast that few people can configure easily. I'd love to work on a caching reverse proxy that was simple, extensible, and fast.


> ObRelated: Varnish is a beast that few people can configure easily. I'd love to work on a caching reverse proxy that was simple, extensible, and fast.

It doesn't have as many caching-specific bells and whistles as Varnish, but nginx is an excellent reverse proxy with some caching abilities (and simple configuration).


The biggest problem is you cannot use `proxy_cache_purge` unless you pay for the commercial version/fork of nginx.

That means you can't expire the cached content by URL.


Doesn't this open-source module do the same thing: http://labs.frickle.com/nginx_ngx_cache_purge/

It seems odd for nginx to try to commercialise such basic parts of the stack where 3rd parties can easily write such functionality.


A nice feature from the commercial nginx purge package is that it lets you purge by prefix. That's a feature that I've not seen in any of the open source purge modules.

If you are hosting data for several users on the same nginx cache and you want to purge only one of them, your only options are to scan the full cache on disk and delete the files that have a key with your prefix, or fork >$1K/year per nginx box for the commercial license.


Synapse has the ability automatically discover docker containers and configure HAProxy: https://github.com/airbnb/synapse#docker.


Aurbnb wrote a set of apps they dub "smart stack" which do the haproxy config for you.

http://nerds.airbnb.com/smartstack-service-discovery-cloud/


> Now your HAProxy or Varnish config is going to get bloated and every time you deploy a container the config needs to be modified and reloaded. By this time you might be looking at chef or puppet for automating the config generation.

Varnish at least can route using DNS [0] - You do need a nameserver or two to handle the internal domain of course, but they're reasonably easy to set up using powerdns for example.

[0] https://www.varnish-cache.org/docs/3.0/reference/vcl.html#th...


A colleague of mine recently wrote a post about automating an Nginx reverse proxy for Docker containers:

http://jasonwilder.com/blog/2014/03/25/automated-nginx-rever...


I think you can define the IP address assigned to a container via something like `-p 127.18.0.10:80:80`, if that helps with your HAProxy config (but that assumes your host machine isn't changing as well).

Definitely an interesting issue. Have you seen etcd from CoreOS? Useful for service discovery.


That doesn't assign an ip address to the container, it just determines which of the hosts addresses are forwarded to the container. But you can certainly use that to keep a fixed IP, if you use a suitable script to update iptables. You can use that if the host machine is changing as well, as long as an IP address is only ever used by containers that will be on the same server. Just add the IP on the new host, remove it from the old, and use arpsend -U -i [ip] [interface] to speed up the ip takeover. I use it fairly regularly to live-migrate services.


What are you finding unreliable in terms of tooling for Puppet on Windows?


For non-Paas use cases (for example, a development server with a bunch of projects) I find schroot (1) simpler and more productive. For example, you can use the normal `service stop / service start` instead of writing manually init scripts, and you don't get stuck with sharing directories, which I found extremely tricky with Docker (for example, I couldn't start correctly mysql with supervisor sharing the mysql db directory). But Docker is in early development, so I think it will become easier in the future.

1: https://wiki.debian.org/Schroot


Here's one benefit you get with docker: Speed of rebuilds, and ensuring that your build instructions and list of dependencies accurately reflects your actual environment. I basically have a "basic dev setup" container for all my projects now, and each new project sits as sub-directories of a directory on the host that I bind mount into the docker containers. Each project then also has a Dockerfile which adds any project specific dependencies.

Building a fresh container then takes a couple of seconds. And the projects run within those containers only. Ever time I restart the apps in my development environment, I rebuild the container, because it is so cheap. Which means I know at any time that the container can be rebuilt to a state the app will run in. I know when I want to deploy that the Dockerfile accurately reflects the dependencies, because otherwise my app wouldn't be running in the dev environment, as any and all changes to anything outside of the application repository are only applied through changes to the Dockerfile.


Sure, but speed of rebuilds is important when you rebuild often, which is not my case.

In my case the Dockerfile is easily replaced with a bash set -e script, which has never gave problems to me: deboostrap, share volumes, aptitude install apache/nginx, mysql/postgresql, php/ruby, copy files, predeployment commands like f.e. bundle install, start services.

And you have the advantage of starting services with `service start ...`, instead of reinventing init scripts.


I've had the same issue with MySQL. It's an issue of timing - You install MySQL, and the MySQL data directory has its data/default databases. Then you share the directory with Docker, and the data lib directory is wiped out (the files don't exist on the host machine, after all). Getting it right in an automated way is a Hard Problem™.

As of now, I'm keeping data persisted within the Container, which I don't necessarily like. I would love to hear a good solution on that.


Here's a dockerfile setup I wrote for Postgres which uses a 'data container' for the entire Postgres database: https://github.com/codelittinc/dockerfiles/tree/master/postg...

The gist of it is that you explicitly tell your DB container that there will be a shared directory on the container at runtime. This allows you to chown the directory before the data container is added.

Then, when you're running, use --volumes-from `$data-container-name` and it'll work. Want an article on it?


Awesome, thanks. I can figure it out from that example most likely, but an article would never hurt :D


An article on that would be great!


As a note, for Debian/Ubuntu, you can recreate the default DBs on a shell with mysql_install_db followed by mysql_secure_installation .


That's actually not true at this time. When you turn a directory into a volume, Docker will copy whatever the contents of that directory into the volume directory... But you need to make the volume after the directory is populated.


Since I already use schroot and I'm happy with it since it is a mature project I'll continue to use it, keeping a look to next Docker developments.


CoreOS experience designer here. I'm looking for testers to check out the general platform and test some of our new features. All skill levels are fine – new to docker & CoreOS, new to CoreOS only, etc. I'm happy to work with your schedule and make it as quick or involved as you're comfortable with. Anything from emailing a few thoughts to Skype to hanging out in our office in SF for the day.

Email: rob.szumski@coreos.com


I've been using docker for a couple of months, but we have only just begun experimenting with actual deployment in a test environment on ec2. Right now we use it primarily as configuration/dependency management. We're a small team and it seems to make setup easier, at least so far. Two examples: the first is a log sink container, in which we run redis + logstash. The container exposes the redis and es/kibana ports, and the run command maps these to the host instance. Setting up a new log server means launching an instance, and then pulling and starting the container. The second example is elasticsearch. We have a container set up to have cluster and host data injected into it by the run command, so we pull the container, start it, and it joins the designated cluster. The thing I like about this is the declarative specification of the dependencies, and the ease of spinning up a new instance. As I say, just experimenting so far, and I don't know how optimal all of this is yet, so would love any feedback.

One last quick thought on internal discovery. A method we're playing with on ec2 is to use tags. On startup a container can use a python script and boto to pull the list of running instances within a region that have certain tags and tag values. So we can tag an instance as an es cluster member, for example, and our indexer script can find all the running es nodes and choose one to connect to. We can use other tags to specify exposed ports and other information. Again, just messing around and still not sure of the optimal approach for our small group, but these are some interesting possibilities.


This is a copy and improvement of the article I wrote last month, even down to the breakdown of "What's that command doing?" with `docker run -t -i ubuntu /bin/bash`.

Glad it was useful enough to spur an improved article, at least.

http://tonyhb.com/unsuck-your-vagrant-developing-in-one-vm-w...


I've been sitting on my original since December. https://www.dropbox.com/s/lf0qi70vlglasgv/Screenshot%202014-...


Hey, no problem at all! Just surprised to see how similar we were in our styles. Happy that there are more resources popping up now!


I never saw your article before. Sorry, dude. Maybe great minds think a like tho!


Can someone tell me what's the point of this? (I seriously love to know, not criticizing it.) Why would I need to have docker containers to install stuff on them instead of just installing stuff directly on host?

Let's say I develop a new web app, I would install NodeJS, PostgreSQL and such on my machine. Before I deploy the app for the first time, I'll install them in the necessary servers. Now, it looks like I would need to do the same, except adding the step of building Docker containers.

I think I must miss something important here because the number of GitHub stars for Docker is impressive and this is usually a good indication of the usefulness of the project.


Docker containers let you isolate the entire environment for your app. Let's say your running an app on CoreOS in a container that needs python 1.2.3.

On your laptop you can build and test the new version of the app that needs python 1.2.4. Once you decide that's ready to go, you can push the new container onto the same CoreOS machine, so it's running both containers. Without the containers, running two versions of python on the same box isn't possible. If you had a chef script that updated to 1.2.4, you'd possibly break every other app on the box.

Containers also let you do some cool things like sign and verify a container before it's launched on the box. It should be bit for bit the same on your laptop as it is on the remote machine. Containers also boot within seconds, much faster than a VM. There have been a few tech demos running around that actually spin up a new container with a web server to service every web request, just to show how fast you can boot them. 300ms is pretty long for a web request, but it's the idea that counts.


Thanks, this is good info. For NodeJS and Rails, I use nvm and rvm, so didn't have any problem with multiple environments. But yeah, I see your point that Docker can help in such scenario or when there's no equivalent of nvm/rvm.


Have a dependency on conflicting libc is a favourite problem that can be difficult to solve without some form of container (vm, chroot or something "in between" like lxc/jails). Another is dependency on different kernel (either major version on same os, or dependency on a different kernel, like freebsd), which docker (by design) doesn't solve.

I don't use ruby much, but it doesn't strike me as very easy to work with/very reliable for production deployment. But that might be me. How well does it handle dependencies on conflicting modules with parts written in c?

Perhaps the most important point is that (when it makes sense) a docker setup might allow easier horizontal scale-out, and or redundancy.

All that said, keeping things simple is generally a good thing. But sometimes adding complexity in one area makes the overall system less complex.


Imagine you wanted to run two apps on the same host, and they depended on different versions of those components, and you wanted to be able to install the app and its dependencies on a new machine really quickly for scaling reasons, and you wanted an exploit on one of the apps to be difficult to escalate over to the other app from, and you'll start understanding some of the pain points Docker solves. :)


Thank you. Do you normally setup your dev environment inside a container? Or you do development as usual and then when coming to deployment, you deploy your app into the container?


> with Macintosh's kernel

I misread that as "Microsoft's..." and got excited since I run a build farm that's 70% windows and wish I could use docker but it's not worth having two systems (Container and VMs).

Also isn't that complete wrong? Macintosh is not an OS or company. It was one of Apple's product lines, long ago.


VM CAN share binaries/libs/etc (otherwise called files)

also, VMs CAN "share" memory. ie VMs can dedup memory between themselves. On Linux at least.

Not saying docker/lxc and all things namespaces are bad at all - but setting things straight. VMs can do this:)

Checkout KSM for memory "sharing" and any overlay-style file system that is mounted by VMs (this one works exactly the same as when you use namespaces/docker/lxc in fact)


Shouldn't "setting up a correct init process" be part of every "getting started with docker?" http://phusion.github.io/baseimage-docker/


No. That guide assumes you're running a bunch of processes in the container (or even a full system). That's not the case at all when you're doing an "application container" that doesn't need its own cron daemon, ssh daemon, etc.

Containers can be much leaner than the kind discussed there.


I am one of the authors behind that guide. That guide does not assume that you're running "a bunch of processes" or "even a full system" in the container. Even if you're only running 1 process, it's still highly relevant.

The main point is the Unix process model and how zombie processes work. Things are just not setup properly if your system doesn't handle that properly. And except by using specialized apps (e.g. the my_init system used in baseimage-docker), it just isn't set up properly.

One of the Docker authors, shykes, stated: "In short, regular applications don't expect to be pid 1, and generally speaking they shouldn't."

The other point is to be able to login to the container to perform one-off sysadmin and debugging work. There was lxc-attach for that, but now that Docker supports multiple backends, SSH is the only portable solution that works no matter which Docker backend you use.


Wow thanks! I didn't know about the init process and the zombies.


Agree.


I wish that people would stop writing tutorials on "getting started" with Docker, and actually start writing up examples of how to work with multiple containers, hosts, and linking.

That's the part that I (and I'm sure other beginners) get totally stuck on. Anyone can do docker commit/pull.


I actually haven't found many getting started articles, which is why I wrote this. However I fully plan on writing up more interesting stuff.


I found this to be interesting, specifically walks through a 2 container deployment http://blog.tutum.co/2014/02/06/how-to-build-a-2-container-a...


Also, airbnb's synapse looks pretty amazing for linking and service discovery as mentioned in this comment https://news.ycombinator.com/item?id=7443803


I have been enjoying some of the tutorials on the Century Link Labs blog: http://www.centurylinklabs.com/category/docker-posts/


this is the first time I have heard of coreOS - seems to be custom built for containers like docker. are there downsides to doing system updates this way and not having a package manager, just relying on containers for everything? Seems great in concept.


I don't know how well this works as soon as you have a single file that needs multiple edits to support multiple images.


Well good morning hackers.. This has been around for ages...

http://www.xenproject.org/


Would it be better to use FreeBSD and their Jails mechanism for all of this?


Joyent would probably claim that kvm+zfs would be best. But if you don't have kernel support for jails, then no, using jails isn't better. It's not an option. Oracle would probably claim solaris zones are better (and arguably, they'd be right).

Jails are (as far as I can tell) great -- but not so great that freebsd didn't include a new hvm assisted hypervisor in freebsd 10 (BSD Hypervisor (bhyve)).

LXC in many ways are *bsd jails for Linux.


Or schroot on Linux, as I wrote in another comment




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: