Taming the Docker Swarm - Part 2

In Taming the Docker Swarm - Part 1 I showed how I created a three node Docker Swarm Cluster, added a load balancer with support for Let’s Encrypt SSL certificates and deployed a service to monitor other services in the cluster.

In this post I will take this further by deploying more services and adding shared storage using NFS from my FreeNas machine. To make things easier for myself I created a wrapper cookbook that uses my docker-swarmm so that Chef could manage the deployment of services for me.

Overview

After creating the Docker Swarm cluster on my three machines I wanted to be able to run some services. Some of them I already had running as services on virtual or physical machines so it was a case of replicating them. One of the nice things was that I was able to scale up easily. For example I run ElasticSearch but it was only ever a development system of one machine, however using Docker Swarm service I was able to create a proper ES cluster :-).

One of the things that still alludes me is a solid way of having shared storage. At the moment I have an NFS share off my FreeNas server, which helps but it now means that I have a single point of failure - less than ideal!

I created a new wrapper cookbook called swarm-deploy that is responsible for deploying all the services that I wanted. Not only does it contain recipes for the services, but also one for mounting the shared storage on each node.

Shared Storage

When running services in a Docker Swarm cluster the services can be run on any machine, unless modified by constraints. This means that if there is a service that requires persistent data that data must be available on the other hosts so that when the service is moved it finds the data. This also means that the mount point must be the same on each node, in my case I chose to use /data/dockerswarm with a sub-directory for each service running in the cluster.

The following diagram shows how the storage node is connected in my network. Note that it is not part of the network overlay as it is the hosts that require the mount and not the services themselves.

This an update to the previous posts diagram where I have remove the Windows host. This is not because it is not part of the network anymore, but for this scenario it is not required.

In my cookbook I have a recipe called storage_home.rb. This recipe is used in my home/office network. This is run on every node that could potentially run a service.

# Add the necessary packages
package "nfs-common"

# ensure the directory point exists
directory "/data/dockerswarm" do
  recursive true
end

mount "/data/dockerswarm" do
  device "192.168.36.10:/mnt/storage01/dockerswarm"
  fstype "nfs"
  options "rw"
  action [:mount, :enable]
end

It is worth noting that being able to run Chef on each of the nodes is very benefical. In my home/office network I control all the nodes and can SSH into them; however if using the Azure Container Service (ACS) for example, it is not possible to login to the worker nodes. But it is possible to modify the ARM template so that each of the nodes installs Chef which means that nodes can be managed and tested for Compliance if required.

Services

I run a number of services in my cluster, all of which can be seen in the Portainer service deployed in the previous part.

Docker Registry

I have a a pipeline that builds my Docker images and deploys to my private docker registry. In fact I have two, one for home and one for Internet deployment. In both cases I am running the Docker registry service. This is deployed along with an Nginx container so that I can add authentication to the repository. As I am running this behind Traefik which I have already enabled for SSL certificates, I terminate the SSL at the load balancer so all Nginx has to deal with is port 80.

Request -> Traefik -> Nginx -> Registry
               |          |       
          SSL Terminate   |
                          |
                      Basic Auth    

Service

The following command deploys the registry service:

$> docker service create --name registry \
                         --network traefik-net \
                         --mount "type-bind,src=/data/dockerswarm/registry,dst=/var/lib/registry" \
                         registry:2

Reverse Proxy

As can be seen from the deployment, no ports are exposed. This is because the next service to be deployed is the Nginx proxy.

$> docker service create --name registry_proxy \
                         --network traefik-net \
                         --mount "type=bind,src=/data/dockerswarm/registry_proxy/nginx.conf,dst=/etc/nginx/nginx.conf,readonly" \
                         --mount "type=bind,src=/data/dockerswarm/registry_proxy/htpasswd,dst=/etc/nginx/.htpasswd,readonly" \
                         --label "traefik.enable=true" \
                         --label "traefik.backend=docker_registry" \
                         --label "traefik.port=8080" \
                         --label "traefik.frontend.rule=Host:registry.home.turtlesystems.co.uk" \
                         nginx:1.13

Nginx is configure to listen in port 8080 within the overlay network and is where Traefik will send traffic when requests appear on registry.home.turtlesystems.co.uk. The labels inform Traefik how the service should be exposed.

The nginx.conf and .htpasswd files are generated from my registry.rb recipe.

user  nginx;
worker_processes  1;

error_log  /var/log/nginx/error.log warn;
pid        /var/run/nginx.pid;


events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;
    keepalive_timeout  65;
    server {
        listen 8080;

        server_name registry.example.com;

        # disable any limits to avoid HTTP 413 for large image uploads
        client_max_body_size 0;

        # required to avoid HTTP 411
        chunked_transfer_encoding on;

        location / {

            auth_basic "Docker Registry";
            auth_basic_user_file /etc/nginx/.htpasswd;

            add_header 'Docker-Distribution-Api-Version' 'registry/2.0' always;

            proxy_pass                      http://registry:5000/;

        }
    }
}

The name used for the registry is how Nginx can communicate with the Registry service on port 5000. Notice that I do not have deal with IP addresses here to get it working.

The .htpasswd is generated using the username and password from an encrypted databag.

  names = data_bag(node["swarm_deploy"]["apps"]["registry"]["data_bag"])
  names.each do |name|
    account = data_bag_item(node["swarm_deploy"]["apps"]["registry"]["data_bag"], name)
    htpasswd "/data/dockerswarm/registry_proxy/htpasswd" do
      user account["username"]
      password account["password"]
    end
  end

On my list of things to do is run Hashicorp Vault as a service so that my recipes and services can access a central store of passwords.

Now it should be possible to login to the newly created docker service.

Elastic Search

I use ES a lot. I use it for my websites, logstash and now a Raspberry Pi project I am working on. So I need to be able to run a stable service. The configuration I outline here creates a global service which is constrained by labels and has an Nginx proxy service to accept requests.

The proxy serves two functions. The first one is to provide basic authentication, so I do not need to use Xpack security. It also provides access to the ES cluster. This might seem obvious, but because the ES servic is configured to use DNS round robin in the Docker Swarm cluster there is no ingress, hence Nginx.

As I have Traefik configured I will get an SSL certificate for the service to boot!

Much of the information I used to get this working was from this page Elasticsearch Cluster with Docker Engine Swarm Mode by Brian Mancini.

Service

The following command deployed ES as a global service in my Docker Swarm cluster:

$> docker service create --name elasticsearch \
                         --hostname "{{.Node.ID}}-{{.Service.Name}}" \
                         --network traefik-net \
                         --mode global \
                         --constraint "node.labels.app_role == elasticsearch" \
                         --mount "type=bind,src=/data/dockerswarm/elasticsearch,dst=/usr/share/elasticsearch/data" \
                         --endpoint-mode dnsrr \
                         docker.elastic.co/elasticsearch/elasticsearch:5.4.2 \
                         elasticsearch \
                         -Enetwork.host=0.0.0.0 \
                         -Ediscovery.zen.ping.unicast.hosts=elasticsearch \
                         -Ediscovery.zen.minimum_master_nodes=1 \
                         -Epath.data=/usr/share/elasticsearch/data/${HOSTNAME}/data \
                         -Epath.logs=/usr/share/elasticsearch/data/${HOSTNAME}/logs \
                         -Expack.security.enabled=false

There is a lot going on here, the following shows what specific lines are doing:

  • Line 2 sets the hostname of the machine. This is using a docker template to generate the name from the node and the service
  • Line 4 sets the service as a global one. This means that it will run 1 service on each node in the cluster unless constrained
  • Line 5 sets the constraint for the service. In this case services will only run on nodes that have the app_role of “elasticsearch”
  • Line 6 mounts the shared storage so that ES writes out indexes and logs to it
  • Line 7 tells the service to use DNS round robin rather than vip
  • Lines 10-15 set the environment variables for the elasticsearch command
    • Lines 13 and 14 use the hostname environment variable in the path so that the data is assigned to a hostname folder in the shared storage

As I have mentioned before I am not convinced this is the correct way to handle the storage, but it does work. The other way would be to use volume mounts, but I am not sure how to make these move with the service if it had to be relocated to another node in the Docker Swarm cluster.

At this point Traefik will not expose the service as there are no configuration labels, not is it accessible from outside the Docker Swarm environment.

Reverse Proxy

To make the service accessible a Nginx service is deployed. This connects to the ES service inside the cluster and as is it configured for use with Traefik it will expose it on the named host. The Nginx service will answer requests on port 443.

$> docker service create --name elasticsearch_proxy \
                         --network traefik-net \
                         --mount "type=bind,src=/data/dockerswarm/elasticsearch_proxy/nginx.conf,dst=/etc/nginx/nginx.conf,readonly" \
                         --mount "type=bind,src=/data/dockerswarm/elasticsearch_proxy/htpasswd,dst=/etc/nginx/.htpasswd,readonly" \
                         --label "traefik.enable=true" \
                         --label "traefik.backend=elasticsearch_proxy" \
                         --label "traefik.port=9200" \
                         --label "traefik.frontend.rule=Host:elasticsearch.home.turtlesystems.co.uk" \
                         nginx:1.13                     

The configuration for Nginx is as follows:

user  nginx;  
worker_processes  1;

error_log  /var/log/nginx/error.log warn;  
pid        /var/run/nginx.pid;


events {  
    worker_connections  1024;
}

http {  
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;
    keepalive_timeout  65;
    server {
        listen 9200;

        add_header X-Frame-Options "SAMEORIGIN";

        location / {

            auth_basic Elasticsearch";
            auth_basic_user_file /etc/nginx/.htpasswd;

            proxy_pass http://elasticsearch:9200;
            proxy_http_version 1.1;
            proxy_set_header Connection keep-alive;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_cache_bypass $http_upgrade;

        }
    }
}

Slightly confusingly Nginx listens on port 9200 as well as Elasticsearch. I expected this to cause an issue, but it works without issue. I suspect it is because the port is not exposed other than by Nginx so the one from ES is only in the traefik-net network.

Using the HTTP endpoint for ES it is possible to see that the clsuter is in “green”.

Kibana

Kibana is the tool used to visualise data that is sent to ES. This too can be added to the swarm cluster:

$> docker service create --name kibana \
                         --network traefik-net \
                         --label "traefik.enable=true" \
                         --label "traefik.port=5601" \
                         --label "traefik.docker.network=traefik.net" \
                         --label "traefik.backend=kibana" \
                         --label "traefik.frontend.rule=Host:kibana.home.turtlesystems.co.uk" \
                         --env "XPACK_SECURITY_ENABLED=false" \
                         --env "ELASTICSEARCH_URL=http://elasticsearch:9200" \
                         docker.elastic.co/kibana/kibana:5.4.2

There are two environment variables being passed to this service. XPACK_SECURITY_ENABLED is set to false. For the moment I have no authentication for Kibana, but it could be placed behind another Nginx reverse proxy to provide that. The ELASTICSEARCH_URL informs Kibana how to connect to the ES cluster. As the service is running on the same overlay network it is able to access the ES cluster on its port 9200.

As is the pattern I am publishing this through Traefik so it will be accessible on https://kibana.home.turtlesystems.co.uk.

Logstash

So now I have an ES cluster and a way to view the data, but no way to ingest data. To do this I deploy a Logstash service. The key thing to remember here is that Logstsh does not use an HTTP protocol so it cannot be load balanced by Traefik.
a

$> docker service create --name logstash \
                         --network traefik-net \
                         --publish 5044:5044 \
                         --mount "type=bind,src=/data/dockerswarm/logstash/pipeline,dst=/usr/share/logstash/pipeline" \
                         --mount "type=bind,src=/data/dockerswarm/logstash/data,dst=/usr/share/logstash/data" \
                         docker.elastic.co/logstash/logstash:5.4.2

The pipeline files, for input and output, are placed in the pipeline directory. These files are as follows:

input.beats.conf

input {
  beats {
    port => 5044
  }
}

output.elasticsearch.conf

output {
  elasticsearch {
    hosts => "elasticsearch:9200"
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}" 
    document_type => "%{[@metadata][type]}" 
  }
}

As this service is not being load balanced, it can be accessed by targetting any host in the Docker Swarm cluster on the port 5044. This is how data can be sent to Logstash easily.

VSTS Agent

I use VSTS a lot and I need to be able to run my own agents for building. This is not always the case but is very useful when I am working on new VSTS extensions as I can get into the agent and see why things are not working.

Again I was running these on physical machines, which was too much of an issue as they tend to be quite ephemeral, but as I was migrating things to my swarm cluster I wanted to see if I could get this to work as well.

$> docker service create --name vsts-agent \
                         --env VSTS_ACCOUNT=turtlesystems \
                         --env VSTS_TOKEN=1234567890abcdefghijklmnopqrstuvwxyz \
                         --env VSTS_AGENT=swarm-agent-1 \
                         --env VSTS_POOL=default \
                         microsoft/vsts-agent

As can be seen there is no data to persist and it is not being load balanced by Traefik. This is because the agent connects VSTS and is never connected to directly.

Deployment with Chef

Using the docker-swarmm resources it is possible to deploy the services using a recipe. The following code shows how I deploy the registry proxy service for the Docker registry using the docker_swarm_create resource.

# Only attempt to deploy the services if this is running on the manager node
if node["expanded_run_list"].include?("swarm_deploy::manager_node")

  reverse_proxy_config = "/data/dockerswarm/registry_proxy/nginx.conf"
  htpasswd_file = "/data/dockerswarm/registry_proxy/htpasswd"

  # Deploy the NGinx reverse proxy
  docker_swarm_service "registry_proxy" do
    networks [
      "traefik-net",
    ]
    mounts [
      format("type=bind,src=%s,dst=/etc/nginx/nginx.conf,readonly", reverse_proxy_config),
      format("type=bind,src=%s,dst=/etc/nginx/.htpasswd,readonly", htpasswd_file),
    ]
    labels [
      "traefik.enable=true",
      "traefik.backend=docker_registry",
      "traefik.port=8080",
      format("traefik.frontend.rule=Host:%s", node["swarm_deploy"]["apps"]["registry"]["fqdn"]),
    ]
    image "nginx:1.13"
  end
end

As services can only be deployed to manager nodes, the recipe checks to see if it is running on a node that has the swarm_deploy::manager_node assigned to it. My pattern being that I will always use this cookbook to deploy services, thus I can apply the recipe to all nodes for configuration purposes safe int he knowledge that the server will only deploy on a manager node.

Summary

I am really pleased with the way in which I have been able to migrate things to Docker Swarm and have them automatically get an SSL certificate. I have started to build out a system on the Internet with an extra twist, but that is another post.

I am sure that I will be deploying more services to the cluster as time goes on but the ones I have already done have made life much easier. There are still things that are not so good, namely the shared storage. I am on the look out for ideas that can help fix this issue.

My ES cluster works and am pleased with its simplicitly of deployment. I am sure that people reading this will ask questions about its stability and that the cluster should not be run like this. I have not found any issues yet, apart from not easily being able to load in new plugins, but I am sure I will come across them.

Share Comments