Ephemeral Volumes

On-disk files in a container are ephemeral as we saw in the last lesson. This presents some problems for applications that want to save long-lived data across restarts. For example, user data in a database.

The Kubernetes volume abstraction solves two primary problems:

Data persistence
Data sharing across containers

As it turns out, there are a lot of different types of "volumes" in Kubernetes. Some are even ephemeral as well, just like a container's standard filesystem. The primary reason for using an ephemeral volume is to share data between containers in a pod.

Assignment

It's time to shift our focus back to the crawler service. The crawler service continuously crawls Project Gutenberg and exposes the information that it finds via a JSON API. That data is then made available via slash commands in the chat application.

The crawler is pretty slow by default. Each instance only crawls one book every 30 seconds.

To see what I mean, run:

kubectl logs <crawler-podname>

You should see some logs with timestamps that show you the crawler's progress.

We can speed it up by increasing the number of concurrent crawlers. The trouble with scaling up beyond one instance is that each crawler currently stores its data in memory. We need all pods to share the same data so they can each add their findings to the same database.

Let's update the crawler deployment to use a volume that will be shared across all containers in the crawler pod, and scale up the number of containers in the pod.

Add a volumes section to spec/template/spec.

volumes:
  - name: cache-volume
    emptyDir: {}

Add a new volumeMounts section to the container entry. This will mount the volume we just created at the /cache path.

volumeMounts:
  - name: cache-volume
    mountPath: /cache

Duplicate the entire first entry in the containers list twice (you should now have 3 total containers). Update the name of each:
1. synergychat-crawler-1
2. synergychat-crawler-2
3. synergychat-crawler-3

Now all the containers in the pod will share the same volume at /cache. It's just an empty directory, but the crawler will use it to store its data.

Add a CRAWLER_DB_PATH environment variable to the crawler's ConfigMap. Set it to /cache/db. The crawler will use a directory called db inside the volume to store its data.
Apply the new ConfigMap and Deployment, and use kubectl get pod to see the status of your new pod.

You should notice that there's a problem with the pod! Only 1/3 of containers should be "ready". Use the logs command to get the logs for all 3 containers:

kubectl logs <podname> --all-containers

You should see something like this:

listen tcp :8080: bind: address already in use

Because pods share the same network namespace, they can't all bind to the same port! Hmm... let's put a band-aid on this by binding each container to a different port. 8080 is the only one that will be exposed via the service, but that's okay for now. We can add redundancy later.

Add two new values to the crawler's ConfigMap:
1. CRAWLER_PORT_2: 8081
2. CRAWLER_PORT_3: 8082
Update the crawler deployment:

Change the second and third containers to map CRAWLER_PORT_2 -> CRAWLER_PORT and CRAWLER_PORT_3 -> CRAWLER_PORT respectively (the Docker image expects a variable named "CRAWLER_PORT"). I'm not going to give you the code, but know that it's gonna be a bit tedious because you need to use env: instead of envFrom: for the second and third containers. Don't forget to continue exposing the CRAWLER_KEYWORDS and CRAWLER_DB_PATH environment variables for all containers.

When you're done, apply the changes and run kubectl get pods again. All three containers should be ready, each serving on a different port, with only the first exposed via the service.

Run:

kubectl proxy

Run and submit the CLI tests.