Alpenhorn Demo

Introduction

This is a short demonstration of using alpenhorn intended to show off most of the major features of the system.

Demo set-up

Because alpenhorn is designed to run as a distributed system, with both data and software present at multiple dispersed, independent places, this demo uses docker to run several virtual images simulating independent hosts. Additionally, Docker Compose is used to manage the multi-container set-up for this demo.

Installing docker itself is beyond the scope of this demo. The Docker install documentation may help. You may also be able to get help from your friendly neighbourhood sysadmin. Once docker is properly installed you can test the installation by running the hello-world container:

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
e6590344b1a5: Pull complete
Digest: sha256:d715f14f9eca81473d9112df50457893aa4d099adeb4729f679006bf5ea12407
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Hint

If you get a permission error: “permission denied while trying to connect to the Docker daemon socket”, this means you lack the proper permissions to run docker. The best solution in this case is to have your sysadmin add you to the docker group. See the Docker documentation on running docker as a non-root user.

Once docker itself is running, you’ll also need to install the docker compose plugin. After installing docker compose, you should be able to run

  docker compose version

and see the version of the plugin you just installed:

$ docker compose version
Docker Compose version v2.31.0

If that’s all working, you should be able to proceed with the alpenhorn demo itself!

Starting the demo

There are five docker containers that comprise this demo:

  • a database container (alpendb), which runs the MySQL server containing the alpenhorn Data Index

  • a container providing a root shell (alpenshell) used to interact with the alpenhorn CLI

  • three containers (alpenhost1 through alpenhost3) implementing the separate alpenhorn hosts, each containing a StorageNode and running an instance of the alpenhorn daemon.

These containers will be automatically built when we first start the demo.

The demo must be run from the /demo/ subdirectory of the alpenhorn repository. If you don’t already have a clone of the alpenhorn repository, the first step will be to clone the repository from GitHub:

git clone https://github.com/radiocosmology/alpenhorn.git

Once you’ve cloned the repository, you should change directory into the /demo/ subdirectory of the newly-cloned repository (the directory containing Dockerfile.alpenhorn):

$ git clone https://github.com/radiocosmology/alpenhorn.git
Cloning into 'alpenhorn'...
remote: Enumerating objects: 3764, done.
remote: Counting objects: 100% (574/574), done.
remote: Compressing objects: 100% (158/158), done.
remote: Total 3764 (delta 444), reused 451 (delta 413), pack-reused 3190 (from 2)
Receiving objects: 100% (3764/3764), 1.35 MiB | 1.35 MiB/s, done.
Resolving deltas: 100% (2678/2678), done.
$ cd alpenhorn/demo
$ ls
Dockerfile.alpenhorn  alpenhorn.conf  docker-compose.yaml

Once you’re in the demo subdirectory, we can begin the demo.

Let’s start off by starting the database container in the background. Because alpenhorn is a distributed system, it is not expected that the database itself runs on an alpenhorn node. We simulate this in the demo by running the database out of a standard mysql container.

To start the database container, run the following from the /demo subdirectory:

    docker compose up --detach alpendb

Hint

If you get a no configuration file provided: not found error, you’re not in the right directory. (The /demo/ directory within the alpenhorn repository.)

Doing this the first time will probably cause docker to download the latest MySQL image, create the virtual demo network and the demo_db_vol volume, which contains the persistent database for the demo:

$ docker compose up --detach alpendb
[+] Running 11/11
 ✔ alpendb Pulled                                                           15.9s
 ✔ 1d19e87a21f5 Pull complete                                                3.0s
 ✔ 16ec22ff04f9 Pull complete                                                3.1s
 ✔ 9f789b8d2675 Pull complete                                                3.1s
 ✔ 96f4da41c548 Pull complete                                                3.5s
 ✔ fb087646189b Pull complete                                                3.5s
 ✔ 023374826adc Pull complete                                                3.5s
 ✔ 8293a632aa25 Pull complete                                                4.6s
 ✔ c3947540e0c6 Pull complete                                                4.7s
 ✔ c38bed95fb4b Pull complete                                               14.5s
 ✔ 712eb897f1e5 Pull complete                                               14.5s
[+] Running 3/3
 ✔ Network demo_default      Created                                         0.1s
 ✔ Volume "demo_db_vol"      Created                                         0.0s
 ✔ Container demo-alpendb-1  Started                                         2.6s

You can use docker stats or docker container ls to verify that the alpendb container is running:

$ docker container ls
CONTAINER ID   IMAGE          COMMAND         CREATED         STATUS         PORTS                 NAMES
7e19895eb701   mysql:latest   "docker-ent…"   2 minutes ago   Up 2 minutes   3306/tcp, 33060/tcp   demo-alpendb-1

Stopping and resetting the demo

Tip

Before we continue, a few words about stopping and resetting this demo.

You can stop the docker containers running this demo at any time by executing:

docker compose stop

This will stop all running containers. To restart the demo, run the appropriate docker compose up commands. Stopping the demo does not delete the containers or volumes containing the database and the storage node data.

If you want to also remove the demo containers:

docker compose down --remove-orphans

To remove the containers and the volumes containing the database and the storage node data:

docker compose down --remove-orphans --volumes

Warning

Removing the volumes will delete the demo’s alpenhorn data index. After doing this, you’ll need to rebuild the demo database from scratch as described below.

Deleting the volumes will also delete all files in the StorageNodes which you create over the course of this demo.

Finally, to remove the alpenhorn container image, which gets built the first time the image is needed, run:

docker rmi alpenhorn:latest

You should do this if you want to update the version of alpenhorn used by the demo, or if you’ve made changes to the demo’s Dockerfile.alpenhorn or docker-compose.yaml files.

Tip

You can also remove the mysql:latest image if you want to run a newer version of the database container.

Conventions used in this demo

To follow along with this demo, you will be executing commands in three different places:

  • the docker host (the real machine on which you’ve cloned the alpenhorn repository)

  • the alpenshell container, where you’ll be issuing alpenhorn commands

  • the alpenhost1 container, where you’ll be interacting with data files

To aid in distinguishing these three places, we’ve tried to indicate them by using different highlights.

Commands you should execute on the docker host will look like this:

echo "This is a command on the demo host."

and command output will look like this:

$ echo "This is a command on the demo host."
This is a command on the demo host

Commands meant to be run in the alpenshell container will look like this:

echo "This is a command in the alpenshell container."

and command output will look like this:

root@alpenshell:/# echo "This is a command in the alpenshell container."
This is a command in the alpenshell container.

Finally, commands that need to be run in the alpenhost1 container will look like this:

echo "This is a command in the alpenhost1 container."

and command output will look like this:

root@alpenhost1:/# echo "This is a command in the alpenhost1 container."
This is a command in the alpenhost1 container.

Hint

How to access a shell in these containers is explained later on, when access to them is first needed.

Initialising the database

Now we need to use some alpenhorn commands to create the Data Index (the alpenhorn database) and the define the start of our storage infrastructure in it. The data index must exist before we can start the first alpenhorn daemon.

To create the data index we’ll need access to the MySQL database housing it. This can’t be done from the docker host, so we’ll create a separate docker container (called alpenshell) which we’ll use for the duration of this demo to interact with alpenhorn.

To build the container and start a bash session in it, run:

docker compose run --rm alpenshell

Note

The --rm option here means docker will delete the container when you exit it, preventing “orphan” containers. If you forget to do this, and end up with warnings about orphan containers as a result, you can always add --remove-orphans to the command to remove the old containers.

Running this the first time will cause docker compose to build the alpenhorn container image. This may take some time. Eventually you should be presented with a bash prompt as root inside the alpenshell container:

$ docker compose run --rm alpenshell
[+] Creating 1/1
 ✔ Container demo-alpendb-1  Running                                                                                     0.0s
[+] Running 1/1
 ! alpenshell Warning pull access denied for alpenhorn, repository does not exist or may require ...                     1.1s
[+] Building 13.4s (4/15)                                                                                      docker:default
[+] Building 79.8s (17/17) FINISHED                                                                            docker:default
 => [alpenshell internal] load build definition from Dockerfile.alpenhorn                                                0.0s
 => => transferring dockerfile: 1.20kB                                                                                   0.0s
 => [alpenshell internal] load metadata for docker.io/library/python:latest                                              1.2s
 => [alpenshell internal] load .dockerignore                                                                             0.0s
 => => transferring context: 2B                                                                                          0.0s
 => [alpenshell internal] load build context                                                                             0.9s
 => => transferring context: 5.97MB                                                                                      0.9s
 => [alpenshell  1/11] FROM docker.io/library/python:latest@sha256:c33390eacee652aecb774f9606c263b4f76415bc83926a6777e  18.8s
 => => resolve docker.io/library/python:latest@sha256:c33390eacee652aecb774f9606c263b4f76415bc83926a6777ede0f853c6bc19   0.0s
 => => sha256:ca513cad200b13ead2c745498459eed58a6db3480e3ba6117f854da097262526 64.39MB / 64.39MB                         1.8s
 => => sha256:c33390eacee652aecb774f9606c263b4f76415bc83926a6777ede0f853c6bc19 10.04kB / 10.04kB                         0.0s
 => => sha256:1dc5d6fc8bbd1dd9e0f4a202e99e03fe9575010057e730426c379da106ad446b 6.26kB / 6.26kB                           0.0s
 => => sha256:cf05a52c02353f0b2b6f9be0549ac916c3fb1dc8d4bacd405eac7f28562ec9f2 48.49MB / 48.49MB                         1.5s
 => => sha256:63964a8518f54dc31f8df89d7f06714c7a793aa1aa08a64ae3d7f4f4f30b4ac8 24.01MB / 24.01MB                         0.9s
 => => sha256:9ceebdae2d382eb0a06dfb69d15f21a14cb8dd4e369cc93df299fb4fd9c6183b 2.32kB / 2.32kB                           0.0s
 => => sha256:c187b51b626e1d60ab369727b81f440adea9d45e97a45e137fc318be0bb7f09f 211.36MB / 211.36MB                       4.7s
 => => sha256:776493ee5e4c0d0be79a520728d8e75ad7875d3d0a20c559719ce4bdbfd1135a 6.16MB / 6.16MB                           1.8s
 => => extracting sha256:cf05a52c02353f0b2b6f9be0549ac916c3fb1dc8d4bacd405eac7f28562ec9f2                                2.8s
 => => sha256:39ca2d92e12971b595d75bc8a5333312290333b9697057fbc650aa59b5e0d79f 27.38MB / 27.38MB                         2.6s
 => => sha256:ab89b311642188180787ced631a8b087ec24cc326cc76f84a4c2cd9cf30170a1 250B / 250B                               2.0s
 => => extracting sha256:63964a8518f54dc31f8df89d7f06714c7a793aa1aa08a64ae3d7f4f4f30b4ac8                                0.7s
 => => extracting sha256:ca513cad200b13ead2c745498459eed58a6db3480e3ba6117f854da097262526                                3.2s
 => => extracting sha256:c187b51b626e1d60ab369727b81f440adea9d45e97a45e137fc318be0bb7f09f                                7.8s
 => => extracting sha256:776493ee5e4c0d0be79a520728d8e75ad7875d3d0a20c559719ce4bdbfd1135a                                0.4s
 => => extracting sha256:39ca2d92e12971b595d75bc8a5333312290333b9697057fbc650aa59b5e0d79f                                1.0s
 => => extracting sha256:ab89b311642188180787ced631a8b087ec24cc326cc76f84a4c2cd9cf30170a1                                0.0
 => [alpenshell  2/11] RUN apt-get update && apt-get install --no-install-recommends -y     vim     ssh     rsync       14.3s
 => [alpenshell  3/11] RUN pip install --no-cache-dir mysqlclient                                                        8.0s
 => [alpenshell  4/11] RUN ssh-keygen -t rsa -N '' -f /root/.ssh/id_rsa                                                  1.2s
 => [alpenshell  5/11] RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys                                           0.5s
 => [alpenshell  6/11] RUN echo 'Host *\n    StrictHostKeyChecking no\n' > /root/.ssh/config                             0.6s
 => [alpenshell  7/11] COPY demo/alpenhorn.conf /etc/alpenhorn/alpenhorn.conf                                            0.1s
 => [alpenshell  8/11] RUN mkdir /var/log/alpenhorn                                                                      0.4s
 => [alpenshell  9/11] COPY examples/pattern_importer.py /root/python/pattern_importer.py                                0.1s
 => [alpenshell 10/11] ADD . /build                                                                                      0.4s
 => [alpenshell 11/11] RUN cd /build && pip install .                                                                   32.7s
 => [alpenshell] exporting to image                                                                                      1.2s
 => => exporting layers                                                                                                  1.2s
 => => writing image sha256:fd14160332396a1c20e3fc322dfa041887d0df81d362664be82fc2637df0e57c                             0.0s
 => => naming to docker.io/library/alpenhorn                                                                             0.0s
 => [alpenshell] resolving provenance for metadata file
 root@alpenshell:/#

Once at the root prompt, we can build the data index and start populating it.

Tip

You can log out of this alpenshell container at any time during the demo. To later re-enter it, simply run the docker compose run --rm alpenshell command again.

Setting up the data index

Creating the data index is simple, and can be accomplished by running the following command with the alpenhorn CLI utility:

alpenhorn db init

Hint

Remember that all these alpenhorn commands need to be run inside the alpenshell container that we started in the last section.

On successful completion, the db init command will report the version of the database schema used to create the Data Index:

root@alpenshell:/# alpenhorn db init
Data Index version 2 initialised.

Tip

It’s worth pointing out at this point that the alpenhorn CLI can be run from anywhere that has access to the alpenhorn database. It’s explicitly not necessary to run the CLI on a host which contains a StorageNode (or is running the daemon), even when using the CLI to run commands which affect that StorageNode or daemon.

Setting up the import extension

Because alpenhorn is data agnostic, it doesn’t have any facilities out-of-the-box to import files. To be able to import files, alpenhorn needs one or more “import-detect extensions” to be loaded. For the purposes of this demo, we’ll use the simple pattern_importer example extension provided in the /examples directory. This extension has already been incorporated into the alpenhorn container image that we’re running, and alpenhorn has been set up to use it.

Hint

The reason alpenhorn is aware of the pattern_importer extension is because it is listed as an extension to load in the alpenhorn config file, which is available in the alpenshell at /etc/alpenhorn/alpenhorn.conf.

You can also take a look at it on the docker host, in the /demo/ subdirectory out of which you’re running this demo.

As explained in the documentation for the pattern_importer example, the extension adds four new tables to the alpenhorn Data Index: AcqData, AcqType, FileData, and FileType. Adding extra tables to the Data Index is permitted, but caution must be used to prevent name clashes with alpenhorn’s own tables, and tables from other potential extensions. Fortunately, for the simple case in this demo, we don’t have to worry about that.

To initialise the database for the extension, run the demo_init function provided by the extension:

python -c 'import pattern_importer; pattern_importer.demo_init()'

If you get a ModuleNotFoundError: No module named 'pattern_importer' error, you’re probably not executing this command in the root-shell in the alpenshell container.

You should see a success message:

root@alpenshell:/# python -c 'import pattern_importer; pattern_importer.demo_init()'
Plugin init complete.

Create the first StorageNode

We need to start with a place to put some files. We’ll create the first StorageNode, which will be hosted on alpenhost1. Before we can do that, though we first need to create a StorageGroup to house the node. Every StorageNode needs to be contained in a StorageGroup. Typically each group contains only a single node, but certain group classes support or require multiple nodes (such as the transport group that we’ll create later).

To create the group, which we’ll call demo_storage1, run:

alpenhorn group create demo_storage1

Tip

You’re encouraged to explore the commands you’re running in this demo. Every command (or partial command) accepts the --help flag, which will show you all the possible options for the command, and provide information on usage, with important caveats. If you’re curious, try all of these and see what they tell you: alpenhorn --help, alpenhorn group --help, and alpenhorn group create --help.

This should create the group:

root@alpenshell:/# alpenhorn group create demo_storage1
Created storage group "demo_storage1".

Hint

If instead you get an error: Error: Group "demo_storage1" already exists. then likely you’re trying to run this demo using an old instance of the database. In this case, you can stop the demo and delete the old database volume as explained above, if you want to start with a clean demo.

Now that the group is created, we can create a node within it. We’ll also call the node demo_storage1. (By convention, when a StorageGroup contains only one StorageNode, the node and group have the same name, though that’s not required.)

alpenhorn node create demo_storage1 --group=demo_storage1 --auto-import --root=/data --host=alpenhost1

This command will create a new StorageNode called demo_storage1 and put it in the identically-named group. Auto-import (automatic monitoring for new files) will be turned on; the mount point in the filesystem will be set to /data and we declare it to be available on host alpenhost1:

root@alpenshell:/# alpenhorn node create demo_storage1 --group=demo_storage1 --auto-import
              --root=/data --host=alpenhost1
Created storage node "demo_storage1".

That’s enough to get us started.

Tip

You will be issuing a lot of alpenhorn commands over the course of this demo. We suggest leaving the alpenshell prompt open to make it more convenient to issue them. If you ever need to re-open the shell, remember you can run docker compose run alpenshell again to re-enter it.

Start the first daemon

Now it’s time to start the first daemon. The alpenhorn container image is designed to run the alpenhorn daemon automatically. Start the first host container by running the docker compose up command:

docker compose up --detach alpenhost1

Note: if you’re following along with this demo, the database container should already be running:

$ docker compose up --detach alpenhost1
[+] Running 2/2
 ✔ Container demo-alpendb-1   Running                                                         0.0s
 ✔ Container demo-alpenhost1-1  Started                                                       0.4s

(If the database container is not running, docker compose will start it first).

You should now check the logs for the daemon:

docker compose logs alpenhost1

(You can add --follow if you wish to have the logs continuously update.) You’ll see the alpenhorn daemon start up:

alpenhost1-1  | Feb 21 00:38:32 INFO >> [MainThread] Alpenhorn start.
alpenhost1-1  | Feb 21 00:38:32 INFO >> [MainThread] Loading config file /etc/alpenhorn/alpenhorn.conf
alpenhost1-1  | Feb 21 00:38:32 INFO >> [MainThread] Loading extension pattern_importer
alpenhost1-1  | Feb 21 00:38:32 INFO >> [Worker#1] Started.
alpenhost1-1  | Feb 21 00:38:32 INFO >> [Worker#2] Started.

Two worker threads are started because that’s what’s specified in the alpenhornd.conf file. It has also loaded the pattern_exporter extension, since that’s also specified in the config file.

Almost immediately, the daemon will notice that there are no active nodes on alpenhost1. It will perform this check roughly every ten seconds, which is the update interval time set in the alpenhornd.conf file.

alpenhost1-1  | Feb 21 00:38:32 WARNING >> [MainThread] No active nodes on host (alpenhost1)!
alpenhost1-1  | Feb 21 00:38:32 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:38:32 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers
alpenhost1-1  | Feb 21 00:38:42 WARNING >> [MainThread] No active nodes on host (alpenhost1)!
alpenhost1-1  | Feb 21 00:38:42 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:38:42 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers

We can fix this by activating the node we created. To do this, in the alpenshell container, we can use the node activate command:

alpenhorn node activate demo_storage1

Alpenhorn will acknowledge the command:

root@alpenshell:/# alpenhorn node activate demo_storage1
Storage node "demo_storage1" activated.

Now the daemon will find the active node, but there’s still a problem:

alpenhost1-1  | Feb 21 00:40:22 INFO >> [MainThread] Node "demo_storage1" now available.
alpenhost1-1  | Feb 21 00:40:22 WARNING >> [MainThread] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost1-1  | Feb 21 00:40:22 WARNING >> [MainThread] Ignoring node "demo_storage1": not initialised.
alpenhost1-1  | Feb 21 00:40:22 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:40:22 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers

We need to initialise the node so alpenhorn can use it. In this case, we could do this by manually creating the /data/ALPENHORN_NODE file that it can’t find. But, generally, it’s easier to get alpenhorn to initialise the node for us:

alpenhorn node init demo_storage1

The initialisation is not performed by the alpenhorn CLI. Instead the CLI will create a request in the database to initialise the node:

root@alpenshell:/# alpenhorn node init demo_storage1
Requested initialisation of Node "demo_storage1".

Tip

A node only ever needs to be initialised once, when it is first created, but it’s always safe to run this command: a request to initialise an already-initialised node is simply ignored.

The daemon on alpenhost1 will notice this request and you should see the node being initialised by one of the daemon workers:

alpenhost1-1  | Feb 21 00:40:52 INFO >> [MainThread] Node "demo_storage1" now available.
alpenhost1-1  | Feb 21 00:40:52 WARNING >> [MainThread] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost1-1  | Feb 21 00:40:52 INFO >> [MainThread] Requesting init of node "demo_storage1".
alpenhost1-1  | Feb 21 00:40:52 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:40:52 INFO >> [Worker#1] Beginning task Init Node "demo_storage1"
alpenhost1-1  | Feb 21 00:40:52 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | Feb 21 00:40:52 WARNING >> [Worker#1] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost1-1  | Feb 21 00:40:52 WARNING >> [Worker#1] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost1-1  | Feb 21 00:40:52 INFO >> [Worker#1] Node "demo_storage1" initialised.
alpenhost1-1  | Feb 21 00:40:52 INFO >> [Worker#1] Finished task: Init Node "demo_storage1"

After initialisation is complete, the daemon will finally be happy with the Storage Node and start the auto-import monitor. The start of auto-import triggers a “catch-up” job which searches for unknown, pre-existing files that need import. As this is an empty node, though, it won’t find anything:

alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Node "demo_storage1" now available.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Group "demo_storage1" now available.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Watching node "demo_storage1" root "/data" for auto import.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#1] Beginning task Catch-up on demo_storage1
alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#1] Scanning "." on "demo_storage1" for new files.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#1] Scanning ".".
alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#1] Finished task: Catch-up on demo_storage1
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Node demo_storage1: 46.77 GiB available.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Updating node "demo_storage1".
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:41:02 INFO >> [MainThread] Tasks: 1 queued, 0 deferred, 0 in-progress on 2 workers

It will also run a job to see if there’s anything needing clean-up on the node. This “tidy up” job helps the alpenhorn daemon recover from unexpected crashes by looking for and removing temporary files which the alpenhorn daemon may have not been able to clean up the last time it ran. The job is generally run when a node first becomes available to the daemon, and then periodically after that. Again, because this is a brand-new node, there isn’t anything needing tidying:

alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#2] Beginning task Tidy up demo_storage1
alpenhost1-1  | Feb 21 00:41:02 INFO >> [Worker#2] Finished task: Tidy up demo_storage1
alpenhost1-1  | Feb 21 00:41:12 INFO >> [MainThread] Node demo_storage1: 46.77 GiB available.
alpenhost1-1  | Feb 21 00:41:12 INFO >> [MainThread] Updating node "demo_storage1".
alpenhost1-1  | Feb 21 00:41:12 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | Feb 21 00:41:12 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Feb 21 00:41:12 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers

Importing files

Let’s experiment now with importing files into alpenhorn, using both the auto-import system and manually importing them.

What kind of files can be imported?

As mentioned before, alpenhorn itself is agnostic to data file contents. All decisions on which files are imported into the data index are made by the import detect extensions, which can be tailored to the specific data being managed. For this demo, the only import detect function we’re using is the example pattern_importer extension. This extension uses a regular expressions to match against the pathnames of candidate files to determine whether they should be imported or not.

The demo_init function that we called earlier to initialise the database for this demo, added one allowed ArchiveAcq name pattern consisting of a nested directory tree with the date: YYYY/MM/DD and two allowed ArchiveFile name patterns. The first of these is a file called “meta.txt” in the top acquisition directory (i.e. YYYY/MM/DD/meta.txt), which provides metadata for our notional acquisition, and then data files with the time of day, sorted further into hourly directories (i.e. YYYY/MM/DD/hh/mmss.dat).

It bears repeating: the contents of these files are not interesting to alpenhorn per se, but an import detect extension may be implemented which inspects the data of the files being imported, if desired.

We’ll continue this demo by creating files with the above-mentioned naming conventions, without much concern about the file contents.

Auto-importing files and lock files

Let’s start with auto-importing files. When auto-import is turned on for a node, like it has been for our demo_storage1 node, then files will automatically be discovered by alpenhorn as they are added to the node filesystem.

Care must be taken when writing files to a node filesystem when auto-import is turned on to prevent alpenhorn from trying to import a file before it is fully written. To prevent this from happening, before creating a file on the node filesystem, we can create a lock file.

For a file at the path AAA/BBB/name.ext, the corresponding lock file will be called AAA/BBB/.name.ext.lock (i.e. the name of a lock file is the name of the file it’s locking plus a leading . and a .lock suffix.

Let’s create the first file we want to import into alpenhorn, first creating it’s lockfile. To do this, we’ll have to log into the alpenhost1 container, to gain access to the demo_storage1 filesystem. We can start a shell in the running container using docker exec:

docker compose exec alpenhost1 bash -l

Once in this root shell on alpenhost1, we can create the first of our files:

cd /data
mkdir -p 2025/02/21
touch 2025/02/21/.meta.txt.lock
echo "This is the first acquisition in the alpenhorn demo" > 2025/02/21/meta.txt

Hint

If the cd command returns a “No such file or directory” error, then you’re probably trying to create the file in the alpenshell container. That container doesn’t have access to the demo_storage1 filesystem. You need to create the files inside the alpenhost1 container, which you can access using the docker compose exec command provided above.

When creating the file in this last step, you’ll see alpenhorn notice the file, but skip it because it’s locked:

alpenhost1-1  | Feb 21 23:04:21 INFO >> [Worker#1] Beginning task Import 2025/02/21/meta.txt on demo_storage1
alpenhost1-1  | Feb 21 23:04:21 INFO >> [Worker#1] Skipping "2025/02/21/meta.txt": locked.
alpenhost1-1  | Feb 21 23:04:21 INFO >> [Worker#1] Finished task: Import 2025/02/21/meta.txt on demo_storage1

Note

In some cases file creation can cause multiple import requests to be scheduled. This is harmless: alpenhorn is prepared to handle multiple simultaneous attempts to import the same file and will only ever import a file once.

Once the file has been created, the lock file can be deleted, to trigger import of the file:

rm -f 2025/02/21/.meta.txt.lock

This will trigger alpenhorn to finally actually import the file:

alpenhost1-1  | Feb 21 23:07:07 INFO >> [Worker#1] Beginning task Import 2025/02/21/meta.txt on demo_storage1
alpenhost1-1  | Feb 21 23:07:07 INFO >> [Worker#1] Acquisition "2025/02/21" added to DB.
alpenhost1-1  | Feb 21 23:07:07 INFO >> [Worker#1] File "2025/02/21/meta.txt" added to DB.
alpenhost1-1  | Feb 21 23:07:07 INFO >> [Worker#1] Imported file copy "2025/02/21/meta.txt" on node "demo_storage1".
alpenhost1-1  | Feb 21 23:07:07 INFO >> [Worker#1] Finished task: Import 2025/02/21/meta.txt on demo_storage1

Note here that the the three lines in the middle of the daemon output above indicate that the daemon has created three new records in the database:

  • an ArchiveAcq record for the new acquisition, with name 2025/02/21

  • an ArchiveFile record for the new file, with name 21/meta.txt in the new acquisition

  • an ArchiveFileCopy record recording that a copy of the newly-created ArchiveFile exists on demo_storage1

You can use the alpenhorn CLI to see that this file is now present on the demo_storage1 node:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             1          52 B         -
root@alpenshell:/# alpenhorn file list --node=demo_storage1 --details
File                 Size    MD5 Hash                          Registration Time             State    Size on Node
-------------------  ------  --------------------------------  ----------------------------  -------  --------------
2025/02/21/meta.txt  52 B    4f2a66c1ff5eb90a5013522d53ea2e91  Fri Feb 21 23:07:08 2025 UTC  Healthy  4.000 kiB

Auto-importing files and temporary names

Another option for writing files to a node filesystem when auto-import is turned on, is to use a temporary name for the file which will cause alpenhorn to decline to import the file. The import extensions which you’re using may provide a namespace for such files, as is the case with this demo and the pattern_importer which has been configured: any filename which does not match the patterns which were defined by the pattern_importer.demo_init function would work.

Whether or not your import extensions don’t have provisions for omitting files based on pathname, another option is to use a leading dot in the filename of a file you’re creating: alpenhorn will never import a file whose first character is a . (dot). Note: this is only true of file names: alpenhorn is still willing to import paths which contain directories with leading dots in their names, assuming such names are acceptable to at least one of your import extensions.

As an example, let’s create a .dat file with a temporary name by appending, say, .temp to the name of the file we want to create. In the alpenhost1 container:

cd /data
mkdir 2025/02/21/23
echo "0 1 2 3 4 5" > 2025/02/21/23/1324.dat.temp

This file creation will be noticed by alpenhorn, but no import will occur, because the pattern_exporter won’t accept the name as valid:

alpenhost1-1  | Feb 21 23:51:59 INFO >> [Worker#1] Beginning task Import 2025/02/21/23/1324.dat.temp on demo_storage1
alpenhost1-1  | Feb 21 23:51:59 INFO >> [Worker#1] Not importing non-acquisition path: 2025/02/21/23/1324.dat.temp
alpenhost1-1  | Feb 21 23:51:59 INFO >> [Worker#1] Finished task: Import 2025/02/21/23/1324.dat.temp on demo_storage1

The message “Not importing non-acquisition path” means no import extension indicated to alpenhorn that the file should be imported. If, instead, we had used a temporary filename with a leading dot, say, /data/2025/02/21/23/.1324.dat, an import task wouldn’t have even been made, since alpenhorn would have rejected the file name earlier, before it got around to attempting to import the file.

After file is fully written, it can be moved to the correct name. On most filesystems, this is an atomic operation:

mv 2025/02/21/23/1324.dat.temp 2025/02/21/23/1324.dat

Hint

By “atomic operation” we mean: on most filesystems there is never a time during execution of the mv command when the destination filename 2025/02/21/23/1324.dat refers to a partial file. Either the destination file doesn’t exist, or it exists and is complete.

This will trigger import of the file:

alpenhost1-1  | Feb 21 23:52:20 INFO >> [Worker#2] Beginning task Import 2025/02/21/23/1324.dat on demo_storage1
alpenhost1-1  | Feb 21 23:52:20 INFO >> [Worker#2] File "2025/02/21/23/1324.dat" added to DB.
alpenhost1-1  | Feb 21 23:52:20 INFO >> [Worker#2] Imported file copy "2025/02/21/23/1324.dat" on node "demo_storage1".
alpenhost1-1  | Feb 21 23:52:20 INFO >> [Worker#2] Finished task: Import 2025/02/21/23/1324.dat on demo_storage1

Unlike when we imported the first file, now only two new records are created in the database, because the ArchiveAcq record already exists:

  • an ArchiveFile for the new file

  • an ArchiveFileCopy for the copy of the new file on demo_storage1

Now there are two files on the node:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             2          64 B         -
root@alpenshell:/# alpenhorn file list --node=demo_storage1 --details
File                    Size    MD5 Hash                          Registration Time             State    Size on Node
----------------------  ------  --------------------------------  ----------------------------  -------  --------------
2025/02/21/23/1324.dat  12 B    4c79018e00ddef11af0b9cfc14dd3261  Fri Feb 21 23:52:21 2025 UTC  Healthy  4.000 kiB
2025/02/21/meta.txt     52 B    4f2a66c1ff5eb90a5013522d53ea2e91  Fri Feb 21 23:07:08 2025 UTC  Healthy  4.000 kiB

Manually importing files

Let’s now turn to the case where we don’t have auto-import turned on for a node. In this case there’s no difficulty writing to the node, since filesystem events won’t trigger automatic attempts to import files.

First, turn off auto-import on the node by modifying its properties:

alpenhorn node modify demo_storage1 --no-auto-import

If you want, you can verify that auto-import has been turned off for the node by checking its metadata after the modify command:

root@alpenshell:/# alpenhorn node modify demo_storage1 --no-auto-import
Node updated.
root@alpenshell:/# alpenhorn node show demo_storage1
   Storage Node: demo_storage1
  Storage Group: demo_storage1
         Active: Yes
           Type: -
          Notes:
      I/O Class: Default

    Daemon Host: alpenhost1
 Log-in Address:
Log-in Username:

    Auto-Import: Off
    Auto-Verify: Off
      Max Total: -
      Available: 46.47 GiB
  Min Available: -
   Last Checked: Sat Feb 22 00:03:36 2025 UTC

I/O Config:

  none

With that done, let’s create some more data files:

cd /data
echo "0 1 2 3 4 5" > 2025/02/21/23/1330.dat
echo "3 4 5 6 7 8" > 2025/02/21/23/1342.dat
echo "9 10 11 12 13" > 2025/02/21/23/1349.dat

None of these files have been added to the database. We can use the alpenhorn CLI to see this: as far as alpenhorn is concerned, there are still only two files on the node.

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             2          64 B         -

But, now that we’ve finished writing these files, we can tell alpenhorn to import them. This can be done for an individual file:

alpenhorn file import --register-new 2025/02/21/23/1330.dat demo_storage1

Hint

The --register-new flag tells alpenhorn that it is allowed to create a new ArchiveFile (and, were it necessary, an ArchiveAcq record, too) for newly discovered files. Without this flag, alpenhorn will only import files which are already represented by an existing ArchiveFile. This second mode is more appropriate in cases where a node should not be receiving new files.

The CLI will create an import request for this file:

root@alpenshell:/# alpenhorn file import --register-new 2025/02/21/23/1330.dat demo_storage1
Added new import request.

The import request should be shortly handled by the daemon:

alpenhost1-1  | Feb 22 00:09:36 INFO >> [Worker#1] Beginning task Import 2025/02/21/23/1330.dat on demo_storage1
alpenhost1-1  | Feb 22 00:09:36 INFO >> [Worker#1] File "2025/02/21/23/1330.dat" added to DB.
alpenhost1-1  | Feb 22 00:09:36 INFO >> [Worker#1] Imported file copy "2025/02/21/23/1330.dat" on node "demo_storage1".
alpenhost1-1  | Feb 22 00:09:36 INFO >> [Worker#1] Completed import request #2.
alpenhost1-1  | Feb 22 00:09:36 INFO >> [Worker#1] Finished task: Import 2025/02/21/23/1330.dat on demo_storage1

It’s also possible to tell alpenhorn to scan an entire directory for new files:

alpenhorn node scan demo_storage1 --register-new 2025/02/21

Which will add another import request:

root@alpenshell:/# alpenhorn node scan demo_storage1 --register-new 2025/02/21
Added request for scan of "2025/02/21" on Node "demo_storage1".

Now alpenhorn will scan the requested path and find the other two files we just created:

alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Beginning task Scan "2025/02/21" on demo_storage1
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Scanning "2025/02/21" on "demo_storage1" for new files.
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Scanning "2025/02/21".
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Scanning "2025/02/21/23".
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#1] Beginning task Import 2025/02/21/23/1349.dat on demo_storage1
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Completed import request #4.
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Finished task: Scan "2025/02/21" on demo_storage1
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Beginning task Import 2025/02/21/23/1342.dat on demo_storage1
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#1] File "2025/02/21/23/1349.dat" added to DB.
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#1] Imported file copy "2025/02/21/23/1349.dat" on node "demo_storage1".
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] File "2025/02/21/23/1342.dat" added to DB.
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Imported file copy "2025/02/21/23/1342.dat" on node "demo_storage1".
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#1] Finished task: Import 2025/02/21/23/1349.dat on demo_storage1
alpenhost1-1  | Feb 22 00:12:56 INFO >> [Worker#2] Finished task: Import 2025/02/21/23/1342.dat on demo_storage1

Now there are five files on the storage node:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             5         102 B         -

Syncing files between nodes

Let’s now move on to syncing, or transferring, files between different hosts.

Starting up the second and third nodes

Before being able to transfer files, we need to create somewhere to transfer them to. We’ll start by creating the second storage node on the second host:

alpenhorn node create demo_storage2 --create-group --root=/data --host=alpenhost2

Hint

The --create-group option to node create tells alpenhorn to also create a StorageGroup for the new node with the same name (i.e. the same thing we did manually for demo_storage1 above)

This will create the second node:

root@alpenshell:/# alpenhorn node create demo_storage2 --create-group --root=/data --host=alpenhost2
Created storage group "demo_storage2".
Created storage node "demo_storage2".

Let’s also make sure this node gets initialised, though this won’t happen immediately, since we haven’t activated the Storage Node, nor are we running the second daemon yet.

alpenhorn node init demo_storage2

Hint

Requests created by the alpenhorn CLI, be they initialisation requests, import requests, or transfer requests, do not require the target node to be active, nor do they require an alpenhorn daemon to be managing them. Requests made on inactive nodes will remain pending in the database until they can be handled by an alpenhorn daemon instance.

You can see pending requests, including this init request, using the alpenhorn CLI:

root@alpenshell:/# alpenhorn node show demo_storage2 --all
   Storage Node: demo_storage2
  Storage Group: demo_storage2
         Active: No
           Type: -
          Notes:
      I/O Class: Default

    Daemon Host: alpenhost2
 Log-in Address:
Log-in Username:

    Auto-Import: Off
    Auto-Verify: Off
      Max Total: -
      Available: -
  Min Available: -
   Last Checked: -

I/O Config:

  none

Stats:

    Total Files: 0
     Total Size: -
          Usage: -%

Pending import requests:

Path         Scan    Register New    Request Time
-----------  ------  --------------  -------------------
[Node Init]  -       -               2025-02-26 22:54:14

Pending outbound transfers:

Dest. Group    Request Count    Total Size
-------------  ---------------  ------------

Auto-actions:

  none

Note

Node init requests are handled, under the hood, as a special kind of import request, which is why the Node Init request appears in the import request table.

This node is initially empty:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             5         102 B         -
demo_storage2             0             -         -

Before starting transfers we have to record log-in details for the hosts containing the nodes. alpenhorn uses SSH to log in to remote nodes when performing transfers, meaning we need to specify a username and login-in address for the node. For demo_storage1, which is already active we can do this by modifying the node record:

alpenhorn node modify demo_storage1 --username root --address alpenhost1

For the second node, we can do it when we activate it. We could have also specified these values when we created the node:

alpenhorn node activate demo_storage2 --username root --address alpenhost2

Tip

It’s very important to distinguish the name used for a node’s host (where the daemon managing the node is running) and the node’s address (the name or IP address used by remote daemons to access the node via SSH). Often these two fields have the same value, but there’s no requirement that they do.

Let’s start up the second alpenhorn container to get the second node running:

docker compose up --detach alpenhost2

You can monitor this nodes in the same way you did with alpenhost1:

docker compose logs alpenhost2

but it’s also possible to monitor all nodes at once:

docker compose logs --follow

For now, the new node should initialise itself, and then idle: there are no pending requests:

alpenhost2-1  | Feb 26 23:05:02 INFO >> [MainThread] Node "demo_storage2" now available.
alpenhost2-1  | Feb 26 23:05:02 WARNING >> [MainThread] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost2-1  | Feb 26 23:05:02 INFO >> [MainThread] Requesting init of node "demo_storage2".
alpenhost2-1  | Feb 26 23:05:02 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost2-1  | Feb 26 23:05:02 INFO >> [MainThread] Tasks: 1 queued, 0 deferred, 0 in-progress on 2 workers
alpenhost2-1  | Feb 26 23:05:02 INFO >> [Worker#1] Beginning task Init Node "demo_storage2"
alpenhost2-1  | Feb 26 23:05:02 WARNING >> [Worker#1] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost2-1  | Feb 26 23:05:02 WARNING >> [Worker#1] Node file "/data/ALPENHORN_NODE" could not be read.
alpenhost2-1  | Feb 26 23:05:02 INFO >> [Worker#1] Node "demo_storage2" initialised.
alpenhost2-1  | Feb 26 23:05:02 INFO >> [Worker#1] Finished task: Init Node "demo_storage2"
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Node "demo_storage2" now available.
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Group "demo_storage2" now available.
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Node demo_storage2: 45.51 GiB available.
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Updating node "demo_storage2".
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Updating group "demo_storage2".
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost2-1  | Feb 26 23:05:12 INFO >> [MainThread] Tasks: 1 queued, 0 deferred, 0 in-progress on 2 workers
alpenhost2-1  | Feb 26 23:05:12 INFO >> [Worker#1] Beginning task Tidy up demo_storage2
alpenhost2-1  | Feb 26 23:05:12 INFO >> [Worker#1] Finished task: Tidy up demo_storage2

Transferring a file

The alpenhorn daemon has the ability to transfer files between Storage Nodes. To trigger file movement, we need to issue sync or transfer requests. Transfer requests always request movement of a file from a Storage Node into a Storage Group. Because all the groups we have for now have a single node in them, this distinction isn’t terribly important, but we’ll revisit this later, when we experiment with multi-node groups.

We can transfer any existing file explicitly by issuing a transfer request for it:

alpenhorn file sync --from demo_storage1 --to demo_storage2 2025/02/21/meta.txt

This will submit a new transfer request:

root@alpenshell:/# alpenhorn file sync --from demo_storage1 --to demo_storage2 2025/02/21/meta.txt
Request submitted.

Transfers are always handled on the receiving side (that is: by the daemon which considers the destination StorageGroup to be available). After, perhaps, a short while, the daemon on alpenhost2 will notice this request. First, it will look at the local filesystem to see if the requested file already exists. If it did, there would be no need for a transfer:

alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/21/meta.txt in demo_storage2
alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/21/meta.txt in demo_storage2

But, in this case, the search will fail to find an existing copy of the file, so then a file transfer will be started:

alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#1] Beginning task AFCR#1: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#1] Creating directory "/data/2025/02/21".
alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#1] Pulling remote file 2025/02/21/meta.txt using rsync
alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#1] Pull of 2025/02/21/meta.txt complete. Transferred 52 B in 0.4s [139 B/s]
alpenhost2-1  | Feb 26 23:18:52 INFO >> [Worker#1] Finished task: AFCR#1: demo_storage1 -> demo_storage2

Note

The default tool for remote transfers is rsync, but alpenhorn will also try to use bbcp, a GridFTP implementation, which may allow for higher-rate transfers, if it is available on to the daemon.

Now there is one file on demo_storage2:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             5         102 B         -
demo_storage2             1          52 B         -

You can check the filesystem on alpenhost2 (by, say, running a find command) to see that this file now exists on that node:

$ docker container run alpenhost2 find /data
/data
/data/ALPENHORN_NODE
/data/2025
/data/2025/02
/data/2025/02/21
/data/2025/02/21/meta.txt

Bulk transfers

Rather than the tedious operation of requesting individual files to be transferred, it is more typical to request all files present on a source node and absent from a destination group be transferred:

alpenhorn node sync demo_storage1 demo_storage2 --show-files

This will cause the alpenhorn CLI to create transfer requests for all files which are present on demo_storage1 but not present on demo_storage2.

This command will require confirmation:

root@alpenshell:/# alpenhorn node sync demo_storage1 demo_storage2 --show-files
Would sync 4 files (50 B) from Node "demo_storage1" to Group "demo_storage2":

2025/02/21/23/1324.dat
2025/02/21/23/1330.dat
2025/02/21/23/1342.dat
2025/02/21/23/1349.dat

Continue? [y/N]: y

Syncing 4 files (50 B) from Node "demo_storage1" to Group "demo_storage2".

Added 4 new copy requests.

Hint

Although there are five files on the node, only four of them will be transferred, because the first file we transferred is already on demo_storage2.

The daemon on alpenhost2 will churn through these requests:

alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/21/23/1330.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/21/23/1330.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Beginning task AFCR#2: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/21/23/1324.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Creating directory "/data/2025/02/21/23".
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/21/23/1324.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Beginning task AFCR#3: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Pulling remote file 2025/02/21/23/1324.dat using rsync
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Pulling remote file 2025/02/21/23/1330.dat using rsync
alpenhost2-1  | Feb 26 23:34:32 INFO >> [MainThread] Main loop execution was 0.1s.
alpenhost2-1  | Feb 26 23:34:32 INFO >> [MainThread] Tasks: 2 queued, 0 deferred, 2 in-progress on 2 workers
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Pull of 2025/02/21/23/1324.dat complete. Transferred 12 B in 0.3s [36 B/s]
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Pull of 2025/02/21/23/1330.dat complete. Transferred 12 B in 0.3s [36 B/s]
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Finished task: AFCR#3: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Finished task: AFCR#2: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/21/23/1349.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Beginning task Pre-pull search for 2025/02/21/23/1342.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/21/23/1349.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Beginning task AFCR#4: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Finished task: Pre-pull search for 2025/02/21/23/1342.dat in demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Beginning task AFCR#5: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Pulling remote file 2025/02/21/23/1349.dat using rsync
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Pulling remote file 2025/02/21/23/1342.dat using rsync
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Pull of 2025/02/21/23/1342.dat complete. Transferred 12 B in 0.4s [32 B/s]
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Pull of 2025/02/21/23/1349.dat complete. Transferred 14 B in 0.4s [37 B/s]
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#1] Finished task: AFCR#5: demo_storage1 -> demo_storage2
alpenhost2-1  | Feb 26 23:34:32 INFO >> [Worker#2] Finished task: AFCR#4: demo_storage1 -> demo_storage2

And eventually all files will be transferred to alpenhost2:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             5         102 B         -
demo_storage2             5         102 B         -

Hint

If you were to try the identical sync request a second time, after alpenhost2 has finished all the transfers, alpenhorn will decide that nothing needs transferring and respond with “No files to sync”.

One last note on the node sync command: if you prefer thinking about the destination side of transfers, you can use group sync to perform the same task.

The command

alpenhorn node sync demo_storage1 demo_storage2 --show-files

is equivalent to

alpenhorn group sync demo_storage2 demo_storage1 --show-files

though note that with node sync the arguments are source node and then destination group but with group sync these are reversed: the first argument is the destination group and the second argument the source node.

Dealing with corruption

More than just helping you copy files around, alpenhorn can monitor your files for corruption.

MD5 Digest Hashes

Although, as mentioned earlier, alpenhorn doesn’t really know what’s in the files its managing, whenever it registers a new file, it computes the MD5 digest hash for the file. This means that, if a file is changed after registration, alpenhorn can detect this change by re-computing the MD5 hash and comparing it to the hash value it recorded when first registering the file.

You can see the stored hash value for a file using the alpenhorn CLI:

root@alpenshell:/# alpenhorn file show 2025/02/21/23/1324.dat
       Name: 23/1324.dat
Acquisition: 2025/02/21
       Path: 2025/02/21/23/1324.dat

       Size: 12 B
   MD5 Hash: 4c79018e00ddef11af0b9cfc14dd3261
 Registered: Thu Mar  6 22:54:37 2025 UTC

If we were to manually compute the MD5 digest for this file (in, say, the alpenhost1 container) we would get the same result:

root@alpenhost1:/data# md5sum 2025/02/21/23/1324.dat
4c79018e00ddef11af0b9cfc14dd3261  2025/02/21/23/1324.dat

Let’s corrupt a file by changing its contents on alpenhost1:

cd /data
echo "bad data" > 2025/02/21/23/1324.dat

Now if we manually compute the MD5 hash, we can see that’s it’s different than what alpenhorn has recorded:

root@alpenhost1:/data# md5sum 2025/02/21/23/1324.dat
3412f7b66a30b90ae3d3085c96615f00  2025/02/21/23/1324.dat

However, alpenhorn hasn’t noticed this:

root@alpenshell:/# alpenhorn node stats --extra-stats
Name             File Count    Total Size    % Full    Corrupt Files    Suspect Files    Missing Files
-------------  ------------  ------------  --------  ---------------  ---------------  ---------------
demo_storage1             5         102 B         -                -                -                -
demo_storage2             5         102 B         -                -                -                -

It still lists no corrupt files on demo_storage1. This is because alpenhorn doesn’t normally automatically detect corruption to files it is managing. You can turn on “auto-verify” on a node, but that won’t result in instantaneous detection of corruption either, and can be I/O expensive, (and, so, should be used with caution).

In some cases, file corruption will be detected by alpenhorn when copying an unexpectedly corrupt file from one node to another. For now, we can manually request a verification of the file. We’ll do this by requesting verification for the entire acquisition, even though we’ve only corrupted one of the files.

To request verification of all files in the acquisition on the node demo_storage1, run:

alpenhorn node verify --all --acq=2025/02/21 demo_storage1

You will have to confirm this request:

root@alpenshell:/# alpenhorn node verify --all --acq=2025/02/21 demo_storage1
Would request verification of 5 files (102 B).

Continue? [y/N]: y

Requesting verification of 5 files (102 B).
Updated 5 files.

The daemon on alpenhost1 will respond to this command by re-verifying all files in that acquisition:

alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Checking copy "2025/02/21/meta.txt" on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Checking copy "2025/02/21/23/1324.dat" on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Checking copy "2025/02/21/23/1330.dat" on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Checking copy "2025/02/21/23/1349.dat" on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Checking copy "2025/02/21/23/1342.dat" on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 ERROR >> [Worker#2] File 2025/02/21/23/1324.dat on node demo_storage1 is corrupt! Size: 9; expected: 12
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Updating file copy #2 for file 2025/02/21/23/1324.dat on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Finished task: Check file 2025/02/21/23/1324.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Beginning task Check file 2025/02/21/23/1330.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] File 2025/02/21/meta.txt on node demo_storage1 is A-OK!
alpenhost1-1  | Mar 07 01:48:25 INFO >> [MainThread] Tasks: 2 queued, 0 deferred, 2 in-progress on 2 workers
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] Updating file copy #1 for file 2025/02/21/meta.txt on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] Finished task: Check file 2025/02/21/meta.txt on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] Beginning task Check file 2025/02/21/23/1349.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] File 2025/02/21/23/1330.dat on node demo_storage1 is A-OK!
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Updating file copy #3 for file 2025/02/21/23/1330.dat on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Finished task: Check file 2025/02/21/23/1330.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Beginning task Check file 2025/02/21/23/1342.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] File 2025/02/21/23/1349.dat on node demo_storage1 is A-OK!
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] Updating file copy #4 for file 2025/02/21/23/1349.dat on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] File 2025/02/21/23/1342.dat on node demo_storage1 is A-OK!
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Updating file copy #5 for file 2025/02/21/23/1342.dat on node demo_storage1.
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#1] Finished task: Check file 2025/02/21/23/1349.dat on demo_storage1
alpenhost1-1  | Mar 07 01:48:25 INFO >> [Worker#2] Finished task: Check file 2025/02/21/23/1342.dat on demo_storage1

As you can see, it has discovered our corruption of 2025/02/21/23/1324.dat, and also verified that the other files are not corrupt.

Now if we check the node stats, we can see one corrupt file on this node.

root@alpenshell:/# alpenhorn node stats --extra-stats
Name             File Count    Total Size    % Full    Corrupt Files    Suspect Files    Missing Files
-------------  ------------  ------------  --------  ---------------  ---------------  ---------------
demo_storage1             4          90 B         -                1                -                -
demo_storage2             5         102 B         -                -                -                -
root@alpenshell:/# alpenhorn file state 2025/02/21/23/1324.dat demo_storage1
Corrupt Ready

Also note that the file count for demo_storage1 is down to four: a known corrupt file is not considered “present” on a node, since it doesn’t provide the expected data.

Recovering corrupt files

The standard way to recover a corrupt file copy is to re-transfer a known-good copy of the file over top of the corrupt version. We can do this by syncing the file back from alpenhost2:

alpenhorn node sync demo_storage2 demo_storage1

It will tell you there is only one file to transfer (the corrupt file) and ask for confirmation:

root@alpenshell:/# alpenhorn node sync demo_storage2 demo_storage1
Would sync 1 file (12 B) from Node "demo_storage2" to Group "demo_storage1".

Continue? [y/N]: y

Syncing 1 file (12 B) from Node "demo_storage2" to Group "demo_storage1".

Added 1 new copy request.

Wait for the daemon on alpenhost1 to pull the file from alpenhost2:

alpenhost1-1  | Mar 07 01:52:15 INFO >> [Worker#1] Beginning task AFCR#6: demo_storage2 -> demo_storage1
alpenhost1-1  | Mar 07 01:52:15 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | Mar 07 01:52:15 INFO >> [Worker#1] Pulling remote file 2025/02/21/23/1324.dat using rsync
alpenhost1-1  | Mar 07 01:52:15 INFO >> [Worker#1] Pull of 2025/02/21/23/1324.dat complete. Transferred 12 B in 0.4s [32 B/s]
alpenhost1-1  | Mar 07 01:52:15 INFO >> [Worker#1] Finished task: AFCR#6: demo_storage2 -> demo_storage1

After transferring the file back, now alpenhorn now considers the file healthy again:

root@alpenshell:/# alpenhorn node stats --extra-stats
Name             File Count    Total Size    % Full    Corrupt Files    Suspect Files    Missing Files
-------------  ------------  ------------  --------  ---------------  ---------------  ---------------
demo_storage1             5         102 B         -                -                -                -
demo_storage2             5         102 B         -                -                -                -

Deleting files

Typically you’ll want to delete files off your acquisition nodes once they’ve been transferred off-site. File deletion can be accomplished with the clean command.

Since we’ve copied some files from alpenhost1 to alpenhost2, let’s try deleting one of the files from alpenhost1:

alpenhorn file clean --now --node=demo_storage1 2025/02/21/meta.txt

The CLI should release the file immediately:

root@alpenshell:/# alpenhorn file clean --now --node=demo_storage1 2025/02/21/meta.txt
Released "2025/02/21/meta.txt" for immediate removal on Node "demo_storage1".

Hint

The --now flag tells alpenhorn to delete the file as soon as possible. Without that flag, instead of being released for removal, the file is marked for “discretionary cleaning”, which tells alpenhorn that it can decide to delete the file if it wants to clear space on the node, but in this demo alpenhorn would never decide to do that, so we’ll opt for immediate removal.

Despite our request, if you look at the daemon log on alpenhost1, you’ll see that it’s refused to delete the file:

alpenhost1-1  | Mar 07 02:21:25 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | Mar 07 02:21:25 WARNING >> [Worker#1] Too few archive copies (0) to delete 2025/02/21/meta.txt on demo_storage1.
alpenhost1-1  | Mar 07 02:21:25 INFO >> [Worker#1] Finished task: Delete copies [1] from demo_storage1

To prevent data loss, alpenhorn will only delete file copies from a node if at least two other copies of the file exist on other archive nodes. Currently we have no archive nodes, so we can’t delete files.

Let’s fix that. While we do, the alpenhost1 daemon will keep checking whether it can delete that file.

Archive nodes

An archive node is any storage node with the “archive” storage type. Let’s change demo_storage2 into an archive node. We do that by modifying it’s metadata:

alpenhorn node modify --archive demo_storage2

After running this command, you can look at the node metadata to see that it now has the “archive” storage type:

root@alpenshell:/# alpenhorn node modify --archive demo_storage2
Node updated.
root@alpenshell:/# alpenhorn node show demo_storage2
   Storage Node: demo_storage2
  Storage Group: demo_storage2
         Active: Yes
           Type: Archive
          Notes:
      I/O Class: Default

    Daemon Host: alpenhost2
 Log-in Address: alpenhost2
Log-in Username: root

    Auto-Import: Off
    Auto-Verify: Off
      Max Total: -
      Available: 45.38 GiB
  Min Available: -
   Last Checked: Fri Mar  7 02:27:47 2025 UTC

I/O Config:

  none

Now if we look at the alpenhost1 daemon log, the file it’s trying to delete is now found on one archive node (out of the two needed):

alpenhost1-1  | Mar 07 02:28:55 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | Mar 07 02:28:55 WARNING >> [Worker#1] Too few archive copies (1) to delete 2025/02/21/meta.txt on demo_storage1.
alpenhost1-1  | Mar 07 02:28:55 INFO >> [Worker#1] Finished task: Delete copies [1] from demo_storage1

We’ll need another archive node with this file on it if we want the deletion to happen. So, let’s set up the final storage host, alpenhost3.

First let’s create the storage node in the database. We’ll make this one an archive node when we create it:

alpenhorn node create demo_storage3 --create-group --archive --root=/data --host=alpenhost3 \
                                    --username root --address alpenhost3 --init --activate

Tip

The --init and --activate flags save us from having to run those commands on the new node later.

Now let’s start the third docker container and take a look at its logs:

docker compose up --detach alpenhost3
docker compose logs --follow alpenhost3

Sync everything on demo_storage2 to demo_storage3:

alpenhorn node sync --force demo_storage2 demo_storage3

Caution

Using --force here skips the confirmation step. You can use --force with any alpenhorn command that would ask for confirmation, but you should be careful when using it.

As soon as the file is transferred to demo_storage3, the daemon on alpenhost1 will finally delete the file:

alpenhost1-1  | Mar 07 02:38:45 INFO >> [Worker#1] Beginning task Delete copies [1] from demo_storage1
alpenhost1-1  | Mar 07 02:38:45 INFO >> [Worker#1] Removed file copy 2025/02/21/meta.txt on demo_storage1
alpenhost1-1  | Mar 07 02:38:45 INFO >> [Worker#1] Finished task: Delete copies [1] from demo_storage1
alpenhost1-1  | Mar 07 02:38:45 INFO >> [MainThread] Main loop execution was 0.1s.
alpenhost1-1  | Mar 07 02:38:45 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers

Now there are only four files on demo_storage1:

root@alpenshell:/# alpenhorn node stats --extra-stats
Name             File Count    Total Size    % Full    Corrupt Files    Suspect Files    Missing Files
-------------  ------------  ------------  --------  ---------------  ---------------  ---------------
demo_storage1             4          50 B         -                -                -                -
demo_storage2             5         102 B         -                -                -                -
demo_storage3             5         102 B         -                -                -                -

As with sync requests, rather than cleaning individual files, we can do bulk operations. To tell alpenhorn to delete everything from demo_storage1 that already exists on demo_storage3:

alpenhorn node clean demo_storage1 --now --target demo_storage3

It will find four files to clean, which you’ll have to confirm:

root@alpenshell:/# alpenhorn node clean demo_storage1 --now --target demo_storage3
Would release 4 files (50 B).

Continue? [y/N]: y

Releasing 4 files (50 B).
Updated 4 files.

The files will be removed from demo_storage1 by the daemon:

alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Beginning task Delete copies [2, 3, 4, 5] from demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed file copy 2025/02/21/23/1324.dat on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed file copy 2025/02/21/23/1330.dat on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed file copy 2025/02/21/23/1349.dat on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed file copy 2025/02/21/23/1342.dat on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed directory /data/2025/02/21/23 on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed directory /data/2025/02/21 on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed directory /data/2025/02 on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Removed directory /data/2025 on demo_storage1
alpenhost1-1  | Mar 07 02:43:05 INFO >> [Worker#1] Finished task: Delete copies [2, 3, 4, 5] from demo_storage1

Note that the daemon will also delete directories on the node which end up empty after file deletion to keep the storage node directory tree tidy.

Now demo_storage1 is empty:

root@alpenshell:/# alpenhorn node stats --extra-stats
Name             File Count    Total Size    % Full    Corrupt Files    Suspect Files    Missing Files
-------------  ------------  ------------  --------  ---------------  ---------------  ---------------
demo_storage1             0             -         -                -                -                -
demo_storage2             5         102 B         -                -                -                -
demo_storage3             5         102 B         -                -                -                -

You can also inspect the filesystem on alpenhost to see that it is now empty:

root@alpenhost1:/# find /data
/data
/data/ALPENHORN_NODE

Auto-actions

Up till now, we’ve been moving files around manually, however you can configure alpenhorn to automate the movement of your files through the Storage graph using auto-actions.

There are two auto-actions which always connect a StorageGroup with another StorageNode not in that group:

  • Auto-sync: triggers when a file is added to a StorageNode (via either import or sync) and tells alpenhorn to create a new copy request to a downstream StorageGroup from this node, if it is not already in that group.

  • Auto-clean: triggers when a file is added to a StorageGroup (via either import or sync) and tells alpenhorn to delete the file from an upstream StorageNode, if it exists on that node.

The first auto-sync

Let’s set up auto-actions to automatically transfer data from demo_storage1 to the archives in demo_storage2 and demo_storage3. We’ll start with an auto-sync action which tells alpenhorn to create transfer requests for new files which appear on demo_storage1 to have them transferred to the demo_storage2 group. Note: auto-sync actions are managed using the downstream target group:

alpenhorn group autosync demo_storage2 demo_storage1

After adding the action, you can see it in the group details:

root@alpenshell:/# alpenhorn group autosync demo_storage2 demo_storage1
Auto-sync from "demo_storage1" started
root@alpenshell:/# alpenhorn group show demo_storage2 --actions
Storage Group: demo_storage2
Notes:
I/O Class: Default

I/O Config:

  none

Nodes:

  demo_storage2

Auto-actions:

Node           Action     Trigger
-------------  ---------  -----------------------
demo_storage1  Auto-sync  File added to that node

Let’s create a second acquisition now on alpenhost1 and see if it will get automatically transferred:

mkdir -p /data/2025/02/26
echo "This is the second acquisition in the alpenhorn demo" > /data/2025/02/26/meta.txt

After creating this new file, request a scan of demo_storage1

alpenhorn node scan demo_storage1 --register-new

After the scan completes on demo_storage1 you should almost immediately see the transfer happen on demo_storage2:

alpenhost1-1  | May 07 20:37:45 INFO >> [MainThread] Node demo_storage1: 43.09 GiB available.
alpenhost1-1  | May 07 20:37:45 INFO >> [MainThread] Updating node "demo_storage1".
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Beginning task Scan "." on demo_storage1
alpenhost1-1  | May 07 20:37:45 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Scanning "." on "demo_storage1" for new files.
alpenhost1-1  | May 07 20:37:45 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | May 07 20:37:45 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Scanning ".".
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Scanning "2025/02/26".
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#1] Beginning task Import 2025/02/26/meta.txt on demo_storage1
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Completed import request #5.
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#2] Finished task: Scan "." on demo_storage1
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#1] Acquisition "2025/02/26" added to DB.
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#1] File "2025/02/26/meta.txt" added to DB.
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#1] Imported file copy "2025/02/26/meta.txt" on node "demo_storage1".
alpenhost1-1  | May 07 20:37:45 INFO >> [Worker#1] Finished task: Import 2025/02/26/meta.txt on demo_storage1

alpenhost2-1  | May 07 20:37:55 INFO >> [MainThread] Node demo_storage2: 43.09 GiB available.
alpenhost2-1  | May 07 20:37:55 INFO >> [MainThread] Updating node "demo_storage2".
alpenhost2-1  | May 07 20:37:55 INFO >> [MainThread] Updating group "demo_storage2".
alpenhost2-1  | May 07 20:37:55 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/26/meta.txt in demo_storage2
alpenhost2-1  | May 07 20:37:55 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/26/meta.txt in demo_storage2
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#1] Beginning task AFCR#11: demo_storage1 -> demo_storage2
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#1] Creating directory "/data/2025/02/26".
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#1] Pulling remote file 2025/02/26/meta.txt using rsync
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#1] Pull of 2025/02/26/meta.txt complete. Transferred 52 B in 0.4s [142 B/s]
alpenhost2-1  | May 07 20:37:55 INFO >> [Worker#1] Finished task: AFCR#11: demo_storage1 -> demo_storage2

The second auto-sync

Let’s set up the second part of our transfer by requesting files be automatically moved to demo_storage3 from demo_storage2:

alpenhorn group autosync demo_storage3 demo_storage2

Now let’s create another file to check that it will travel all the way to demo_storage3:

mkdir /data/2025/02/26/01
echo "14 15 16 17 18" > /data/2025/02/26/01/0529.dat

And scan the node again to import the file:

alpenhorn node scan demo_storage1 --register-new

Hint

If you don’t want to run all the node scan commands in this section, you can always turn auto-sync back on for demo_storage1. Use the node modify command to do that. (We’ll keep doing it manually here, though, since that allows you to control when the auto-actions fire.)

After a few update loops, the new file should be successfully synced all the way to demo_storage3, but there’s a subtlety we shouldn’t forget about. If you compare demo_storage2 and demo_storage3 you’ll notice that one file is missing from the latter node:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             2          67 B         -
demo_storage2             7         169 B         -
demo_storage3             6         117 B         -

This is the /data/2025/02/26/meta.txt, which was copied onto demo_storage2 before we created the second auto-sync action. Because auto-actions only trigger on new files being added to a node or group, auto-actions never apply retroactively. To fix this, we’ll need to perform a manual sync:

alpenhorn node sync demo_storage2 demo_storage3

Once that completes, then the file counts should be consistent:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             2          67 B         -
demo_storage2             7         169 B         -
demo_storage3             7         169 B         -

Tip

When automating movement of data through a storage graph, auto-actions are not a replacement for periodic (e.g. cron-based) invocation of sync and other alpenhorn commands. A robust transfer system will combine auto-actions with alpenhorn commands.

The primary benefit of auto-actions is lower latency of transfers over automated, scheduled sync commands.

The auto-clean action

The final piece of our automated transfer mechanism is to delete files once they’re on demo_storage3. As with the second auto-sync we created, when we create the auto-clean action, it won’t retroactively trigger on files which already exist, so when creating the action, we’ll also clean up demo_storage1, for consistency:

alpenhorn node clean --now demo_storage1 --target=demo_storage3
alpenhorn node autoclean demo_storage1 demo_storage3

Auto-clean actions can be seen in the metadata for the node or group:

root@alpenshell:/# alpenhorn node clean --now demo_storage1 --target=demo_storage3
Would release 2 files (67 B).

Continue? [y/N]: y

Releasing 2 files (67 B).
Updated 2 files.
root@alpenshell:/# alpenhorn node autoclean demo_storage1 demo_storage3
Auto-clean trigger: Group "demo_storage3" added
root@alpenshell:/# alpenhorn node show demo_storage1 --actions
   Storage Node: demo_storage1
  Storage Group: demo_storage1
         Active: Yes
           Type: -
          Notes:
      I/O Class: Default

    Daemon Host: alpenhost1
 Log-in Address: alpenhost1
Log-in Username: root

    Auto-Import: Off
    Auto-Verify: Off
      Max Total: -
      Available: 43.09 GiB
  Min Available: -
   Last Checked: Wed May  7 21:00:55 2025 UTC

I/O Config:

  none

Auto-actions:

Group          Action      Trigger
-------------  ----------  ------------------------
demo_storage2  Auto-sync   File added to this node
demo_storage3  Auto-clean  File added to that group

Let’s create yet another file to test this:

mkdir -p /data/2025/02/26/02
echo "14 15 16 17 18" > /data/2025/02/26/02/0011.dat

And scan again:

alpenhorn node scan demo_storage1 --register-new

You should see the whole transfer. The scan:

alpenhost1-1  | May 07 21:13:15 INFO >> [MainThread] Node demo_storage1: 43.09 GiB available.
alpenhost1-1  | May 07 21:13:15 INFO >> [MainThread] Updating node "demo_storage1".
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Beginning task Scan "." on demo_storage1
alpenhost1-1  | May 07 21:13:15 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Scanning "." on "demo_storage1" for new files.
alpenhost1-1  | May 07 21:13:15 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Scanning ".".
alpenhost1-1  | May 07 21:13:15 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Scanning "2025/02/26/02".
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#1] Beginning task Import 2025/02/26/02/0011.dat on demo_storage1
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Completed import request #7.
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#2] Finished task: Scan "." on demo_storage1
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#1] File "2025/02/26/02/0011.dat" added to DB.
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#1] Imported file copy "2025/02/26/02/0011.dat" on node "demo_storage1".
alpenhost1-1  | May 07 21:13:15 INFO >> [Worker#1] Finished task: Import 2025/02/26/02/0011.dat on demo_storage1

The first auto-sync:

alpenhost2-1  | May 07 21:13:25 INFO >> [MainThread] Node demo_storage2: 43.09 GiB available.
alpenhost2-1  | May 07 21:13:25 INFO >> [MainThread] Updating node "demo_storage2".
alpenhost2-1  | May 07 21:13:25 INFO >> [MainThread] Updating group "demo_storage2".
alpenhost2-1  | May 07 21:13:25 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#2] Beginning task Pre-pull search for 2025/02/26/02/0011.dat in demo_storage2
alpenhost2-1  | May 07 21:13:25 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#2] Finished task: Pre-pull search for 2025/02/26/02/0011.dat in demo_storage2
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#1] Beginning task AFCR#15: demo_storage1 -> demo_storage2
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#1] Creating directory "/data/2025/02/26/02".
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#1] Pulling remote file 2025/02/26/02/0011.dat using rsync
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#1] Pull of 2025/02/26/02/0011.dat complete. Transferred 15 B in 0.4s [40 B/s]
alpenhost2-1  | May 07 21:13:25 INFO >> [Worker#1] Finished task: AFCR#15: demo_storage1 -> demo_storage2
alpenhost2-1  | May 07 21:13:35 INFO >> [MainThread] Node demo_storage2: 43.09 GiB available.
alpenhost2-1  | May 07 21:13:35 INFO >> [MainThread] Updating node "demo_storage2".
alpenhost2-1  | May 07 21:13:35 INFO >> [MainThread] Updating group "demo_storage2".
alpenhost2-1  | May 07 21:13:35 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost2-1  | May 07 21:13:35 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 0 in-progress on 2 workers

The second auto-sync:

alpenhost3-1  | May 07 21:13:35 INFO >> [MainThread] Updating group "demo_storage3".
alpenhost3-1  | May 07 21:13:35 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost3-1  | May 07 21:13:35 INFO >> [MainThread] Tasks: 1 queued, 0 deferred, 0 in-progress on 2 workers
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#1] Beginning task Pre-pull search for 2025/02/26/02/0011.dat in demo_storage3
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#1] Finished task: Pre-pull search for 2025/02/26/02/0011.dat in demo_storage3
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#2] Beginning task AFCR#16: demo_storage2 -> demo_storage3
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#2] Creating directory "/data/2025/02/26/02".
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#2] Pulling remote file 2025/02/26/02/0011.dat using rsync
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#2] Pull of 2025/02/26/02/0011.dat complete. Transferred 15 B in 0.4s [41 B/s]
alpenhost3-1  | May 07 21:13:35 INFO >> [Worker#2] Finished task: AFCR#16: demo_storage2 -> demo_storage3

The auto-clean:

alpenhost1-1  | May 07 21:13:45 INFO >> [MainThread] Node demo_storage1: 43.09 GiB available.
alpenhost1-1  | May 07 21:13:45 INFO >> [MainThread] Updating node "demo_storage1".
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Beginning task Delete copies [22] from demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [MainThread] Updating group "demo_storage1".
alpenhost1-1  | May 07 21:13:45 INFO >> [MainThread] Main loop execution was 0.0s.
alpenhost1-1  | May 07 21:13:45 INFO >> [MainThread] Tasks: 0 queued, 0 deferred, 1 in-progress on 2 workers
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Removed file copy 2025/02/26/02/0011.dat on demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Removed directory /data/2025/02/26/02 on demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Removed directory /data/2025/02/26 on demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Removed directory /data/2025/02 on demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Removed directory /data/2025 on demo_storage1
alpenhost1-1  | May 07 21:13:45 INFO >> [Worker#2] Finished task: Delete copies [22] from demo_storage1

After that, everything should be on the two archive nodes, and cleaned off of demo_storage1:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             0             -         -
demo_storage2             8         184 B         -
demo_storage3             8         184 B         -

Transport disks and the Sneakernet

Alpenhorn has been designed to work with instruments in remote locations where network transport of data may be difficult or impossible to accomplish. To help with this situation, alpenhorn can be used to manage transfer of data via physically moving storage media from site to site. (This is known as the Sneakernet).

Alpenhorn can be configured to copy data onto a set of physical media at one location where data are produced and then, later, copy data off those media once they have been transported to a data ingest site.

To demonstrate this, we’ll use a transport device to simulate transferring data back from demo_storage3 to demo_storage1, as if these two nodes were unable to communicate directly over the network.

The Transport Group and Transport Nodes

In alpenhorn, each individual physical device holding data to transfer is represent by its own StorageNode which has the “transport” storage type. All the transport nodes are collected into a StorageGroup which has the “Transport” I/O class.

Our first job, then, is to create a transport group:

alpenhorn group create --class=Transport transport_group

This has I/O class “Transport” (the capital “T” is important). Typically you only ever need one transport group, and you put all your transport nodes in the single group. Normal logistics of the Sneakernet mean that typically different member nodes of this group will be located at different sites and/or be in-transit at any given time, and the locations of the nodes will change over time. Alpenhorn never requires, nor expects, multiple nodes in the transport group to be accessible to a single daemon.

Now that we have the transport group, we can create storage nodes to put in it. As mentioned above, each node is a single physical device (disk, tape, etc.) which is transferred through the Sneakernet. Multiple nodes in the group can be available at a particular site, but we’ll just create a single node for the purpose of this demo.

When we create the new node, we’ll tell alpenhorn that it’s initially available on alpenhost3:

alpenhorn node create transport1 --transport --group transport_group --host=alpenhost3 \
                                 --root=/mnt/transport --init --activate

Note the use of the --transport flag to set the node’s storage type to “transport”. The “Transport” Group I/O class allows StorageNodes of any class to be added to the group, but requires all such nodes to have the “transport” storage type.

The filesystem has already been made available in the alpenhost3 container, so wait for the daemon on alpenhost3 to initialise the node:

alpenhost3-1  | Mar 07 09:21:03 INFO >> [MainThread] Node "transport1" now available.
alpenhost3-1  | Mar 07 09:21:03 WARNING >> [MainThread] Node file "/mnt/transport/ALPENHORN_NODE" could not be read.
alpenhost3-1  | Mar 07 09:21:03 INFO >> [MainThread] Requesting init of node "transport1".
alpenhost3-1  | Mar 07 09:21:03 INFO >> [Worker#1] Beginning task Init Node "transport1"
alpenhost3-1  | Mar 07 09:21:03 WARNING >> [Worker#1] Node file "/mnt/transport/ALPENHORN_NODE" could not be read.
alpenhost3-1  | Mar 07 09:21:03 WARNING >> [Worker#1] Node file "/mnt/transport/ALPENHORN_NODE" could not be read.
alpenhost3-1  | Mar 07 09:21:03 INFO >> [Worker#1] Node "transport1" initialised.

Copying Data to the Transport Group

Remember that, when copy files, data always flows from a node to a group. To get data onto the transport node, we need to transfer data into the transport group. Logic defined by the Transport I/O class then determines which of the available transport nodes the transferred files will be written to.

Briefly, the Transport logic works like this:

  • only local transfers are allowed into the group (i.e. syncing into the transport group will only ever copy data onto transport nodes at the same location as the source node).

  • the transport group will try to fill up one transport node before putting data onto another

  • all other things being equal, all transport nodes have the same priority for accepting data

In our case we only have a single transport node, so it’s easy to figure out which node the data will end up on.

Let’s transfer data out of demo_storage3 into the transport group with the intent of transferring data, ultimately, to demo_storage1:

alpenhorn node sync demo_storage3 transport_group --target=demo_storage1

The --target option indicates to alpenhorn the ultimate destination for the data we’re syncing to transport. This prevents alpenhorn from trying to transfer data already present on demo_storage1 (though in our case, that’s nothing).

This should sync all eight files we have:

root@alpenshell:/# alpenhorn node sync demo_storage3 transport_group --target=demo_storage1
Would sync 8 files (184 B) from Node "demo_storage3" to Group "transport_group".

Continue? [y/N]: y

Syncing 8 files (184 B) from Node "demo_storage3" to Group "transport_group".

Added 8 new copy requests.

It may also be good to point out here that even though both demo_storage3 and the transport node are on alpenhost3, and all the resulting transfers are local, we’ll still run this command in the alpenshell container. Running commands with the CLI never need to occur where the storage nodes referenced are. Anywhere that can access the alpenhorn database can be used to run any alpenhorn command.

After waiting for the daemon to process these requests, a look at the transport group should show us that they’ve all ended up on the transport node:

root@alpenshell:/# alpenhorn group show --node-stats transport_group
Storage Group: transport_group
        Notes:
    I/O Class: Transport

I/O Config:

  none

Nodes:

Name          File Count  Total Size    % Full
----------  ------------  ------------  --------
transport1             8  184 B         -

Transporting the transport node

Now let’s simulate what would happen if we wanted to move this transport node from alpenhost3 to alpenhost1 over the Sneakernet. (Normally, to increase throughput of the Sneakernet, we would wait for the node to fill up, but we’re not going to wait for that in this demo.)

The first step is to deactivate the alpenhorn node to tell alpenhorn to stop managing it:

alpenhorn node deactivate transport1

The daemon on alpenhost3 will notice this, and stop updating the transport node:

alpenhost3-1  | Mar 09 04:10:07 INFO >> [MainThread] Node "transport1" no longer available.
alpenhost3-1  | Mar 09 04:10:07 INFO >> [MainThread] Group "transport_group" no longer available.

Hint

A StorageGroup is only available to a daemon if at least one of its nodes is available. When we deactivate the transport1 node, resulting it it no longer being available, the transport_group also becomes unavailable because there are no other active nodes on alpenhost3.

If this were a real transport device, the next steps would be to:

  • unmount the filesystem

  • eject the media

  • remove the physical storage device from the machine

After this, the transport device would need to travel (via Sneakernet) from the site containing alpenhost3 to the site containing alpenhost1 where we would do the reverse procedure, installing the device in the alpenhost1 machine and mounting the device’s filesystem.

For the purposes of this demo, however, we don’t have to do any of that: the docker volume we’re using to simulate the transport device has already been made available in the alpenhost1 container, so let’s proceed with the last step of the transport process, which is to update the alpenhorn data index to record the movement of the transport device.

In addition to activating the node to tell alpenhorn to start managing it again, there are, generally, four fields we may need to update for a transport device after it’s been moved:

  • its host, to let alpenhorn know which daemon should now be able to access the disk

  • its username and address, to set the log-in details for remote access to the device. If remote access to the transport node isn’t needed, this may not be necessary to do.

  • its root, to tell alpenhorn where we have mounted the transport device’s filesystem.

We can do this all using the node activate command, which has been designed with this use case in mind:

alpenhorn node activate transport1 --host=alpenhost1 --username=root \
                        --address=alpenhost1 --root=/mnt/transport

The node (and also the group) will now appear to the daemon on alpenhost1:

alpenhost1-1  | Mar 09 04:21:44 INFO >> [MainThread] Node "transport1" now available.
alpenhost1-1  | Mar 09 04:21:44 INFO >> [MainThread] Group "transport_group" now available.

Now let’s now copy all the data off the transport media onto the demo_storage1 node to complete our long-distance transfer:

alpenhorn node sync transport1 demo_storage1

Once the transfers are complete, we’ve now got data back on demo_storage1 via our transport media:

root@alpenshell:/# alpenhorn node stats
Name             File Count    Total Size    % Full
-------------  ------------  ------------  --------
demo_storage1             8         184 B         -
demo_storage2             8         184 B         -
demo_storage3             8         184 B         -
transport1                8         184 B         -

Once we’re happy with the transfer off of the transport device, we’ll want to clear it out so we can ship it back to alpenhost3 to be used to transfer more data in the future:

alpenhorn node clean --now --force transport1 --target=demo_storage1

Hint

The --target option ensures we only delete files from transport1 which are present on demo_storage1.

Next steps

This is the end of the curated part of the alpenhorn demo, but you can use this demo system to experiment with running alpenhorn. Remember: you can always reset this demo to its initial state.