Cluster#

Usage#

hyper-shell cluster [-h]: [-p PORT] [-r NUM [--eager]] [-f PATH] [--capture | [-o PATH] [-e PATH]] [--no-db | --initdb] [--no-confirm] [--delay-start SEC] [-T SEC] [-W SEC] [--ssh [HOST... | --ssh-group NAME] [--env] | --mpi | --launcher=ARGS...] [--autoscaling [MODE] [-P SEC] [-F VALUE] [-I NUM] [-X NUM] [-Y NUM]]

Description#

Start the cluster either locally or with remote clients with ssh or a custom launcher. This mode should be the most common entry-point for general usage. It fully encompasses all of the different agents in the system in a concise workflow.

The input source for tasks is file-like, either a local path, or from stdin if no argument is given. The command-line tasks are pulled in and either directly published to a distributed queue (see --no-db) or committed to a database first before being scheduled later.

For large, long running workflows, it might be a good idea to configure a database and run an initial submit job to populate the database, and then run the cluster with --restart and no input FILE. If the cluster is interrupted for whatever reason it can gracefully restart where it left off.

Use --autoscaling with either fixed or dynamic to run a persistent, elastically scalable cluster using an external --launcher to bring up clients as needed.

Arguments#

FILE: Path to input task file (default: <stdin>).

Modes#

--ssh HOST…

Launch directly with SSH host(s). This can be a single host, a comma-separated list of hosts, or an expandable pattern, e.g., “cluster-a[00-04].xyz”.

See also --ssh-group and --ssh-args.

--mpi

Same as --launcher=mpirun.

--launcher ARGS…

Use specific launch interface. This can be any program that handles process management on a distributed system. For example, on a SLURM cluster one might want to use srun. In this case you would specify --launcher=srun; however, the ARGS are not merely the executable but the full listing, e.g., --launcher='srun --mpi=pmi2'.

Options#

-N, --num-tasks NUM

Number of task executors per client (default: 1).

For example, -N4 would create four workers, but -N4 --ssh 'cluster-a[00-01].xyz' creates two clients and a total of eight workers.

-t, --template CMD

Command-line template pattern (default: “{}”).

This is expanded by the client just before execution. With the default “{}” the input command-line will be run verbatim. Specifying a template pattern allows for simple input arguments (e.g., file paths) to be transformed into some common form; such as -t './some_command.py {} >outputs/{/-}.out'.

See section on templates.

-p, --port NUM

Port number (default: 50001).

This is an arbitrary choice and simply must be an available port. The default option chosen here is typically available on most platforms and is not expected by any known major software.

-b, --bundlesize SIZE

Size of task bundle (default: 1).

The default value allows for greater concurrency and responsiveness on small scales. This is used by the submit thread to accumulate bundles for either database commits and/or publishing to the queue. If a database is in use, the scheduler thread selects tasks from the database in batches of this size.

Using larger bundles is a good idea for large distributed workflows; specifically, it is best to coordinate bundle size with the number of executors in use by each client.

See also --num-tasks and --bundlewait.

-w, --bundlewait SEC

Seconds to wait before flushing tasks (default: 5).

This is used by both the submit thread and forwarded to each client. The client collector thread that accumulates finished task bundles to return to the server will push out a bundle after this period of time regardless of whether it has reached the preferred bundle size.