Configuration#

Most of the choices that HyperShell makes about timing, task bundling, coordination, logging, and such are configurable by the user. This configuration is loaded when HyperShell starts and is constructed from several sources including an ordered merger of files, environment variables, and command-line options.

In order of precedence (lowest to highest), three files are loaded:


Level

BSD/Linux

Windows

System

/etc/hypershell.toml

%ProgramData%\HyperShell\Config.toml

User

~/.hypershell/config.toml

%AppData%\HyperShell\Config.toml

Local

./.hypershell/config.toml

.\.hypershell\Config.toml


The TOML format is modern and minimal.

Every configurable option can be set in one of these files. Further, every option can also be set by an environment variable, where the name aligns to the path to that option, delimited by underscores.

For example, set the logging level at the user level with a command:

Set user-level configuration option

hyper-shell config set logging.level info --user

The file should now look something like this:

~/.hypershell/config.toml

# File automatically created on 2022-07-02 11:57:29.332993
# Settings here are merged automatically with defaults and environment variables

[logging]
level = "info"

Alternatively, you can set an environment variable and the runtime configuration would be equivalent:

Define environment variable

export HYPERSHELL_LOGGING_LEVEL=INFO

Finally, any option defined within a configuration file that ends with _env or _eval is automatically expanded by the given environment variable or shell expression, respectively. This is useful as both a dynamic feature but also as a means to obfuscate sensitive information, such as database connection details.

~/.hypershell/config.toml

# File automatically created on 2022-07-02 11:57:29.332993
# Settings here are merged automatically with defaults and environment variables

[logging]
level = "info"

[database]
provider = "postgres"
database = "hypershell"
host = "my.instance.university.edu"
user = "me"
password_eval = "pass hypershell/database/password"  # Decrypt using GNU Pass


Parameter Reference#

[logging]

Logging configuration. See also logging section.

.level

One of DEVEL, TRACE, DEBUG, INFO, WARNING, ERROR, or CRITICAL (default: WARNING)

.datefmt

Date/time format, standard codes apply (default: ‘%Y-%m-%d %H:%M:%S’)

.format

Log message format. Default set by the default logging.style. See the available attributes defined by the underlying Python logging interface.

.style

Presets for logging.format which can be difficult to define correctly. Options are default, detailed, and system.

[database]

Database configuration and connection details. See also database section.

.provider

Database provider (default: ‘sqlite’). Supported alternatives include ‘postgres’ (or compatible). Support for other providers may be considered in the future.

.file

Only applicable for SQLite provider. SQLite does not understand any other connection detail.

.database

Name for database. Not applicable for SQLite.

.schema

Not applicable for all RDMS providers. For PostgreSQL the default schema is public. Specifying the schema may be useful for having multiple instances within the same database.

.host

Hostname or address of database server (default: localhost).

.port

Port number to connect with database server. The default value depends on the provider, e.g., 5432 for PostgreSQL.

.user

Username for databaser server account. If provided a password must also be provided. Default is the local account.

.password

Password for database server account. If provided a user must also be provided. Default is the local account.

See also note on _env and _eval.

.echo

Special parameter enables verbose logging of all database transactions.

[connection_args]

Specify additional connection details for the underlying SQL dialect provider, e.g., sqlite3 or psycopg2.

*

Any additional arguments are forwarded to the provider, e.g., encoding = 'utf-8'.

[server]

Section for server workflow parameters.

.bind

Bind address (default: localhost).

When running locally, the default is recommended. To allow remote clients to connect over the network, bind the server to 0.0.0.0.

.port

Port number (default: 50001).

This is an arbitrary choice and simply must be an available port. The default option chosen here is typically available on most platforms and is not expected by any known major software.

.auth

Cryptographic authorization key to connect with server (default: <not secure>).

The default KEY used by the server and client is not secure and only a place holder. It is expected that the user choose a secure KEY. The cluster automatically generates a secure one-time KEY.

.queuesize

Maximum number of task bundles on the shared queue (default: 1).

This blocks the next bundle from being published by the scheduler until a client has taken the current prepared bundle. On smaller scales this is probably best and is only of modest performance impact, limiting the scheduler from getting so far ahead of the currently running tasks.

On large scale workflows with many clients (e.g., 100) it may be advantageous to allow the scheduler to work ahead in selecting new tasks.

.bundlesize

Size of task bundle (default: 1).

The default value allows for greater concurrency and responsiveness on small scales. This is used by the submit thread to accumulate bundles for either database commits and/or publishing to the queue. If a database is in use, the scheduler thread selects tasks from the database in batches of this size.

Using larger bundles is a good idea for large distributed workflows; specifically, it is best to coordinate bundle size with the number of executors in use by each client.

See also -b/--bundlesize command-line option.

.attempts

Attempts for auto-retry on failed tasks (default: 1).

If a database is in use, then there is an opportunity to automatically retry failed tasks. A task is considered to have failed if it has a non-zero exit status. The original is not over-written, a new task is submitted and later scheduled.

Counterpart to the -r/--max-retries command-line option. Setting --max-retries 1 is equivalent to setting .attempts to 2.

See also .eager.

.eager

Schedule failed tasks before new tasks (default: false).

If .attempts is greater than one, this option defines the appetite for re-submitting failed tasks. By default, failed tasks will only be scheduled when there are no more remaining novel tasks.

.wait

Polling interval in seconds for database queries during scheduling (default: 5). This waiting only occurs when no tasks are returned by the query.

.evict

Eviction period in seconds for clients (default: 600).

If a client fails to register a heartbeat after this period of time it is considered defunct and is evicted. When there are no more tasks to schedule the server sends a disconnect request to all registered clients, and waits until a confirmation is returned for each. If a client is defunct, this will hang the shutdown process.

[client]

Section for client workflow parameters.

.bundlesize

Size of task bundle (default: 1).

The default value allows for greater concurrency and responsiveness on small scales.

Using larger bundles is a good idea for larger distributed workflows; specifically, it is best to coordinate bundle size with the number of executors in use by each client. It is also a good idea to coordinate bundle size between the client and server so that the client returns the same sized bundles that it receives.

See also -b/--bundlesize command-line option.

.bundlewait

Seconds to wait before flushing task bundle (default: 5).

If this period of time expires since the previous bundle was returned to the server, the current group of finished tasks will be pushed regardless of bundlesize.

For larger distributed workflows it is a good idea to make this waiting period sufficiently long so that most bundles are returned whole.

See also -w/--bundlewait command-line option.

.heartrate

Interval in seconds between heartbeats sent to server (default 10).

Even on the largest scales the default interval should be fine.

[submit]

Section for submit workflow parameters.

.bundlesize

Size of task bundle (default: 1).

The default value allows for greater concurrency and responsiveness on small scales. Using larger bundles is a good idea for large distributed workflows; specifically, it is best to coordinate bundle size with the number of executors in use by each client.

See also -b/--bundlesize command-line option.

.bundlewait

Seconds to wait before flushing tasks (default: 5).

If this period of time expires since the previous bundle was pushed to the database, the current bundle will be pushed regardless of how many tasks have been accumulated.

See also -w/--bundlewait command-line option.

[task]

Section for task runtime settings.

.cwd

Explicitly set the working directory for all tasks.

[ssh]

SSH configuration section.

.args

SSH connection arguments; e.g., -i ~/.ssh/some.key. It is preferable to configure SSH directly however, in ~/.ssh/config.

[group]

Setting a list for the .group allows for a global list of available client hosts. Or, set one or more named groups and reference them by name with --ssh-group.

.<name> = ['host-01', 'host-02', 'host-03']



Task Environment#

A few common environment variables are defined for every task.


TASK_ID

Universal identifier (UUID) for the current task.

TASK_ARGS

Original input command-line argument line. Equivalent to {}, see templates section.

TASK_SUBMIT_ID

Universal identifier (UUID) for submitting application instance.

TASK_SUBMIT_HOST

Hostname of submitting application instance.

TASK_SUBMIT_TIME

Timestamp task was submitted.

TASK_SERVER_ID

Universal identifier (UUID) for server application instance.

TASK_SERVER_HOST

Hostname of server application instance.

TASK_SCHEDULE_TIME

Timestamp task was scheduled by server.

TASK_CLIENT_ID

Universal identifier (UUID) for client application instance.

TASK_CLIENT_HOST

Hostname of client application instance.

TASK_COMMAND

Final command line for task.

TASK_ATTEMPT

Integer number of attempts for current task (starts at 1).

TASK_PREVIOUS_ID

Universal identifier (UUID) for previous attempt (if any).

TASK_CWD

Current working directory for the current task.

TASK_OUTPATH

Absolute file path where standard output is directed (if defined).

TASK_ERRPATH

Absolute file path where standard error is directed (if defined).


Further, any environment variable starting with HYPERSHELL_EXPORT_ will be injected into the task environment sans prefix; e.g., HYPERSHELL_EXPORT_FOO would define FOO in the task environment. You can also define such variables in the export section of your configuration file(s); e.g.,

~/.hypershell/config.toml

# File automatically created on 2022-07-02 11:57:29.332993
# Settings here are merged automatically with defaults and environment variables

[logging]
level = "info"

# Options defined as a list will be joined with a ":" on BSD/Linux or ";" on Windows
# Environment variables will be in all-caps (e.g., FOO and PATH).
[export]
foo = "value"
path = ["/some/bin", "/some/other/bin"]