Running Experiments#

Run an experiment#

Launch Autosubmit with the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID>

In the previous command output <EXPID> is the experiment identifier. The command exits with 0 when the workflow finishes with no failed jobs, and with 1 otherwise.

Options:

$ autosubmit run -h

Output:
usage: autosubmit run [-h] [-nt] [-v] [-st START_TIME] [-sa START_AFTER]
                      [-rom RUN_ONLY_MEMBERS] [-p [MAX_ITER]] [-t]
                      expid

runs specified experiment

positional arguments:
  expid                 experiment identifier

optional arguments:
  -h, --help            show this help message and exit
  -nt, --notransitive   Disable transitive reduction
  -v, --update_version  Update experiment version
  -st START_TIME, --start_time START_TIME
                        Sets the starting time for this experiment
  -sa START_AFTER, --start_after START_AFTER
                        Sets a experiment expid which completion will trigger
                        the start of this experiment.
  -rom RUN_ONLY_MEMBERS, --run_only_members RUN_ONLY_MEMBERS
                        Sets members allowed on this run.
  -p [MAX_ITER], --profile [MAX_ITER]
                        Enables profiling. Optionally pass the maximum number
                        of iterations (0 = no cap).
  -t, --trace           Enables trace output for profiling (requires
                        --profile).

Example:

Important

If the autosubmit version is set in autosubmit_<EXPID>.yml, it must match the actual autosubmit version.

Hint

It is recommended to launch it in background and with nohup (continue running although the user who launched the process logs out).

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
nohup autosubmit run <EXPID> &

Important

Before launching Autosubmit, check that password-less ssh is feasible (HPCName is the hostname).

Important

Add encryption key to ssh agent for each session (if your ssh key is encrypted).

Important

The host machine has to be able to access HPC’s/Clusters via password-less ssh. Make sure that the ssh key is in PEM format ssh-keygen -t rsa -b 4096 -C “email@email.com” -m PEM.

ssh HPCName

More info on password-less ssh can be found at: http://www.linuxproblem.org/art_9.html

Caution

After launching Autosubmit, one must be aware of login expiry limit and policy (if applicable for any HPC) and renew the login access accordingly (by using token/key etc) before expiry.

When running operational experiments (i.e. an experiment whose EXPID starts with 'o', e.g. o001), and that have a Git project, Autosubmit checks if there is any code that was not committed or not pushed to the remote Git repository.

If there are local changes not committed and pushed, Autosubmit will fail to run the experiment, print an error message, and exit with an exit code different than zero.

This can be disabled by setting the property CONFIG.GIT_OPERATIONAL_CHECK_ENABLED to False (it is True by default). Note, however, that this is discouraged as it would affect the traceability of operational experiments.

How to run an experiment that was created with another version#

Important

First of all you have to stop your Autosubmit instance related with the experiment

Once you’ve already loaded / installed the Autosubmit version do you want:

autosubmit create <EXPID>
autosubmit recovery <EXPID> -s --all -f
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -v
or
autosubmit updateversion <EXPID>
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -v

EXPID is the experiment identifier. The most common problem when you change your Autosubmit version is the apparition of several Python errors. This is due to how Autosubmit saves internally the data, which can be incompatible between versions. The steps above represent the process to re-create (1) these internal data structures and to recover (2) the previous status of your experiment.

How to run an experiment that was created with version <= 4.0.0#

Important

First of all you have to stop your Autosubmit instance related with the experiment.

Once you’ve already loaded / installed the Autosubmit version do you want:

autosubmit upgrade <EXPID>
autosubmit create <EXPID>
autosubmit recovery <EXPID> -s --all -f
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -v
or
autosubmit updateversion <EXPID>
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -v

<EXPID> is the experiment identifier. The most common problem when you upgrade an experiment with INI configuration to YAML is that some variables may be not automatically translated. Ensure that all your <EXPID>/conf/*.yml files are correct and also revise the templates in <EXPID>/proj/$proj_name.

How to run only selected members#

To run only a subset of selected members you can execute the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -rom MEMBERS

<EXPID> is the experiment identifier, the experiment you want to run.

MEMBERS is the selected subset of members. Format “member1 member2 member2”, example: “fc0 fc1 fc2”.

Then, your experiment will start running jobs belonging to those members only. If the experiment was previously running and autosubmit was stopped when some jobs belonging to other members (not the ones from your input) where running, those jobs will be tracked and finished in the new exclusive run.

Furthermore, if you wish to run a sequence of only members execution, then instead of running autosubmit run -rom “member_1”autosubmit run -rom “member_n”, you can make a bash file with that sequence and run the bash file. Example:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -rom MEMBER_1
autosubmit run <EXPID> -rom MEMBER_2
autosubmit run <EXPID> -rom MEMBER_3
...
autosubmit run <EXPID> -rom MEMBER_N

How to start an experiment at a given time#

To start an experiment at a given time, use the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -st INPUT

<EXPID> is the experiment identifier

INPUT is the time when your experiment will start. You can provide two formats:
  • H:M:S: For example, 15:30:00 will start your experiment at 15:30 in the afternoon of the present day.

  • yyyy-mm-dd H:M:S: For example, 2021-02-15 15:30:00 will start your experiment at 15:30 in the afternoon on February 15th.

Then, your terminal will show a countdown for your experiment start.

This functionality can be used together with other options supplied by the run command.

The -st command has a long version –start_time.

How to start an experiment after another experiment is finished#

To start an experiment after another experiment is finished, use the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run <EXPID> -sa <EXPIDB>

<EXPID> is the experiment identifier, the experiment you want to start.

<EXPIDB> is the experiment identifier of the experiment you are waiting for before your experiment starts.

Warning

Both experiments must be using Autosubmit version 3.13.0 or later.

Then, your terminal will show the current status of the experiment you are waiting for. The status format is COMPLETED/QUEUING/RUNNING/SUSPENDED/FAILED.

This functionality can be used together with other options supplied by the run command.

The -sa command has a long version –start_after.

How to profile Autosubmit while running an experiment#

Autosubmit offers the possibility to profile an experiment execution. To enable the profiler, just add the --profile (or -p) flag to your autosubmit run command, as in the following example:

autosubmit run --profile <EXPID>

Note

Remember that the purpose of this profiler is to measure the performance of Autosubmit, not the jobs it runs.

This profiler uses Python’s cProfile and psutil modules to generate a report with simple CPU and memory metrics which will be displayed in your console after the command finishes, as in the example below:

Screenshot of the header of the profiler's output

The profiler output is also saved in <EXPID>/tmp/profile. There you will find two files, the report in plain text format and a .prof binary which contains the CPU metrics. We highly recommend using SnakeViz to visualize this file, as follows:

The .prof file represented by the graphical library SnakeViz

For more detailed documentation about the profiler, please visit this page.

How to prepare an experiment to run in two independent job_list. (Priority jobs, Two-step-run) (OLD METHOD)#

This feature allows to run an experiment in two separated steps without the need of do anything manually.

To achieve this, you will have to use an special parameter called TWO_STEP_START in which you will put the list of the jobs that you want to run in an exclusive mode. These jobs will run until all of them finishes and once it finishes, the rest of the jobs will begun the execution.

It can be activated through TWO_STEP_START and it is set on expdef_<EXPID>.yml, under the experiment: section.

experiment:
    DATELIST: 20120101 20120201
    MEMBERS: fc00[0-3]
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 10
    CHUNKINI :
    CALENDAR: standard
    # To run before the rest of experiment:
    TWO_STEP_START: <job_names&section,dates,member_or_chunk(M/C),chunk_or_member(C/M)>

In order to be easier to use, there are Three modes for use this feature: job_names and section,dates,member_or_chunk(M/C),chunk_or_member(C/M).

  • By using job_names alone, you will need to put all jobs names one by one divided by the char , .

  • By using section,dates,member_or_chunk(M/C),chunk_or_member(C/M). You will be able to select multiple jobs at once combining these filters.

  • Use both options, job_names and section,dates,member_or_chunk(M/C),chunk_or_member(C/M). You will have to put & between the two modes.

There are 5 fields on TWO_STEP_START, all of them are optional but there are certain limitations:

  • Job_name: [Independent] List of job names, separated by ‘,’ char. Optional, doesn’t depend on any field. Separated from the rest of fields by ‘&’ must be the first field if specified

  • Section: [Independent] List of sections, separated by ‘,’ char. Optional, can be used alone. Separated from the rest of fields by ‘;’

  • Dates: [Depends on section] List of dates, separated by ‘,’ char. Optional, but depends on Section field. Separated from the rest of fields by ‘;’

  • member_or_chunk: [Depends on Dates(OR)] List of chunk or member, must start with C or M to indicate the filter type. Jobs are selected by [1,2,3..] or by a range [0-9] Optional, but depends on Dates field. Separated from the rest of fields by ‘;’

  • chunk_or_member: [Depends on Dates(OR)] List of member or chunk, must start with M or C to indicate the filter type. Jobs are selected by [1,2,3..] or by a range [0-9] Optional, but depends on Dates field. Separated from the rest of fields by ‘;’

Example using the old method#

Guess the expdef configuration as follow:

experiment:
    DATELIST: 20120101
    MEMBERS: 00[0-1]
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 2
    TWO_STEP_START: a02n_20120101_000_1_REDUCE&COMPILE_DA,SIM;20120101;c[1]

Given this job_list ( jobs_conf has REMOTE_COMPILE(once),DA,SIM,REDUCE)

[‘a02n_REMOTE_COMPILE’, ‘a02n_20120101_000_1_SIM’, ‘a02n_20120101_000_2_SIM’, ‘a02n_20120101_001_1_SIM’, ‘a02n_20120101_001_2_SIM’, ‘a02n_COMPILE_DA’, ‘a02n_20120101_1_DA’, ‘a02n_20120101_2_DA’, ‘a02n_20120101_000_1_REDUCE’, ‘a02n_20120101_000_2_REDUCE’, ‘a02n_20120101_001_1_REDUCE’, ‘a02n_20120101_001_2_REDUCE’]

The priority jobs will be ( check TWO_STEP_START from expdef conf):

[‘a02n_20120101_000_1_SIM’, ‘a02n_20120101_001_1_SIM’, ‘a02n_COMPILE_DA’, ‘a02n_20120101_000_1_REDUCE’]

How to prepare an experiment to run in two independent job_list. (New method)#

From AS4, TWO_STEP_START is not longer needed since the users can now specify exactly which tasks of a job are needed to run the current task in the DEPENDENCIES parameter.

Simplified example using the new method#

This example is based on the previous one, but using the new method and without the reduce job.

experiment:
    DATELIST: 20120101
    MEMBERS: "00[0-1]"
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 2
JOBS:
    REMOTE_COMPILE:
        FILE: remote_compile.sh
        RUNNING: once
    DA:
        FILE: da.sh
        DEPENDENCIES:
            SIM:
            DA:
                DATES_FROM:
                 "20120201":
                   CHUNKS_FROM:
                    1:
                     DATES_TO: "20120101"
                     CHUNKS_TO: "1"
    SIM:
        FILE: sim.sh
        DEPENDENCIES:
            LOCAL_SEND_STATIC:
            REMOTE_COMPILE:
            SIM-1:
            DA-1:

Example 2: Crossdate wrappers using the the new dependencies#

experiment:
  DATELIST: 20120101 20120201
  MEMBERS: "000 001"
  CHUNKSIZEUNIT: day
  CHUNKSIZE: '1'
  NUMCHUNKS: '3'
wrappers:
    wrapper_simda:
        TYPE: "horizontal-vertical"
        JOBS_IN_WRAPPER: "SIM DA"

JOBS:
  LOCAL_SETUP:
    FILE: templates/local_setup.sh
    PLATFORM: marenostrum_archive
    RUNNING: once
    NOTIFY_ON: COMPLETED
  LOCAL_SEND_SOURCE:
    FILE: templates/01_local_send_source.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP
    RUNNING: once
    NOTIFY_ON: FAILED
  LOCAL_SEND_STATIC:
    FILE: templates/01b_local_send_static.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP
    RUNNING: once
    NOTIFY_ON: FAILED
  REMOTE_COMPILE:
    FILE: templates/02_compile.sh
    DEPENDENCIES: LOCAL_SEND_SOURCE
    RUNNING: once
    PROCESSORS: '4'
    WALLCLOCK: 00:50
    NOTIFY_ON: COMPLETED
  SIM:
    FILE: templates/05b_sim.sh
    DEPENDENCIES:
      LOCAL_SEND_STATIC:
      REMOTE_COMPILE:
      SIM-1:
      DA-1:
    RUNNING: chunk
    PROCESSORS: '68'
    WALLCLOCK: 00:12
    NOTIFY_ON: FAILED
  LOCAL_SEND_INITIAL_DA:
    FILE: templates/00b_local_send_initial_DA.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP LOCAL_SEND_INITIAL_DA-1
    RUNNING: chunk
    SYNCHRONIZE: member
    DELAY: '0'
  COMPILE_DA:
    FILE: templates/02b_compile_da.sh
    DEPENDENCIES: LOCAL_SEND_SOURCE
    RUNNING: once
    WALLCLOCK: 00:20
    NOTIFY_ON: FAILED
  DA:
    FILE: templates/05c_da.sh
    DEPENDENCIES:
      SIM:
      LOCAL_SEND_INITIAL_DA:
        CHUNKS_TO: "all"
        DATES_TO: "all"
        MEMBERS_TO: "all"
      COMPILE_DA:
      DA:
        DATES_FROM:
         "20120201":
           CHUNKS_FROM:
            1:
             DATES_TO: "20120101"
             CHUNKS_TO: "1"
    RUNNING: chunk
    SYNCHRONIZE: member
    DELAY: '0'
    WALLCLOCK: 00:12
    PROCESSORS: '256'
    NOTIFY_ON: FAILED
crossdate-example

Finally, you can launch Autosubmit run in background and with nohup (continue running although the user who launched the process logs out).

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
nohup autosubmit run <EXPID> &

How to stop the experiment#

From Autosubmit 4.1.6, you can stop an experiment using the command autosubmit stop

Options:

$ autosubmit stop -h

Output:
usage: autosubmit stop [-h] [-f] [-a] [-fa] [-y] [-c] [-fs FILTER_STATUS]
                       [-t STATUS]
                       [expid]

Completely stops an autosubmit run process

positional arguments:
  expid                 experiment identifiers separated by commas

optional arguments:
  -h, --help            show this help message and exit
  -f, --force           Forces to stop autosubmit process, equivalent to kill
                        -9
  -a, --all             Stop all current running autosubmit processes, will
                        ask for confirmation unless -y is used
  -fa, --force_all      Stop all current running autosubmit processes
  -y, --yes             Automatically answer yes to prompts
  -c, --cancel          Orders to the schedulers to stop active jobs.
  -fs FILTER_STATUS, --filter_status FILTER_STATUS
                        Select the status (one or more) to filter the list of
                        jobs.
  -t STATUS, --target STATUS
                        Final status of killed jobs. Default is FAILED.

Examples:#

autosubmit stop <EXPID>
autosubmit stop <EXPID>, <EXPID>
autosubmit stop -a
autosubmit stop -a -f
autosubmit stop -a -c
autosubmit stop -fa --cancel -fs "SUBMITTED, QUEUING, RUNNING" -t "FAILED"

You can stop Autosubmit by sending a signal to the process. To get the process identifier (PID) you can use the ps command on a shell interpreter/terminal.

ps -ef | grep autosubmit
dbeltran  22835     1  1 May04 ?        00:45:35 autosubmit run <EXPID>
dbeltran  25783     1  1 May04 ?        00:42:25 autosubmit run <EXPID>

To send a signal to a process you can use kill also on a terminal.

To stop immediately experiment <EXPID>:

kill -9 22835

Important

In case you want to restart the experiment, you must follow the How to restart the experiment procedure, explained below, in order to properly resynchronize all completed jobs.

Retries#

For remote platforms, there are at least two parts where retries happen (if you use wrappers you may have others), when Autosubmit connects to the remote platform, and when Autosubmit executes a command.

When Autosubmit connects to a remote platform, it will use the host value of the platform configuration. This value can contain a single host name, or a list of host names using commas (,) as separators.

Right now Autosubmit has a hard-coded number of retries for connecting to remote platforms. It will try to connect to the platform, without interval, retrying connecting twice (``2``). It will write to logs in INFO and WARNING levels information about the retries, like whether it is retrying to connect, and what is the current retry number.

When Autosubmit retries connecting to a platform with multiple hosts separated by comma, the first connection uses the first host name. If it retries the connection, the next executions will exclude the first host, and then randomly select one of the remaining host names.

For executing commands on remote platforms, Autosubmit uses another hard-coded value of 3 retries, without interval between each retry. Autosubmit will submit the command to be executed via SSH. If the command fails on the remote platform, Autosubmit will not retry the command.

As an example, if you try to run an executable such as Rscript, but this executable does not exist on the remote platform, Autosubmit will log the error, and mark the job as FAILED.

However, if you have a networking issue between Autosubmit and your remote platform, then Autosubmit will log in INFO and WARNING and will retry executing the command up to hard-coded ``3`` retries.

Note

We already have an issue created to make these retry settings configurable by users and site admins.