Running Experiments#

Run an experiment#

Launch Autosubmit with the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run EXPID

In the previous command output EXPID is the experiment identifier. The command exits with 0 when the workflow finishes with no failed jobs, and with 1 otherwise.

Options:

usage: autosubmit run [-h] expid

  expid       experiment identifier
  -nt                   --notransitive
                            prevents doing the transitive reduction when plotting the workflow
  -v                    --update_version
                            update the experiment version to match the actual autosubmit version
  -st                   --start_time
                            Sets the starting time for the experiment. Accepted format: 'yyyy-mm-dd HH:MM:SS' or 'HH:MM:SS' (defaults to current day).
  -sa                   --start_after
                            Sets a experiment expid that will be tracked for completion. When this experiment is completed, the current instance of Autosubmit run will start.
  -rom,--run_only_members  --run_members
                            Sets a list of members allowed to run. The list must have the format '### ###' where '###' represents the name of the member as set in the conf files.
  -h, --help  show this help message and exit

Example:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run cxxx

Important

If the autosubmit version is set on autosubmit.yml it must match the actual autosubmit version

Hint

It is recommended to launch it in background and with nohup (continue running although the user who launched the process logs out).

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
nohup autosubmit run cxxx &

Important

Before launching Autosubmit check password-less ssh is feasible (HPCName is the hostname):

Important

Add encryption key to ssh agent for each session (if your ssh key is encrypted)

Important

The host machine has to be able to access HPC’s/Clusters via password-less ssh. Make sure that the ssh key is in PEM format ssh-keygen -t rsa -b 4096 -C “email@email.com” -m PEM.

ssh HPCName

More info on password-less ssh can be found at: http://www.linuxproblem.org/art_9.html

Caution

After launching Autosubmit, one must be aware of login expiry limit and policy (if applicable for any HPC) and renew the login access accordingly (by using token/key etc) before expiry.

How to run an experiment that was created with another version#

Important

First of all you have to stop your Autosubmit instance related with the experiment

Once you’ve already loaded / installed the Autosubmit version do you want:

autosubmit create $EXPID -np
autosubmit recovery $EXPID -s --all -f -np
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run $EXPID -v
or
autosubmit updateversion $EXPID
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run $EXPID -v

EXPID is the experiment identifier. The most common problem when you change your Autosubmit version is the apparition of several Python errors. This is due to how Autosubmit saves internally the data, which can be incompatible between versions. The steps above represent the process to re-create (1) these internal data structures and to recover (2) the previous status of your experiment.

How to run an experiment that was created with version <= 4.0.0#

Important

First of all you have to stop your Autosubmit instance related with the experiment.

Once you’ve already loaded / installed the Autosubmit version do you want:

autosubmit upgrade $expid
autosubmit create $EXPID -np
autosubmit recovery $EXPID -s --all -f -np
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run $EXPID -v
or
autosubmit updateversion $EXPID
# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run $EXPID -v

EXPID is the experiment identifier. The most common problem when you upgrade an experiment with INI configuration to YAML is that some variables may be not automatically translated. Ensure that all your $EXPID/conf/*.yml files are correct and also revise the templates in $EXPID/proj/$proj_name.

How to run only selected members#

To run only a subset of selected members you can execute the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run EXPID -rom MEMBERS

EXPID is the experiment identifier, the experiment you want to run.

MEMBERS is the selected subset of members. Format “member1 member2 member2”, example: “fc0 fc1 fc2”.

Then, your experiment will start running jobs belonging to those members only. If the experiment was previously running and autosubmit was stopped when some jobs belonging to other members (not the ones from your input) where running, those jobs will be tracked and finished in the new exclusive run.

Furthermore, if you wish to run a sequence of only members execution; then, instead of running autosubmit run -rom “member_1”autosubmit run -rom “member_n”, you can make a bash file with that sequence and run the bash file. Example:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run EXPID -rom MEMBER_1
autosubmit run EXPID -rom MEMBER_2
autosubmit run EXPID -rom MEMBER_3
...
autosubmit run EXPID -rom MEMBER_N

How to start an experiment at a given time#

To start an experiment at a given time, use the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run EXPID -st INPUT

EXPID is the experiment identifier

INPUT is the time when your experiment will start. You can provide two formats:
  • H:M:S: For example 15:30:00 will start your experiment at 15:30 in the afternoon of the present day.

  • yyyy-mm-dd H:M:S: For example 2021-02-15 15:30:00 will start your experiment at 15:30 in the afternoon on February 15th.

Then, your terminal will show a countdown for your experiment start.

This functionality can be used together with other options supplied by the run command.

The -st command has a long version –start_time.

How to start an experiment after another experiment is finished#

To start an experiment after another experiment is finished, use the command:

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
autosubmit run EXPID -sa EXPIDB

EXPID is the experiment identifier, the experiment you want to start.

EXPIDB is the experiment identifier of the experiment you are waiting for before your experiment starts.

Warning

Both experiments must be using Autosubmit version 3.13.0 or later.

Then, your terminal will show the current status of the experiment you are waiting for. The status format is COMPLETED/QUEUING/RUNNING/SUSPENDED/FAILED.

This functionality can be used together with other options supplied by the run command.

The -sa command has a long version –start_after.

How to profile Autosubmit while running an experiment#

Autosubmit offers the possibility to profile an experiment execution. To enable the profiler, just add the --profile (or -p) flag to your autosubmit run command, as in the following example:

autosubmit run --profile EXPID

Note

Remember that the purpose of this profiler is to measure the performance of Autosubmit, not the jobs it runs.

This profiler uses Python’s cProfile and psutil modules to generate a report with simple CPU and memory metrics which will be displayed in your console after the command finishes, as in the example below:

Screenshot of the header of the profiler's output

The profiler output is also saved in <EXPID>/tmp/profile. There you will find two files, the report in plain text format and a .prof binary which contains the CPU metrics. We highly recommend using SnakeViz to visualize this file, as follows:

The .prof file represented by the graphical library SnakeViz

For more detailed documentation about the profiler, please visit this page.

How to prepare an experiment to run in two independent job_list. (Priority jobs, Two-step-run) (OLD METHOD)#

This feature allows to run an experiment in two separated steps without the need of do anything manually.

To achieve this, you will have to use an special parameter called TWO_STEP_START in which you will put the list of the jobs that you want to run in an exclusive mode. These jobs will run until all of them finishes and once it finishes, the rest of the jobs will begun the execution.

It can be activated through TWO_STEP_START and it is set on expdef_a02n.yml, under the experiment: section.

experiment:
    DATELIST: 20120101 20120201
    MEMBERS: fc00[0-3]
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 10
    CHUNKINI :
    CALENDAR: standard
    # To run before the rest of experiment:
    TWO_STEP_START: <job_names&section,dates,member_or_chunk(M/C),chunk_or_member(C/M)>

In order to be easier to use, there are Three modes for use this feature: job_names and section,dates,member_or_chunk(M/C),chunk_or_member(C/M).

  • By using job_names alone, you will need to put all jobs names one by one divided by the char , .

  • By using section,dates,member_or_chunk(M/C),chunk_or_member(C/M). You will be able to select multiple jobs at once combining these filters.

  • Use both options, job_names and section,dates,member_or_chunk(M/C),chunk_or_member(C/M). You will have to put & between the two modes.

There are 5 fields on TWO_STEP_START, all of them are optional but there are certain limitations:

  • Job_name: [Independent] List of job names, separated by ‘,’ char. Optional, doesn’t depend on any field. Separated from the rest of fields by ‘&’ must be the first field if specified

  • Section: [Independent] List of sections, separated by ‘,’ char. Optional, can be used alone. Separated from the rest of fields by ‘;’

  • Dates: [Depends on section] List of dates, separated by ‘,’ char. Optional, but depends on Section field. Separated from the rest of fields by ‘;’

  • member_or_chunk: [Depends on Dates(OR)] List of chunk or member, must start with C or M to indicate the filter type. Jobs are selected by [1,2,3..] or by a range [0-9] Optional, but depends on Dates field. Separated from the rest of fields by ‘;’

  • chunk_or_member: [Depends on Dates(OR)] List of member or chunk, must start with M or C to indicate the filter type. Jobs are selected by [1,2,3..] or by a range [0-9] Optional, but depends on Dates field. Separated from the rest of fields by ‘;’

Example using the old method#

Guess the expdef configuration as follow:

experiment:
    DATELIST: 20120101
    MEMBERS: 00[0-1]
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 2
    TWO_STEP_START: a02n_20120101_000_1_REDUCE&COMPILE_DA,SIM;20120101;c[1]

Given this job_list ( jobs_conf has REMOTE_COMPILE(once),DA,SIM,REDUCE)

[‘a02n_REMOTE_COMPILE’, ‘a02n_20120101_000_1_SIM’, ‘a02n_20120101_000_2_SIM’, ‘a02n_20120101_001_1_SIM’, ‘a02n_20120101_001_2_SIM’, ‘a02n_COMPILE_DA’, ‘a02n_20120101_1_DA’, ‘a02n_20120101_2_DA’, ‘a02n_20120101_000_1_REDUCE’, ‘a02n_20120101_000_2_REDUCE’, ‘a02n_20120101_001_1_REDUCE’, ‘a02n_20120101_001_2_REDUCE’]

The priority jobs will be ( check TWO_STEP_START from expdef conf):

[‘a02n_20120101_000_1_SIM’, ‘a02n_20120101_001_1_SIM’, ‘a02n_COMPILE_DA’, ‘a02n_20120101_000_1_REDUCE’]

How to prepare an experiment to run in two independent job_list. (New method)#

From AS4, TWO_STEP_START is not longer needed since the users can now specify exactly which tasks of a job are needed to run the current task in the DEPENDENCIES parameter.

Simplified example using the new method#

This example is based on the previous one, but using the new method and without the reduce job.

experiment:
    DATELIST: 20120101
    MEMBERS: "00[0-1]"
    CHUNKSIZEUNIT: day
    CHUNKSIZE: 1
    NUMCHUNKS: 2
JOBS:
    REMOTE_COMPILE:
        FILE: remote_compile.sh
        RUNNING: once
    DA:
        FILE: da.sh
        DEPENDENCIES:
            SIM:
            DA:
                DATES_FROM:
                 "20120201":
                   CHUNKS_FROM:
                    1:
                     DATES_TO: "20120101"
                     CHUNKS_TO: "1"
    SIM:
        FILE: sim.sh
        DEPENDENCIES:
            LOCAL_SEND_STATIC:
            REMOTE_COMPILE:
            SIM-1:
            DA-1:

Example 2: Crossdate wrappers using the the new dependencies#

experiment:
  DATELIST: 20120101 20120201
  MEMBERS: "000 001"
  CHUNKSIZEUNIT: day
  CHUNKSIZE: '1'
  NUMCHUNKS: '3'
wrappers:
    wrapper_simda:
        TYPE: "horizontal-vertical"
        JOBS_IN_WRAPPER: "SIM DA"

JOBS:
  LOCAL_SETUP:
    FILE: templates/local_setup.sh
    PLATFORM: marenostrum_archive
    RUNNING: once
    NOTIFY_ON: COMPLETED
  LOCAL_SEND_SOURCE:
    FILE: templates/01_local_send_source.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP
    RUNNING: once
    NOTIFY_ON: FAILED
  LOCAL_SEND_STATIC:
    FILE: templates/01b_local_send_static.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP
    RUNNING: once
    NOTIFY_ON: FAILED
  REMOTE_COMPILE:
    FILE: templates/02_compile.sh
    DEPENDENCIES: LOCAL_SEND_SOURCE
    RUNNING: once
    PROCESSORS: '4'
    WALLCLOCK: 00:50
    NOTIFY_ON: COMPLETED
  SIM:
    FILE: templates/05b_sim.sh
    DEPENDENCIES:
      LOCAL_SEND_STATIC:
      REMOTE_COMPILE:
      SIM-1:
      DA-1:
    RUNNING: chunk
    PROCESSORS: '68'
    WALLCLOCK: 00:12
    NOTIFY_ON: FAILED
  LOCAL_SEND_INITIAL_DA:
    FILE: templates/00b_local_send_initial_DA.sh
    PLATFORM: marenostrum_archive
    DEPENDENCIES: LOCAL_SETUP LOCAL_SEND_INITIAL_DA-1
    RUNNING: chunk
    SYNCHRONIZE: member
    DELAY: '0'
  COMPILE_DA:
    FILE: templates/02b_compile_da.sh
    DEPENDENCIES: LOCAL_SEND_SOURCE
    RUNNING: once
    WALLCLOCK: 00:20
    NOTIFY_ON: FAILED
  DA:
    FILE: templates/05c_da.sh
    DEPENDENCIES:
      SIM:
      LOCAL_SEND_INITIAL_DA:
        CHUNKS_TO: "all"
        DATES_TO: "all"
        MEMBERS_TO: "all"
      COMPILE_DA:
      DA:
        DATES_FROM:
         "20120201":
           CHUNKS_FROM:
            1:
             DATES_TO: "20120101"
             CHUNKS_TO: "1"
    RUNNING: chunk
    SYNCHRONIZE: member
    DELAY: '0'
    WALLCLOCK: 00:12
    PROCESSORS: '256'
    NOTIFY_ON: FAILED
crossdate-example

Finally, you can launch Autosubmit run in background and with nohup (continue running although the user who launched the process logs out).

# Add your key to ssh agent ( if encrypted )
ssh-add ~/.ssh/id_rsa
nohup autosubmit run cxxx &

How to stop the experiment#

You can stop Autosubmit by sending a signal to the process. To get the process identifier (PID) you can use the ps command on a shell interpreter/terminal.

ps -ef | grep autosubmit
dbeltran  22835     1  1 May04 ?        00:45:35 autosubmit run cxxy
dbeltran  25783     1  1 May04 ?        00:42:25 autosubmit run cxxx

To send a signal to a process you can use kill also on a terminal.

To stop immediately experiment cxxx:

kill -9 22835

Important

In case you want to restart the experiment, you must follow the How to restart the experiment procedure, explained below, in order to properly resynchronize all completed jobs.