12. API Reference¶
12.1. Sessions and Security Contexts¶
12.1.1. Sessions¶
-
class
radical.pilot.
Session
(database_url=None, database_name='radicalpilot', session_uid=None)[source]¶ A Session encapsulates a RADICAL-Pilot instance and is the root object for all other RADICAL-Pilot objects.
A Session holds
radical.pilot.PilotManager
andradical.pilot.UnitManager
instances which in turn holdradical.pilot.Pilot
andradical.pilot.ComputeUnit
instances.Each Session has a unique identifier
radical.pilot.Session.uid
that can be used to re-connect to a RADICAL-Pilot instance in the database.Example:
s1 = radical.pilot.Session(database_url=DBURL) s2 = radical.pilot.Session(database_url=DBURL, session_uid=s1.uid) # s1 and s2 are pointing to the same session assert s1.uid == s2.uid
-
__init__
(database_url=None, database_name='radicalpilot', session_uid=None)[source]¶ Creates a new or reconnects to an exising session.
If called without a session_uid, a new Session instance is created and stored in the database. If session_uid is set, an existing session is retrieved from the database.
- Arguments:
- database_url (string): The MongoDB URL. If none is given, RP uses the environment variable RADICAL_PILOT_DBURL. If that is not set, an error will be raises.
- database_name (string): An alternative database name (default: ‘radicalpilot’).
- session_uid (string): If session_uid is set, we try re-connect to an existing session instead of creating a new one.
- Returns:
- A new Session instance.
- Raises:
-
close
(cleanup=True, terminate=True, delete=None)[source]¶ Closes the session.
All subsequent attempts access objects attached to the session will result in an error. If cleanup is set to True (default) the session data is removed from the database.
- Arguments:
- cleanup (bool): Remove session from MongoDB (implies * terminate)
- terminate (bool): Shut down all pilots associated with the session.
- Raises:
radical.pilot.IncorrectState
if the session is closed or doesn’t exist.
-
created
¶ Returns the UTC date and time the session was created.
-
last_reconnect
¶ Returns the most recent UTC date and time the session was reconnected to.
-
list_pilot_managers
()[source]¶ Lists the unique identifiers of all
radical.pilot.PilotManager
instances associated with this session.Example:
s = radical.pilot.Session(database_url=DBURL) for pm_uid in s.list_pilot_managers(): pm = radical.pilot.PilotManager(session=s, pilot_manager_uid=pm_uid)
- Returns:
- A list of
radical.pilot.PilotManager
uids (list oif strings`).
- A list of
- Raises:
radical.pilot.IncorrectState
if the session is closed or doesn’t exist.
-
get_pilot_managers
(pilot_manager_ids=None)[source]¶ Re-connects to and returns one or more existing PilotManager(s).
Arguments:
- session [
radical.pilot.Session
]: The session instance to use. - pilot_manager_uid [string]: The unique identifier of the PilotManager we want to re-connect to.
Returns:
- One or more new [
radical.pilot.PilotManager
] objects.
Raises:
radical.pilot.pilotException
if a PilotManager with pilot_manager_uid doesn’t exist in the database.
- session [
-
list_unit_managers
()[source]¶ Lists the unique identifiers of all
radical.pilot.UnitManager
instances associated with this session.Example:
s = radical.pilot.Session(database_url=DBURL) for pm_uid in s.list_unit_managers(): pm = radical.pilot.PilotManager(session=s, pilot_manager_uid=pm_uid)
- Returns:
- A list of
radical.pilot.UnitManager
uids (list of strings).
- A list of
- Raises:
radical.pilot.IncorrectState
if the session is closed or doesn’t exist.
-
get_unit_managers
(unit_manager_ids=None)[source]¶ Re-connects to and returns one or more existing UnitManager(s).
Arguments:
- session [
radical.pilot.Session
]: The session instance to use. - pilot_manager_uid [string]: The unique identifier of the PilotManager we want to re-connect to.
Returns:
- One or more new [
radical.pilot.PilotManager
] objects.
Raises:
radical.pilot.pilotException
if a PilotManager with pilot_manager_uid doesn’t exist in the database.
- session [
-
add_resource_config
(resource_config)[source]¶ Adds a new
radical.pilot.ResourceConfig
to the PilotManager’s dictionary of known resources, or accept a string which points to a configuration file.For example:
rc = radical.pilot.ResourceConfig rc.name = "mycluster" rc.job_manager_endpoint = "ssh+pbs://mycluster rc.filesystem_endpoint = "sftp://mycluster rc.default_queue = "private" rc.bootstrapper = "default_bootstrapper.sh" pm = radical.pilot.PilotManager(session=s) pm.add_resource_config(rc) pd = radical.pilot.ComputePilotDescription() pd.resource = "mycluster" pd.cores = 16 pd.runtime = 5 # minutes pilot = pm.submit_pilots(pd)
-
12.2. Pilots and PilotManagers¶
12.2.1. PilotManagers¶
-
class
radical.pilot.
PilotManager
(session, pilot_launcher_workers=1, _reconnect=False)[source]¶ A PilotManager holds
radical.pilot.ComputePilot
instances that are submitted via theradical.pilot.PilotManager.submit_pilots()
method.It is possible to attach one or more Using Local and Remote HPC Resources to a PilotManager to outsource machine specific configuration parameters to an external configuration file.
Each PilotManager has a unique identifier
radical.pilot.PilotManager.uid
that can be used to re-connect to previoulsy created PilotManager in a givenradical.pilot.Session
.Example:
s = radical.pilot.Session(database_url=dbURL) pm1 = radical.pilot.PilotManager(session=s, resource_configurations=RESCONF) # Re-connect via the 'get()' method. pm2 = radical.pilot.PilotManager.get(session=s, pilot_manager_id=pm1.uid) # pm1 and pm2 are pointing to the same PilotManager assert pm1.uid == pm2.uid
-
__init__
(session, pilot_launcher_workers=1, _reconnect=False)[source]¶ Creates a new PilotManager and attaches is to the session.
Note
The resource_configurations (see Using Local and Remote HPC Resources) parameter is currently mandatory for creating a new PilotManager instance.
Arguments:
session [
radical.pilot.Session
]: The session instance to use.resource_configurations [string or list of strings]: A list of URLs pointing to Using Local and Remote HPC Resources. Currently file://, http:// and https:// URLs are supported.
If one or more resource_configurations are provided, Pilots submitted via this PilotManager can access the configuration entries in the files via the
ComputePilotDescription
. For example:pm = radical.pilot.PilotManager(session=s) pd = radical.pilot.ComputePilotDescription() pd.resource = "futuregrid.india" # defined in futuregrid.json pd.cores = 16 pd.runtime = 5 # minutes pilot = pm.submit_pilots(pd)
pilot_launcher_workers (int): The number of pilot launcher worker processes to start in the background.
Note
pilot_launcher_workers can be used to tune RADICAL-Pilot’s performance. However, you should only change the default values if you know what you are doing.
Returns:
- A new PilotManager object [
radical.pilot.PilotManager
].
- Raises:
-
close
(terminate=True)[source]¶ Shuts down the PilotManager and its background workers in a coordinated fashion.
Arguments:
- terminate [bool]: If set to True, all active pilots will get canceled (default: False).
-
submit_pilots
(pilot_descriptions)[source]¶ Submits a new
radical.pilot.ComputePilot
to a resource.Returns:
- One or more
radical.pilot.ComputePilot
instances [list of :class:`radical.pilot.ComputePilot].
Raises:
- One or more
-
list_pilots
()[source]¶ Lists the unique identifiers of all
radical.pilot.ComputePilot
instances associated with this PilotManagerReturns:
- A list of
radical.pilot.ComputePilot
uids [string].
Raises:
- A list of
-
get_pilots
(pilot_ids=None)[source]¶ Returns one or more
radical.pilot.ComputePilot
instances.Arguments:
- pilot_uids [list of strings]: If pilot_uids is set, only the Pilots with the specified uids are returned. If pilot_uids is None, all Pilots are returned.
Returns:
- A list of
radical.pilot.ComputePilot
objects [list of :class:`radical.pilot.ComputePilot].
Raises:
-
wait_pilots
(pilot_ids=None, state=['Done', 'Failed', 'Canceled'], timeout=None)[source]¶ Returns when one or more
radical.pilot.ComputePilots
reach a specific state or when an optional timeout is reached.If pilot_uids is None, wait_pilots returns when all Pilots reach the state defined in state.
Arguments:
pilot_uids [string or list of strings] If pilot_uids is set, only the Pilots with the specified uids are considered. If pilot_uids is None (default), all Pilots are considered.
state [list of strings] The state(s) that Pilots have to reach in order for the call to return.
By default wait_pilots waits for the Pilots to reach a terminal state, which can be one of the following:
radical.pilot.DONE
radical.pilot.FAILED
radical.pilot.CANCELED
timeout [float] Optional timeout in seconds before the call returns regardless whether the Pilots have reached the desired state or not. The default value -1.0 never times out.
Raises:
-
cancel_pilots
(pilot_ids=None)[source]¶ Cancels one or more ComputePilots.
Arguments:
- pilot_uids [string or list of strings] If pilot_uids is set, only the Pilots with the specified uids are canceled. If pilot_uids is None, all Pilots are canceled.
Raises:
-
register_callback
(callback_function, callback_data=None)[source]¶ Registers a new callback function with the PilotManager. Manager-level callbacks get called if any of the ComputePilots managed by the PilotManager change their state.
All callback functions need to have the same signature:
def callback_func(obj, state, data)
where
object
is a handle to the object that triggered the callback,state
is the new state of that object, anddata
are the data passed on callback registration.
-
12.2.2. ComputePilotDescription¶
-
class
radical.pilot.
ComputePilotDescription
[source]¶ A ComputePilotDescription object describes the requirements and properties of a
radical.pilot.Pilot
and is passed as a parameter toradical.pilot.PilotManager.submit_pilots()
to instantiate a new pilot.Note
A ComputePilotDescription MUST define at least
resource
and the number ofcores
to allocate on the target resource.Example:
pm = radical.pilot.PilotManager(session=s) pd = radical.pilot.ComputePilotDescription() pd.resource = "local.localhost" # defined in futuregrid.json pd.cores = 16 pd.runtime = 5 # minutes pilot = pm.submit_pilots(pd)
-
resource
¶ [Type: string] [`mandatory`] The key of a Using Local and Remote HPC Resources entry. If the key exists, the machine-specifc configuration is loaded from the configuration once the ComputePilotDescription is passed to
radical.pilot.PilotManager.submit_pilots()
. If the key doesn’t exist, aradical.pilot.pilotException
is thrown.
-
access_schema
¶ [Type: string] [`optional`] The key of an access mechanism to use. The valid access mechanism are defined in the resource configurations, see Using Local and Remote HPC Resources. The first one defined there is used by default, if no other is specified.
-
runtime
¶ [Type: int] [mandatory] The maximum run time (wall-clock time) in minutes of the ComputePilot.
-
sandbox
¶ [Type: string] [optional] The working (“sandbox”) directory of the ComputePilot agent. This parameter is optional. If not set, it defaults to radical.pilot.sandox in your home or login directory.
Warning
If you define a ComputePilot on an HPC cluster and you want to set sandbox manually, make sure that it points to a directory on a shared filesystem that can be reached from all compute nodes.
-
cores
¶ [Type: int] [mandatory] The number of cores the pilot should allocate on the target resource.
-
memory
¶ [Type: int] [optional] The amount of memorty (in MB) the pilot should allocate on the target resource.
-
queue
¶ [Type: string] [optional] The name of the job queue the pilot should get submitted to . If queue is defined in the resource configuration (
resource
) defining queue will override it explicitly.
-
project
¶ [Type: string] [optional] The name of the project / allocation to charge for used CPU time. If project is defined in the machine configuration (
resource
), defining project will override it explicitly.
-
cleanup
¶ [Type: bool] [optional] If cleanup is set to True, the pilot will delete its entire sandbox upon termination. This includes individual ComputeUnit sandboxes and all generated output data. Only log files will remain in the sandbox directory.
-
12.2.3. Pilots¶
-
class
radical.pilot.
ComputePilot
[source]¶ - A ComputePilot represent a resource overlay on a local or remote
- resource.
Note
A ComputePilot cannot be created directly. The factory method
radical.pilot.PilotManager.submit_pilots()
has to be used instead.Example:
pm = radical.pilot.PilotManager(session=s) pd = radical.pilot.ComputePilotDescription() pd.resource = "local.localhost" pd.cores = 2 pd.runtime = 5 # minutes pilot = pm.submit_pilots(pd)
-
uid
¶ Returns the Pilot’s unique identifier.
The uid identifies the Pilot within the
PilotManager
and can be used to retrieve an existing Pilot.- Returns:
- A unique identifier (string).
-
description
¶ Returns the pilot description the pilot was started with.
-
sandbox
¶ Returns the Pilot’s ‘sandbox’ / working directory url.
- Returns:
- A URL string.
-
state
¶ Returns the current state of the pilot.
-
state_history
¶ Returns the complete state history of the pilot.
-
stdout
¶ Returns the stdout of the pilot.
-
stderr
¶ Returns the stderr of the pilot.
-
logfile
¶ Returns the logfile of the pilot.
-
log
¶ Returns the log of the pilot.
-
resource_detail
¶ Returns the names of the nodes managed by the pilot.
-
pilot_manager
¶ Returns the pilot manager object for this pilot.
-
unit_managers
¶ Returns the unit manager object UIDs for this pilot.
-
units
¶ Returns the units scheduled for this pilot.
-
submission_time
¶ Returns the time the pilot was submitted.
-
start_time
¶ Returns the time the pilot was started on the backend.
-
stop_time
¶ Returns the time the pilot was stopped.
-
resource
¶ Returns the resource.
-
register_callback
(callback_func, callback_data=None)[source]¶ Registers a callback function that is triggered every time the ComputePilot’s state changes.
All callback functions need to have the same signature:
def callback_func(obj, state, data)
where
object
is a handle to the object that triggered the callback,state
is the new state of that object, anddata
is the data passed on callback registration.
-
wait
(state=['Done', 'Failed', 'Canceled'], timeout=None)[source]¶ Returns when the pilot reaches a specific state or when an optional timeout is reached.
Arguments:
state [list of strings] The state(s) that Pilot has to reach in order for the call to return.
By default wait waits for the Pilot to reach a terminal state, which can be one of the following:
radical.pilot.states.DONE
radical.pilot.states.FAILED
radical.pilot.states.CANCELED
timeout [float] Optional timeout in seconds before the call returns regardless whether the Pilot has reached the desired state or not. The default value None never times out.
Raises:
radical.pilot.exceptions.radical.pilotException
if the state of the pilot cannot be determined.
12.3. ComputeUnits and UnitManagers¶
12.3.1. UnitManager¶
-
class
radical.pilot.
UnitManager
(session, scheduler=None, input_transfer_workers=2, output_transfer_workers=2, _reconnect=False)[source]¶ A UnitManager manages
radical.pilot.ComputeUnit
instances which represent the executable workload in RADICAL-Pilot. A UnitManager connects the ComputeUnits with one or morePilot
instances (which represent the workload executors in RADICAL-Pilot) and a scheduler which determines whichComputeUnit
gets executed on whichPilot
.Each UnitManager has a unique identifier
radical.pilot.UnitManager.uid
that can be used to re-connect to previoulsy created UnitManager in a givenradical.pilot.Session
.Example:
s = radical.pilot.Session(database_url=DBURL) pm = radical.pilot.PilotManager(session=s) pd = radical.pilot.ComputePilotDescription() pd.resource = "futuregrid.alamo" pd.cores = 16 p1 = pm.submit_pilots(pd) # create first pilot with 16 cores p2 = pm.submit_pilots(pd) # create second pilot with 16 cores # Create a workload of 128 '/bin/sleep' compute units compute_units = [] for unit_count in range(0, 128): cu = radical.pilot.ComputeUnitDescription() cu.executable = "/bin/sleep" cu.arguments = ['60'] compute_units.append(cu) # Combine the two pilots, the workload and a scheduler via # a UnitManager. um = radical.pilot.UnitManager(session=session, scheduler=radical.pilot.SCHED_ROUND_ROBIN) um.add_pilot(p1) um.submit_units(compute_units)
-
__init__
(session, scheduler=None, input_transfer_workers=2, output_transfer_workers=2, _reconnect=False)[source]¶ Creates a new UnitManager and attaches it to the session.
Args:
- session (string): The session instance to use.
- scheduler (string): The name of the scheduler plug-in to use.
- input_transfer_workers (int): The number of input file transfer worker processes to launch in the background.
- output_transfer_workers (int): The number of output file transfer worker processes to launch in the background.
Note
input_transfer_workers and output_transfer_workers can be used to tune RADICAL-Pilot’s file transfer performance. However, you should only change the default values if you know what you are doing.
- Raises:
-
uid
¶ Returns the unique id.
-
scheduler
¶ Returns the scheduler name.
-
scheduler_details
¶ Returns the scheduler logs.
-
add_pilots
(pilots)[source]¶ Associates one or more pilots with the unit manager.
Arguments:
- pilots [
radical.pilot.ComputePilot
or list ofradical.pilot.ComputePilot
]: The pilot objects that will be added to the unit manager.
Raises:
- pilots [
-
list_pilots
()[source]¶ Lists the UIDs of the pilots currently associated with the unit manager.
Returns:
- A list of
radical.pilot.ComputePilot
UIDs [string].
Raises:
- A list of
-
get_pilots
()[source]¶ get the pilots instances currently associated with the unit manager.
Returns:
- A list of
radical.pilot.ComputePilot
instances.
Raises:
- A list of
-
remove_pilots
(pilot_ids, drain=True)[source]¶ Disassociates one or more pilots from the unit manager.
TODO: Implement ‘drain’.
After a pilot has been removed from a unit manager, it won’t process any of the unit manager’s units anymore. Calling remove_pilots doesn’t stop the pilot itself.
Arguments:
- drain [boolean]: Drain determines what happens to the units which are managed by the removed pilot(s). If True, all units currently assigned to the pilot are allowed to finish execution. If False (the default), then ACTIVE units will be canceled.
Raises:
-
list_units
()[source]¶ Returns the UIDs of the
radical.pilot.ComputeUnit
managed by this unit manager.Returns:
- A list of
radical.pilot.ComputeUnit
UIDs [string].
- A list of
-
submit_units
(unit_descriptions)[source]¶ Submits on or more
radical.pilot.ComputeUnit
instances to the unit manager.Arguments:
- unit_descriptions [
radical.pilot.ComputeUnitDescription
or list ofradical.pilot.ComputeUnitDescription
]: The description of the compute unit instance(s) to create.
Returns:
- A list of
radical.pilot.ComputeUnit
objects.
Raises:
- unit_descriptions [
-
get_units
(unit_ids=None)[source]¶ Returns one or more compute units identified by their IDs.
Arguments:
- unit_ids [string or list of strings]: The IDs of the compute unit objects to return.
Returns:
- A list of
radical.pilot.ComputeUnit
objects.
Raises:
-
wait_units
(unit_ids=None, state=['Done', 'Failed', 'Canceled'], timeout=None)[source]¶ Returns when one or more
radical.pilot.ComputeUnits
reach a specific state.If unit_uids is None, wait_units returns when all ComputeUnits reach the state defined in state.
Example:
# TODO -- add example
Arguments:
unit_uids [string or list of strings] If unit_uids is set, only the ComputeUnits with the specified uids are considered. If unit_uids is None (default), all ComputeUnits are considered.
state [string] The state that ComputeUnits have to reach in order for the call to return.
By default wait_units waits for the ComputeUnits to reach a terminal state, which can be one of the following:
radical.pilot.DONE
radical.pilot.FAILED
radical.pilot.CANCELED
timeout [float] Timeout in seconds before the call returns regardless of Pilot state changes. The default value None waits forever.
Raises:
-
cancel_units
(unit_ids=None)[source]¶ Cancel one or more
radical.pilot.ComputeUnits
.Arguments:
- unit_ids [string or list of strings]: The IDs of the compute unit objects to cancel.
Raises:
-
register_callback
(callback_function, metric='UNIT_STATE', callback_data=None)[source]¶ Registers a new callback function with the UnitManager. Manager-level callbacks get called if the specified metric changes. The default metric UNIT_STATE fires the callback if any of the ComputeUnits managed by the PilotManager change their state.
All callback functions need to have the same signature:
def callback_func(obj, value, data)
where
object
is a handle to the object that triggered the callback,value
is the metric, anddata
is the data provided on callback registration.. In the example of UNIT_STATE above, the object would be the unit in question, and the value would be the new state of the unit.Available metrics are:
- UNIT_STATE: fires when the state of any of the units which are managed by this unit manager instance is changing. It communicates the unit object instance and the units new state.
- WAIT_QUEUE_SIZE: fires when the number of unscheduled units (i.e. of units which have not been assigned to a pilot for execution) changes.
-
12.3.2. ComputeUnitDescription¶
-
class
radical.pilot.
ComputeUnitDescription
[source]¶ A ComputeUnitDescription object describes the requirements and properties of a
radical.pilot.ComputeUnit
and is passed as a parameter toradical.pilot.UnitManager.submit_units()
to instantiate and run a new ComputeUnit.Note
A ComputeUnitDescription MUST define at least an
executable
.Example:
# TODO
-
executable
¶ (Attribute) The executable to launch (string) [mandatory].
-
cores
¶ (Attribute) The number of cores (int) required by the executable. (int) [mandatory].
-
mpi
¶ (Attribute) Set to true if the task is an MPI task. (bool) [optional].
-
name
¶ (Attribute) A descriptive name for the compute unit (string) [optional].
-
arguments
¶ (Attribute) The arguments for
executable
(list of strings) [optional].
-
environment
¶ (Attribute) Environment variables to set in the execution environment (dict) [optional].
-
stdout
¶ (Attribute) the name of the file to store stdout in.
-
stderr
¶ (Attribute) the name of the file to store stderr in.
-
input_staging
¶ (Attribute) The files that need to be staged before execution (list of staging directives) [optional].
Note
TODO: Explain input staging.
-
output_staging
¶ (Attribute) The files that need to be staged after execution (list of staging directives) [optional].
Note
TODO: Explain output staging.
-
pre_exec
¶ (Attribute) Actions to perform before this task starts (list of strings) [optional].
-
post_exec
¶ (Attribute) Actions to perform after this task finishes (list of strings) [optional].
Note
Before the BigBang, there was nothing ...
-
kernel
¶ (Attribute) Name of a simulation kernel which expands to description attributes once the unit is scheduled to a pilot (and resource).
Note
TODO: explain in detal, reference ENMDTK.
-
restartable
¶ (Attribute) If the unit starts to execute on a pilot, but cannot finish because the pilot fails or is canceled, can the unit be restarted on a different pilot / resource? (default: False)
Note
TODO: explain in detal, reference ENMDTK.
-
cleanup
¶ [Type: bool] [optional] If cleanup is set to True, the pilot will delete the entire unit sandbox upon termination. This includes all generated output data in that sandbox. Output staging will be performed before cleanup.
-
12.3.3. ComputeUnit¶
-
class
radical.pilot.
ComputeUnit
[source]¶ A ComputeUnit represent a ‘task’ that is executed on a ComputePilot. ComputeUnits allow to control and query the state of this task.
Note
A ComputeUnit cannot be created directly. The factory method
radical.pilot.UnitManager.submit_units()
has to be used instead.Example:
umgr = radical.pilot.UnitManager(session=s) ud = radical.pilot.ComputeUnitDescription() ud.executable = "/bin/date" ud.cores = 1 unit = umgr.submit_units(ud)
-
uid
¶ Returns the unit’s unique identifier.
The uid identifies the ComputeUnit within a
UnitManager
and can be used to retrieve an existing ComputeUnit.- Returns:
- A unique identifier (string).
-
name
¶ Returns the unit’s application specified name.
- Returns:
- A name (string).
-
working_directory
¶ Returns the full working directory URL of this ComputeUnit.
-
pilot_id
¶ Returns the pilot_id of this ComputeUnit.
-
stdout
¶ Returns a snapshot of the executable’s STDOUT stream.
If this property is queried before the ComputeUnit has reached ‘DONE’ or ‘FAILED’ state it will return None.
-
stderr
¶ Returns a snapshot of the executable’s STDERR stream.
If this property is queried before the ComputeUnit has reached ‘DONE’ or ‘FAILED’ state it will return None.
-
description
¶ Returns the ComputeUnitDescription the ComputeUnit was started with.
-
state
¶ Returns the current state of the ComputeUnit.
-
state_history
¶ Returns the complete state history of the ComputeUnit.
-
exit_code
¶ Returns the exit code of the ComputeUnit.
If this property is queried before the ComputeUnit has reached ‘DONE’ or ‘FAILED’ state it will return None.
-
log
¶ Returns the logs of the ComputeUnit.
-
execution_details
¶ Returns the exeuction location(s) of the ComputeUnit.
-
execution_locations
¶ Returns the exeuction location(s) of the ComputeUnit. This is just an alias for execution_details.
-
submission_time
¶ Returns the time the ComputeUnit was submitted.
-
start_time
¶ Returns the time the ComputeUnit was started on the backend.
-
stop_time
¶ Returns the time the ComputeUnit was stopped.
-
register_callback
(callback_func, callback_data=None)[source]¶ Registers a callback function that is triggered every time the ComputeUnit’s state changes.
All callback functions need to have the same signature:
def callback_func(obj, state)
where
object
is a handle to the object that triggered the callback andstate
is the new state of that object.
-
wait
(state=['Done', 'Failed', 'Canceled'], timeout=None)[source]¶ Returns when the ComputeUnit reaches a specific state or when an optional timeout is reached.
Arguments:
state [list of strings] The state(s) that compute unit has to reach in order for the call to return.
By default wait waits for the compute unit to reach a terminal state, which can be one of the following:
radical.pilot.states.DONE
radical.pilot.states.FAILED
radical.pilot.states.CANCELED
timeout [float] Optional timeout in seconds before the call returns regardless whether the compute unit has reached the desired state or not. The default value None never times out.
Raises:
-
12.4. Exceptions¶
-
class
radical.pilot.
PilotException
(msg, obj=None)[source]¶ Parameters: Raises: –
The base class for all RADICAL-Pilot Exception classes – this exception type is never raised directly, but can be used to catch all RADICAL-Pilot exceptions within a single except clause.
The exception message and originating object are also accessable as class attributes (
e.object()
ande.message()
). The__str__()
operator redirects toget_message()
.
12.5. State Models¶
12.5.1. ComputeUnit State Model¶
12.5.2. ComputePilot State Model¶
- A new compute pilot is launched via
radical.pilot.PilotManager.submit_pilots()
- The pilot is submitted to the remote resource and enters
LAUNCHING
state. - The pilot has been succesfully launched on the remote machine and is now waiting to become
ACTIVE
. - The pilot has been launched by the queueing system and is now in
ACTIVE STATE
. - The pilot has finished execution regularly and enters
DONE
state. - An error has occured during preparation for pilot launching and the pilot enters
FAILED
state. - An error has occured during pilot launching and the pilot enters
FAILED
state. - An error has occured on the backend and the pilot couldn’t become active and the pilot enters
FAILED
state. - An error has occured during pilot runtime and the pilot enters
FAILED
state. - The active pilot has been canceled via the
radical.pilot.ComputePilot.cancel()
call and entersCANCELED
state.