.. _chapter_example_gettinstarted: *************** Getting Started *************** **This is where you should start if you are new to RADICAL-Pilot. It is highly recommended that you carefully read and understand all of this before you go off and start developing your own applications.** In this chapter we explain the main components of RADICAL-Pilot and the foundations of their function and their interplay. For your convenience, you can find a fully working example at the end of this page. After you have worked through this chapter, you will understand how to launch a local ComputePilot and use a UnitManager to schedule and run ComputeUnits (tasks) on it. Throughout this chapter you will also find links to more advanced topics like launching ComputePilots on remote HPC clusters and scheduling. .. note:: This chapter assumes that you have successfully installed RADICAL-Pilot on (see chapter :ref:`chapter_installation`). Loading the Module ------------------ In order to use RADICAL-Pilot in your Python application, you need to import the ``radical.pilot`` module. .. code-block:: python import radical.pilot You can check / print the version of your RADICAL-Pilot installation via the ``version`` property. .. code-block:: python print radical.pilot.version Creating a Session ------------------ A :class:`radical.pilot.Session` is the root object for all other objects in RADICAL- Pilot. You can think of it as a *tree* or a *directory structure* with a Session as root. Each Session can have zero or more :class:`radical.pilot.Context`, :class:`radical.pilot.PilotManager` and :class:`radical.pilot.UnitManager` attached to it. .. code-block:: text (~~~~~~~~~) ( ) <---- [Session] ( MongoDB ) | ( ) |---- Context (_________) |---- .... | |---- [PilotManager] | | | |---- ComputePilot | |---- ComputePilot | |---- [UnitManager] | | | |---- ComputeUnit | |---- ComputeUnit | |.... | |---- [UnitManager] | | | |.... | |.... A Session also encapsulates the connection(s) to a back end `MongoDB `_ server which is the *brain* and *central nervous system* of RADICAL-Pilot. More information about how RADICAL-Pilot uses MongoDB can be found in the :ref:`chapter_intro` section. To create a new Session, the only thing you need to provide is the URL of a MongoDB server: .. code-block:: python session = radical.pilot.Session(database_url="mongodb://my-mongodb-server.edu:27017") Each Session has a unique identifier (`uid`) and methods to traverse its members. The Session `uid` can be used to disconnect and reconnect to a Session as required. This is covered in :ref:`chapter_example_disconnect_reconnect`. .. code-block:: python print "UID : %s" % session.uid print "Contexts : %s" % session.list_contexts() print "UnitManagers : %s" % session.list_unit_managers() print "PilotManagers : %s" % session.list_pilot_managers() .. warning:: Always call :func:`radical.pilot.Session.close` before your application terminates. This will ensure that RADICAL-Pilot shuts down properly. Creating a ComputePilot ----------------------- A :class:`radical.pilot.ComputePilot` is responsible for ComputeUnit (task) execution. ComputePilots can be launched either locally or remotely, on a single machine or on one or more HPC clusters. In this example we just use local ComputePilots, but more on remote ComputePilots and how to launch them on HPC clusters can be found in :ref:`chapter_example_remote_and_hpc_pilots`. As shown in the hierarchy above, ComputePilots are grouped in :class:`radical.pilot.PilotManager` *containers*, so before you can launch a ComputePilot, you need to add a PilotManager to your Session. Just like a Session, a PilotManager has a unique id (`uid`) as well as a traversal method (`list_pilots`). .. code-block:: python pmgr = radical.pilot.PilotManager(session=session) print "PM UID : %s" % pmgr.uid print "Pilots : %s" % pmgr.list_pilots() In order to create a new ComputePilot, you first need to describe its requirements and properties. This is done with the help of a :class:`radical.pilot.ComputePilotDescription` object. The mandatory properties that you need to define are: * `resource` - The name (hostname) of the target system or ``localhost`` to launch a local ComputePilot. * `runtime` - The runtime (in minutes) of the ComputePilot agent. * `cores` - The number or cores the ComputePilot agent will try to allocate. You can define and submit a 2-core local pilot that runs for 5 minutes like this: .. code-block:: python pdesc = radical.pilot.ComputePilotDescription() pdesc.resource = "local.localhost" pdesc.runtime = 5 # minutes pdesc.cores = 2 A ComputePilot is launched by passing the ComputePilotDescription to the ``submit_pilots()`` method of the PilotManager. This automatically adds the ComputePilot to the PilotManager. Like any other object in RADICAL-Pilot, a ComputePilot also has a unique identifier (``uid``) .. code-block:: python pilot = pmgr.submit_pilots(pdesc) print "Pilot UID : %s" % pilot.uid .. warning:: Note that ``submit_pilots()`` is a non-blocking call and that the submitted ComputePilot agent **will not terminate** when your Python scripts finishes. ComputePilot agents terminate only after they have reached their ``runtime`` limit or if you call :func:`radical.pilot.PilotManager.cancel_pilots` or :func:`radical.pilot.ComputePilot.cancel`. .. note:: You can change to the ComputePilot sandbox directory (``/tmp/radical.pilot.sandbox`` in the above example) to see the raw logs and output files of the ComputePilot agent(s) ``[pilot-]`` as well as the working directories and output of the individual ComputeUnits (``[task-]``). .. code-block:: text [//] | |----[pilot-/] | | | |---- STDERR | |---- STDOUT | |---- AGENT.LOG | |---- [task-/] | |---- [task-/] | |.... | |.... *Knowing where to find these files might come in handy for debugging purposes but it is not required for regular RADICAL-Pilot usage.* Creating ComputeUnits (Tasks) ----------------------------- After you have launched a ComputePilot, you can now generate a few :class:`radical.pilot.ComputeUnit` objects for the ComputePilot to execute. You can think of a ComputeUnit as something very similar to an operating system process that consists of an ``executable``, a list of ``arguments``, and an ``environment`` along with some runtime requirements. Analogous to ComputePilots, a ComputeUnit is described via a :class:`radical.pilot.ComputeUnitDescription` object. The mandatory properties that you need to define are: * ``executable`` - The executable to launch. * ``arguments`` - The arguments to pass to the executable. * ``cores`` - The number of cores required by the executable. For example, you can create a workload of 8 '/bin/sleep' ComputeUnits like this: .. code-block:: python compute_units = [] for unit_count in range(0, 8): cu = radical.pilot.ComputeUnitDescription() cu.environment = {"SLEEP_TIME" : "10"} cu.executable = "/bin/sleep" cu.arguments = ["$SLEEP_TIME"] cu.cores = 1 compute_units.append(cu) .. note:: The example above uses a single executable that requires only one core. It is however possible to run multiple commands in one ComputeUnit. This is described in :ref:`chapter_example_multiple_commands`. If you want to run multi-core executables, like for example MPI programs, check out :ref:`chapter_example_multicore`. Input- / Output-File Transfer ----------------------------- Often, a computational task doesn't just consist of an executable with some arguments but also needs some input data. For this reason, a :class:`radical.pilot.ComputeUnitDescription` allows the definition of ``input_staging`` and ``output_staging``: * ``input_staging`` defines a list of local files that need to be transferred to the execution resource before a ComputeUnit can start running. * ``output_staging`` defines a list of remote files that need to be transferred back to the local machine after a ComputeUnit has finished execution. See :ref:`chapter_data_staging` for more information on data staging. Furthermore, a ComputeUnit provides two properties :data:`radical.pilot.ComputeUnit.stdout` and :data:`radical.pilot.ComputeUnit.stderr` that can be used to access a ComputeUnit's STDOUT and STDERR files after it has finished execution. Example: .. code-block:: python cu = radical.pilot.ComputeUnitDescription() cu.executable = "/bin/cat" cu.arguments = ["file1.dat", "file2.dat"] cu.cores = 1 cu.input_staging = ["./file1.dat", "./file2.dat"] Adding Callbacks ---------------- Events in RADICAL-Pilot are mostly asynchronous as they happen at one or more distributed components, namely the ComputePilot agents. At any time during the execution of a workload, ComputePilots and ComputeUnits can begin or finish execution or fail with an error. RADICAL-Pilot provides callbacks as a method to react to these events asynchronously when they occur. ComputePilots, PilotManagers, ComputeUnits and UnitManagers all have a ``register_callbacks`` method: * :func:`radical.pilot.UnitManager.register_callback` * :func:`radical.pilot.PilotManager.register_callback` * :func:`radical.pilot.ComputePilot.register_callback` * :func:`radical.pilot.ComputeUnit.register_callback` A simple callback that prints the state of all pilots would look something like this: .. code-block:: python def pilot_state_cb(pilot, state): print "[Callback]: ComputePilot '%s' state changed to '%s'."% (pilot.uid, state) pmgr = radical.pilot.PilotManager(session=session) pmgr.register_callback(pilot_state_cb) .. note:: Using callbacks can greatly improve the performance of an application since it eradicates the necessity for global / blocking ``wait()`` calls and state polling. More about callbacks can be read in :ref:`chapter_programming_with_callbacks`. Scheduling ComputeUnits ----------------------- In the previous steps we have created and launched a ComputePilot (via a PilotManager) and created a list of ComputeUnitDescriptions. In order to put it all together and execute the ComputeUnits on the ComputePilot, we need to create a :class:`radical.pilot.UnitManager` instance. As shown in the diagram below, a UnitManager combines three things: the ComputeUnits, added via :func:`radical.pilot.UnitManager.submit_units`, one or more ComputePilots, added via :func:`radical.pilot.UnitManager.add_pilots` and a :ref:`chapter_schedulers`. Once instantiated, a UnitManager assigns the submitted CUs to one of its ComputePilots based on the selected scheduling algorithm. .. code-block:: text +----+ +----+ +----+ +----+ +----+ | CU | | CU | | CU | | CU | ... | CU | +----+ +----+ +----+ +----+ +----+ | | | | | |_______|_______|_______|____________| | v submit_units() +---------------+ | UnitManager | |---------------| | | | | +---------------+ ^ add_pilots() | __________|___________ | | | +~~~~+ +~~~~+ +~~~~+ | CP | | CP | ... | CP | +~~~~+ +~~~~+ +~~~~+ Since we have only one ComputePilot, we don't need any specific scheduling algorithm for our example. We choose ``SCHED_DIRECT_SUBMISSION`` which simply passes the ComputeUnits on to the ComputePilot. .. code-block:: python umgr = radical.pilot.UnitManager(session=session, scheduler=radical.pilot.SCHED_DIRECT_SUBMISSION) umgr.add_pilots(pilot) umgr.submit_units(compute_units) umgr.wait_units() The :func:`radical.pilot.UnitManager.wait_units` call blocks until all ComputeUnits have been executed by the UnitManager. Simple control flows / dependencies can be realized with ``wait_units()``, however, for more complex control flows it can become inefficient due to its blocking nature. To address this, RADICAL-Pilot also provides mechanisms for asynchronous notifications and callbacks. This is discussed in more detail in :ref:`chapter_example_async`. .. note:: The ``SCHED_DIRECT_SUBMISSION`` only works with a sinlge ComputePilot. If you add more than one ComputePilot to a UnitManager, you will end up with an error. If you want to use RADICAL-Pilot to run multiple ComputePilots concurrently, possibly on different machines, check out :ref:`chapter_example_remote_and_hpc_pilots`. Results and Inspection ---------------------- .. code-block:: python for unit in umgr.get_units(): print "unit id : %s" % unit.uid print " state : %s" % unit.state print " history:" for entry in unit.state_history : print " %s : %s" (entry.timestamp, entry.state) Cleanup and Shutdown -------------------- When your application has finished executing all ComputeUnits, it should make an attempt to cancel the ComputePilot. If a ComputePilot is not canceled, it will continue running until it reaches its ``runtime`` limit, even if application has terminated. An individual ComputePilot is canceled by calling :func:`radical.pilot.ComputePilot.cancel`. Alternatively, all ComputePilots of a PilotManager can be canceled by calling :func:`radical.pilot.PilotManager.cancel_pilots`. .. code-block:: python pmgr.cancel_pilots() Before your application terminates, you should always call :func:`radical.pilot.Session.close` to ensure that your RADICAL-Pilot session terminates properly. If you haven't canceled the pilots before explicitly, ``close()`` will take care of that implicitly (control it via the `terminate` parameter). ``close()`` will also delete all traces of the session from the database (control this with the `cleanup` parameter). .. code-block:: python session.close(cleanup=True, terminate=True) What's Next? ------------ Now that you understand the basic mechanics of RADICAL-Pilot, it's time to dive into some of the more advanced topics. We suggest that you check out the following chapters next: * :ref:`chapter_example_errorhandling`. Error handling is crucial for any RADICAL-Pilot application! This chapter captures everything from exception handling to state callbacks. * :ref:`chapter_example_remote_and_hpc_pilots`. In this chapter we explain how to launch ComputePilots on remote HPC clusters, something you most definitely want to do. * :ref:`chapter_example_disconnect_reconnect`. This chapter is very useful for example if you work with long-running tasks that don't need continuous supervision. The Complete Example -------------------- Below is a complete and working example that puts together everything we discussed in this section. You can download the sources from :download:`here <../../../examples/getting_started_local.py>`. .. literalinclude:: ../../../examples/getting_started_local.py