.. _chapter_data_staging:

************
Data Staging
************

.. note:: Currently RADICAL-Pilot only supports data on file abstraction
          level, so `data == files` at this moment.

Many, if not all, programs require input data to operate and create output
data as a result in some form or shape.
RADICAL-Pilot has a set of constructs that allows the user to specify the
required staging of input and output files for a Compute Unit.

The primary constructs are on the level of the Compute Unit (Description)
which are discussed in the next section.
For more elaborate use-cases we also have constructs on the Compute Pilot
level, which are discussed later in this chapter.

.. note:: RADICAL-Pilot uses system calls for local file operations and SAGA for
          remote transfers and URL specification.


Compute Unit I/O
================

To instruct RADICAL-Pilot to handle files for you, there are two things to
take care of.
First you need to specify the respective input and output files for the
Compute Unit in so called `staging directives`.
Additionally you need to associate these `staging directives` to the the
Compute Unit by means of the ``input_staging`` and ``output_staging`` members.

What it looks like
------------------

The following code snippet shows this in action:

.. code-block:: python

    INPUT_FILE_NAME  = "INPUT_FILE.TXT"
    OUTPUT_FILE_NAME = "OUTPUT_FILE.TXT"

    # This executes: "/usr/bin/sort -o OUTPUT_FILE.TXT INPUT_FILE.TXT"
    cud = radical.pilot.ComputeUnitDescription()
    cud.executable = "/usr/bin/sort"
    cud.arguments = ["-o", OUTPUT_FILE_NAME, INPUT_FILE_NAME]
    cud.input_staging  = INPUT_FILE_NAME
    cud.output_staging = OUTPUT_FILE_NAME

Here the `staging directives` ``INPUT_FILE_NAME`` and ``OUTPUT_FILE_NAME``
are simple strings that both specify a single filename and are associated to
the Compute Unit Description ``cud`` for input and output respectively.

What this does is that the file `INPUT_FILE.TXT` is transferred from the local
directory to the directory where the task is executed. After the task has run,
the file `OUTPUT_FILE.TXT` that has been created by the task, will be
transferred back to the local directory.

The :ref:`example-string` example demonstrates this in full glory.

Staging Directives
------------------

The format of the `staging directives` can either be a string as above or a
dict of the following structure:

.. code-block:: python

    staging_directive = {
        'source':   source,   # radical.pilot.Url() or string (MANDATORY).
        'target':   target,   # radical.pilot.Url() or string (OPTIONAL).
        'action':   action,   # One of COPY, LINK, MOVE, TRANSFER or TARBALL (OPTIONAL).
        'flags':    flags,    # Zero or more of CREATE_PARENTS or SKIP_FAILED (OPTIONAL).
        'priority': priority  # A number to instruct ordering (OPTIONAL).
    }

The semantics of the keys from the dict are as follows:

- ``source`` (default: None) and ``target`` (default: os.path.basename(source)):
    In case of the `staging directive` being used for *input*,
    then the ``source`` refers to the location to get the input files
    from,  e.g. the local working directory on your laptop or a remote
    data repository, and ``target`` refers to the working directory of
    the ComputeUnit. Alternatively, in case of the `staging directive`
    being used for *output*, then the ``source`` refers to the output
    files being generated by the ComputeUnit in the working directory
    and ``target`` refers to the location where you need to store the
    output data, e.g. back to your laptop or some remote data repository.

- ``action`` (default: TRANSFER):
    The *ultimate* goal is to make data available to the application
    kernel in the ComputeUnit and to be able to make the results
    available for further use.
    Depending on the relative location of the working directory of the
    ``source`` to the ``target`` location, the action can be ``COPY``
    (local resource), ``LINK`` (same file system),
    ``MOVE`` (local resource), ``TRANSFER`` (to a remote resource), or 
    ``TARBALL`` (transfer to a remote resource after tarring files).

- ``flags`` (default: [CREATE_PARENTS, SKIP_FAILED]):
    By passing certain flags we can influence the behavior of the action.
    Available flags are:

    - ``CREATE_PARENTS``: Create parent directories while writing file.
    - ``SKIP_FAILED``: Don't stage out files if tasks failed.

    In case of multiple values these can be passed as a list.

- ``priority`` (default: 0):
    This optional field can be used to instruct the backend to priority
    the actions on the ``staging directives``. E.g. to first stage the
    output that is required for immediate further analysis and afterwards
    some output files that are of secondary concern.

The :ref:`example-dict` example demonstrates this in full glory.

When the `staging directives` are specified as a string as we did earlier,
that implies a `staging directive` where the ``source`` and the ``target``
are equal to the content of the string, the ``action`` is set to the default
action ``TRANSFER``, the ``flags`` are set to the default flags
``CREATE_PARENTS`` and ``SKIP_FAILED``, and the ``priority`` is set to the
default value ``0``:

.. code-block:: python

    'INPUT_FILE.TXT' == {
        'source':   'INPUT_FILE.TXT',
        'target':   'INPUT_FILE.TXT',
        'action':   TRANSFER,
        'flags':    [CREATE_PARENTS, SKIP_FAILED],
        'priority': 0
    }


.. _staging-area:

Staging Area
------------

As the pilot job creates an abstraction for a computational resource,
the user does not necessarily know where the working directory of the Compute
Pilot or the Compute Unit is.
Even if he knows, the user might not have direct access to it.
For this situation we have the staging area, which is a special construct so
that the user can specify files relative to or in the working directory
without knowing the exact location. This can be done using the following
URL format:

.. code-block:: python

    'staging:///INPUT_FILE.TXT'

The :ref:`example-pipeline` example demonstrates this in full glory.


Compute Pilot I/O
=================

As mentioned earlier, in addition to the constructs on Compute Unit-level
RADICAL-Pilot also has constructs on Compute Pilot-level.
The main rationale for this is that often there is (input) data to be shared
between multiple Compute Units.
Instead of transferring the same files for every Compute Unit,
we can transfer the data once to the Pilot, and then make it available to
every Compute Unit that needs it.

This works in a similar way as the Compute Unit-IO, where we use also use
the Staging Directive to specify the I/O transaction
The difference is that in this case, the Staging Directive is not associated
to the Description, but used in a direct method call ``pilot.stage_in(sd_pilot)``.

.. code-block:: python

    # Configure the staging directive for to insert the shared file into
    # the pilot staging directory.
    sd_pilot = {'source': shared_input_file_url,
                'target': os.path.join(MY_STAGING_AREA, SHARED_INPUT_FILE),
                'action': radical.pilot.TRANSFER
    }
    # Synchronously stage the data to the pilot
    pilot.stage_in(sd_pilot)

The :ref:`example-shared` example demonstrates this in full glory.

.. note:: The call to ``stage_in()`` is synchronous and will return once the
          transfer is complete.


Examples
========

.. note:: All of the following examples are configured to run on localhost,
          but they can be easily changed to run on a remote resource by
          modifying the resource specification in the Compute Pilot Description.
          Also note the comments in :ref:`staging-area` when changing the
          examples to a remote target.

These examples require an installation of RADICAL-Pilot of course.
There are download links for each of the examples.


.. _example-string:

String-Based Input and Output Transfer
--------------------------------------

This example demonstrates the simplest form of the data staging capabilities.
The example demonstrates how a local input file is staged through
RADICAL-Pilot, processed by the Compute Unit and the resulting output file
is staged back to the local environment.

.. note::
    Download the example:

    ``curl -O https://raw.githubusercontent.com/radical-cybertools/radical.pilot/readthedocs/examples/io_staging_simple.py``

.. literalinclude:: ../../examples/io_staging_simple.py


.. _example-dict:

Dictionary-Based Input and Output Transfer
------------------------------------------

This example demonstrates the use of the staging directives structure to
have more control over the staging behavior.
The flow of the example is similar to that of the previous example,
but here we show that by using the dict-based Staging Directive,
one can specify different names and paths for the local and remote files,
a feature that is often required in real-world applications.

.. note::
    Download the example:

    ``curl -O https://raw.githubusercontent.com/radical-cybertools/radical.pilot/readthedocs/examples/io_staging_dict.py``

.. literalinclude:: ../../examples/io_staging_dict.py


.. _example-shared:

Shared Input Files
------------------

This example demonstrates the staging of a shared input file by means of
the stage_in() method of the pilot and consequently making that available to
all compute units.

.. note::
    Download the example:

    ``curl -O https://raw.githubusercontent.com/radical-cybertools/radical.pilot/readthedocs/examples/io_staging_shared.py``

.. literalinclude:: ../../examples/io_staging_shared.py


.. _example-pipeline:

Pipeline
--------

This example demonstrates a two-step pipeline that makes use of a remote pilot
staging area, where the first step of the pipeline copies the intermediate
output into and that is picked up by the second step in the pipeline.

.. note::
    Download the example:

    ``curl -O https://raw.githubusercontent.com/radical-cybertools/radical.pilot/readthedocs/examples/io_staging_pipeline.py``

.. literalinclude:: ../../examples/io_staging_pipeline.py