4.8. Sharing Unit Input Data

RP aims to support the concurrent execution of many tasks, and for many workloads which fit that broad description, those tasks share (some or all) input data. We have seen earlier that input staging can incur a significant runtime overhead – but that can be significantly reduced by avoiding redundant staging operations.

For this purpose, each RP pilot manages a spaces of shared data, and any data put into that space by the application can later be symlinked into the unit’s workdir, for consumption:

# stage shared data from `pwd` to the pilot's shared staging space
pilot.stage_in({'source': 'file://%s/input.dat' % os.getcwd(),
                'target': 'staging:///input.dat',
                'action': rp.TRANSFER})

[...]

for i in range(0, n):

    cud = rp.ComputeUnitDescription()

    cud.executable     = '/usr/bin/wc'
    cud.arguments      = ['-c', 'input.dat']
    cud.input_staging  = {'source': 'staging:///input.dat',
                          'target': 'input.dat',
                          'action': rp.LINK
                         }

The rp.LINK staging action requests a symlink to be created by RP, instead of the copy operation used on the default rp.TRANSFER action. The full example can be found here: 07_shared_unit_data.py.

Note

Unlike many other methods in RP, the pilot.stage_in option is synchronous, ie. it will only return once the transfer has been completed. That semantics may change in a future version of RP.

4.8.1. Running the Example

The result of this example’s execution is the very same as we saw in the previous, but it will now run significantly faster due to the removed staging redundancy (at least for non-local pilots):

../_images/07_shared_unit_data.png

4.8.2. What’s Next?

This completes the discussion on data staging – the next sections will go into more details of the units execution: environment setup, pre- and post- execution, and MPI applications.