.. _residual-memcpy:

.. include:: benchmarks/residual/README.rst

----

Implementation details of tensor streaming
------------------------------------------

With ``SdkRuntime``, the user can copy/stream data in/out the WSE and/or launch kernels arbitrarily. 
This example uses a ``portmap`` and a symbol to copy a tensor into the corresponding device memory.
The symbol can be extracted via ``SdkRuntime.get_id()``.
The ``runtime_utils`` converts the user's tensor to an architecture-independent form based on the ``portmap`` because 
the ``SdkRuntime`` no longer accepts the parameter ``portmap``.

For example, the following ``portmap`` shows a block distribution of tensor ``A`` in the core rectangle with block ``LOCAL_IN_SZ`` by ``LOCAL_OUT_SZ`` per PE.

.. code-block:: python

    iportmap_A = f"{{ A[j=0:{M-1}][i=0:{N-1}] -> [PE[i//{LOCAL_IN_SZ}, j//{LOCAL_OUT_SZ}] -> \
        index[i%{LOCAL_IN_SZ}, j%{LOCAL_OUT_SZ}]] }}"

The symbol of tensor A defined in the kernel is extracted by

.. code-block:: python

    symbol_A = simulator.get_id("A")

Then ``runtime_utils.convert_input_tensor()`` converts user's tensor ``A`` based on the ``iportmap_A`` and calls ``SdkRuntime.memcpy_h2d()`` to copy ``A`` to 
device memory pointed to by ``symbol_A``.

.. code-block:: python

    (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A)
    simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False)


The following code copies the vector ``b`` into the first column of PEs in the core rectangle:

.. code-block:: python

    iportmap_b = f"{{ b[i=0:{M-1}][j=0] -> [PE[0, i//{LOCAL_OUT_SZ}] -> \
        index[i%{LOCAL_OUT_SZ}]] }}"
    symbol_y = simulator.get_id("y")
    (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_b, b)
    simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False)

The same idea holds for vector ``x`` which is distributed over the first row of PEs in the core rectangle with the following sequence

.. code-block:: python

    iportmap_x = f"{{ x[i=0:{N-1}][j=0] -> [PE[i//{LOCAL_IN_SZ}, 0] ->  \
        index[i%{LOCAL_IN_SZ}]] }}"
    symbol_x = simulator.get_id("x")
    (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_x, x)
    simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False)

The output norm of ``|b-A*x|`` is also copied from device memory directly via the following sequence 

.. code-block:: python

    oportmap_nrm_r = "{ nrm_r[i=0:0][j=0] -> [PE[1, 0] -> index[i]] }"
    symbol_nrm = simulator.get_id("nrm")
    (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32)
    simulator.memcpy_d2h(data, symbol_nrm, False, px, py, w, h, l, 0, False)
    nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data)


In this example, the user does not specify colors to stream in tensors ``A``, ``x`` and ``b`` or to stream out a scalar ``|b-A*x|``.
The runtime uses internal colors to stream in/out the tensors implicitly. 
The user only specifies the color ``LAUNCH`` to launch a kernel function.
In this example, we launch the kernel function ``bcast_x()`` without any arguments.
So the first argument of ``call()`` is ``bcast_x()`` which is defined in ``residual_memcpy.csl``,
and the second argument is an empty array.

.. code-block:: python

    simulator.call("bcast_x", [], nonblock=False)

The user calls ``SdkRuntime.stop()`` if all operations are done.

The kernel (``residual_memcpy.csl``) does not define three WTT (wave-triggered task) to receive ``A``, ``x`` and ``b``.
Instead, the user defines pointers to ``A``, ``x``, ``y`` and ``nrm``, and exports those pointers via ``export_symbol``.

.. code-block:: csl

    var ptr_A : [*]f32 = &A;
    var ptr_x : [*]f32 = &x;
    var ptr_y : [*]f32 = &y;
    var ptr_nrm : [*]f32 = &nrm;
    ...
    comptime {
        @export_symbol(ptr_A, "A");
        @export_symbol(ptr_x, "x");
        @export_symbol(ptr_y, "y");
        @export_symbol(ptr_nrm, "nrm");
    }

Also in the layout file, ``layout_memcpy.csl``, the user must export symbol names corresponding to names in ``residual_memcpy.csl``. 

.. code-block:: csl

    @export_name("A", [*]f32, true);
    @export_name("x", [*]f32, true);
    @export_name("y", [*]f32, true);
    @export_name("nrm", [*]f32, true);


The host runtime copies ``A``, ``x`` and ``b`` to device via three calls to ``simulator.memcpy_h2d``.

.. code-block:: python

    simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False)
    simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False)
    simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False)

After that, the user can launch a kernel function ``bcast_x()`` to broadcast ``x`` from 1st row to other rows, to compute the local ``y=A*x`` via ``f_comp()``, 
and to reduce the partial result via ``f_reduce()``.
In the end of function ``bcast_x()``, the user must call ``sys_mod.unblock_cmd_stream()`` in order to process next command, which is ``simulator.memcpy_d2h``.
If the user does not call ``sys_mod.unblock_cmd_stream()``, the program hangs because ``simulator.memcpy_d2h`` is not processed.
To launch a kernel function in device, the user must export color LAUNCH and the host-callable function ``bcast_x()`` via

.. code-block:: csl

    comptime {
       @export_symbol(bcast_x);
       @rpc(LAUNCH);
    }

Finally the user has to compile the kernel with the flag ``--fabric-offsets=4,1`` and several additional parameters to
use the new runtime.

-
    ``--memcpy`` to compile the infrastructure (a.k.a. halo) to route the data between the host and the device.

-
    pass ``LAUNCH`` colors to ``cslc`` via ``LAUNCH_ID``.

-
    ``--channels=k`` where k is positive between 1 and 16. 

-
    ``----width-west-buf=p --width-east-buf=q`` where p and q are non-negative numbers. 
    In this example, no additional buffers are inserted in the framework, so both parameters are zero.

::

    cslc LAUNCH_ID:4 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0 


----

.. _residual-code:

layout.csl
----------

.. literalinclude:: benchmarks/residual/layout.csl
    :language: csl

residual.csl
------------

.. literalinclude:: benchmarks/residual/residual.csl
    :language: csl

gemv.csl
--------

.. literalinclude:: benchmarks/residual/gemv.csl
    :language: csl

axpy.csl
--------

.. literalinclude:: benchmarks/residual/axpy.csl
    :language: csl

nrminf.csl
----------

.. literalinclude:: benchmarks/residual/nrminf.csl
    :language: csl

run.py
------

.. literalinclude:: benchmarks/residual/run.py
    :language: python

commands.sh
-----------

.. literalinclude:: benchmarks/residual/commands.sh
    :language: shell