.. _residual-memcpy: .. include:: benchmarks/residual/README.rst ---- Implementation details of tensor streaming ------------------------------------------ With ``SdkRuntime``, the user can copy/stream data in/out the WSE and/or launch kernels arbitrarily. This example uses a ``portmap`` and a symbol to copy a tensor into the corresponding device memory. The symbol can be extracted via ``SdkRuntime.get_id()``. The ``runtime_utils`` converts the user's tensor to an architecture-independent form based on the ``portmap`` because the ``SdkRuntime`` no longer accepts the parameter ``portmap``. For example, the following ``portmap`` shows a block distribution of tensor ``A`` in the core rectangle with block ``LOCAL_IN_SZ`` by ``LOCAL_OUT_SZ`` per PE. .. code-block:: python iportmap_A = f"{{ A[j=0:{M-1}][i=0:{N-1}] -> [PE[i//{LOCAL_IN_SZ}, j//{LOCAL_OUT_SZ}] -> \ index[i%{LOCAL_IN_SZ}, j%{LOCAL_OUT_SZ}]] }}" The symbol of tensor A defined in the kernel is extracted by .. code-block:: python symbol_A = simulator.get_id("A") Then ``runtime_utils.convert_input_tensor()`` converts user's tensor ``A`` based on the ``iportmap_A`` and calls ``SdkRuntime.memcpy_h2d()`` to copy ``A`` to device memory pointed to by ``symbol_A``. .. code-block:: python (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A) simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False) The following code copies the vector ``b`` into the first column of PEs in the core rectangle: .. code-block:: python iportmap_b = f"{{ b[i=0:{M-1}][j=0] -> [PE[0, i//{LOCAL_OUT_SZ}] -> \ index[i%{LOCAL_OUT_SZ}]] }}" symbol_y = simulator.get_id("y") (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_b, b) simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False) The same idea holds for vector ``x`` which is distributed over the first row of PEs in the core rectangle with the following sequence .. code-block:: python iportmap_x = f"{{ x[i=0:{N-1}][j=0] -> [PE[i//{LOCAL_IN_SZ}, 0] -> \ index[i%{LOCAL_IN_SZ}]] }}" symbol_x = simulator.get_id("x") (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_x, x) simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False) The output norm of ``|b-A*x|`` is also copied from device memory directly via the following sequence .. code-block:: python oportmap_nrm_r = "{ nrm_r[i=0:0][j=0] -> [PE[1, 0] -> index[i]] }" symbol_nrm = simulator.get_id("nrm") (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32) simulator.memcpy_d2h(data, symbol_nrm, False, px, py, w, h, l, 0, False) nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data) In this example, the user does not specify colors to stream in tensors ``A``, ``x`` and ``b`` or to stream out a scalar ``|b-A*x|``. The runtime uses internal colors to stream in/out the tensors implicitly. The user only specifies the color ``LAUNCH`` to launch a kernel function. In this example, we launch the kernel function ``bcast_x()`` without any arguments. So the first argument of ``call()`` is ``bcast_x()`` which is defined in ``residual_memcpy.csl``, and the second argument is an empty array. .. code-block:: python simulator.call("bcast_x", [], nonblock=False) The user calls ``SdkRuntime.stop()`` if all operations are done. The kernel (``residual_memcpy.csl``) does not define three WTT (wave-triggered task) to receive ``A``, ``x`` and ``b``. Instead, the user defines pointers to ``A``, ``x``, ``y`` and ``nrm``, and exports those pointers via ``export_symbol``. .. code-block:: csl var ptr_A : [*]f32 = &A; var ptr_x : [*]f32 = &x; var ptr_y : [*]f32 = &y; var ptr_nrm : [*]f32 = &nrm; ... comptime { @export_symbol(ptr_A, "A"); @export_symbol(ptr_x, "x"); @export_symbol(ptr_y, "y"); @export_symbol(ptr_nrm, "nrm"); } Also in the layout file, ``layout_memcpy.csl``, the user must export symbol names corresponding to names in ``residual_memcpy.csl``. .. code-block:: csl @export_name("A", [*]f32, true); @export_name("x", [*]f32, true); @export_name("y", [*]f32, true); @export_name("nrm", [*]f32, true); The host runtime copies ``A``, ``x`` and ``b`` to device via three calls to ``simulator.memcpy_h2d``. .. code-block:: python simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False) simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False) simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False) After that, the user can launch a kernel function ``bcast_x()`` to broadcast ``x`` from 1st row to other rows, to compute the local ``y=A*x`` via ``f_comp()``, and to reduce the partial result via ``f_reduce()``. In the end of function ``bcast_x()``, the user must call ``sys_mod.unblock_cmd_stream()`` in order to process next command, which is ``simulator.memcpy_d2h``. If the user does not call ``sys_mod.unblock_cmd_stream()``, the program hangs because ``simulator.memcpy_d2h`` is not processed. To launch a kernel function in device, the user must export color LAUNCH and the host-callable function ``bcast_x()`` via .. code-block:: csl comptime { @export_symbol(bcast_x); @rpc(LAUNCH); } Finally the user has to compile the kernel with the flag ``--fabric-offsets=4,1`` and several additional parameters to use the new runtime. - ``--memcpy`` to compile the infrastructure (a.k.a. halo) to route the data between the host and the device. - pass ``LAUNCH`` colors to ``cslc`` via ``LAUNCH_ID``. - ``--channels=k`` where k is positive between 1 and 16. - ``----width-west-buf=p --width-east-buf=q`` where p and q are non-negative numbers. In this example, no additional buffers are inserted in the framework, so both parameters are zero. :: cslc LAUNCH_ID:4 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0 ---- .. _residual-code: layout.csl ---------- .. literalinclude:: benchmarks/residual/layout.csl :language: csl residual.csl ------------ .. literalinclude:: benchmarks/residual/residual.csl :language: csl gemv.csl -------- .. literalinclude:: benchmarks/residual/gemv.csl :language: csl axpy.csl -------- .. literalinclude:: benchmarks/residual/axpy.csl :language: csl nrminf.csl ---------- .. literalinclude:: benchmarks/residual/nrminf.csl :language: csl run.py ------ .. literalinclude:: benchmarks/residual/run.py :language: python commands.sh ----------- .. literalinclude:: benchmarks/residual/commands.sh :language: shell