.. _stencil-memcpy: 25-Point Stencil ================ .. attention:: All the code for this stencil is in early BETA, and provided here as advance information only. The code for this 3D 25-point stencil was inspired by the proprietary code of TotalEnergies EP Research & Technology US. Implementation details of tensor streaming ------------------------------------------ The stencil code is a time-marching app, it requires the following three inputs: - scalar ``iterations``: number of time steps - tensor ``vp``: velocity field - tensor ``source``: source term and the following three outputs - maximum and minimum value of vector field of last time step, two f32 per PE - timestamps of the time-marching per PE, three uint32 per PE - vector field ``z`` of last time step, ``zdim`` f32 per PE The stencil code uses 21 colors for communication patterns and "new-style" tensor streaming reserves 6 colors, so only 4 colors left for H2D/D2H and some entrypoints for control flow. We decide to use one color (color 0) to launch kernel functions and one entrypoint (color 2) to trigger the time marching. The ``copy mode`` of memcpy is used for two inputs and two outputs. First of all, ``run.py`` instantiates the new runtime ``SdkRuntime`` to use ``copy mode`` of memcpy and kernel launches. .. code-block:: python simulator = SdkRuntime(dirname, cmaddr=args.cmaddr) simulator.load() simulator.run() After the simulator (or WSE) is up via ``run()``, we send input tensors ``vp`` and ``source`` to the device. The 3rd argument (``streaming``) of ``memcpy_h2d()`` indicates this is ``copy mode``, so we need to pass the corresponding symbols to 1st argument of ``memcpy_h2d()``. .. code-block:: python symbol_vp = simulator.get_id("vp") symbol_source = simulator.get_id("source") # H2D vp[h][w][zDim] iportmap_vp = f"{{ vp[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] \ -> [PE[i, j] -> index[k]] }}" # H2D source[h][w][zDim] iportmap_source = f"{{ source[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] \ -> [PE[i, j] -> index[k]] }}" # use the runtime_utils library to calculate memcpy args and shuffle data (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_vp, vp) simulator.memcpy_h2d(symbol_vp, data, False, px, py, w, h, l, 0, False) (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_source, source_all) simulator.memcpy_h2d(symbol_source, data, False, px, py, w, h, l, 0, False) Secondly we launch time marching with the argument ``iterations``. In this example, we have two kernel launches. One performs time marching after ``vp`` and ``source`` are received and the other prepares the output data ``zValues``. The former has the function symbol ``f_activate_comp`` and the latter has the function symbol ``f_prepare_zout``. Both are exported in the kernel (we will explain this later). Here ``SdkRuntime.call()`` triggers a host-callable function, in which the first argument is the function symbol ``f_activate_comp``, calling the time marching and ``iterations`` is the second argument, a u32 array of size 1. .. code-block:: python simulator.call("f_activate_comp", [cast_uint32(iterations)], nonblock=False) The end of time marching (``f_checkpoint()`` in ``task_memcpy.csl``) will record maximum and minimum value of vector field and timing info into an array ``d2h_buf_f32``. The host can call ``memcpy_d2h()`` to receive the data in ``d2h_buf_f32``. .. code-block:: python # task_memcpy.csl binds the symbol "maxmin_time" to the array d2h_buf_f32 symbol_maxmin_time = simulator.get_id("maxmin_time") # D2H [h][w][5] oportmap1 = f"{{ maxmin_time[j=0:{height-1}][i=0:{width-1}][k=0:{5-1}] \ -> [PE[i, j] -> index[k]] }}" # use the runtime_utils library to calculate memcpy args and manage output data (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap1, np.float32) simulator.memcpy_d2h(data, symbol_maxmin_time, False, px, py, w, h, l, 0, False) maxmin_time_hwl = runtime_utils.format_output_tensor(oportmap1, np.float32, data) To receive the vector field of last time step, the function ``f_prepare_zout()`` is called by ``SdkRuntime.call()``, to prepare this data into a temporary array ``zout`` because the result is in either ``zValues[0, :]`` or ``zValues[1, :]``. .. code-block:: python simulator.call("f_prepare_zout", [], nonblock=False) The last operation, ``memcpy_d2h()``, sends the array ``zout`` back to the host. .. code-block:: python symbol_zout = simulator.get_id("zout") # D2H [h][w][zDim] oportmap2 = f"{{ z[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] -> [PE[i, j] -> index[k]] }}" (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap2, np.float32) simulator.memcpy_d2h(data, symbol_zout, False, px, py, w, h, l, 0, False) z_hwl = runtime_utils.format_output_tensor(oportmap2, np.float32, data) The kernel (``task_memcpy.csl``) does not use ``streaming mode`` so it does not define WTT. To support kernel launches, we pass color ``LAUNCH`` and data type ``f32`` to memcpy module. The ``copy mode`` is automatically supported in SDK 0.9.0, no need to define ``.data_type=f32``. .. code-block:: csl const sys_mod = @import_module( "", @concat_structs(memcpyParams, .{ .LAUNCH = LAUNCH // .data_type=f32 })); Additionally, we need to define symbols for input/output tensors and export them. .. code-block:: csl var ptr_vp : [*]f32 = &vp; var ptr_source : [*]f32 = &source; var ptr_d2h_buf_f32 : [*]f32 = &d2h_buf_f32; var ptr_zout : [*]f32 = &zout; comptime { @export_symbol(ptr_vp, "vp"); @export_symbol(ptr_source, "source"); @export_symbol(ptr_d2h_buf_f32, "maxmin_time"); @export_symbol(ptr_zout, "zout"); } The layout file (``code_memcpy.csl``) needs to export the names of the tensors as well. .. code-block:: csl layout { @export_name("vp", [*]f32, true); @export_name("source", [*]f32, true); @export_name("maxmin_time", [*]f32, true); @export_name("zout", [*]f32, true); } To launch a kernel, we need to export color ``LAUNCH`` by ``@rpc(LAUNCH)``. The host ``SdkRuntime.call(sym, params)`` triggers the corresponding function in first argument ``sym``. In this case, if function symbol is ``f_activate_comp``, time marching is called via ``@activate(COMP)``. If function symbol is ``f_prepare_zout``, ``f_prepare_zout()`` is called. Both functions and RPC color ``LAUNCH`` are exported via .. code-block:: csl comptime { @export_symbol(f_activate_comp); @export_symbol(f_prepare_zout); @rpc(LAUNCH); } The first ``SdkRuntime.call()`` calls ``f_activate_comp`` which triggers the entrypoint ``f_comp()`` to start the time-marching and to record the starting time. .. code-block:: csl fn f_activate_comp(iter_cnt: u32) void { iterations = iter_cnt; @activate(COMP); } task f_comp() void { timestamp.enable_tsc(); timestamp.get_timestamp(&tscStartBuffer); @activate(send); } In the end of time marching, the function ``epilog()`` checks ``iterationCount``, if it reaches the given ``iterations``, ``epilog()`` triggers the entrypoint ``CHECKPOINT`` to prepare the data for first ``memcpy_d2h()``. .. code-block:: csl fn epilog() void { ... if (iterationCount < iterations) { @activate(send); } else { // we've finished executing the program, need to: // 1. Record the value of the timestamp counter // 2. Compute the minimum and maximum value of the wavefield // 3. send the timestamp values @activate(CHECKPOINT); } } The function ``f_checkpoint()`` calls ``unblock_cmd_stream()`` to process next operation which is the first ``memcpy_d2h()``. Without ``unblock_cmd_stream()``, the program stalls because the ``memcpy_d2h()`` is never scheduled. .. code-block:: csl fn f_checkpoint() void { // compute max/min of zValues // record timestamps // d2h_buf_f32 = { timestamps, max/min} sys_mod.unblock_cmd_stream(); } The second ``SdkRuntime.call()`` calls the function ``f_prepare_zout()`` to prepare the vector field into ``zout``. It also calls ``unblock_cmd_stream()`` to process next operation which is second ``memcpy_d2h()``. .. code-block:: csl fn f_prepare_zout() void { // toggle = 1 - (iterations % 2) var toggle: i32 = 1 - (@as(i32,iterations) % 2); if (0 == toggle){ mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[0, zOffset]))); }else{ mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[1, zOffset]))); } @mov32(mem_zout_buf_dsd, mem_z_buf_dsd); sys_mod.unblock_cmd_stream(); } Finally the user has to compile the kernel with the flag ``--fabric-offsets=4,1`` and four additional parameter sets to enable the "new-style" tensor streaming. - ``--memcpy`` to compile the infrastructure (a.k.a. halo) to route the data between the host and the device. - pass LAUNCH colors to ``cslc`` via ``--params=LAUNCH_ID:0``. - ``--channels=1`` to run ``SdkRuntime`` which supports ``copy mode`` and kernel launches. - ``--width-west-buf=0`` and ``--width-east-buf=0`` because no buffers are inserted. If either of these two is nonzero, we must adjust ``--fabric-offsets`` and/or ``--fabric-dims``. Here is the command to compile .. code-block:: bash cslc --params=LAUNCH_ID:0 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0 ... code_memcpy.csl --------------- .. literalinclude:: benchmarks/stencil-v2/code_memcpy.csl :language: csl commands.sh ----------- .. literalinclude:: benchmarks/stencil-v2/commands.sh :language: shell run.py ------ .. literalinclude:: benchmarks/stencil-v2/run.py :language: python task_memcpy.csl --------------- .. literalinclude:: benchmarks/stencil-v2/task_memcpy.csl :language: csl