.. _stencil-memcpy:

25-Point Stencil
================

.. attention::

    All the code for this stencil is in early BETA, and provided here as advance information only.
    The code for this 3D 25-point stencil was inspired by the proprietary code of TotalEnergies EP Research & Technology US.


Implementation details of tensor streaming
------------------------------------------

The stencil code is a time-marching app, it requires the following three inputs:

-
    scalar ``iterations``: number of time steps

-
    tensor ``vp``: velocity field

-
    tensor ``source``: source term 

and the following three outputs

-
    maximum and minimum value of vector field of last time step, two f32 per PE

-
    timestamps of the time-marching per PE, three uint32 per PE

-
    vector field ``z`` of last time step, ``zdim`` f32 per PE

The stencil code uses 21 colors for communication patterns and "new-style" tensor streaming reserves 6 colors, 
so only 4 colors left for H2D/D2H and some entrypoints for control flow. We decide to use one color 
(color 0) to launch kernel functions and one entrypoint (color 2) to trigger the time marching.
The ``copy mode`` of memcpy is used for two inputs and two outputs.

First of all, ``run.py`` instantiates the new runtime ``SdkRuntime`` to use ``copy mode`` of memcpy and kernel launches. 

.. code-block:: python

  simulator = SdkRuntime(dirname, cmaddr=args.cmaddr)
  simulator.load()
  simulator.run()

After the simulator (or WSE) is up via ``run()``, we send input tensors ``vp`` and ``source`` to the device.
The 3rd argument (``streaming``) of ``memcpy_h2d()`` indicates this is ``copy mode``, so we need to pass the 
corresponding symbols to 1st argument of ``memcpy_h2d()``.

.. code-block:: python

  symbol_vp = simulator.get_id("vp")
  symbol_source = simulator.get_id("source")
  # H2D vp[h][w][zDim]
  iportmap_vp = f"{{ vp[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] \
    -> [PE[i, j] -> index[k]] }}"
  # H2D source[h][w][zDim]
  iportmap_source = f"{{ source[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] \
    -> [PE[i, j] -> index[k]] }}"
  # use the runtime_utils library to calculate memcpy args and shuffle data
  (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_vp, vp)
  simulator.memcpy_h2d(symbol_vp, data, False, px, py, w, h, l, 0, False)
  (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_source, source_all)
  simulator.memcpy_h2d(symbol_source, data, False, px, py, w, h, l, 0, False)

Secondly we launch time marching with the argument ``iterations``.
In this example, we have two kernel launches. One performs time marching after ``vp`` and ``source`` are received
and the other prepares the output data ``zValues``.
The former has the function symbol ``f_activate_comp`` and the latter has the function symbol ``f_prepare_zout``.
Both are exported in the kernel (we will explain this later).
Here ``SdkRuntime.call()`` triggers a host-callable function, in which
the first argument is the function symbol ``f_activate_comp``, calling the time marching and
``iterations`` is the second argument, a u32 array of size 1.

.. code-block:: python

  simulator.call("f_activate_comp", [cast_uint32(iterations)], nonblock=False)


The end of time marching (``f_checkpoint()`` in ``task_memcpy.csl``) will record maximum and minimum value 
of vector field and timing info into an array ``d2h_buf_f32``.
The host can call ``memcpy_d2h()`` to receive the data in ``d2h_buf_f32``.

.. code-block:: python

  # task_memcpy.csl binds the symbol "maxmin_time" to the array d2h_buf_f32
  symbol_maxmin_time = simulator.get_id("maxmin_time")
  # D2H [h][w][5]
  oportmap1 = f"{{ maxmin_time[j=0:{height-1}][i=0:{width-1}][k=0:{5-1}] \
    -> [PE[i, j] -> index[k]] }}"
  # use the runtime_utils library to calculate memcpy args and manage output data
  (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap1, np.float32)
  simulator.memcpy_d2h(data, symbol_maxmin_time, False, px, py, w, h, l, 0, False)
  maxmin_time_hwl = runtime_utils.format_output_tensor(oportmap1, np.float32, data)

To receive the vector field of last time step, the function ``f_prepare_zout()`` is called
by ``SdkRuntime.call()``, to prepare this data into a temporary array ``zout`` because the result is in either
``zValues[0, :]`` or ``zValues[1, :]``.

.. code-block:: python

  simulator.call("f_prepare_zout", [], nonblock=False)

The last operation, ``memcpy_d2h()``, sends the array ``zout`` back to the host.

.. code-block:: python

  symbol_zout = simulator.get_id("zout")
  # D2H [h][w][zDim]
  oportmap2 = f"{{ z[j=0:{height-1}][i=0:{width-1}][k=0:{zDim-1}] -> [PE[i, j] -> index[k]] }}"
  (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap2, np.float32)
  simulator.memcpy_d2h(data, symbol_zout, False, px, py, w, h, l, 0, False)
  z_hwl = runtime_utils.format_output_tensor(oportmap2, np.float32, data)


The kernel (``task_memcpy.csl``) does not use ``streaming mode`` so it does not define WTT. 
To support kernel launches, we pass color ``LAUNCH`` and data type ``f32`` to memcpy module.
The ``copy mode`` is automatically supported in SDK 0.9.0, no need to define ``.data_type=f32``.

.. code-block:: csl

  const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{
     .LAUNCH = LAUNCH
     // .data_type=f32
      }));

Additionally, we need to define symbols for input/output tensors and export them.

.. code-block:: csl
   
  var ptr_vp : [*]f32 = &vp;
  var ptr_source : [*]f32 = &source;
  var ptr_d2h_buf_f32 : [*]f32 = &d2h_buf_f32;
  var ptr_zout : [*]f32 = &zout;
  comptime {
    @export_symbol(ptr_vp, "vp");
    @export_symbol(ptr_source, "source");
    @export_symbol(ptr_d2h_buf_f32, "maxmin_time");
    @export_symbol(ptr_zout, "zout");
  }

The layout file (``code_memcpy.csl``) needs to export the names of the tensors as well.

.. code-block:: csl

  layout {
    @export_name("vp", [*]f32, true);
    @export_name("source", [*]f32, true);
    @export_name("maxmin_time", [*]f32, true);
    @export_name("zout", [*]f32, true);
  }

To launch a kernel, we need to export color ``LAUNCH`` by ``@rpc(LAUNCH)``.
The host ``SdkRuntime.call(sym, params)`` triggers the corresponding function in first argument ``sym``.
In this case, if function symbol is ``f_activate_comp``, time marching is called via ``@activate(COMP)``.
If function symbol is ``f_prepare_zout``, ``f_prepare_zout()`` is called.
Both functions and RPC color ``LAUNCH`` are exported via

.. code-block:: csl

  comptime {
    @export_symbol(f_activate_comp);
    @export_symbol(f_prepare_zout);
    @rpc(LAUNCH);
  }


The first ``SdkRuntime.call()`` calls ``f_activate_comp`` which triggers the entrypoint ``f_comp()``
to start the time-marching and to record the starting time.

.. code-block:: csl

  fn f_activate_comp(iter_cnt: u32) void {
    iterations = iter_cnt;
    @activate(COMP);
  }

  task f_comp() void {
    timestamp.enable_tsc();
    timestamp.get_timestamp(&tscStartBuffer);
    @activate(send);
  }

In the end of time marching, the function ``epilog()`` checks ``iterationCount``, if it reaches the given ``iterations``, 
``epilog()`` triggers the entrypoint ``CHECKPOINT`` to prepare the data for first ``memcpy_d2h()``.

.. code-block:: csl

  fn epilog() void {
     ...
    if (iterationCount < iterations) {
      @activate(send);
    } else {
      // we've finished executing the program, need to:
      // 1. Record the value of the timestamp counter
      // 2. Compute the minimum and maximum value of the wavefield
      // 3. send the timestamp values
      @activate(CHECKPOINT);
    }
  }

The function ``f_checkpoint()`` calls ``unblock_cmd_stream()`` to process next operation which is the first ``memcpy_d2h()``.
Without ``unblock_cmd_stream()``, the program stalls because the ``memcpy_d2h()`` is never scheduled.

.. code-block:: csl
  
  fn f_checkpoint() void {
    // compute max/min of zValues
    // record timestamps
    // d2h_buf_f32 = { timestamps, max/min} 
    sys_mod.unblock_cmd_stream();
  }

The second ``SdkRuntime.call()`` calls the function ``f_prepare_zout()`` to prepare the vector field into ``zout``.
It also calls ``unblock_cmd_stream()`` to process next operation which is second ``memcpy_d2h()``.

.. code-block:: csl

  fn f_prepare_zout() void {
    // toggle = 1 - (iterations % 2)
    var toggle: i32 = 1 - (@as(i32,iterations) % 2);
    if (0 == toggle){
      mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[0, zOffset])));
    }else{
      mem_z_buf_dsd = @set_dsd_base_addr(mem_z_buf_dsd, @ptrcast([*]f32, &(zValues[1, zOffset])));
    }
    @mov32(mem_zout_buf_dsd, mem_z_buf_dsd);
    sys_mod.unblock_cmd_stream();
  }


Finally the user has to compile the kernel with the flag ``--fabric-offsets=4,1`` and four additional parameter sets to
enable the "new-style" tensor streaming.

-
    ``--memcpy`` to compile the infrastructure (a.k.a. halo) to route the data between the host and the device.

-
    pass LAUNCH colors to ``cslc`` via ``--params=LAUNCH_ID:0``.

-
    ``--channels=1`` to run ``SdkRuntime`` which supports ``copy mode`` and kernel launches.

-
    ``--width-west-buf=0`` and ``--width-east-buf=0`` because no buffers are inserted. 
    If either of these two is nonzero, we must adjust ``--fabric-offsets`` and/or ``--fabric-dims``. 


Here is the command to compile 

.. code-block:: bash

    cslc --params=LAUNCH_ID:0 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0 ...


code_memcpy.csl
---------------

.. literalinclude:: benchmarks/stencil-v2/code_memcpy.csl
    :language: csl

commands.sh
-----------

.. literalinclude:: benchmarks/stencil-v2/commands.sh
    :language: shell

run.py
------

.. literalinclude:: benchmarks/stencil-v2/run.py
    :language: python

task_memcpy.csl
---------------

.. literalinclude:: benchmarks/stencil-v2/task_memcpy.csl
    :language: csl