.. _tensor-streaming: Tensor Streaming Implementations ================================ Tensor streaming through ``CSELFRunner``'s ``add_input_tensor()`` and ``add_output_tensor()`` methods are the primary way to transfer data to and from the WSE with the legacy host runtime. These functions manage the data transfer to and from the host's filesystem or memory, through the host and WSE network interfaces, and finally route the data into your kernel. This last step is implemented on the WSE itself in order to connect the I/O channel entry-points, which are in fixed locations at the edges of the WSE, to each kernel, which has a variable size and location. These I/O channels are connected to the fabric routers at the east and west edge PEs, spaced roughly 16 rows apart, giving a total of 66 channels at each edge: .. _fig-lvds: .. figure:: ./images/lvds.svg :align: center :scale: 25% The SDK 0.5.1 release extends the optional new tensor streaming implementation by introducing `batched execution mode` and `selective batched execution mode`. These execution modes (only available with new tensor streaming) allow ``CSELFRunner`` to send its input and output tensors (with the same or different numpy array data) to the fabric multiple times per fabric startup operation, avoiding the fabric load and reset overhead on subsequent runs. The SDK 0.6.0 release adds a couple of features on top of SDK 0.5.1, including direct memory copy and kernel launches. The new runtime ``SdkRuntime`` loads the binaries and starts the simulator or WSE first, then the user can stream or copy the tensors at any time. There is no need to declare H2D/D2H signatures the way ``CSELFRunner`` requires. The user can also transform the batched operations in SDK 0.5.1 into a calling sequence in SDK 0.6.0 without describing the batched pattern in advance. The user can run existing code in SDK 0.6.0 without any modification, however the code will not take advantage of the new runtime features in SDK 0.6.0. The SDK 0.7.0 release adds three features on top of SDK 0.6.0: - ``copy mode`` supports 16-bit data. - The input/output tensors can be formed in column-major order to improve the bandwidth. - The kernel launch is simplified because the compiler generates the RPC code. The user only needs to export the host-callable functions. Old-style Tensor Streaming -------------------------- In "old-style" tensor streaming, the only implementation before the SDK 0.5.0 release, ``CSELFRunner`` adds a 1-PE wide box or "halo" on all four sides of the wafer. The halo sends the correct tensors towards the correct edge of the kernel. Up to 12 total input and output tensors can be used. Each tensor is assigned to a unique I/O channel, resulting in these properties: - All tensors are sent and received concurrently. - Each tensor transfer is limited to the maximum bandwidth of a single I/O channel. - Each I/O channel has an independent data path to and from the WSE, so the total application I/O bandwidth scales up with the number of tensors used. In this implementation, the halo only delivers tensor data to a PE (or row/column of PEs) at the outer edge of the kernel. The ``portmap`` argument must specify the source or destination edge. The kernel itself must manage the internal distribution of the tensor data. No extra code is added to the kernel in this tensor streaming implementation. Old-style streaming is the default, and existing applications will continue to default to this implementation. New-style Tensor Streaming -------------------------- The "new-style" tensor streaming uses additional PEs around the user kernel to route the tensor data and also adds a small executable component to the kernel PEs. The additional support PEs consume three columns on the west of the kernel and two columns on the east. .. _fig-components: .. figure:: ./images/memcpy-components.svg :align: center This implementation uses three routable colors, leaving up to 20 routable colors available, from color 0 to color 20. It only supports up to four input and up to four output tensor colors. Each tensor's data is sent over a single shared input or output I/O channel, resulting in these properties: - Tensors are sent/received in the order they were created with ``memcpy_h2d()`` and ``memcpy_d2h()``. - Total application I/O bandwidth in one direction is limited to the bandwidth of a single I/O channel. In this implementation, tensor data to be delivered directly to each kernel PE, without requiring additional routing/distribution code in the kernel. (Tensor data can also be delivered only to the edges, as in old-style streaming.) The ``portmap`` argument must specify a rectangle of PEs as the source/destination. SDK 0.7.0 has a different runtime, ``SdkRuntime``, which supports up to 16 I/O channels while ``CSELFRunner`` in SDK 0.5.1 only supports a single I/O channel. In addition to bandwidth improvement, SDK 0.7.0 can also reduce the I/O latency by buffer insertion on either side of the core kernel. Enabling New-Style Tensor Streaming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For the kernel, SDK 0.7.0 can stream the data into the device or copy the data into a given device memory directly. The former is called ``streaming mode`` and the latter is called ``copy mode``. The ``streaming mode`` requires the user to count number of received wavelets in order to process the next task. The ``copy mode`` copies the data into the memory directly without notifying the user. To compute ``f(A)`` in the kernel, the user has two options: (1) ``streaming mode``: define a WTT to receive the tensor ``A`` and call a function to compute ``f(A)`` after whole ``A`` is received. (2) ``copy mode``: send tensor ``A`` first, then launch a kernel function to to compute ``f(A)``. Here are some highlights, the detail can be found in ``SdkRuntime/residual-memcpy`` and ``SdkRuntime/stencil-memcpy``. - SDK 0.7.0 introduces a new runtime ``SdkRuntime`` which is different from ``CSELFRunner`` in SDK 0.5.0. However SDK 0.7.0 supports ``CSELFRunner`` as well, so the user can run existing codes without any change. To distinguish ``SdkRuntime`` from ``CSELFRunner`` in the kernel side, the compiler introduces a new flag, called ``--channels``. If the user passes ``--channels=0`` to ``cslc``, the code must instantiate memcpy module by ````, then it is compiled to run with ``CSELFRunner``. If the user passes ``--channels=k`` (k is positive) to ``cslc``, the code must instantiate memcpy module by ````, then it is compiled to run with ``SdkRuntime``. If the user does not specify ``--channels``, the default value is zero. SDK 0.7.0 can activate more I/O channels than SDK 0.5.1. The number of I/O channels is the number ``k`` in ``--channels=k``. The largest value of ``--channels`` is 16 in SDK 0.7.0, so SDK 0.7.0 can deliver up to 16x more bandwidth than SDK 0.5.1. - The user must instantiate memcpy module with an ``@import_module()`` statement. The parameters passing to the memcpy module include colors to stream the data, a data type to copy the data and a color to launch kernel functions. - To stream data in/out, the user must pass the colors into memcpy module. Input tensors are prefixed with ``MEMCPYH2D_`` and output tensors are prefixed with ``MEMCPYD2H_``, followed by the tensor ID, an integer in the range 1-4. Unused colors should be omitted, and only four colors per direction are allowed. The supplied color values should match those used in the host API, either ``add_input_tensor()``/``add_output_tensor()`` calls to ``CSELFRunner`` or ``memcpy_h2d()``/``memcpy_d2h()`` calls to ``SdkRuntime``. The parameter ``px`` is the x-coordinate of the core rectangle. SDK 0.5.1 does not use ``memcpyParams``, instead, the user has to define ``.first_pe=...`` and ``.last_pe=...``. SDK 0.7.0 hides parameter ``px`` under ``memcpyParams``, usually the layout file passes ``memcpyParams`` to the kernel file. Here is one example for SDK 0.7.0: .. code-block:: csl const memcpy = @import_module( "", .{ .width = width, .height = height }); const memcpyParams = memcpy.get_params(px); const sys_mod = @import_module( "", @concat_structs(memcpyParams, .{ .MEMCPYH2D_1=first_input_color, .MEMCPYH2D_2=second_input_color, ..., .MEMCPYD2H_1=first_output_color, ... }); To run with SDK 0.5.1, the user can replace ``memcpy_multi/memcpy`` by ``memcpy/memcpy``, remove memcpyParams, and add ``.first_pe=...`` and ``.last_pe=...`` into memcpy module. - The user must pass input/output color ID and value pairs to cslc as parameters, where ``x`` is the tensor ID: .. code-block:: shell --params=MEMCPYH2D_DATA__ID: - To stream the data into the device, the user can either use a WTT (wavelet-triggered task) to read the data from input tensor color or use a microthread to read the data. To bind a WTT task to an input color, call ``@bind_task``: .. code-block:: csl // read data on color MEMCPYH2D_DATA_1 @bind_task(f_memcpyh2d_data_1, MEMCPYH2D_DATA_1 ); - The user can send data to an output tensor color using a microthread. To write the data on P1.0 via the microthread with output queue 1: .. code-block:: csl @mov32(fab_trans_nrm_wdsd_p10, mem_nrm_buf_dsd, .{.async=true}); - To copy data to/from the device, the user has to pass the data type to memcpy module and define the symbols for the tensors to be copied. For example, the following code defines a pointer ``ptr_A`` pointing to tensor ``A``, and exports it. The type of ``A`` must be the same as ``data_type``, in other words, the user cannot copy two tensors of different type. See the complete example in ``SdkRuntime/residual-memcpy/residual_memcpy.csl``. .. code-block:: csl const sys_mod = @import_module( "", @concat_structs(memcpyParams, .{ ..., .data_type=f32 })); ... var A = @zeros([4]f32); var ptr_A : [*]f32 = &A; ... comptime { @export_symbol(ptr_A, "A"); } - SDK 0.7.0 supports both 16-bit and 32-bit data transfer via ``copy mode`` or ``streaming mode``. If the user does not use ``runtime_utils`` to prepare the input tensors for 16-bit tensor but calls the low-level ``memcpy_h2d`` directly, zero extension from 16-bit to 32-bit must be performed. For example, if the tensor is [10]u16, the user has to prepare [10]u32 and fills the lower 16-bit of u32 with the input tensor. The same holds for memcpy_d2H which returns 32-bit data where higher 16-bit is zero. The user has to strip out the higher 16-bit. - SDK 0.7.0 can launch a kernel function. The user has to specify a color ``LAUNCH`` to memcpy module and define a function to be launched. .. code-block:: csl const sys_mod = @import_module( "", @concat_structs(memcpyParams, .{ ..., .LAUNCH = LAUNCH })); The following is an example of kernel launching protocol. It calls different kernel functions, ``f1()`` or ``f2()``. See the complete example in ``SdkRuntime/residual-memcpy/residual_memcpy.csl``. .. code-block:: csl fn f1(...) void { ... } fn f2(...) void { ... } @comptime{ @export_symbol(f1); @export_symbol(f2); @rpc(LAUNCH); } - SDK 0.7.0 can insert buffers in the infrastructure to reduce the latency of I/O. The buffer stores the wavelets from the I/O for one row PEs while the core is busy and cannot process the wavelets from the I/O. In other words, the buffer acts like a ``prefetch`` from the computation point of view. There are two kind of buffers: one stores the data for host-to-device, and the other stores the data for device-to-host. The width of the former is configured by ``--width-west-buf`` and the width of the latter is configured by ``--width-east-buf``. By default, ``--width-west-buf=0`` and ``--width-east-buf=0``, i.e. no buffers are inserted. ``--width-west-buf=k`` means k column PEs are inserted at west side of the core kernel, each PE has 46KB storage. If the user has 500 PEs in a row, then 46KB can buffer 23 wavelets per PE. If the user wants to stream/copy a tensor of size 100 per PE, then ``--width-west-buf=5`` can hold the whole tensor. - The user can block/unblock input or output tensor colors. For example, the user can overlap computation and communication by unblocking/blocking the color. - The user must not set or modify the routing of an input, output tensor color or kernel launch color. The routing pattern is configured implicitly by the compiler. If the user modifies those routing patterns, the behavior is undefined. - The user must add ``--memcpy`` to the ``cslc`` command - The user must pass input/output tensor colors to cslc via ``MEMCPYH2D_DATA__ID`` and ``MEMCPYD2H_DATA__ID``, where ``x`` and ``y`` are integers, starting from 1. - The user must use ``--fabric-offsets=x,y`` to compile the kernel. SDK 0.5.1 requires ``--fabric-offsets=4,1``. The runtime will emit an error if ``--fabric-offsets`` is not 4,1. SDK 0.7.0 requires ``x >= 4 + width-west-buf`` and ``y >= 1``. - The user must use ``--fabric-dims=dim_x,dim_y`` to compile the kernel. SDK 0.5.1 requires ``dim_x >= x + width + 3``, ``dim_y >= y + height + 1``. SDK 0.7.0 requires ``dim_x >= x + width + 3 + width-east-buf``, ``dim_y >= y + height + 1``. - The user must not use reserved resources: colors 21, 22, 23, 27, 28, 29, 30, and input/output queue 0. The compiler and runtime cannot detect if there is a resource conflict and cannot emit any error messages if it occurs. For the host (SDK 0.5.1), - Pass the kernel height and width, and the input/output tensor colors to ``CSELFRunner()``. For reference, see the following example, or the complete example in SDK 0.5.1 ``residual-memcpy/run_memcpy.py``: .. code-block:: python simulator = CSELFRunner(elf_list, debug_mode=True, cmaddr=args.cmaddr, height=height, width=width, input_colors=set(c_h2d), output_colors=set(c_d2h)) - The user must use topological ordering of the data graph to issue the input/output tensor commands. For example, the residual example receives A, x and b, then computes ``|b-A*x|``. The result ``|b-A*x|`` can be sent out only if all inputs are received and computed, so the output tensor must be added last. The following host code sends three input tensors from the host, in order, before initiating the output tensor receive on the host. .. code-block:: python simulator.add_input_tensor(c_h2d[0], iportmap_A, A) simulator.add_input_tensor(c_h2d[1], iportmap_x, x) simulator.add_input_tensor(c_h2d[2], iportmap_b, b) simulator.add_output_tensor(c_d2h[0], oportmap_nrm_r, np.float32) The kernel defines a `WTT` to read ``A``, ``x`` and ``b``, and it does not assume any ordering among the three tensors. Instead, it uses boolean variables ``x_is_ready``, ``A_is_ready`` and ``Ax_is_ready`` to define the order. This allows the ``add_input_tensor()`` to occur in any order. For the host (SDK 0.7.0), SDK 0.7.0 introduces a new runtime, ``SdkRuntime`` which supports memory transfers and kernel launches by ``memcpy_h2d()``, ``memcpy_d2h()`` and ``call()`` Each API can be a blocking call or a nonblocking call, depending on the parameter ``nonblock`` of the API. If blocking mode (``nonblock=False``) is selected, the API waits until the operation is done, otherwise, the API returns before the operation even starts. The ``SdkRuntime`` can aggregate multiple nonblocking operations together to reduce the latency. However the user must take care to avoid race conditions in nonblocking mode. For example, if the user has two ``memcpy_d2h()`` to the same destination, the content of the destination is undefined if both operations are nonblocking. - Pass the artifact which contains the ELFs, and the IP of WSE to ``SdkRuntime()``. Then the user can load the ELFs by ``load()`` and start the simulator or WSE by ``run()``. After that, the user can do any operation, either memory transfers or kernel launches. Finally, the user calls ``stop()`` to shutdown the simulator or WSE. For reference, see the following example, or the complete example in ``SdkRuntime/residual-memcpy/run_memcpy.py``: .. code-block:: python simulator = SdkRuntime(args.name, cmaddr=args.cmaddr) simulator.load() simulator.run() // a sequence of operations simulator.stop() - Import MemcpyDataType and MemcpyOrder in order to specify the data type and ordering of the tensor .. code-block:: python from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime from cerebras.sdk.runtime.sdkruntimepybind import MemcpyDataType from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder - The API ``memcpy_h2d()`` transfers a tensor from host to device with two modes, one is ``streaming mode`` and the other is ``copy mode``. The former corresponds to ``streaming=True`` and the latter corresponds to ``streaming=False``. The region of interest is 4-tuple ``(x, y, w, h)``, which indicates a subrectangle starting at ``(x,y)`` with (width, height) = ``(w, h)``. The parameter ``l`` indicates number of elements (wavelets) per PE. The parameter ``data_type`` specifies either 16-bit (``MemcpyDataType.MEMCPY_16BIT``) or 32-bit (``MemcpyDataType.MEMCPY_32BIT``) for copy mode. This parameter is don't care for streaming mode. The parameter ``order`` specifies row-major order (``MemcpyOrder.ROW_MAJOR``) or column-major order (``MemcpyOrder.COL_MAJOR``) for the input/output tensor of the form ``A[h][w][l]``. The parameter ``nonblock`` indicates if the operation is blocking or nonblocking. The parameter ``dest`` is the color of H2D if ``streaming=True`` (streaming mode) and is the symbol if ``streaming=False`` (copy mode). .. code-block:: python memcpy_h2d(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock) The following examples shows a ``copy mode`` to send a tensor A. The util function converts the tensor A based on the portmap ``iportmap_A``. For reference, see the complete example in ``SdkRuntime/residual-memcpy/run_memcpy.py``. .. code-block:: python symbol_A = simulator.get_id("A") (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A) simulator.memcpy_h2d(symbol_A, data, px, py, w, h, l, streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False) - The API ``memcpy_d2h()`` transfers a tensor from device to host with two modes, same as ``memcpy_h2d()``. The first parameter ``dest`` is the host tensor to receive the data from the device. The second parameter ``src`` is the color of D2H if ``streaming=True`` (streaming mode) and is the symbol if ``streaming=False`` (copy mode). Other parameters are the same as ``memcpy_h2d()``. .. code-block:: python memcpy_d2h(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock) The following examples shows a copy mode to receive a scalar ``|b - A*x|``. The util function allocates a workspace ``data`` to receive the wavelets from the device. Then it converts data back to ``nrm_r_cs`` based on the portmap ``oportmap_nrm_r``. For reference, see the complete example in ``SdkRuntime/residual-memcpy/run_memcpy.py``. .. code-block:: python symbol_nrm = simulator.get_id("nrm") (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32) simulator.memcpy_d2h(data, symbol_nrm, px, py, w, h, l, streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False) nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data) - SDK 0.7.0 adds the support of column-major order for the input/output tensor. The parameter ``order`` of ``memcpy_h2d()`` and ``memcpy_d2h()`` specifies either row-major or column-major. The column-major version delivers better bandwidth than row-major. ``runtime_utils`` only supports row-major. To use the column-major, the user must prepare the column-major host tensor and calls the low-level ``memcpy_h2d()`` and ``memcpy_d2h()`` directly. For reference, see the complete example in ``SdkRuntime/bandwidthTest/run.py``. - SDK 0.7.0 introduces a new API, ``call()`` to support kernel launches. ``call()`` is simpler than ``memcpy_launch()``, but the kernel has to export color ``LAUNCH`` and host-callable functions. The first parameter ``sym`` is the symbol of the host-callable function. The second parameter ``params`` is an array of the parameters of the host-callable function. The size of of each input parameter is 4 bytes (a wavelet or size of u32). The third parameter ``nonblock`` is either a blocking mode or a nonblocking mode. .. code-block:: python params = np.zeros(K).astype(np.uint32) # format of params # +---------------------+ # | param 1 | 1st wavelet # +---------------------+ # | param 2 | 2nd wavelet # +---------------------+ # | param 3 | 3rd wavelet # +---------------------+ # | ... | # +---------------------+ # params has K wavelets where K = number of parameters simulator.call(sym, params, nonblock) The advantage of ``call()`` is twofold: - The user does not need to encode the function ID - The user does not need to implement the WTT of color ``LAUNCH`` Batched execution mode (SDK 0.5.1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In batched execution mode, the host script should replace the call to ``CSELFRunner.connect_and_run()`` with calls to: .. code-block:: python simulator = CSELFRunner(elf_list, ..., input_colors, ...) simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_output_tensor(outColor, oportmap, np.float32) simulator.load() simulator.start() simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1) simulator.run_batch() result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2) simulator.run_batch() result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.stop() Note these requirements: - ``CSELFRunner`` constructor is passed input and output color args, so that the system uses new tensor streaming - The ``add_input_tensor()`` call sets ``np_arr`` to None. If the ``np_arr`` value is still passed, the subsequent initial call to ``prepare_input_tensor()`` can be omitted. - The ``add_input_tensor()`` call sets ``reusable`` to True. This enables batched mode for this tensor. All created input and output tensors must all take the same ``reusable`` value. - The ``prepare_input_tensor()`` call replaces any existing prepared array data within ``CSELFRunner`` with the new values passed in through ``np_arr``. - The ``run_batch()`` call sends and receives all declared tensors to the fabric, and waits until all bytes are sent and received,, as if ``connect_and_run()`` has been called. However the execution or simulation does not stop at this point. - The output tensor from this run can be copied from ``out_tensor_dict`` as usual. The data in this dict will be lost once ``run_batch()`` is called again. - The ``stop()`` function must be called to stop the simulator after the last batch. A running program on the wafer will not be stopped or reset after each batch. The kernel itself is responsible for any program state init/reset that may be needed after each batch. Selected batched execution mode (SDK 0.5.1) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The selective batch execution mode extends the previous new behavior by allowing a list of input and output tensors to be sent, instead of sending all declared tensors every time ``run_batch()`` is called. With this extension, a program could have distinct init and termination phases. For example: .. code-block:: python simulator = CSELFRunner(elf_list, ..., input_colors, ...) simulator.add_input_tensor(initColor, init_iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_output_tensor(outColor, oportmap, np.float32) simulator.add_output_tensor(finalOutColor, final_oportmap, np.float32) simulator.load() simulator.start() # run a one-time init step simulator.prepare_input_tensor(initColor, init_iportmap, np_arr=init_data) simulator.run_batch([initColor]) simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1) simulator.run_batch([color, outColor]) result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2) simulator.run_batch([color, outColor]) result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.run_batch([finalOutColor]) final_result_tensor = simulator.out_tensor_dict["final_out_tensor"] simulator.stop() Note these requirements: - The ``run_batch()`` call now takes an optional list of colors. If no list is given, all declared tensors are sent and received. Otherwise only the listed colors are sent. See the :ref:`hadamard-product` example for example uses of these batched execution features. Upcoming improvements ~~~~~~~~~~~~~~~~~~~~~ Although SDK 0.7.0 can activate up to 16 I/O channels, it still does not saturate the I/O bandwidth. The future release will resolve the restrictions and bring more I/O channels to increase the raw bandwidth.