Tensor Streaming Implementations

Tensor streaming through CSELFRunner’s add_input_tensor() and add_output_tensor() methods are the primary way to transfer data to and from the WSE with the legacy host runtime. These functions manage the data transfer to and from the host’s filesystem or memory, through the host and WSE network interfaces, and finally route the data into your kernel. This last step is implemented on the WSE itself in order to connect the I/O channel entry-points, which are in fixed locations at the edges of the WSE, to each kernel, which has a variable size and location. These I/O channels are connected to the fabric routers at the east and west edge PEs, spaced roughly 16 rows apart, giving a total of 66 channels at each edge:

_images/lvds.svg

The SDK 0.5.1 release extends the optional new tensor streaming implementation by introducing batched execution mode and selective batched execution mode. These execution modes (only available with new tensor streaming) allow CSELFRunner to send its input and output tensors (with the same or different numpy array data) to the fabric multiple times per fabric startup operation, avoiding the fabric load and reset overhead on subsequent runs.

The SDK 0.6.0 release adds a couple of features on top of SDK 0.5.1, including direct memory copy and kernel launches. The new runtime SdkRuntime loads the binaries and starts the simulator or WSE first, then the user can stream or copy the tensors at any time. There is no need to declare H2D/D2H signatures the way CSELFRunner requires. The user can also transform the batched operations in SDK 0.5.1 into a calling sequence in SDK 0.6.0 without describing the batched pattern in advance. The user can run existing code in SDK 0.6.0 without any modification, however the code will not take advantage of the new runtime features in SDK 0.6.0.

The SDK 0.7.0 release adds three features on top of SDK 0.6.0:

  • copy mode supports 16-bit data.

  • The input/output tensors can be formed in column-major order to improve the bandwidth.

  • The kernel launch is simplified because the compiler generates the RPC code. The user only needs to export the host-callable functions.

Old-style Tensor Streaming

In “old-style” tensor streaming, the only implementation before the SDK 0.5.0 release, CSELFRunner adds a 1-PE wide box or “halo” on all four sides of the wafer. The halo sends the correct tensors towards the correct edge of the kernel. Up to 12 total input and output tensors can be used. Each tensor is assigned to a unique I/O channel, resulting in these properties:

  • All tensors are sent and received concurrently.

  • Each tensor transfer is limited to the maximum bandwidth of a single I/O channel.

  • Each I/O channel has an independent data path to and from the WSE, so the total application I/O bandwidth scales up with the number of tensors used.

In this implementation, the halo only delivers tensor data to a PE (or row/column of PEs) at the outer edge of the kernel. The portmap argument must specify the source or destination edge. The kernel itself must manage the internal distribution of the tensor data. No extra code is added to the kernel in this tensor streaming implementation. Old-style streaming is the default, and existing applications will continue to default to this implementation.

New-style Tensor Streaming

The “new-style” tensor streaming uses additional PEs around the user kernel to route the tensor data and also adds a small executable component to the kernel PEs. The additional support PEs consume three columns on the west of the kernel and two columns on the east.

_images/memcpy-components.svg

This implementation uses three routable colors, leaving up to 20 routable colors available, from color 0 to color 20. It only supports up to four input and up to four output tensor colors. Each tensor’s data is sent over a single shared input or output I/O channel, resulting in these properties:

  • Tensors are sent/received in the order they were created with memcpy_h2d() and memcpy_d2h().

  • Total application I/O bandwidth in one direction is limited to the bandwidth of a single I/O channel.

In this implementation, tensor data to be delivered directly to each kernel PE, without requiring additional routing/distribution code in the kernel. (Tensor data can also be delivered only to the edges, as in old-style streaming.) The portmap argument must specify a rectangle of PEs as the source/destination.

SDK 0.7.0 has a different runtime, SdkRuntime, which supports up to 16 I/O channels while CSELFRunner in SDK 0.5.1 only supports a single I/O channel. In addition to bandwidth improvement, SDK 0.7.0 can also reduce the I/O latency by buffer insertion on either side of the core kernel.

Enabling New-Style Tensor Streaming

For the kernel,

SDK 0.7.0 can stream the data into the device or copy the data into a given device memory directly. The former is called streaming mode and the latter is called copy mode. The streaming mode requires the user to count number of received wavelets in order to process the next task. The copy mode copies the data into the memory directly without notifying the user. To compute f(A) in the kernel, the user has two options:

  1. streaming mode: define a WTT to receive the tensor A and call a function to compute f(A) after whole A is received.

  2. copy mode: send tensor A first, then launch a kernel function to to compute f(A).

Here are some highlights, the detail can be found in SdkRuntime/residual-memcpy and SdkRuntime/stencil-memcpy.

  • SDK 0.7.0 introduces a new runtime SdkRuntime which is different from CSELFRunner in SDK 0.5.0. However SDK 0.7.0 supports CSELFRunner as well, so the user can run existing codes without any change. To distinguish SdkRuntime from CSELFRunner in the kernel side, the compiler introduces a new flag, called --channels. If the user passes --channels=0 to cslc, the code must instantiate memcpy module by <memcpy/memcpy>, then it is compiled to run with CSELFRunner. If the user passes --channels=k (k is positive) to cslc, the code must instantiate memcpy module by <memcpy_multi/memcpy>, then it is compiled to run with SdkRuntime. If the user does not specify --channels, the default value is zero.

    SDK 0.7.0 can activate more I/O channels than SDK 0.5.1. The number of I/O channels is the number k in --channels=k. The largest value of --channels is 16 in SDK 0.7.0, so SDK 0.7.0 can deliver up to 16x more bandwidth than SDK 0.5.1.

  • The user must instantiate memcpy module with an @import_module() statement. The parameters passing to the memcpy module include colors to stream the data, a data type to copy the data and a color to launch kernel functions.

  • To stream data in/out, the user must pass the colors into memcpy module. Input tensors are prefixed with MEMCPYH2D_ and output tensors are prefixed with MEMCPYD2H_, followed by the tensor ID, an integer in the range 1-4. Unused colors should be omitted, and only four colors per direction are allowed. The supplied color values should match those used in the host API, either add_input_tensor()/add_output_tensor() calls to CSELFRunner or memcpy_h2d()/memcpy_d2h() calls to SdkRuntime.

    The parameter px is the x-coordinate of the core rectangle. SDK 0.5.1 does not use memcpyParams, instead, the user has to define .first_pe=... and .last_pe=.... SDK 0.7.0 hides parameter px under memcpyParams, usually the layout file passes memcpyParams to the kernel file. Here is one example for SDK 0.7.0:

    const memcpy = @import_module( "<memcpy_multi/get_params>", .{
         .width = width,
         .height = height
    });
    const memcpyParams = memcpy.get_params(px);
    const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{
         .MEMCPYH2D_1=first_input_color,
         .MEMCPYH2D_2=second_input_color,
         ...,
         .MEMCPYD2H_1=first_output_color,
         ...
    });
    

    To run with SDK 0.5.1, the user can replace memcpy_multi/memcpy by memcpy/memcpy, remove memcpyParams, and add .first_pe=... and .last_pe=... into memcpy module.

  • The user must pass input/output color ID and value pairs to cslc as parameters, where x is the tensor ID:

    --params=MEMCPYH2D_DATA_<x>_ID:<input_color>
    
  • To stream the data into the device, the user can either use a WTT (wavelet-triggered task) to read the data from input tensor color or use a microthread to read the data. To bind a WTT task to an input color, call @bind_task:

    // read data on color MEMCPYH2D_DATA_1
    @bind_task(f_memcpyh2d_data_1, MEMCPYH2D_DATA_1 );
    
  • The user can send data to an output tensor color using a microthread. To write the data on P1.0 via the microthread with output queue 1:

    @mov32(fab_trans_nrm_wdsd_p10, mem_nrm_buf_dsd, .{.async=true});
    
  • To copy data to/from the device, the user has to pass the data type to memcpy module and define the symbols for the tensors to be copied.

    For example, the following code defines a pointer ptr_A pointing to tensor A, and exports it. The type of A must be the same as data_type, in other words, the user cannot copy two tensors of different type. See the complete example in SdkRuntime/residual-memcpy/residual_memcpy.csl.

    const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{
        ...,
        .data_type=f32
    }));
    ...
    var A = @zeros([4]f32);
    var ptr_A : [*]f32 = &A;
    ...
    comptime {
        @export_symbol(ptr_A, "A");
    }
    
  • SDK 0.7.0 supports both 16-bit and 32-bit data transfer via copy mode or streaming mode.

    If the user does not use runtime_utils to prepare the input tensors for 16-bit tensor but calls the low-level memcpy_h2d directly, zero extension from 16-bit to 32-bit must be performed. For example, if the tensor is [10]u16, the user has to prepare [10]u32 and fills the lower 16-bit of u32 with the input tensor. The same holds for memcpy_d2H which returns 32-bit data where higher 16-bit is zero. The user has to strip out the higher 16-bit.

  • SDK 0.7.0 can launch a kernel function. The user has to specify a color LAUNCH to memcpy module and define a function to be launched.

    const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{
        ...,
        .LAUNCH = LAUNCH
    }));
    

    The following is an example of kernel launching protocol. It calls different kernel functions, f1() or f2(). See the complete example in SdkRuntime/residual-memcpy/residual_memcpy.csl.

    fn f1(...) void {
       ...
    }
    fn f2(...) void {
       ...
    }
    @comptime{
       @export_symbol(f1);
       @export_symbol(f2);
       @rpc(LAUNCH);
    }
    
  • SDK 0.7.0 can insert buffers in the infrastructure to reduce the latency of I/O. The buffer stores the wavelets from the I/O for one row PEs while the core is busy and cannot process the wavelets from the I/O. In other words, the buffer acts like a prefetch from the computation point of view. There are two kind of buffers: one stores the data for host-to-device, and the other stores the data for device-to-host. The width of the former is configured by --width-west-buf and the width of the latter is configured by --width-east-buf. By default, --width-west-buf=0 and --width-east-buf=0, i.e. no buffers are inserted.

    --width-west-buf=k means k column PEs are inserted at west side of the core kernel, each PE has 46KB storage. If the user has 500 PEs in a row, then 46KB can buffer 23 wavelets per PE. If the user wants to stream/copy a tensor of size 100 per PE, then --width-west-buf=5 can hold the whole tensor.

  • The user can block/unblock input or output tensor colors.

    For example, the user can overlap computation and communication by unblocking/blocking the color.

  • The user must not set or modify the routing of an input, output tensor color or kernel launch color.

    The routing pattern is configured implicitly by the compiler. If the user modifies those routing patterns, the behavior is undefined.

  • The user must add --memcpy to the cslc command

  • The user must pass input/output tensor colors to cslc via MEMCPYH2D_DATA_<x>_ID and MEMCPYD2H_DATA_<y>_ID, where x and y are integers, starting from 1.

  • The user must use --fabric-offsets=x,y to compile the kernel.

    SDK 0.5.1 requires --fabric-offsets=4,1. The runtime will emit an error if --fabric-offsets is not 4,1.

    SDK 0.7.0 requires x >= 4 + width-west-buf and y >= 1.

  • The user must use --fabric-dims=dim_x,dim_y to compile the kernel.

    SDK 0.5.1 requires dim_x >= x + width + 3, dim_y >= y + height + 1.

    SDK 0.7.0 requires dim_x >= x + width + 3 + width-east-buf, dim_y >= y + height + 1.

  • The user must not use reserved resources: colors 21, 22, 23, 27, 28, 29, 30, and input/output queue 0. The compiler and runtime cannot detect if there is a resource conflict and cannot emit any error messages if it occurs.

For the host (SDK 0.5.1),

  • Pass the kernel height and width, and the input/output tensor colors to CSELFRunner(). For reference, see the following example, or the complete example in SDK 0.5.1 residual-memcpy/run_memcpy.py:

    simulator = CSELFRunner(elf_list, debug_mode=True, cmaddr=args.cmaddr,
                          height=height, width=width, input_colors=set(c_h2d),
                          output_colors=set(c_d2h))
    
  • The user must use topological ordering of the data graph to issue the input/output tensor commands. For example, the residual example receives A, x and b, then computes |b-A*x|. The result |b-A*x| can be sent out only if all inputs are received and computed, so the output tensor must be added last. The following host code sends three input tensors from the host, in order, before initiating the output tensor receive on the host.

    simulator.add_input_tensor(c_h2d[0], iportmap_A, A)
    simulator.add_input_tensor(c_h2d[1], iportmap_x, x)
    simulator.add_input_tensor(c_h2d[2], iportmap_b, b)
    simulator.add_output_tensor(c_d2h[0], oportmap_nrm_r, np.float32)
    

    The kernel defines a WTT to read A, x and b, and it does not assume any ordering among the three tensors. Instead, it uses boolean variables x_is_ready, A_is_ready and Ax_is_ready to define the order. This allows the add_input_tensor() to occur in any order.

For the host (SDK 0.7.0),

SDK 0.7.0 introduces a new runtime, SdkRuntime which supports memory transfers and kernel launches by memcpy_h2d(), memcpy_d2h() and call() Each API can be a blocking call or a nonblocking call, depending on the parameter nonblock of the API. If blocking mode (nonblock=False) is selected, the API waits until the operation is done, otherwise, the API returns before the operation even starts. The SdkRuntime can aggregate multiple nonblocking operations together to reduce the latency. However the user must take care to avoid race conditions in nonblocking mode.

For example, if the user has two memcpy_d2h() to the same destination, the content of the destination is undefined if both operations are nonblocking.

  • Pass the artifact which contains the ELFs, and the IP of WSE to SdkRuntime(). Then the user can load the ELFs by load() and start the simulator or WSE by run(). After that, the user can do any operation, either memory transfers or kernel launches. Finally, the user calls stop() to shutdown the simulator or WSE. For reference, see the following example, or the complete example in SdkRuntime/residual-memcpy/run_memcpy.py:

    simulator = SdkRuntime(args.name, cmaddr=args.cmaddr)
    simulator.load()
    simulator.run()
    // a sequence of operations
    simulator.stop()
    
  • Import MemcpyDataType and MemcpyOrder in order to specify the data type and ordering of the tensor

    from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime
    from cerebras.sdk.runtime.sdkruntimepybind import MemcpyDataType
    from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder
    
  • The API memcpy_h2d() transfers a tensor from host to device with two modes, one is streaming mode and the other is copy mode. The former corresponds to streaming=True and the latter corresponds to streaming=False. The region of interest is 4-tuple (x, y, w, h), which indicates a subrectangle starting at (x,y) with (width, height) = (w, h). The parameter l indicates number of elements (wavelets) per PE. The parameter data_type specifies either 16-bit (MemcpyDataType.MEMCPY_16BIT) or 32-bit (MemcpyDataType.MEMCPY_32BIT) for copy mode. This parameter is don’t care for streaming mode. The parameter order specifies row-major order (MemcpyOrder.ROW_MAJOR) or column-major order (MemcpyOrder.COL_MAJOR) for the input/output tensor of the form A[h][w][l]. The parameter nonblock indicates if the operation is blocking or nonblocking.

    The parameter dest is the color of H2D if streaming=True (streaming mode) and is the symbol if streaming=False (copy mode).

    memcpy_h2d(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock)
    

    The following examples shows a copy mode to send a tensor A. The util function converts the tensor A based on the portmap iportmap_A. For reference, see the complete example in SdkRuntime/residual-memcpy/run_memcpy.py.

    symbol_A = simulator.get_id("A")
    (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A)
    simulator.memcpy_h2d(symbol_A, data, px, py, w, h, l, streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
    
  • The API memcpy_d2h() transfers a tensor from device to host with two modes, same as memcpy_h2d(). The first parameter dest is the host tensor to receive the data from the device. The second parameter src is the color of D2H if streaming=True (streaming mode) and is the symbol if streaming=False (copy mode). Other parameters are the same as memcpy_h2d().

    memcpy_d2h(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock)
    

    The following examples shows a copy mode to receive a scalar |b - A*x|. The util function allocates a workspace data to receive the wavelets from the device. Then it converts data back to nrm_r_cs based on the portmap oportmap_nrm_r. For reference, see the complete example in SdkRuntime/residual-memcpy/run_memcpy.py.

    symbol_nrm = simulator.get_id("nrm")
    (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32)
    simulator.memcpy_d2h(data, symbol_nrm, px, py, w, h, l,
                      streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT,
                      order=MemcpyOrder.ROW_MAJOR, nonblock=False)
    nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data)
    
  • SDK 0.7.0 adds the support of column-major order for the input/output tensor. The parameter order of memcpy_h2d() and memcpy_d2h() specifies either row-major or column-major. The column-major version delivers better bandwidth than row-major. runtime_utils only supports row-major. To use the column-major, the user must prepare the column-major host tensor and calls the low-level memcpy_h2d() and memcpy_d2h() directly. For reference, see the complete example in SdkRuntime/bandwidthTest/run.py.

  • SDK 0.7.0 introduces a new API, call() to support kernel launches. call() is simpler than memcpy_launch(), but the kernel has to export color LAUNCH and host-callable functions. The first parameter sym is the symbol of the host-callable function. The second parameter params is an array of the parameters of the host-callable function. The size of of each input parameter is 4 bytes (a wavelet or size of u32). The third parameter nonblock is either a blocking mode or a nonblocking mode.

    params = np.zeros(K).astype(np.uint32)
    # format of params
    #  +---------------------+
    #  | param 1             | 1st wavelet
    #  +---------------------+
    #  | param 2             | 2nd wavelet
    #  +---------------------+
    #  | param 3             | 3rd wavelet
    #  +---------------------+
    #  | ...                 |
    #  +---------------------+
    # params has K wavelets where K = number of parameters
    simulator.call(sym, params, nonblock)
    

    The advantage of call() is twofold:

    • The user does not need to encode the function ID

    • The user does not need to implement the WTT of color LAUNCH

Batched execution mode (SDK 0.5.1)

In batched execution mode, the host script should replace the call to CSELFRunner.connect_and_run() with calls to:

simulator = CSELFRunner(elf_list, ..., input_colors, ...)

simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True)

simulator.add_output_tensor(outColor, oportmap, np.float32)

simulator.load()
simulator.start()

simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1)
simulator.run_batch()
result_tensor = simulator.out_tensor_dict["out_tensor"]

simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2)
simulator.run_batch()
result_tensor = simulator.out_tensor_dict["out_tensor"]

simulator.stop()

Note these requirements:

  • CSELFRunner constructor is passed input and output color args, so that the system uses new tensor streaming

  • The add_input_tensor() call sets np_arr to None. If the np_arr value is still passed, the subsequent initial call to prepare_input_tensor() can be omitted.

  • The add_input_tensor() call sets reusable to True. This enables batched mode for this tensor. All created input and output tensors must all take the same reusable value.

  • The prepare_input_tensor() call replaces any existing prepared array data within CSELFRunner with the new values passed in through np_arr.

  • The run_batch() call sends and receives all declared tensors to the fabric, and waits until all bytes are sent and received,, as if connect_and_run() has been called. However the execution or simulation does not stop at this point.

  • The output tensor from this run can be copied from out_tensor_dict as usual. The data in this dict will be lost once run_batch() is called again.

  • The stop() function must be called to stop the simulator after the last batch.

A running program on the wafer will not be stopped or reset after each batch. The kernel itself is responsible for any program state init/reset that may be needed after each batch.

Selected batched execution mode (SDK 0.5.1)

The selective batch execution mode extends the previous new behavior by allowing a list of input and output tensors to be sent, instead of sending all declared tensors every time run_batch() is called. With this extension, a program could have distinct init and termination phases. For example:

simulator = CSELFRunner(elf_list, ..., input_colors, ...)

simulator.add_input_tensor(initColor, init_iportmap, np_arr=None, sentinel=None, reusable=True)
simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True)

simulator.add_output_tensor(outColor, oportmap, np.float32)
simulator.add_output_tensor(finalOutColor, final_oportmap, np.float32)

simulator.load()
simulator.start()

# run a one-time init step
simulator.prepare_input_tensor(initColor, init_iportmap, np_arr=init_data)
simulator.run_batch([initColor])

simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1)
simulator.run_batch([color, outColor])
result_tensor = simulator.out_tensor_dict["out_tensor"]

simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2)
simulator.run_batch([color, outColor])
result_tensor = simulator.out_tensor_dict["out_tensor"]

simulator.run_batch([finalOutColor])
final_result_tensor = simulator.out_tensor_dict["final_out_tensor"]

simulator.stop()

Note these requirements:

  • The run_batch() call now takes an optional list of colors. If no list is given, all declared tensors are sent and received. Otherwise only the listed colors are sent.

See the hadamard-product example for example uses of these batched execution features.

Upcoming improvements

Although SDK 0.7.0 can activate up to 16 I/O channels, it still does not saturate the I/O bandwidth. The future release will resolve the restrictions and bring more I/O channels to increase the raw bandwidth.