Tensor Streaming Implementations
Contents
Tensor Streaming Implementations¶
Tensor streaming through CSELFRunner
’s add_input_tensor()
and add_output_tensor()
methods are the primary way to transfer data to and from the WSE with the legacy host runtime. These functions manage the data transfer to and from the host’s filesystem or memory, through the host and WSE network interfaces, and finally route the data into your kernel. This last step is implemented on the WSE itself in order to connect the I/O channel entry-points, which are in fixed locations at the edges of the WSE, to each kernel, which has a variable size and location. These I/O channels are connected to the fabric routers at the east and west edge PEs, spaced roughly 16 rows apart, giving a total of 66 channels at each edge:
The SDK 0.5.1 release extends the optional new tensor streaming implementation by introducing batched execution mode and selective batched execution mode. These execution modes (only available with new tensor streaming) allow CSELFRunner
to send its input and output tensors (with the same or different numpy array data) to the fabric multiple times per fabric startup operation, avoiding the fabric load and reset overhead on subsequent runs.
The SDK 0.6.0 release adds a couple of features on top of SDK 0.5.1, including direct memory copy and kernel launches.
The new runtime SdkRuntime
loads the binaries and starts the simulator or WSE first, then the user can stream or copy the tensors at any time.
There is no need to declare H2D/D2H signatures the way CSELFRunner
requires.
The user can also transform the batched operations in SDK 0.5.1 into a calling sequence in SDK 0.6.0 without describing the batched pattern in advance.
The user can run existing code in SDK 0.6.0 without any modification, however the code will not take advantage of the new runtime features in SDK 0.6.0.
The SDK 0.7.0 release adds three features on top of SDK 0.6.0:
copy mode
supports 16-bit data.The input/output tensors can be formed in column-major order to improve the bandwidth.
The kernel launch is simplified because the compiler generates the RPC code. The user only needs to export the host-callable functions.
Old-style Tensor Streaming¶
In “old-style” tensor streaming, the only implementation before the SDK 0.5.0 release, CSELFRunner
adds a 1-PE wide box or “halo” on all four sides of the wafer. The halo sends the correct tensors towards the correct edge of the kernel. Up to 12 total input and output tensors can be used. Each tensor is assigned to a unique I/O channel, resulting in these properties:
All tensors are sent and received concurrently.
Each tensor transfer is limited to the maximum bandwidth of a single I/O channel.
Each I/O channel has an independent data path to and from the WSE, so the total application I/O bandwidth scales up with the number of tensors used.
In this implementation, the halo only delivers tensor data to a PE (or row/column of PEs) at the outer edge of the kernel. The portmap
argument must specify the source or destination edge. The kernel itself must manage the internal distribution of the tensor data. No extra code is added to the kernel in this tensor streaming implementation. Old-style streaming is the default, and existing applications will continue to default to this implementation.
New-style Tensor Streaming¶
The “new-style” tensor streaming uses additional PEs around the user kernel to route the tensor data and also adds a small executable component to the kernel PEs. The additional support PEs consume three columns on the west of the kernel and two columns on the east.
This implementation uses three routable colors, leaving up to 20 routable colors available, from color 0 to color 20. It only supports up to four input and up to four output tensor colors. Each tensor’s data is sent over a single shared input or output I/O channel, resulting in these properties:
Tensors are sent/received in the order they were created with
memcpy_h2d()
andmemcpy_d2h()
.Total application I/O bandwidth in one direction is limited to the bandwidth of a single I/O channel.
In this implementation, tensor data to be delivered directly to each kernel PE, without requiring additional routing/distribution code in the kernel. (Tensor data can also be delivered only to the edges, as in old-style streaming.) The portmap
argument must specify a rectangle of PEs as the source/destination.
SDK 0.7.0 has a different runtime, SdkRuntime
, which supports up to 16 I/O channels while CSELFRunner
in SDK 0.5.1 only supports a single I/O channel.
In addition to bandwidth improvement, SDK 0.7.0 can also reduce the I/O latency by buffer insertion on either side of the core kernel.
Enabling New-Style Tensor Streaming¶
For the kernel,
SDK 0.7.0 can stream the data into the device or copy the data into a given device memory directly.
The former is called streaming mode
and the latter is called copy mode
.
The streaming mode
requires the user to count number of received wavelets in order to process the next task.
The copy mode
copies the data into the memory directly without notifying the user.
To compute f(A)
in the kernel, the user has two options:
streaming mode
: define a WTT to receive the tensorA
and call a function to computef(A)
after wholeA
is received.copy mode
: send tensorA
first, then launch a kernel function to to computef(A)
.
Here are some highlights, the detail can be found in SdkRuntime/residual-memcpy
and SdkRuntime/stencil-memcpy
.
SDK 0.7.0 introduces a new runtime
SdkRuntime
which is different fromCSELFRunner
in SDK 0.5.0. However SDK 0.7.0 supportsCSELFRunner
as well, so the user can run existing codes without any change. To distinguishSdkRuntime
fromCSELFRunner
in the kernel side, the compiler introduces a new flag, called--channels
. If the user passes--channels=0
tocslc
, the code must instantiate memcpy module by<memcpy/memcpy>
, then it is compiled to run withCSELFRunner
. If the user passes--channels=k
(k is positive) tocslc
, the code must instantiate memcpy module by<memcpy_multi/memcpy>
, then it is compiled to run withSdkRuntime
. If the user does not specify--channels
, the default value is zero.SDK 0.7.0 can activate more I/O channels than SDK 0.5.1. The number of I/O channels is the number
k
in--channels=k
. The largest value of--channels
is 16 in SDK 0.7.0, so SDK 0.7.0 can deliver up to 16x more bandwidth than SDK 0.5.1.The user must instantiate memcpy module with an
@import_module()
statement. The parameters passing to the memcpy module include colors to stream the data, a data type to copy the data and a color to launch kernel functions.To stream data in/out, the user must pass the colors into memcpy module. Input tensors are prefixed with
MEMCPYH2D_
and output tensors are prefixed withMEMCPYD2H_
, followed by the tensor ID, an integer in the range 1-4. Unused colors should be omitted, and only four colors per direction are allowed. The supplied color values should match those used in the host API, eitheradd_input_tensor()
/add_output_tensor()
calls toCSELFRunner
ormemcpy_h2d()
/memcpy_d2h()
calls toSdkRuntime
.The parameter
px
is the x-coordinate of the core rectangle. SDK 0.5.1 does not usememcpyParams
, instead, the user has to define.first_pe=...
and.last_pe=...
. SDK 0.7.0 hides parameterpx
undermemcpyParams
, usually the layout file passesmemcpyParams
to the kernel file. Here is one example for SDK 0.7.0:const memcpy = @import_module( "<memcpy_multi/get_params>", .{ .width = width, .height = height }); const memcpyParams = memcpy.get_params(px); const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{ .MEMCPYH2D_1=first_input_color, .MEMCPYH2D_2=second_input_color, ..., .MEMCPYD2H_1=first_output_color, ... });
To run with SDK 0.5.1, the user can replace
memcpy_multi/memcpy
bymemcpy/memcpy
, remove memcpyParams, and add.first_pe=...
and.last_pe=...
into memcpy module.The user must pass input/output color ID and value pairs to cslc as parameters, where
x
is the tensor ID:--params=MEMCPYH2D_DATA_<x>_ID:<input_color>
To stream the data into the device, the user can either use a WTT (wavelet-triggered task) to read the data from input tensor color or use a microthread to read the data. To bind a WTT task to an input color, call
@bind_task
:// read data on color MEMCPYH2D_DATA_1 @bind_task(f_memcpyh2d_data_1, MEMCPYH2D_DATA_1 );
The user can send data to an output tensor color using a microthread. To write the data on P1.0 via the microthread with output queue 1:
@mov32(fab_trans_nrm_wdsd_p10, mem_nrm_buf_dsd, .{.async=true});
To copy data to/from the device, the user has to pass the data type to memcpy module and define the symbols for the tensors to be copied.
For example, the following code defines a pointer
ptr_A
pointing to tensorA
, and exports it. The type ofA
must be the same asdata_type
, in other words, the user cannot copy two tensors of different type. See the complete example inSdkRuntime/residual-memcpy/residual_memcpy.csl
.const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{ ..., .data_type=f32 })); ... var A = @zeros([4]f32); var ptr_A : [*]f32 = &A; ... comptime { @export_symbol(ptr_A, "A"); }
SDK 0.7.0 supports both 16-bit and 32-bit data transfer via
copy mode
orstreaming mode
.If the user does not use
runtime_utils
to prepare the input tensors for 16-bit tensor but calls the low-levelmemcpy_h2d
directly, zero extension from 16-bit to 32-bit must be performed. For example, if the tensor is [10]u16, the user has to prepare [10]u32 and fills the lower 16-bit of u32 with the input tensor. The same holds for memcpy_d2H which returns 32-bit data where higher 16-bit is zero. The user has to strip out the higher 16-bit.SDK 0.7.0 can launch a kernel function. The user has to specify a color
LAUNCH
to memcpy module and define a function to be launched.const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{ ..., .LAUNCH = LAUNCH }));
The following is an example of kernel launching protocol. It calls different kernel functions,
f1()
orf2()
. See the complete example inSdkRuntime/residual-memcpy/residual_memcpy.csl
.fn f1(...) void { ... } fn f2(...) void { ... } @comptime{ @export_symbol(f1); @export_symbol(f2); @rpc(LAUNCH); }
SDK 0.7.0 can insert buffers in the infrastructure to reduce the latency of I/O. The buffer stores the wavelets from the I/O for one row PEs while the core is busy and cannot process the wavelets from the I/O. In other words, the buffer acts like a
prefetch
from the computation point of view. There are two kind of buffers: one stores the data for host-to-device, and the other stores the data for device-to-host. The width of the former is configured by--width-west-buf
and the width of the latter is configured by--width-east-buf
. By default,--width-west-buf=0
and--width-east-buf=0
, i.e. no buffers are inserted.--width-west-buf=k
means k column PEs are inserted at west side of the core kernel, each PE has 46KB storage. If the user has 500 PEs in a row, then 46KB can buffer 23 wavelets per PE. If the user wants to stream/copy a tensor of size 100 per PE, then--width-west-buf=5
can hold the whole tensor.The user can block/unblock input or output tensor colors.
For example, the user can overlap computation and communication by unblocking/blocking the color.
The user must not set or modify the routing of an input, output tensor color or kernel launch color.
The routing pattern is configured implicitly by the compiler. If the user modifies those routing patterns, the behavior is undefined.
The user must add
--memcpy
to thecslc
commandThe user must pass input/output tensor colors to cslc via
MEMCPYH2D_DATA_<x>_ID
andMEMCPYD2H_DATA_<y>_ID
, wherex
andy
are integers, starting from 1.The user must use
--fabric-offsets=x,y
to compile the kernel.SDK 0.5.1 requires
--fabric-offsets=4,1
. The runtime will emit an error if--fabric-offsets
is not 4,1.SDK 0.7.0 requires
x >= 4 + width-west-buf
andy >= 1
.The user must use
--fabric-dims=dim_x,dim_y
to compile the kernel.SDK 0.5.1 requires
dim_x >= x + width + 3
,dim_y >= y + height + 1
.SDK 0.7.0 requires
dim_x >= x + width + 3 + width-east-buf
,dim_y >= y + height + 1
.The user must not use reserved resources: colors 21, 22, 23, 27, 28, 29, 30, and input/output queue 0. The compiler and runtime cannot detect if there is a resource conflict and cannot emit any error messages if it occurs.
For the host (SDK 0.5.1),
Pass the kernel height and width, and the input/output tensor colors to
CSELFRunner()
. For reference, see the following example, or the complete example in SDK 0.5.1residual-memcpy/run_memcpy.py
:simulator = CSELFRunner(elf_list, debug_mode=True, cmaddr=args.cmaddr, height=height, width=width, input_colors=set(c_h2d), output_colors=set(c_d2h))
The user must use topological ordering of the data graph to issue the input/output tensor commands. For example, the residual example receives A, x and b, then computes
|b-A*x|
. The result|b-A*x|
can be sent out only if all inputs are received and computed, so the output tensor must be added last. The following host code sends three input tensors from the host, in order, before initiating the output tensor receive on the host.simulator.add_input_tensor(c_h2d[0], iportmap_A, A) simulator.add_input_tensor(c_h2d[1], iportmap_x, x) simulator.add_input_tensor(c_h2d[2], iportmap_b, b) simulator.add_output_tensor(c_d2h[0], oportmap_nrm_r, np.float32)
The kernel defines a WTT to read
A
,x
andb
, and it does not assume any ordering among the three tensors. Instead, it uses boolean variablesx_is_ready
,A_is_ready
andAx_is_ready
to define the order. This allows theadd_input_tensor()
to occur in any order.
For the host (SDK 0.7.0),
SDK 0.7.0 introduces a new runtime, SdkRuntime
which supports memory transfers and kernel launches by memcpy_h2d()
, memcpy_d2h()
and call()
Each API can be a blocking call or a nonblocking call, depending on the parameter nonblock
of the API.
If blocking mode (nonblock=False
) is selected, the API waits until the operation is done, otherwise, the API returns before the operation even starts.
The SdkRuntime
can aggregate multiple nonblocking operations together to reduce the latency.
However the user must take care to avoid race conditions in nonblocking mode.
For example, if the user has two memcpy_d2h()
to the same destination, the content of the destination is undefined if both operations are nonblocking.
Pass the artifact which contains the ELFs, and the IP of WSE to
SdkRuntime()
. Then the user can load the ELFs byload()
and start the simulator or WSE byrun()
. After that, the user can do any operation, either memory transfers or kernel launches. Finally, the user callsstop()
to shutdown the simulator or WSE. For reference, see the following example, or the complete example inSdkRuntime/residual-memcpy/run_memcpy.py
:simulator = SdkRuntime(args.name, cmaddr=args.cmaddr) simulator.load() simulator.run() // a sequence of operations simulator.stop()
Import MemcpyDataType and MemcpyOrder in order to specify the data type and ordering of the tensor
from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime from cerebras.sdk.runtime.sdkruntimepybind import MemcpyDataType from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder
The API
memcpy_h2d()
transfers a tensor from host to device with two modes, one isstreaming mode
and the other iscopy mode
. The former corresponds tostreaming=True
and the latter corresponds tostreaming=False
. The region of interest is 4-tuple(x, y, w, h)
, which indicates a subrectangle starting at(x,y)
with (width, height) =(w, h)
. The parameterl
indicates number of elements (wavelets) per PE. The parameterdata_type
specifies either 16-bit (MemcpyDataType.MEMCPY_16BIT
) or 32-bit (MemcpyDataType.MEMCPY_32BIT
) for copy mode. This parameter is don’t care for streaming mode. The parameterorder
specifies row-major order (MemcpyOrder.ROW_MAJOR
) or column-major order (MemcpyOrder.COL_MAJOR
) for the input/output tensor of the formA[h][w][l]
. The parameternonblock
indicates if the operation is blocking or nonblocking.The parameter
dest
is the color of H2D ifstreaming=True
(streaming mode) and is the symbol ifstreaming=False
(copy mode).memcpy_h2d(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock)
The following examples shows a
copy mode
to send a tensor A. The util function converts the tensor A based on the portmapiportmap_A
. For reference, see the complete example inSdkRuntime/residual-memcpy/run_memcpy.py
.symbol_A = simulator.get_id("A") (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A) simulator.memcpy_h2d(symbol_A, data, px, py, w, h, l, streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
The API
memcpy_d2h()
transfers a tensor from device to host with two modes, same asmemcpy_h2d()
. The first parameterdest
is the host tensor to receive the data from the device. The second parametersrc
is the color of D2H ifstreaming=True
(streaming mode) and is the symbol ifstreaming=False
(copy mode). Other parameters are the same asmemcpy_h2d()
.memcpy_d2h(dest, src, x, y, w, h, l, streaming, data_type, order, nonblock)
The following examples shows a copy mode to receive a scalar
|b - A*x|
. The util function allocates a workspacedata
to receive the wavelets from the device. Then it converts data back tonrm_r_cs
based on the portmapoportmap_nrm_r
. For reference, see the complete example inSdkRuntime/residual-memcpy/run_memcpy.py
.symbol_nrm = simulator.get_id("nrm") (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32) simulator.memcpy_d2h(data, symbol_nrm, px, py, w, h, l, streaming=False, data_type=MemcpyDataType.MEMCPY_32BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False) nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data)
SDK 0.7.0 adds the support of column-major order for the input/output tensor. The parameter
order
ofmemcpy_h2d()
andmemcpy_d2h()
specifies either row-major or column-major. The column-major version delivers better bandwidth than row-major.runtime_utils
only supports row-major. To use the column-major, the user must prepare the column-major host tensor and calls the low-levelmemcpy_h2d()
andmemcpy_d2h()
directly. For reference, see the complete example inSdkRuntime/bandwidthTest/run.py
.SDK 0.7.0 introduces a new API,
call()
to support kernel launches.call()
is simpler thanmemcpy_launch()
, but the kernel has to export colorLAUNCH
and host-callable functions. The first parametersym
is the symbol of the host-callable function. The second parameterparams
is an array of the parameters of the host-callable function. The size of of each input parameter is 4 bytes (a wavelet or size of u32). The third parameternonblock
is either a blocking mode or a nonblocking mode.params = np.zeros(K).astype(np.uint32) # format of params # +---------------------+ # | param 1 | 1st wavelet # +---------------------+ # | param 2 | 2nd wavelet # +---------------------+ # | param 3 | 3rd wavelet # +---------------------+ # | ... | # +---------------------+ # params has K wavelets where K = number of parameters simulator.call(sym, params, nonblock)
The advantage of
call()
is twofold:The user does not need to encode the function ID
The user does not need to implement the WTT of color
LAUNCH
Batched execution mode (SDK 0.5.1)¶
In batched execution mode, the host script should replace the call to CSELFRunner.connect_and_run()
with calls to:
simulator = CSELFRunner(elf_list, ..., input_colors, ...) simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_output_tensor(outColor, oportmap, np.float32) simulator.load() simulator.start() simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1) simulator.run_batch() result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2) simulator.run_batch() result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.stop()
Note these requirements:
CSELFRunner
constructor is passed input and output color args, so that the system uses new tensor streamingThe
add_input_tensor()
call setsnp_arr
to None. If thenp_arr
value is still passed, the subsequent initial call toprepare_input_tensor()
can be omitted.The
add_input_tensor()
call setsreusable
to True. This enables batched mode for this tensor. All created input and output tensors must all take the samereusable
value.The
prepare_input_tensor()
call replaces any existing prepared array data withinCSELFRunner
with the new values passed in throughnp_arr
.The
run_batch()
call sends and receives all declared tensors to the fabric, and waits until all bytes are sent and received,, as ifconnect_and_run()
has been called. However the execution or simulation does not stop at this point.The output tensor from this run can be copied from
out_tensor_dict
as usual. The data in this dict will be lost oncerun_batch()
is called again.The
stop()
function must be called to stop the simulator after the last batch.
A running program on the wafer will not be stopped or reset after each batch. The kernel itself is responsible for any program state init/reset that may be needed after each batch.
Selected batched execution mode (SDK 0.5.1)¶
The selective batch execution mode extends the previous new behavior by allowing a list of input and output tensors to be sent, instead of sending all declared tensors every time run_batch()
is called. With this extension, a program could have distinct init and termination phases. For example:
simulator = CSELFRunner(elf_list, ..., input_colors, ...) simulator.add_input_tensor(initColor, init_iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_input_tensor(color, iportmap, np_arr=None, sentinel=None, reusable=True) simulator.add_output_tensor(outColor, oportmap, np.float32) simulator.add_output_tensor(finalOutColor, final_oportmap, np.float32) simulator.load() simulator.start() # run a one-time init step simulator.prepare_input_tensor(initColor, init_iportmap, np_arr=init_data) simulator.run_batch([initColor]) simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_1) simulator.run_batch([color, outColor]) result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.prepare_input_tensor(color, iportmap, np_arr=my_array_2) simulator.run_batch([color, outColor]) result_tensor = simulator.out_tensor_dict["out_tensor"] simulator.run_batch([finalOutColor]) final_result_tensor = simulator.out_tensor_dict["final_out_tensor"] simulator.stop()
Note these requirements:
The
run_batch()
call now takes an optional list of colors. If no list is given, all declared tensors are sent and received. Otherwise only the listed colors are sent.
See the hadamard-product example for example uses of these batched execution features.
Upcoming improvements¶
Although SDK 0.7.0 can activate up to 16 I/O channels, it still does not saturate the I/O bandwidth. The future release will resolve the restrictions and bring more I/O channels to increase the raw bandwidth.