SdkRuntime API Reference
Contents
SdkRuntime API Reference¶
This section presents the SdkRuntime
Python host API reference and
associated utilities to develop kernels for the Cerebras Wafer Scale Engine.
SdkRuntime¶
Python API for SdkRuntime
functions.
-
class
cerebras.sdk.runtime.sdkruntimepybind.
SdkRuntime
(bindir: Union[pathlib.Path, str], **kwargs)¶ Bases:
object
Manages the execution of SDK programs on the Cerebras Wafer Scale Engine (WSE) or simfabric. The constructor analyzes the WSE ELFs in the
bindir
and prepares the WSE or simfabric for a run. Requires CM IP address and port for WSE runs.- Parameters
bindir (
Union[pathlib.Path, str]
) – Path to ELF files which is compiled bycslc
. The runtime collects the I/O and fabric parameters automatically, including height, width, number of channels, width of buffers,… etc.- Keyword Arguments
cmaddr (
str
) –'IP_ADDRESS:PORT'
string of CM. Omit thiskwarg
to run on simfabric.
Example:
In the following example, an
SdkRuntime
runner object is instantiated. Ifargs.cmaddr
is non-empty, then the kernel code will run on the WSE pointed to by that address; otherwise, the kernel code will run on simfabric. The compiled kernel code in the directoryargs.name
has exported symbolsA
andB
pointing to arrays on the device. After loading the code and starting the run withload()
andrun()
, data on the host stored indata
is copied toA
on the device, and thenB
on the device is copied back intodata
on the host.runner = SdkRuntime(args.name, cmaddr=args.cmaddr) symbol_A = runner.get_id("A") symbol_B = runner.get_id("B") runner.load() runner.run() runner.memcpy_h2d(symbol_A, data, px, py, w, h, l, streaming=False, data_type=memcpy_dtype, order=memcpy_order, nonblock=False) runner.memcpy_d2h(data, symbol_B, px, py, w, h, l, streaming=False, data_type=memcpy_dtype, order=memcpy_order, nonblock=False)
-
call
(symbol: str, params: numpy.ndarray, **kwargs) → Task¶ Trigger a host-callable function defined in the kernel.
- Parameters
symbol (
str
) – The exported name of the symbol corresponding to a host-callable function.params (
numpy.ndarray
) – Array of parameters taken as arguments to the host-callable function. The parameters must be 32-bit, and no more than fifteen parameters are supported.
- Keyword Arguments
nonblock (
bool
) – Nonblocking ifTrue
, blocking otherwise.
- Returns
task_handle (
Task
) – Handle to the task launched bycall
.
Example:
Consider a kernel which defines a host-callable function
fn_foo
by:comptime { @export_symbol(fn_foo); @rpc(LAUNCH); }
The host calls
fn_foo
byrunner.call("fn_foo", [], nonblock=False)
.
-
dump_core
(corefile: str)¶ Dump the core of a simulator run, to be used for debugging with
csdb
. Note that the specified name of the corefile MUST be “corefile.cs1” to use withcsdb
, and this method can only be called after callingstop()
.- Parameters
corefile – Name of corefile. Must be “corefile.cs1” to use with
csdb
.
-
get_id
(symbol: str)¶ Retrieve the integer representation of an exported symbol which is exported in the kernel. Possible symbols include a data tensor or a host-callable function.
- Parameters
symbol (
str
) – The exported name of the symbol.
-
is_task_done
(task_handle: Task) → bool¶ Query if task
task_handle
is complete- Parameters
task_handle (
Task
) – Handle to a task previously launched bySdkRuntime
.- Returns
task_done (
bool
) –True
if task is done, andFalse
otherwise.
-
launch
(symbol: str, *args, **kwargs) → Task¶ Trigger a host-callable function defined in the kernel, with type checking for arguments.
- Parameters
symbol (
str
) – The exported name of the symbol corresponding to a host-callable function.- Positional Arguments
Matches the arguments of the host-callable function.
launch
will perform type checking on the arguments.
- Keyword Arguments
nonblock (
bool
) – Nonblocking ifTrue
, blocking otherwise.
- Returns
task_handle (
Task
) – Handle to the task launched bylaunch
.
-
load
()¶ Load the binaries to simfabric or WSE. It may takes 80+ seconds to load the binaries onto the WSE.
-
memcpy_d2h
(dest: numpy.ndarray, src: int, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → Task¶ Receive a host tensor to the device via either copy mode or streaming mode. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate
(px, py)
with widthw
and heighth
.- Parameters
dest (
numpy.ndarray
) – A 3-D host tensorA[h][w][l]
, wrapped in a 1-D array according to keyword argumentorder
.src (
int
) – A user-defined color if keyword argumentstreaming=True
, symbol of a device tensor otherwise.px (
int
) –x
-coordinate of start point of the ROI.py (
int
) –y
-coordinate of start point of the ROI.w (
int
) – Width of the ROI.h (
int
) – Height of the ROI.elem_per_pe (
int
) – Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor hask
elements per PE,elt_per_pe
isk
even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
- Keyword Arguments
streaming (
bool
) – Streaming mode ifTrue
, copy mode otherwise.data_type (
MemcpyDataType
) – 32-bit ifMemcpyDataType.MEMCPY_32BIT
or 16-bit ifMemcpyDataType.MEMCPY_16BIT
. Note that this argument has no effect ifstreaming
isTrue
, and the user must handle the data appropriately in the receiving wavelet-triggered task. Additionally, the underlying type of the tensordest
must be 32-bit. The tensor must be extended to a 32-bit one with zero filled in the higher 16 bits.order (
MemcpyOrder
) – Row-major ifMemcpyOrder.ROW_MAJOR
or column-major ifMemcpyOrder.COL_MAJOR
.nonblock (
bool
) – Nonblocking ifTrue
, blocking otherwise.
- Returns
task_handle (
Task
) – Handle to the task launched bymemcpy_d2h
.
-
memcpy_h2d
(dest: int, src: numpy.ndarray, px: int, py: int, w: int, h: int, elem_per_pe: int, **kwargs) → Task¶ Send a host tensor to the device via either copy mode or streaming mode. The data is distributed into the region of interest (ROI) which is a bounding box starting at coordinate
(px, py)
with widthw
and heighth
.- Parameters
dest (
int
) – A user-defined color if keyword argumentstreaming=True
, symbol of a device tensor otherwise.src (
numpy.ndarray
) – A 3-D host tensorA[h][w][l]
, wrapped in a 1-D array according to parameterorder
.px (
int
) –x
-coordinate of start point of the ROI.py (
int
) –y
-coordinate of start point of the ROI.w (
int
) – Width of the ROI.h (
int
) – Height of the ROI.elem_per_pe (
int
) – Number of elements per PE. The data type of an element is 16-bit and 32-bit only. If the tensor hask
elements per PE,elt_per_pe
isk
even if the data type is 16-bit. If the data type is 16-bit, the user has to extend the tensor to a 32-bit one, with zero filled in the higher 16 bits.
- Keyword Arguments
streaming (
bool
) – Streaming mode ifTrue
, copy mode otherwise.data_type (
MemcpyDataType
) – 32-bit ifMemcpyDataType.MEMCPY_32BIT
or 16-bit ifMemcpyDataType.MEMCPY_16BIT
. Note that this argument has no effect ifstreaming
isTrue
, and the user must handle the data appropriately in the receiving wavelet-triggered task. Additionally, the underlying type of the tensorsrc
must be 32-bit. The tensor must be extended to a 32-bit one with zero filled in the higher 16 bits.order (
MemcpyOrder
) – Row-major ifMemcpyOrder.ROW_MAJOR
or column-major ifMemcpyOrder.COL_MAJOR
.nonblock (
bool
) – Nonblocking ifTrue
, blocking otherwise.
- Returns
task_handle (
Task
) – Handle to the task launched bymemcpy_h2d
.
-
run
()¶ Start the simfabric or WSE run and wait for commands from the host runtime.
-
stop
()¶ Wait for all pending commands (data transfers and kernel function calls) to complete and then stop simfabric or WSE. After this call is complete, no new commands will be accepted for this
SdkRuntime
object.stop
must be called to end a program. Otherwise, the runtime will admit an error.
-
class
cerebras.sdk.runtime.sdkruntimepybind.
MemcpyDataType
¶ Bases:
Enum
Specifies the data size for transfers using
memcpy_d2h
andmemcpy_h2d
copy mode.- Values
MEMCPY_16BIT
MEMCPY_32BIT
-
class
cerebras.sdk.runtime.sdkruntimepybind.
MemcpyOrder
¶ Bases:
Enum
Specifies mapping of data for transfers using
memcpy_d2h
andmemcpy_h2d
.- Values
ROW_MAJOR
COL_MAJOR
-
class
cerebras.sdk.runtime.sdkruntimepybind.
Task
¶ Handle to a task launched by
SdkRuntime
.
runtime_utils¶
Utility functions for preparing input and output tensors.
-
cerebras.sdk.runtime.runtime_utils.
convert_input_tensor
(portmap: str, arr: numpy.ndarray)¶ Given a portmap and array, prepare and return the args that should be passed into
memcpy_h2d
. Note that this function is only compatible withorder=ROW_MAJOR
.- Parameters
portmap (
str
) – ISL portmap giving input mapping of array.arr (
numpy.ndarray
) – Input array to be prepared for input data transfer.
- Returns
(px, py, w, h, elem_per_pe, mapped_arr)
px (
int
) –x
-coordinate of start point of the region of interest (ROI).py (
int
) –y
-coordinate of start point of the ROI.w (
int
) – Width of the ROI.h (
int
) – Height of the ROI.elem_per_pe (
int
) – Number of elements per PE.mapped_arr (
numpy.ndarray
) – A prepared input array for use withmemcpy_h2d
.
-
cerebras.sdk.runtime.runtime_utils.
format_output_tensor
(portmap: str, datatype: type, flat_out_arr: numpy.ndarray) → numpy.ndarray¶ Given a portmap and unshuffled array filled by a
memcpy_d2h
call, prepare and return the shuffled data. Note that this function is only compatible withorder=ROW_MAJOR
.- Parameters
portmap (
str
) – ISL portmap giving output mapping of array.datatype (
type
) – Type of the data to be transferred.flat_out_arr (
numpy.ndarray
) – Output array filled bymemcpy_d2h
.
- Returns
output_arr (
numpy.ndarray
) – Formatted output array with the correct indexing as specified byportmap
.
-
cerebras.sdk.runtime.runtime_utils.
prepare_output_tensor
(portmap: str, datatype: type)¶ Given a portmap and datatype, prepare and return the args that should be passed into
memcpy_d2h
. Note that this function is only compatible withorder=ROW_MAJOR
.- Parameters
portmap (
str
) – ISL portmap giving output mapping of array.datatype (
type
) – Type of the data to be transferred.
- Returns
(px, py, w, h, elem_per_pe, mapped_arr)
px (
int
) –x
-coordinate of start point of the region of interest (ROI).py (
int
) –y
-coordinate of start point of the ROI.w (
int
) – Width of the ROI.h (
int
) – Height of the ROI.elem_per_pe (
int
) – Number of elements per PE.mapped_arr (
numpy.ndarray
) – A prepared output array for use withmemcpy_d2h
.
sdk_utils¶
Utility functions for common operations with SdkRuntime
.
-
cerebras.sdk.sdk_utils.
memcpy_view
(arr: numpy.ndarray, datatype: numpy.dtype)¶ Returns a 32, 16 or 8 bit view of a 32 bit numpy array (only the lower 16 or 8 bits of each 32 bit word in the last two cases).
- Params arr
A numpy array with 4 bytes per element on which the numpy view will be created.
- Parameters
datatype (
numpy.dtype
) – The numpy data type which should be used in the output view. The itemsize must be 1, 2, or 4 bytes.- Returns
output_view (
numpy.ndarray.view
) – Numpy view intoarr
with specified numpy data type.
Example:
memcpy_view
simplifies the use of various precision data types when copying between host and device. Consider the following Python host code which creates afloat16
view into a numpy array. Note that this array must be 32-bit. The user can fill the array withfloat16
data, and copy it to an array on the device with CSL data typef16
.x_symbol = runner.get_symbol('x') # This container array must be 32-bit x_container = np.zeros(N, dtype=np.uint32) x = sdk_utils.memcpy_view(x_container, np.float16) x.fill(0.5) runner.memcpy_h2d(x_symbol, x_container, 0, 0, 1, 1, N, streaming=False, data_type=MemcpyDataType.MEMCPY_16BIT, order=MemcpyOrder.ROW_MAJOR, nonblock=False)
debug_util¶
Utilities for parsing debug output and core files of a simulator run.
-
class
cerebras.sdk.debug.debug_util.
debug_util
(bindir: Union[pathlib.Path, str])¶ Bases:
object
Loads ELF files in
bindir
in order to dump symbols for debugging.The user does not need to export the symbols in the kernel.
debug_util
dumps the core and looks for the symbols in the ELFs. If the symbol atPx.y
is not found in the corresponding ELF,debug_util
emits an error.The most common errors are either: 1) a wrong coordinate passed in
get_symbol()
, or 2) a correct coordinate, but the symbol has been removed due to compiler optimization. One can usereadelf
to check if the symbol exists or not. If not, the user can export the symbol in the kernel to keep the symbol in the ELF.The functionality of this class is only supported in the simulator.
Example:
from cerebras.sdk.debug.debug_util import debug_util # run the app # dirname is the path to ELFs simulator = SdkRuntime(dirname) simulator.load() simulator.run() ... simulator.stop() # retrieve symbols after the run debug_mod = debug_util(dirname) # assume the core rectangle starts at P4.1, the dimension is # width-by-height and we want to retrieve the symbol y for every PE core_offset_x = 4 core_offset_y = 1 for py in range(height): for px in range(width): t = debug_mod.get_symbol(core_offset_x+px, core_offset_y+py, 'y', np.float32) print(f"At (py, px) = {py, px}, symbol y = {t}")
-
get_symbol
(col: int, row: int, symbol: str, dtype: numpy.dtype) → numpy.ndarray¶ Read the value of
symbol
of given type at given PE coordinates. Note that each call to this function scans the whole fabric, so preferget_symbol_rect
over calling this in a loop.- Parameters
px (
int
) –x
-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)py (
int
) –y
-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)symbol (
str
) – Name of the symbol to be read.dtype (
numpy.dtype
) – Numpy data type of values contained by symbol.
- Returns
output_arr (
numpy.ndarray
) – Numpy array of output values read at symbol.
-
get_symbol_rect
(rectangle: Rectangle, symbol: str, dtype: numpy.dtype) → numpy.ndarray¶ Read the value of
symbol
of given type for a rectangle of PEs.- Parameters
rectangle (
Rectangle
) – Rectangle specified as((col, row), (width, height))
, indexed from the northwest corner of the entire fabric (NOT the program rectangle)symbol (
str
) – Name of the symbol to be read.dtype (
numpy.dtype
) – Numpy data type of values contained by symbol.
- Returns
output_arr (
numpy.ndarray
) – Numpy array of output values read at symbol. The first two dimensions of the returned array are PE coordinates(column, row)
relative to the rectangle.
-
read_trace
(px: int, py: int, name: str) → list¶ Parse a CSL trace buffer with name
name
at the given PE coordinates.- Parameters
px (
int
) –x
-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)py (
int
) –y
-coordinate of the PE, indexed from the northwest corner of the entire fabric (NOT the program rectangle)name (
str
) – Name of the trace buffer to be read.
- Returns
trace_output (
list
) – Heterogenous list of trace values.
Example:
Consider a device kernel which initializes a trace buffer with the CSL
debug
library and uses it to record values:const debug_mod = @import_module("<debug>", .{.key = "my_trace", .buffer_size = 100}); fn foo() void { debug_mod.trace_timestamp(); debug_mod.trace_string("Bar"); debug_mod.trace_i16(1); }
Then the trace can be read in the host code with:
trace_output = debug_mod.read_trace(4, 1, 'my_trace') print(trace_output)
If
foo
was executed only once, thentrace_output
will be a heterogenous list containing a timestamp, the string “Bar”, and the number 1.
-