Residual¶

This example shows a CSL program that uses a rectangle of 2-by-2 PEs to compute |b - A * x|, i.e., the norm of the residual b - A * x.

A is an M x N matrix. Each PE computes a part of the A'*x multiplication, where A''' is a ``M/2 x N/2 matrix. In other words, each PE essentially does “a fourth” of the multiplication. It then does a row reduction, so that the last column of PEs has the result b - A*x. Finally, the PEs of the last column computes the norm, |b-A*x|, via a column reduction.

The 2-by-2 rectangle is surrounded by memcpy infrastructure which occupies five column of PEs shown below. The memcpy routes the input and output data between the host and the device.

../../_images/residual-memcpy-2-by-2.png

The matrix A, the input vectors x and b and the output scalar (the computed norm |b - A * x|) are supported by memcpy streaming.

The matrix A is distributed into the PEs. For simplicity, the matrix dimensions M x N are assumed even.
The vector x is distributed into the first row PEs. The first row receives x from the memcpy, then broadcasts x into other rows. The incoming vector x is distributed across all N = 4 PEs along the top side of the rectangle.
The vector b is distributed into rows of the first column. The vector b is distributed across all M = 6 PEs along the left side of the rectangle.
The scalar nrm_r is sent out by the PE with coordinates pe_x=1 and pe_y=0.

Three functions GEMV, AXPY, and NRMINF are defined separately, and are loaded by import_module. GEM computes y = A*x, AXPY computes y = alpha*x and NRMINF computes the norm. SIMD operations are used in GEMV and AXPY to reduce the overhead of address computation.

Implementation details of tensor streaming¶

With SdkRuntime, the user can copy/stream data in/out the WSE and/or launch kernels arbitrarily. This example uses a portmap and a symbol to copy a tensor into the corresponding device memory. The symbol can be extracted via SdkRuntime.get_id(). The runtime_utils converts the user’s tensor to an architecture-independent form based on the portmap because the SdkRuntime no longer accepts the parameter portmap.

For example, the following portmap shows a block distribution of tensor A in the core rectangle with block LOCAL_IN_SZ by LOCAL_OUT_SZ per PE.

iportmap_A = f"{{ A[j=0:{M-1}][i=0:{N-1}] -> [PE[i//{LOCAL_IN_SZ}, j//{LOCAL_OUT_SZ}] -> \
    index[i%{LOCAL_IN_SZ}, j%{LOCAL_OUT_SZ}]] }}"

The symbol of tensor A defined in the kernel is extracted by

symbol_A = simulator.get_id("A")

Then runtime_utils.convert_input_tensor() converts user’s tensor A based on the iportmap_A and calls SdkRuntime.memcpy_h2d() to copy A to device memory pointed to by symbol_A.

(px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A)
simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False)

The following code copies the vector b into the first column of PEs in the core rectangle:

iportmap_b = f"{{ b[i=0:{M-1}][j=0] -> [PE[0, i//{LOCAL_OUT_SZ}] -> \
    index[i%{LOCAL_OUT_SZ}]] }}"
symbol_y = simulator.get_id("y")
(px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_b, b)
simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False)

The same idea holds for vector x which is distributed over the first row of PEs in the core rectangle with the following sequence

iportmap_x = f"{{ x[i=0:{N-1}][j=0] -> [PE[i//{LOCAL_IN_SZ}, 0] ->  \
    index[i%{LOCAL_IN_SZ}]] }}"
symbol_x = simulator.get_id("x")
(px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_x, x)
simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False)

The output norm of |b-A*x| is also copied from device memory directly via the following sequence

oportmap_nrm_r = "{ nrm_r[i=0:0][j=0] -> [PE[1, 0] -> index[i]] }"
symbol_nrm = simulator.get_id("nrm")
(px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32)
simulator.memcpy_d2h(data, symbol_nrm, False, px, py, w, h, l, 0, False)
nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data)

In this example, the user does not specify colors to stream in tensors A, x and b or to stream out a scalar |b-A*x|. The runtime uses internal colors to stream in/out the tensors implicitly. The user only specifies the color LAUNCH to launch a kernel function. In this example, we launch the kernel function bcast_x() without any arguments. So the first argument of call() is bcast_x() which is defined in residual_memcpy.csl, and the second argument is an empty array.

simulator.call("bcast_x", [], nonblock=False)

The user calls SdkRuntime.stop() if all operations are done.

The kernel (residual_memcpy.csl) does not define three WTT (wave-triggered task) to receive A, x and b. Instead, the user defines pointers to A, x, y and nrm, and exports those pointers via export_symbol.

var ptr_A : [*]f32 = &A;
var ptr_x : [*]f32 = &x;
var ptr_y : [*]f32 = &y;
var ptr_nrm : [*]f32 = &nrm;
...
comptime {
    @export_symbol(ptr_A, "A");
    @export_symbol(ptr_x, "x");
    @export_symbol(ptr_y, "y");
    @export_symbol(ptr_nrm, "nrm");
}

Also in the layout file, layout_memcpy.csl, the user must export symbol names corresponding to names in residual_memcpy.csl.

@export_name("A", [*]f32, true);
@export_name("x", [*]f32, true);
@export_name("y", [*]f32, true);
@export_name("nrm", [*]f32, true);

The host runtime copies A, x and b to device via three calls to simulator.memcpy_h2d.

simulator.memcpy_h2d(symbol_A, data, False, px, py, w, h, l, 0, False)
simulator.memcpy_h2d(symbol_x, data, False, px, py, w, h, l, 0, False)
simulator.memcpy_h2d(symbol_y, data, False, px, py, w, h, l, 0, False)

After that, the user can launch a kernel function bcast_x() to broadcast x from 1st row to other rows, to compute the local y=A*x via f_comp(), and to reduce the partial result via f_reduce(). In the end of function bcast_x(), the user must call sys_mod.unblock_cmd_stream() in order to process next command, which is simulator.memcpy_d2h. If the user does not call sys_mod.unblock_cmd_stream(), the program hangs because simulator.memcpy_d2h is not processed. To launch a kernel function in device, the user must export color LAUNCH and the host-callable function bcast_x() via

comptime {
   @export_symbol(bcast_x);
   @rpc(LAUNCH);
}

Finally the user has to compile the kernel with the flag --fabric-offsets=4,1 and several additional parameters to use the new runtime.

--memcpy to compile the infrastructure (a.k.a. halo) to route the data between the host and the device.
pass LAUNCH colors to cslc via LAUNCH_ID.
--channels=k where k is positive between 1 and 16.
----width-west-buf=p --width-east-buf=q where p and q are non-negative numbers. In this example, no additional buffers are inserted in the framework, so both parameters are zero.

cslc LAUNCH_ID:4 --memcpy --channels=1 --width-west-buf=0 --width-east-buf=0

layout.csl¶

// This example computes |b-A*x|_inf on a 2-by-2 rectangle which has P0.0, P0.1, P1.0 and P1.1
// The matrix A is distributed to every PE via memcpy
// The vector x is distributed to first row PEs via memcpy
// The vector b is distributed to first column PEs via memcpy
// P1.0 sends out the result |b-A*x| via memcpy
//
// Each PE receives the vector x and computes A*x locally, then performs a row reduction to finish y = b - A*x
// The last column contains the vector y, and performs a column reduction to obtain |b-A*x|
//
// internal color PSUM is used in row reduction
// internal color NRM is used in column reduction

// global routing colors
param LAUNCH_ID: i16;

// (LOCAL_OUT_SZ, LOCAL_IN_SZ) is the dimension of local tensor
//    A is LOCAL_OUT_SZ-by-LOCAL_IN_SZ
//    x is LOCAL_IN_SZ-by-1
//    y is LOCAL_OUT_SZ-by-1
//
// The unit test sets up the parameters LOCAL_OUT_SZ and LOCAL_IN_SZ via cslc
//    LOCAL_OUT_SZ = M / height
//    LOCAL_IN_SZ  = N / width
// where M, N are dimensions of global tensors A_global, x_global and y_global
//    A_global is M-by-N
//    x_global is N-by-1
//    y_global is M-by-1
param LOCAL_OUT_SZ : i16;
param LOCAL_IN_SZ  : i16;

param width: i16;
param height: i16;

const LAUNCH : color = @get_color(LAUNCH_ID);

const RXACT_X: color  = @get_color(8) ;  // receive x
const PSUM: color     = @get_color(9) ;  // row reduction
const NRM: color      = @get_color(10) ; // column reduction

// local tasks
const COMP: color     = @get_color(12) ;
const REDUCE: color   = @get_color(13) ;
const DONE: color     = @get_color(14) ;

// neither routing color nor local task
const NONE: color     = @get_color(15) ; // NONE is don't care (neither routing color nor entrypoint)
                             // the compiler emits an error for un-initialized colors or parameters
                             // binding a non-routing local color to NONE to avoid the compilation error
const EXIT: color     = @get_color(17);

const memcpy = @import_module( "<memcpy_multi/get_params>", .{
    .width = width,
    .height = height
    });


layout{

    @comptime_assert(2 == width);
    @comptime_assert(2 == height);

    // step 1: configure the rectangle which does not include halo
    @set_rectangle(width, height);

    // step 2: compile csl code for a set of PEx.y and generate out_x_y.elf
    //   format: @set_tile_code(x, y, code.csl, param_binding);

    const comm_params = .{
        .COMP=COMP,
        .REDUCE=REDUCE,
        .DONE=DONE,
        .LOCAL_OUT_SZ=LOCAL_OUT_SZ,
        .LOCAL_IN_SZ=LOCAL_IN_SZ,
        .LAUNCH = LAUNCH,
        .EXIT = EXIT
    };

    const memcpyParams0 = memcpy.get_params(0);
    const route_00 = @concat_structs(
        .{ .RXACT_X = NONE, .TXACT_X=RXACT_X, .RXACT_Y = NONE, .RXACT_NRM = NONE , .TXACT_Y = PSUM, .TXACT_NRM = NONE },
        .{ .memcpyParams = memcpyParams0 } );
    @set_tile_code(0, 0, "residual.csl", @concat_structs(route_00, comm_params) );

    const route_01 = @concat_structs(
        .{ .RXACT_X = RXACT_X, .TXACT_X=NONE, .RXACT_Y = NONE, .RXACT_NRM = NONE , .TXACT_Y = PSUM, .TXACT_NRM = NONE },
        .{ .memcpyParams = memcpyParams0 } );
    @set_tile_code(0, 1, "residual.csl", @concat_structs(route_01, comm_params) );


    const memcpyParams1 = memcpy.get_params(1);
    const route_10 = @concat_structs(
        .{ .RXACT_X = NONE, .TXACT_X=RXACT_X, .RXACT_Y = PSUM , .RXACT_NRM = NRM , .TXACT_Y = NONE, .TXACT_NRM = NONE },
        .{ .memcpyParams = memcpyParams1 } );
    @set_tile_code(1, 0, "residual.csl", @concat_structs(route_10, comm_params) );

    const route_11 = @concat_structs(
        .{ .RXACT_X = RXACT_X, .TXACT_X=NONE, .RXACT_Y = PSUM , .RXACT_NRM = NONE, .TXACT_Y = NONE, .TXACT_NRM = NRM },
        .{ .memcpyParams = memcpyParams1 } );
    @set_tile_code(1, 1, "residual.csl", @concat_structs(route_11, comm_params) );

    // step 3: global and internal routing
    //  format: @set_color_config(x, y, color, route);

    // routing of RXACT_X
    // - cliff distribution of x along columns
    // - broadcast from the north to the south
    // py = 0 receives x via H2D_2, and forwards x to south
    // py = 1 receives x from north
    @set_color_config(0, 0, RXACT_X, .{ .routes = .{ .rx = .{RAMP},  .tx = .{SOUTH} } });
    @set_color_config(0, 1, RXACT_X, .{ .routes = .{ .rx = .{NORTH}, .tx = .{RAMP} } });

    @set_color_config(1, 0, RXACT_X, .{ .routes = .{ .rx = .{RAMP}, .tx = .{SOUTH} } });
    @set_color_config(1, 1, RXACT_X, .{ .routes = .{ .rx = .{NORTH}, .tx = .{RAMP} } });

    // routing of PSUM (for row reduction)
    // P0.0, P0.1: send partial sum
    // P1.0, P1.1: receive partial sum
    @set_color_config(0, 0, PSUM, .{ .routes = .{ .rx = .{RAMP}, .tx = .{EAST} } });
    @set_color_config(0, 1, PSUM, .{ .routes = .{ .rx = .{RAMP}, .tx = .{EAST} } });
    @set_color_config(1, 0, PSUM, .{ .routes = .{ .rx = .{WEST}, .tx = .{RAMP} } });
    @set_color_config(1, 1, PSUM, .{ .routes = .{ .rx = .{WEST}, .tx = .{RAMP} } });

    // routing of NRM (for column reduction)
    // P1.0: receive local nrm from P1.1
    // P1.1: send local nrm to P1.0
    @set_color_config(1, 0, NRM, .{ .routes = .{ .rx = .{SOUTH}, .tx = .{RAMP} } });
    @set_color_config(1, 1, NRM, .{ .routes = .{ .rx = .{RAMP}, .tx = .{NORTH} } });

    // export symbol name
    @export_name("A", [*]f32, true);
    @export_name("x", [*]f32, true);
    @export_name("y", [*]f32, true);
    @export_name("nrm", [*]f32, true);
    @export_name("bcast_x", fn()void);
}

residual.csl¶

// This example computes |b-A*x|_inf on a 2-by-2 rectangle which has P0.0, P0.1, P1.0 and P1.1
// The matrix A is distributed to every PE via memcpy
// The vector x is distributed to first row PEs via memcpy
// The vector b is distributed to first column PEs via memcpy
// P1.0 sends out the result |b-A*x| via memcpy
//
// The host sends matrix A, vector x and vector b into the core rectangle, launches a RPC to
// broadcast vector x from 1st row to other rows, computes A*x locally, then performs a row
// reduction to finish y = b - A*x.
// The last column contains the vector y and performs a column reduction to compute |b-A*x|

// Notation: a PE (Px.y) is labeled as (px = x, py = y)

param memcpyParams: comptime_struct;

param LAUNCH: color; // a routable color for RPC

// local colors
param RXACT_X: color; // py = 0: don't care
                      // py > 0: receive vector x from the north
param TXACT_X: color; // py = 0: send x to the south
param RXACT_Y: color; // px = 0: forward b-A*x to the east
                      // px = 1: receive partial sum (b - A*x) from px = 0
param TXACT_Y: color;   // px = 0: send parital sum to px = 1
param RXACT_NRM: color; // P1.0: receive nrm from P1.1
param TXACT_NRM: color; // P1.1: send local nrm to P1.0

// local tasks
param COMP: color;     // compute local Ax = A*x
param REDUCE: color;   // compute either local b - A*x or local y - A*x
param DONE: color;     // compute |b-A*x| and send out the result
param EXIT: color;     // entrypoint to leave RPC


param LOCAL_OUT_SZ:i16 ;  // dimension of submatrix A is LOCAL_OUT_SZ-by-LOCAL_IN_SZ
                          // dimension of subvector y is LOCAL_OUT_SZ-by-1
param LOCAL_IN_SZ:i16  ;  // dimension of subvector x is LOCAL_IN_SZ-by-1


const fabric = @import_module("<layout>");

fn get_x_coord() u16 {
    return fabric.get_x_coord();
}
fn get_y_coord() u16 {
    return fabric.get_y_coord();
}

const sys_mod = @import_module( "<memcpy_multi/memcpy>", @concat_structs(memcpyParams, .{
     .LAUNCH = LAUNCH
    }));


////////////////////////////////////////////////////////////////////////////////
// Main memory (48KB)
////////////////////////////////////////////////////////////////////////////////

// A is LOCAL_OUT_SZ-by-LOCAL_IN_SZ with lda=LOCAL_OUT_SZ
// x is LOCAL_IN_SZ-by-1
// y is LOCAL_OUT_SZ-by-1

// Assumption
// - _MAX_SIZE_X >= LOCAL_IN_SZ
// - _MAX_SIZE_Y >= LOCAL_OUT_SZ
// - _MAX_SIZE_A >= LOCAL_OUT_SZ*LOCAL_IN_SZ
const  _MAX_SIZE_A = LOCAL_OUT_SZ * LOCAL_IN_SZ;
const  _MAX_SIZE_X = LOCAL_IN_SZ;
const  _MAX_SIZE_Y = LOCAL_OUT_SZ;

var A  = @zeros([_MAX_SIZE_A]f32);
var x  = @zeros([_MAX_SIZE_X]f32);
var y  = @zeros([_MAX_SIZE_Y]f32);

// workspace for A*x
var Ax = @zeros([_MAX_SIZE_Y]f32);

// workspace for outer-product version of GEMV
var ws = @zeros([_MAX_SIZE_Y]f32);

var nrm = @zeros([1]f32);
var local_nrm : f32 = @as(f32, 0.0);

// (_px, _py) is the coordinate of region of interest, set by the function bcast_x
// which starts the computation
var _px : i16 ;
var _py : i16 ;

// WARNING: export pointers, not arrays
var ptr_A : [*]f32 = &A;
var ptr_x : [*]f32 = &x;
var ptr_y : [*]f32 = &y;
var ptr_nrm : [*]f32 = &nrm;

////////////////////////////////////////////////////////////////////////////////
// DSDs
// data-structure descriptors (DSDs), loaded into data-structure registers (DSRs) to configure DSR
// The DSDs are typically put in their own data segment that is placed right above lo-mem.?
//
// The content of a DSR is a DSD, which is a data structure stored in memory.
// A DSR is a numbered hardware register and, like a GPR, is memory mapped.
// DSRs hold DSDs. Their numbers are stored in instruction operand fields, where the DSD held by the DSR
// serves to describe the actual data operand, which is a memory or fabric tensor.
////////////////////////////////////////////////////////////////////////////////

const mem_x_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{LOCAL_IN_SZ} -> x[i] });

const fab_recv_x_wdsd = @get_dsd(fabin_dsd, .{
    .extent = LOCAL_IN_SZ,
    .fabric_color = RXACT_X,
    .input_queue = @get_input_queue(1)
});

const fab_trans_x_wdsd = @get_dsd(fabout_dsd, .{
    .extent = LOCAL_IN_SZ,
    .fabric_color = TXACT_X,
    .output_queue = @get_output_queue(1)
});

const mem_y_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{LOCAL_OUT_SZ} -> y[i] });

const fab_recv_y_wdsd = @get_dsd(fabin_dsd, .{
    .extent = LOCAL_OUT_SZ,
    .fabric_color = RXACT_Y,
    .input_queue = @get_input_queue(2)
});


const fab_trans_psum_wdsd = @get_dsd(fabout_dsd, .{
    .extent = LOCAL_OUT_SZ,
    .fabric_color = TXACT_Y,
    .output_queue = @get_output_queue(2)
});


const mem_nrm_buf_dsd = @get_dsd(mem1d_dsd, .{ .tensor_access = |i|{1} -> nrm[i] });

const fab_recv_nrm_wdsd = @get_dsd(fabin_dsd, .{
    .extent = 1,
    .fabric_color = RXACT_NRM,
    .input_queue = @get_input_queue(3)
});


// only used in P1.1, send the partial nrm to P1.0
const fab_trans_nrm_wdsd_p11 = @get_dsd(fabout_dsd, .{
    .extent = 1,
    .fabric_color = TXACT_NRM,
    .output_queue = @get_output_queue(3)
});


////////////////////////////////////////////////////////////////////////////////
// Tasks
////////////////////////////////////////////////////////////////////////////////

const gemv_mod = @import_module( "gemv.csl" , .{  .sizeA = _MAX_SIZE_A, .sizeX = _MAX_SIZE_X, .sizeY = _MAX_SIZE_Y });

const axpy_mod = @import_module( "axpy.csl" , .{  .sizeXY = _MAX_SIZE_Y });

const nrminf_mod = @import_module( "nrminf.csl" , .{  .sizeX = _MAX_SIZE_Y });


// All PEs compute local A*x after A and x are received
task f_comp() void {

    //  Ax = A * x + 0*y
    var alpha: f32 = @as(f32, 1.0);
    var beta : f32 = @as(f32, 0.0);
    gemv_mod.sgemv_outer(LOCAL_OUT_SZ, LOCAL_IN_SZ, alpha, &A, LOCAL_OUT_SZ, &x, beta, &Ax, &ws);

    // px = 0: receive vector b from the host
    // px = 1: receive partial sum from the west
    if (0 == _px){
        // y = b is ready
        @activate(REDUCE);
    }else{
        // receive y from the west
        @mov32(mem_y_buf_dsd, fab_recv_y_wdsd, .{.async=true, .activate = f_reduce});
    }
}


// px = 0: compute y=b-A*x, and forward partial sum y to the east
// px = 1: compute the norm |y_recv - A*x| and perform reduction of local norm
task f_reduce() void {
    // y  = b if px = 0
    //    = partial sum if px = 1
    // Ax = local gemv

    // px = 0: y = b - A*x
    // px = 1: y = y_recv - A*x, where y_recv = b-A*x in px=0
    var alpha : f32 =  @as(f32, -1.0);
    axpy_mod.saxpy( LOCAL_OUT_SZ, alpha, &Ax, &y);

    if (0 == _px){
        // send partial sum to the east and finish (every PE must call f_exist)
        @mov32(fab_trans_psum_wdsd, mem_y_buf_dsd, .{.async=true, .activate = f_exit});
    }else{
        // px = 1: compute norm of local (b-A*x)
        nrminf_mod.snrminf(LOCAL_OUT_SZ, &y, &local_nrm);

        if (0 == _py){
            // P1.0: receive nrm from the south
            // f_done will call f_exist
            @mov32(mem_nrm_buf_dsd, fab_recv_nrm_wdsd, .{.async=true, .activate = f_done});
        }else{
            // P1.1: send local nrm to north and finish
            @fmovs(fab_trans_nrm_wdsd_p11, local_nrm);
            @activate(EXIT); // every PE must call f_exist
        }
    }
}


// Only P1.0 triggers f_done to finish the reduction of |b-A*x|
task f_done() void {
    // loc_nrm = |b - A*x| locally
    // nrm[0] = partial result of |b - A*x| from south

    if (nrm[0] < local_nrm){
        nrm[0] = local_nrm;
    }
    // nrm[0] = |b - A*x|
    @activate(EXIT); // every PE must call f_exist
}


// the calling sequence
// px = 0: bcast_x --> f_comp --> f_reduce --> f_exit
// px = 1, py = 0: bcast_x --> f_comp --> f_reduce --> f_done --> f_exit
// px = 1, py = 1: bcast_x --> f_comp --> f_reduce --> f_exit
fn bcast_x() void {

    _px = @as(i16, get_x_coord());
    _py = @as(i16, get_y_coord());

    if (0 == _py){
        // broadcast x to south PEs
        @mov32(fab_trans_x_wdsd, mem_x_buf_dsd, .{.async=true, .activate = f_comp});
    }else{
        // 0 < _py: receive x from north
        @mov32(mem_x_buf_dsd, fab_recv_x_wdsd, .{.async=true, .activate = f_comp});
    }
}

task f_exit() void {
    // the user must unblock cmd color for every PE
    sys_mod.unblock_cmd_stream();
}

comptime {

    // use microthreads to read x, b or partial sum y/nrm, so block RXACT_X, RXACT_Y and RXACT_NRM
    @block(RXACT_X);
    @block(RXACT_Y);
    @block(RXACT_NRM);

    @bind_task(f_comp, COMP);
    @bind_task(f_reduce, REDUCE);
    @bind_task(f_done, DONE);

    @bind_task(f_exit, EXIT);

    @export_symbol(ptr_A, "A");
    @export_symbol(ptr_x, "x");
    @export_symbol(ptr_y, "y");
    @export_symbol(ptr_nrm, "nrm");

    @export_symbol(bcast_x);
    @rpc(LAUNCH);
}

gemv.csl¶

// inner-product version of GEMV
//
// http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f.html
// SGEMV - perform the matrix-vector operation
//     y := alpha*A*x + beta*y
//
// @param[in] m      number of rows of the matrix A
// @param[in] n      number of columns of the matrix A
// @param[in] alpha  scalar
// @param[in] A      array of dimension (lda, n)
// @param[in] lda    leading dimension of the matrix A which is column-major
// @param[in] x      array of dimension n
// @param[in] beta   scalar
// @param[in,out] y  array of dimension m
//                   entry: if beta is zero, y can be NAN or INF


param sizeA : i16 ;  // size of A, sizeA >= lda*n
param sizeX : i16 ;  // size of x, sizeX >= n
param sizeY : i16 ;  // size of y, sizeY >= m


fn sgemv_inner( m: i16, n: i16, alpha: f32,  A: *[sizeA]f32,  lda: i16, x: *[sizeX]f32, beta: f32, y: *[sizeY]f32 ) void {

    var row : i16 = 0;
    while(row < m) : (row +=1) {
        var dot : f32 = @as(f32, 0);
        var col : i16 = 0;
        while(col < n) : (col += 1) {
            var Aij : f32 = (A.*)[row + col*lda];
            var xj  : f32 = (x.*)[col];
            dot = dot + Aij*xj;
        }

        // dot = A(row,:)*x
        // WARNING: if beta is zero, y can be NAN or INF
        var yi : f32 = @as(f32, 0);
        if (beta != @as(f32, 0)){
            yi = (y.*)[row];
        }
        yi = alpha*dot + beta*yi;
        (y.*)[row] = yi;
    }
}



// outer-product version of GEMV
//
// Ax = 0
// for col= 0:n-1
//    Ax = Ax + A(:, col) * x(col)
// end
// if beta is not zero
//    y = beta * y
// end
// y = y + alpha * Ax
//
// http://www.netlib.org/lapack/explore-html/db/d58/sgemv_8f.html
// SGEMV - perform the matrix-vector operation
//     y := alpha*A*x + beta*y
//
// @param[in] m      number of rows of the matrix A
// @param[in] n      number of columns of the matrix A
// @param[in] alpha  scalar
// @param[in] A      array of dimension (lda, n)
// @param[in] lda    leading dimension of the matrix A which is column-major
// @param[in] x      array of dimension n
// @param[in] beta   scalar
// @param[in,out] y  array of dimension m
//                   entry: if beta is zero, y can be NAN or INF
// @param[in,out] ws workspace, array of dimension m


// To change the base address and the length of a DSD, csl requires a dummy DSD.
// The type here doesn't matter
const dummy = @zeros([1]i16);
// The length doesn't matter either since csl will overwrite it
const dummy_dsd = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{42} -> dummy[i]});

fn sgemv_outer( m: i16, n: i16, alpha: f32,  A: *[sizeA]f32,  lda: i16, x: *[sizeX]f32, beta: f32, y: *[sizeY]f32 , ws:*[sizeY]f32 ) void {

    // bind vector Ax to a DSD
    var mem_Ax_buf_dsd = @set_dsd_base_addr(dummy_dsd, ws);
    mem_Ax_buf_dsd = @set_dsd_length(mem_Ax_buf_dsd, @bitcast(u16, m) );

    // bind vector y to a DSD
    // it is based on mem_Ax_buf_dsd, so no need to set the length again
    var mem_y_buf_dsd = @set_dsd_base_addr(mem_Ax_buf_dsd, y);

    // Ax = 0
    var zero : f32 = @as(f32, 0);
    @fmovs(mem_Ax_buf_dsd, zero);

    // Ax = accumulate(A(:, col) * x(col))
    var col : i16 = 0;
    while(col < n) : (col += 1) {
        var xj : f32 = (x.*)[col];
        // bind vector w = A(:,col) to a DSD
        // it is based on mem_Ax_buf_dsd, so no need to set the length again
        var mem_w_buf_dsd = @set_dsd_base_addr(mem_Ax_buf_dsd, A);
        mem_w_buf_dsd = @increment_dsd_offset(mem_w_buf_dsd, col * lda, f32);
        @fmacs(mem_Ax_buf_dsd, mem_Ax_buf_dsd, mem_w_buf_dsd, xj);
    }

    // y = beta * y
    // if beta is zero, y can be NAN or INF
    if (beta != zero ){
        @fmuls(mem_y_buf_dsd, mem_y_buf_dsd, beta);
    }

    // y = y + alpha * Ax
    @fmacs(mem_y_buf_dsd, mem_y_buf_dsd, mem_Ax_buf_dsd, alpha);
}

axpy.csl¶

// http://www.netlib.org/lapack/explore-html/d8/daf/saxpy_8f.html
// SAXPY constant times a vector plus a vector.
//     y = y + alpha*x
//
// @param[in] n      number of elements of the input vectors
// @param[in] alpha  scalar
// @param[in] x      array of dimension n
//                   x[j] can be NAN or INF if alpha is zero
// @param[in,out] y  array of dimension n


param sizeXY : i16 ;  // size of x and y, sizeXY >= n

// To change the base address and the length of a DSD, csl requires a dummy DSD.
// The type here doesn't matter
const dummy = @zeros([1]i16);
// The length doesn't matter either since csl will overwrite it
const dummy_dsd = @get_dsd(mem1d_dsd, .{.tensor_access = |i|{42} -> dummy[i]});

fn saxpy( n: i16, alpha: f32, x: *[sizeXY]f32, y: *[sizeXY]f32 ) void {

    // bind vector x to a DSD
    var mem_x_buf_dsd = @set_dsd_base_addr(dummy_dsd, x);
    mem_x_buf_dsd = @set_dsd_length(mem_x_buf_dsd, @bitcast(u16, n) );

    // bind vector y to DSD
    // it is based on mem_x_buf_dsd, so no need to set the length again
    var mem_y_buf_dsd = @set_dsd_base_addr(mem_x_buf_dsd, y);

    // fast path: if alpha is zero, no need to compute
    if (alpha == @as(f32, 0)){
        return;
    }

    // y[j] = y[j] + x[j]*alpha, j = 0,1,2,...,n-1
    // The SIMD fmacs replaces the following for-loop
    // ========
    // var row : i16 = 0;
    // while(row < n) : (row +=1) {
    //     (y.*)[row] = (y.*)[row] + alpha * (x.*)[row];
    // }
    // ========
    @fmacs(mem_y_buf_dsd, mem_y_buf_dsd, mem_x_buf_dsd, alpha);

}

nrminf.csl¶

// http://www.netlib.org/lapack/explore-html/d6/d12/snrm2_8f90.html
//  SNRMINF returns the maximum of a vector
//     SNRMINF = max(|x|)
//
// @param[in] n       number of elements of the vector x
// @param[in] x       array of dimension n
// @param[out] result scalar
//                    result = max(|x|)

param sizeX : i16 ; // size of x, sizeX >= n

fn snrminf( n: i16, x: *[sizeX]f32, result: *f32 ) void {

    var zero: f32 = @as(f32, 0.0);
    var nrm_r : f32 = zero;
    var row : i16 = 0;
    while(row < n) : (row +=1) {
        var yi : f32 = (x.*)[row];
        if ( zero > yi ) {
            yi = -yi;
        }
        if ( nrm_r < yi ) {
            nrm_r = yi;
        }
    }
    (result.*) = nrm_r;
}

run.py¶

#!/usr/bin/env cs_python

""" Compute |b-A*x| using a 2-by-2 PE rectangle

   The 2-by-2 rectangle is surrounded by a halo of size 1.
   The halo is used to route the input and output data between the host and the device.
   It does not impact the layout index of the kernel code.
   For example, the kernel has 2-by-2 PEs, with the index P0.0, P1.0, P0.1, P1.1
   in the layout/routing configuration.
   The compiler generates ELFs out_0_0.elf, out_0_1.elf, out_1_0.elf and out_1_1.elf.
   However the user needs global coordinate (including halo) for debugging, for example
   P0.0 of the kernel is P1.1 when the user calls sdk_debug_shell to dump the trace or
   landing log.

   Each PE computes A*x and does a row reduction, so last column has the result b - A*x.
   Then last column computes |b-A*x| via a column reduction.

   To simplify the example, the dimensions M and N are assumed even.
   Three functions gemv, axpy and nrminf are used to compute y=A*x, y=y+alpha*x and
   |x|_inf locally.
   Such functions are imported as modules via gemv.csl, axpy.csl and nrminf.csl.
   The arrays A, x and y are passed into the function as pointers.

   The vector x is distributed into columns. The first row receives x from the fabric,
   then broadcasts x into other rows.

   The vector b is distributed into rows of the first column.

   P1.0 computes |b-A*x| which is sent out..

   One can use the following command to check the landing log of P0.0,
    sdk_debug_shell wavelet-trace --artifact_dir . --x 1 --y 1 trace

"""


import os
import argparse
from pathlib import Path
from typing import Optional
import shutil
import subprocess
import numpy as np

from cerebras.sdk.runtime import runtime_utils # pylint: disable=no-name-in-module
from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType # pylint: disable=no-name-in-module
from cerebras.sdk.runtime.sdkruntimepybind import MemcpyOrder # pylint: disable=no-name-in-module

FILE_PATH = os.path.realpath(__file__)
RESIDUAL_DIR = os.path.dirname(FILE_PATH)
BENCHMARKS_DIR = os.path.dirname(RESIDUAL_DIR)
CSL_DIR = os.path.dirname(BENCHMARKS_DIR)
CSLC = os.path.join(CSL_DIR, "build") + "/bin/cslc"


def cast_uint32(x):
  if isinstance(x, (np.float16, np.int16, np.uint16)):
    z = x.view(np.uint16)
    return np.uint32(z)
  if isinstance(x, (np.float32, np.int32, np.uint32)):
    return x.view(np.uint32)
  if isinstance(x, int):
    return np.uint32(x)

  raise RuntimeError(f"type of x {type(x)} is not supported")



def parse_args():
  """ parse the command line """

  parser = argparse.ArgumentParser(description="residual parameters.")
  parser.add_argument("-m", type=int,
                      help="number of rows of the matrix A")
  parser.add_argument("-n", type=int,
                      help="number of columns of the matrix A. If A is square, \
                            n is the dimension of the matrix and m is not used")
  parser.add_argument(
      "--cslc",
      required=False,
      default=CSLC,
      help=f"The path to the csl compiler. Defaults to '{CSLC}'",
  )
  parser.add_argument(
      "-c", "--compile", action="store_true", help="Compile the code."
  )
  parser.add_argument(
      "--name",
      required=False,
      default="out",
      help="prefix of ELF files",
  )
  parser.add_argument("--cmaddr", help="IP:port for CS system")
  parser.add_argument(
      "--fabric-dims",
      help="Fabric dimension, i.e. <W>,<H>")

  parser.add_argument(
      "--width-west-buf",
      default=0, type=int,
      help="width of west buffer")
  parser.add_argument(
      "--width-east-buf",
      default=0, type=int,
      help="width of east buffer")
  parser.add_argument(
      "--n_channels",
      default=1, type=int,
      help="Number of memcpy \"channels\" (LVDS/streamers for both input and output)  to use \
            when memcpy support is compiled with this program. If this argument is not present, \
            or is 0, then the previous single-LVDS version is compiled.")
  parser.add_argument(
      "--arch",
      help="wse1 or wse2. Default is wse2 when not supplied.")

  args = parser.parse_args()

  return args


def csl_compile(
    cslc: str,
    width: int,
    height: int,
    file_config: str,
    elf_dir: str,
    fabric_width: int,
    fabric_height: int,
    core_fabric_offset_x: int,
    core_fabric_offset_y: int,
    compile_flag: bool,
    arch: Optional[str],
    LAUNCH: int,
    LOCAL_OUT_SZ: int,
    LOCAL_IN_SZ: int,
    n_channels: int,
    width_west_buf: int,
    width_east_buf: int
    ):
  """Generate ELFs for the layout, one ELF per PE"""

  comp_dir = elf_dir

  if compile_flag:
    args = []
    args.append(cslc) # command
    args.append(file_config) # file
    args.append(f"--fabric-dims={fabric_width},{fabric_height}") # options
    args.append(f"--fabric-offsets={core_fabric_offset_x},{core_fabric_offset_y}") # options
    args.append(f"--params=width:{width},height:{height}") # options
    args.append(f"--params=LOCAL_OUT_SZ:{LOCAL_OUT_SZ},LOCAL_IN_SZ:{LOCAL_IN_SZ}") # options

    args.append(f"--params=LAUNCH_ID:{LAUNCH}") # options

    args.append(f"-o={comp_dir}")
    if arch is not None:
      args.append(f"--arch={arch}")
    args.append("--memcpy")
    args.append(f"--channels={n_channels}")
    args.append(f"--width-west-buf={width_west_buf}")
    args.append(f"--width-east-buf={width_east_buf}")
    print(f"subprocess.check_call(args = {args}")
    subprocess.check_call(args)
  else:
    print("[csl_compile] use pre-compile ELFs")



def main():
  """Main method to run the example code."""

  args = parse_args()

#  if not, must redo the routing
  width = 2
  height = 2

  if args.m is not None:
    M = args.m
  else:
    M = 6

  if args.n is not None:
    N = args.n
  else:
    N = 4

  LOCAL_OUT_SZ = M // height
  LOCAL_IN_SZ = N // width

  assert M == (LOCAL_OUT_SZ*height), "M must be multiple of LOCAL_OUT_SZ"
  assert N == (LOCAL_IN_SZ*width), "N must be multiple of LOCAL_IN_SZ"

  print(f"M = {M}, N = {N}, width = {width}, height = {height}")

  # prepare host data and reference solution
  np.random.seed(2)
  A = np.arange(M*N).reshape(M, N).astype(np.float32)
  x = np.arange(N).reshape(N, 1).astype(np.float32) + 100
  b = np.arange(M).reshape(M, 1).astype(np.float32) + 200

  Ax = np.matmul(A, x)
  r = b - Ax

  nrm_r = np.linalg.norm(r, np.inf)

  print(f"nrm_r = |b - A*x| = {nrm_r}")

  # prepare the simulation

  # core dump after execution is complete
  # layout of a rectangle
  code_csl = "layout_memcpy.csl"

  n_channels = args.n_channels
  width_west_buf = args.width_west_buf
  width_east_buf = args.width_east_buf
  print(f"n_channels = {n_channels}")
  print(f"width_west_buf = {width_west_buf}, width_east_buf = {width_east_buf}")

  fabric_offset_x = 1
  fabric_offset_y = 1
  fabric_width = 0
  fabric_height = 0
  if args.fabric_dims:
    w_str, h_str = args.fabric_dims.split(",")
    fabric_width = int(w_str)
    fabric_height = int(h_str)

  if fabric_width == 0 or fabric_height == 0:
    fabric_width = fabric_offset_x + 3 + width + 2 + 1 + width_west_buf + width_east_buf
    fabric_height = fabric_offset_y + height + 1

  core_fabric_offset_x = fabric_offset_x + 3 + width_west_buf
  core_fabric_offset_y = fabric_offset_y

  print(f"fabric_width = {fabric_width}, fabric_height = {fabric_height}")
  print(f"core_fabric_offset_x = {core_fabric_offset_x}, ")
  print(f"core_fabric_offset_y = {core_fabric_offset_y}")

  LAUNCH = 4

  # compile csl files and generate compilation ELFs
  csl_compile(
      args.cslc,
      width,
      height,
      code_csl,
      args.name,
      fabric_width,
      fabric_height,
      core_fabric_offset_x,
      core_fabric_offset_y,
      args.compile,
      args.arch,
      LAUNCH,
      LOCAL_OUT_SZ,
      LOCAL_IN_SZ,
      n_channels,
      width_west_buf,
      width_east_buf)
  if args.compile:
    print("COMPILE ONLY: EXIT")
    return

  memcpy_dtype = MemcpyDataType.MEMCPY_32BIT
  memcpy_order = MemcpyOrder.ROW_MAJOR
  simulator = SdkRuntime(args.name, cmaddr=args.cmaddr)

  symbol_A = simulator.get_id("A")
  symbol_x = simulator.get_id("x")
  symbol_y = simulator.get_id("y")
  symbol_nrm = simulator.get_id("nrm")
  print(f"symbol_A = {symbol_A}")
  print(f"symbol_x = {symbol_x}")
  print(f"symbol_y = {symbol_y}")
  print(f"symbol_nrm = {symbol_nrm}")

  simulator.load()
  simulator.run()

  # A is M-by-N
  iportmap_A = f"{{ A[j=0:{M-1}][i=0:{N-1}] -> [PE[i//{LOCAL_IN_SZ}, j//{LOCAL_OUT_SZ}] -> \
        index[i%{LOCAL_IN_SZ}, j%{LOCAL_OUT_SZ}]] }}"
  print(f"iportmap_A = {iportmap_A}")

  # x distributes to {py = 0}
  iportmap_x = f"{{ x[i=0:{N-1}][j=0] -> [PE[i//{LOCAL_IN_SZ}, 0] ->  \
        index[i%{LOCAL_IN_SZ}]] }}"
  print(f"iportmap_x = {iportmap_x}")

  # b distributes to {px = 0}
  #  i = N*(i/N) + (i % N)  ==> PE_y = (i/N)
  iportmap_b = f"{{ b[i=0:{M-1}][j=0] -> [PE[0, i//{LOCAL_OUT_SZ}] -> \
        index[i%{LOCAL_OUT_SZ}]] }}"
  print(f"iportmap_b = {iportmap_b}")

  # |b-A*x| is from P1.0
  oportmap_nrm_r = "{ nrm_r[i=0:0][j=0] -> [PE[1, 0] -> index[i]] }"
  print(f"oportmap_nrm_r = {oportmap_nrm_r}")

  # prepare A, x and b via memcpy
  # use the runtime_utils library to calculate memcpy args and shuffle data
  (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_A, A)
  simulator.memcpy_h2d(symbol_A, data, px, py, w, h, l,
                       streaming=False, data_type=memcpy_dtype,
                       order=memcpy_order, nonblock=False)
  (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_x, x)
  simulator.memcpy_h2d(symbol_x, data, px, py, w, h, l,
                       streaming=False, data_type=memcpy_dtype,
                       order=memcpy_order, nonblock=False)
  (px, py, w, h, l, data) = runtime_utils.convert_input_tensor(iportmap_b, b)
  simulator.memcpy_h2d(symbol_y, data, px, py, w, h, l,
                       streaming=False, data_type=memcpy_dtype,
                       order=memcpy_order, nonblock=False)

  # trigger the computation
  simulator.call("bcast_x", [], nonblock=False)

  # receive |b-A*x| from P1.1
  # use the runtime_utils library to calculate memcpy args and manage output data
  (px, py, w, h, l, data) = runtime_utils.prepare_output_tensor(oportmap_nrm_r, np.float32)
  simulator.memcpy_d2h(data, symbol_nrm, px, py, w, h, l,
                       streaming=False, data_type=memcpy_dtype,
                       order=memcpy_order, nonblock=False)
  nrm_r_cs = runtime_utils.format_output_tensor(oportmap_nrm_r, np.float32, data)

  simulator.stop()

  if args.cmaddr is None:
    # move simulation log and core dump to the given folder
    dst_log = Path(f"{args.name}/sim.log")
    src_log = Path("sim.log")
    if src_log.exists():
      shutil.move(src_log, dst_log)

    dst_trace = Path(f"{args.name}/simfab_traces")
    src_trace = Path("simfab_traces")
    if dst_trace.exists():
      shutil.rmtree(dst_trace)
    if src_trace.exists():
      shutil.move(src_trace, dst_trace)

  print(f"`nrm_r`     from CPU:\n{nrm_r}")
  print(f"`nrm_r_cs`  from CS1 (1-by-1 matrix):\n{nrm_r_cs}")

  dr = abs(nrm_r - nrm_r_cs[(0, 0)])
  print(f"|nrm_r - nrm_r_cs| = {dr}")

  assert np.allclose(nrm_r, nrm_r_cs[(0, 0)], 1.e-5)
  print("\nSUCCESS!")


if __name__ == "__main__":
  main()

commands.sh¶

#!/usr/bin/env bash

set -e

cslc ./layout.csl --fabric-dims=9,4 --fabric-offsets=4,1 \
--params=width:2,height:2 \
--params=LOCAL_OUT_SZ:3,LOCAL_IN_SZ:2,LAUNCH_ID:4 -o=out --memcpy --channels=1 \
--width-west-buf=0 --width-east-buf=0
cs_python run.py --name out -m=6 -n=4

SDK Documentation (0.9.0)

Residual

Contents