stencil-3d-7pts

stencil-3d-7pts

This example evaluates the performance of 7-point stencil. The kernel records the start and end of spmv by tsc counter. In addition the tsc counters of all PEs are not sychronized in the beginning. To avoid the timing variation among those PEs, sync() synchronizes all PEs and samples the reference clock.

The kernel kernel.csl defines a couple of host-callable functions, f_sync(), f_tic() and f_toc() in order to synchronize the PEs and record the timing of spmv.

The kernel allreduce/pe.csl performs a reduction over the whole rectangle to synchronize the PEs, then the bottom-right PE sends a signal to other PEs to sample the reference clock.

The kernel stencil_3d_7pts/pe.csl performs a matrix-vector product (spmv) where the matrix has 7 diagonals corresponding to 7 point stencil. The stencil coefficients can vary per PE, but must be the same for the local vector. The user can change the coefficients based on the boundary condition or curvilinear coordinate transformation.

The script run.py has the following parameters:

  • -k=<int> specifies the maximum size of local vector.

  • --zDim=<int> specifies how many elements per PE are computed.

  • --channels=<int> specifies the number of I/O channels, no bigger than 16.

The tic() samples “time_start” and toc() samples “time_end”. The sync() samples “time_ref” which is used to adjust “time_start” and “time_end”. The elapsed time (unit: cycles) is measured by cycles_send = max(time_end) - min(time_start)

The overall runtime (us) is computed via the following formula time_send = (cycles_send / 0.85) * 1.e-3 us

The bandwidth is calculated by bandwidth = ((6*w*h*4)/time_send)