.. _benchmark-stencil-3d-7pts: stencil-3d-7pts ============= This example evaluates the performance of 7-point stencil. The kernel records the ``start`` and ``end`` of ``spmv`` by tsc counter. In addition the tsc counters of all PEs are not sychronized in the beginning. To avoid the timing variation among those PEs, ``sync()`` synchronizes all PEs and samples the reference clock. The kernel ``kernel.csl`` defines a couple of host-callable functions, ``f_sync()``, ``f_tic()`` and ``f_toc()`` in order to synchronize the PEs and record the timing of ``spmv``. The kernel ``allreduce/pe.csl`` performs a reduction over the whole rectangle to synchronize the PEs, then the bottom-right PE sends a signal to other PEs to sample the reference clock. The kernel ``stencil_3d_7pts/pe.csl`` performs a matrix-vector product (spmv) where the matrix has 7 diagonals corresponding to 7 point stencil. The stencil coefficients can vary per PE, but must be the same for the local vector. The user can change the coefficients based on the boundary condition or curvilinear coordinate transformation. The script ``run.py`` has the following parameters: - ``-k=`` specifies the maximum size of local vector. - ``--zDim=`` specifies how many elements per PE are computed. - ``--channels=`` specifies the number of I/O channels, no bigger than 16. The ``tic()`` samples "time_start" and ``toc()`` samples "time_end". The ``sync()`` samples "time_ref" which is used to adjust "time_start" and "time_end". The elapsed time (unit: cycles) is measured by ``cycles_send = max(time_end) - min(time_start)`` The overall runtime (us) is computed via the following formula ``time_send = (cycles_send / 0.85) * 1.e-3 us`` The bandwidth is calculated by ``bandwidth = ((6*w*h*4)/time_send)``