.. _appliance-mode:

Running SDK on a Wafer-Scale Cluster
====================================

In addition to the containerized Singularity build of the Cerebras SDK (see :ref:`install-guide`)
for installation information),
the SDK is also supported on Cerebras Wafer-Scale Clusters running in Appliance
Mode.

This page documents some modifications needed to your code to run
on a Wafer-Scale Cluster.
For more information about setting up and using a Wafer-Scale Cluster, see
`the documention here <https://docs.cerebras.net/en/latest/wsc/index.html>`_.
In particular, see `here <https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html#cs-quickstart>`_
for setting up your Python virtual environment and installing the Appliance
Python wheel.

Summary
-------

The `Cerebras Wafer-Scale Cluster <https://docs.cerebras.net/en/latest/wsc/cerebras-basics/how-cerebras-works.html#network-attached-accelerator>`_
is our solution to training massive neural networks with near-linear scaling.
The Wafer-Scale Cluster consists of one or more CS-2 systems,
together with special CPU nodes, memory servers, and interconnects,
presented to the end user as a single system, or Appliance.
The Appliance is responsible for job scheduling and allocation of the systems.

There are two types of SDK jobs that can run on the Appliance:
compile jobs, which are used to compile code on a worker node,
and run jobs, which either run the compiled code on a worker node
using the simulator,
or run the code on a real CS-2 within the Appliance.

We will walk through some changes necessary to compile and run your code
on a Wafer-Scale Cluster.
Modified code examples for supporting a Wafer-Scale Cluster can be requested
from developer@cerebras.net.

Note that there are currently some limitations for running SDK jobs on a
Wafer-Scale Cluster.
Unlike ML jobs, SDK jobs can only use a single worker node and CS-2.
The Wafer-Scale Cluster can only support the ``SdkRuntime`` host runtime.
It does not support host code using ``CSELFRunner``.

Compiling
---------

As an example, we'll walk through porting :ref:`sdkruntime-gemv-01-complete-program`.
In the containerized SDK setup, this code is compiled with the following command:

.. code-block:: bash

   cslc ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out

To compile for the Wafer-Scale cluster, we use a Python script which launches
a compile job:

.. code-block:: python

   import json
   from cerebras_appliance.sdk import SdkCompiler

   # Instantiate copmiler
   compiler = SdkCompiler()

   # Launch compile job
   artifact_id = compiler.compile(
       ".",
       "layout.csl",
       "--fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out",
   )

   # Write the artifact_id to a JSON file
   with open("artifact_id.json", "w", encoding="utf8") as f:
       json.dump({"artifact_id": artifact_id,}, f)

The ``SdkCompiler::compile`` function takes three arguments:

   - the directory containing the CSL code files,
   - the name of the top level CSL code file that contains the layout block,
   - and the compiler arguments.

The function returns an artifact ID, which is used when running
to locate the compile artifacts on the Appliance.
We write this artifact ID to a JSON file which will be read by
the runner object in the Python host code.

Just as before, simply pass the full dimensions of the target system to the
``--fabric-dims`` argument to compile for a real hardware run.

Running
-------

In the containerized SDK setup, our Python host code for running is as follows:

.. code-block:: python

   import argparse
   import numpy as np

   from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder

   # Read arguments
   parser = argparse.ArgumentParser()
   parser.add_argument('--name', help="the test compile output dir")
   parser.add_argument('--cmaddr', help="IP:port for CS system")
   args = parser.parse_args()

   # Matrix dimensions
   M = 4
   N = 6

   # Construct A, x, b
   A = np.arange(M*N, dtype=np.float32).reshape(M, N)
   x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
   b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

   # Calculate expected y
   y_expected = A@x + b

   # Construct a runner using SdkRuntime
   runner = SdkRuntime(args.name, cmaddr=args.cmaddr)

   # Load and run the program
   runner.load()
   runner.run()

   # Launch the init_and_compute function on device
   runner.launch('init_and_compute', nonblock=False)

   # Copy y back from device
   y_symbol = runner.get_id('y')
   y_result = np.zeros([1*1*M], dtype=np.float32)
   runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
   order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

   # Stop the program
   runner.stop()

   # Ensure that the result matches our expectation
   np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
   print("SUCCESS!")

For Appliance mode, we need a few modifications to the host code:

.. code-block:: python

   import json
   import os

   import numpy as np

   from cerebras_appliance.pb.sdk.sdk_common_pb2 import MemcpyDataType, MemcpyOrder
   from cerebras_appliance.sdk import SdkRuntime

   # Matrix dimensions
   M = 4
   N = 6

   # Construct A, x, b
   A = np.arange(M*N, dtype=np.float32).reshape(M, N)
   x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
   b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

   # Calculate expected y
   y_expected = A@x + b

   # Read the artifact_id from the JSON file
   with open("artifact_id.json", "r", encoding="utf8") as f:
       data = json.load(f)
       artifact_id = data["artifact_id"]

   # Instantiate a runner object using a context manager
   with SdkRuntime(artifact_id, simulator=True) as runner:
       # Launch the init_and_compute function on device
       runner.launch('init_and_compute', nonblock=False)

       # Copy y back from device
       y_symbol = runner.get_id('y')
       y_result = np.zeros([1*1*M], dtype=np.float32)
       runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
       order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

   # Ensure that the result matches our expectation
   np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
   print("SUCCESS!")

In particular, note that:

   - The imports have changed to reflect Appliance modules.
   - We no longer need to specify a compile output directory.
     Instead, we read our artifact ID from the JSON file generated when compiling.
   - We no longer need to specify a CM address when running on real hardware.
     Instead, we simply pass a flag to the ``SdkRuntime`` constructor specifying whether
     to run in the simulator or on hardware.
   - ``load()`` and ``run()`` are replaced by ``start()``.
   - We can use a `context manager <https://docs.python.org/3/reference/datamodel.html#context-managers>`_
     for the runner object.
     If we do so, the ``start()`` and ``stop()`` functions are implicit,
     and we do not need to explicitly call them.
   - Without a context manager, you must call ``start()`` and ``stop()``
     explicitly.