.. _appliance-mode: Running SDK on a Wafer-Scale Cluster ==================================== In addition to the containerized Singularity build of the Cerebras SDK (see :ref:`install-guide`) for installation information), the SDK is also supported on Cerebras Wafer-Scale Clusters running in Appliance Mode. This page documents some modifications needed to your code to run on a Wafer-Scale Cluster. For more information about setting up and using a Wafer-Scale Cluster, see `the documention here `_. In particular, see `here `_ for setting up your Python virtual environment and installing the Appliance Python wheel. Summary ------- The `Cerebras Wafer-Scale Cluster `_ is our solution to training massive neural networks with near-linear scaling. The Wafer-Scale Cluster consists of one or more CS-2 systems, together with special CPU nodes, memory servers, and interconnects, presented to the end user as a single system, or Appliance. The Appliance is responsible for job scheduling and allocation of the systems. There are two types of SDK jobs that can run on the Appliance: compile jobs, which are used to compile code on a worker node, and run jobs, which either run the compiled code on a worker node using the simulator, or run the code on a real CS-2 within the Appliance. We will walk through some changes necessary to compile and run your code on a Wafer-Scale Cluster. Modified code examples for supporting a Wafer-Scale Cluster can be requested from developer@cerebras.net. Note that there are currently some limitations for running SDK jobs on a Wafer-Scale Cluster. Unlike ML jobs, SDK jobs can only use a single worker node and CS-2. The Wafer-Scale Cluster can only support the ``SdkRuntime`` host runtime. It does not support host code using ``CSELFRunner``. Compiling --------- As an example, we'll walk through porting :ref:`sdkruntime-gemv-01-complete-program`. In the containerized SDK setup, this code is compiled with the following command: .. code-block:: bash cslc ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out To compile for the Wafer-Scale cluster, we use a Python script which launches a compile job: .. code-block:: python import json from cerebras_appliance.sdk import SdkCompiler # Instantiate copmiler compiler = SdkCompiler() # Launch compile job artifact_id = compiler.compile( ".", "layout.csl", "--fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out", ) # Write the artifact_id to a JSON file with open("artifact_id.json", "w", encoding="utf8") as f: json.dump({"artifact_id": artifact_id,}, f) The ``SdkCompiler::compile`` function takes three arguments: - the directory containing the CSL code files, - the name of the top level CSL code file that contains the layout block, - and the compiler arguments. The function returns an artifact ID, which is used when running to locate the compile artifacts on the Appliance. We write this artifact ID to a JSON file which will be read by the runner object in the Python host code. Just as before, simply pass the full dimensions of the target system to the ``--fabric-dims`` argument to compile for a real hardware run. Running ------- In the containerized SDK setup, our Python host code for running is as follows: .. code-block:: python import argparse import numpy as np from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder # Read arguments parser = argparse.ArgumentParser() parser.add_argument('--name', help="the test compile output dir") parser.add_argument('--cmaddr', help="IP:port for CS system") args = parser.parse_args() # Matrix dimensions M = 4 N = 6 # Construct A, x, b A = np.arange(M*N, dtype=np.float32).reshape(M, N) x = np.full(shape=N, fill_value=1.0, dtype=np.float32) b = np.full(shape=M, fill_value=2.0, dtype=np.float32) # Calculate expected y y_expected = A@x + b # Construct a runner using SdkRuntime runner = SdkRuntime(args.name, cmaddr=args.cmaddr) # Load and run the program runner.load() runner.run() # Launch the init_and_compute function on device runner.launch('init_and_compute', nonblock=False) # Copy y back from device y_symbol = runner.get_id('y') y_result = np.zeros([1*1*M], dtype=np.float32) runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False, order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False) # Stop the program runner.stop() # Ensure that the result matches our expectation np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0) print("SUCCESS!") For Appliance mode, we need a few modifications to the host code: .. code-block:: python import json import os import numpy as np from cerebras_appliance.pb.sdk.sdk_common_pb2 import MemcpyDataType, MemcpyOrder from cerebras_appliance.sdk import SdkRuntime # Matrix dimensions M = 4 N = 6 # Construct A, x, b A = np.arange(M*N, dtype=np.float32).reshape(M, N) x = np.full(shape=N, fill_value=1.0, dtype=np.float32) b = np.full(shape=M, fill_value=2.0, dtype=np.float32) # Calculate expected y y_expected = A@x + b # Read the artifact_id from the JSON file with open("artifact_id.json", "r", encoding="utf8") as f: data = json.load(f) artifact_id = data["artifact_id"] # Instantiate a runner object using a context manager with SdkRuntime(artifact_id, simulator=True) as runner: # Launch the init_and_compute function on device runner.launch('init_and_compute', nonblock=False) # Copy y back from device y_symbol = runner.get_id('y') y_result = np.zeros([1*1*M], dtype=np.float32) runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False, order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False) # Ensure that the result matches our expectation np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0) print("SUCCESS!") In particular, note that: - The imports have changed to reflect Appliance modules. - We no longer need to specify a compile output directory. Instead, we read our artifact ID from the JSON file generated when compiling. - We no longer need to specify a CM address when running on real hardware. Instead, we simply pass a flag to the ``SdkRuntime`` constructor specifying whether to run in the simulator or on hardware. - ``load()`` and ``run()`` are replaced by ``start()``. - We can use a `context manager `_ for the runner object. If we do so, the ``start()`` and ``stop()`` functions are implicit, and we do not need to explicitly call them. - Without a context manager, you must call ``start()`` and ``stop()`` explicitly.