Running SDK on a Wafer-Scale Cluster

Running SDK on a Wafer-Scale Cluster

In addition to the containerized Singularity build of the Cerebras SDK (see Installation and Setup) for installation information), the SDK is also supported on Cerebras Wafer-Scale Clusters running in Appliance Mode.

This page documents some modifications needed to your code to run on a Wafer-Scale Cluster. For more information about setting up and using a Wafer-Scale Cluster, see the documention here. In particular, see here for setting up your Python virtual environment and installing the Appliance Python wheel.

Summary

The Cerebras Wafer-Scale Cluster is our solution to training massive neural networks with near-linear scaling. The Wafer-Scale Cluster consists of one or more CS-2 systems, together with special CPU nodes, memory servers, and interconnects, presented to the end user as a single system, or Appliance. The Appliance is responsible for job scheduling and allocation of the systems.

There are two types of SDK jobs that can run on the Appliance: compile jobs, which are used to compile code on a worker node, and run jobs, which either run the compiled code on a worker node using the simulator, or run the code on a real CS-2 within the Appliance.

We will walk through some changes necessary to compile and run your code on a Wafer-Scale Cluster. Modified code examples for supporting a Wafer-Scale Cluster can be requested from developer@cerebras.net.

Note that there are currently some limitations for running SDK jobs on a Wafer-Scale Cluster. Unlike ML jobs, SDK jobs can only use a single worker node and CS-2. The Wafer-Scale Cluster can only support the SdkRuntime host runtime. It does not support host code using CSELFRunner.

Compiling

As an example, we’ll walk through porting GEMV 1: A Complete Program. In the containerized SDK setup, this code is compiled with the following command:

cslc ./layout.csl --fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out

To compile for the Wafer-Scale cluster, we use a Python script which launches a compile job:

import json
from cerebras_appliance.sdk import SdkCompiler

# Instantiate copmiler
compiler = SdkCompiler()

# Launch compile job
artifact_id = compiler.compile(
    ".",
    "layout.csl",
    "--fabric-dims=8,3 --fabric-offsets=4,1 --memcpy --channels=1 -o out",
)

# Write the artifact_id to a JSON file
with open("artifact_id.json", "w", encoding="utf8") as f:
    json.dump({"artifact_id": artifact_id,}, f)

The SdkCompiler::compile function takes three arguments:

  • the directory containing the CSL code files,

  • the name of the top level CSL code file that contains the layout block,

  • and the compiler arguments.

The function returns an artifact ID, which is used when running to locate the compile artifacts on the Appliance. We write this artifact ID to a JSON file which will be read by the runner object in the Python host code.

Just as before, simply pass the full dimensions of the target system to the --fabric-dims argument to compile for a real hardware run.

Running

In the containerized SDK setup, our Python host code for running is as follows:

import argparse
import numpy as np

from cerebras.sdk.runtime.sdkruntimepybind import SdkRuntime, MemcpyDataType, MemcpyOrder

# Read arguments
parser = argparse.ArgumentParser()
parser.add_argument('--name', help="the test compile output dir")
parser.add_argument('--cmaddr', help="IP:port for CS system")
args = parser.parse_args()

# Matrix dimensions
M = 4
N = 6

# Construct A, x, b
A = np.arange(M*N, dtype=np.float32).reshape(M, N)
x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

# Calculate expected y
y_expected = A@x + b

# Construct a runner using SdkRuntime
runner = SdkRuntime(args.name, cmaddr=args.cmaddr)

# Load and run the program
runner.load()
runner.run()

# Launch the init_and_compute function on device
runner.launch('init_and_compute', nonblock=False)

# Copy y back from device
y_symbol = runner.get_id('y')
y_result = np.zeros([1*1*M], dtype=np.float32)
runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Stop the program
runner.stop()

# Ensure that the result matches our expectation
np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
print("SUCCESS!")

For Appliance mode, we need a few modifications to the host code:

import json
import os

import numpy as np

from cerebras_appliance.pb.sdk.sdk_common_pb2 import MemcpyDataType, MemcpyOrder
from cerebras_appliance.sdk import SdkRuntime

# Matrix dimensions
M = 4
N = 6

# Construct A, x, b
A = np.arange(M*N, dtype=np.float32).reshape(M, N)
x = np.full(shape=N, fill_value=1.0, dtype=np.float32)
b = np.full(shape=M, fill_value=2.0, dtype=np.float32)

# Calculate expected y
y_expected = A@x + b

# Read the artifact_id from the JSON file
with open("artifact_id.json", "r", encoding="utf8") as f:
    data = json.load(f)
    artifact_id = data["artifact_id"]

# Instantiate a runner object using a context manager
with SdkRuntime(artifact_id, simulator=True) as runner:
    # Launch the init_and_compute function on device
    runner.launch('init_and_compute', nonblock=False)

    # Copy y back from device
    y_symbol = runner.get_id('y')
    y_result = np.zeros([1*1*M], dtype=np.float32)
    runner.memcpy_d2h(y_result, y_symbol, 0, 0, 1, 1, M, streaming=False,
    order=MemcpyOrder.ROW_MAJOR, data_type=MemcpyDataType.MEMCPY_32BIT, nonblock=False)

# Ensure that the result matches our expectation
np.testing.assert_allclose(y_result, y_expected, atol=0.01, rtol=0)
print("SUCCESS!")

In particular, note that:

  • The imports have changed to reflect Appliance modules.

  • We no longer need to specify a compile output directory. Instead, we read our artifact ID from the JSON file generated when compiling.

  • We no longer need to specify a CM address when running on real hardware. Instead, we simply pass a flag to the SdkRuntime constructor specifying whether to run in the simulator or on hardware.

  • load() and run() are replaced by start().

  • We can use a context manager for the runner object. If we do so, the start() and stop() functions are implicit, and we do not need to explicitly call them.

  • Without a context manager, you must call start() and stop() explicitly.