SDSC Expanse

SDSC Expanse cluster live AI/ML metrics

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

Total Traffic Total traffic entering fabric
Cluster Services Traffic associated with orchestration (Slurm) and storage (Lustre, Ceph and NFS)
Core Link Traffic Histogram of load on fabric links
Edge Link Traffic Histogram of load on access ports
RDMA Operations Total RDMA operations
RDMA Avg. Bytes per Operation Average RDMA operation size
Infiniband Operations Total RoCEv2 Infiniband operations broken out by type
Compute / Exchange Interval Detected period of compute / exchange activity on fabric
Congestion Notification Messages Total ECN / CNP congestion messages
Infiniband Ack. Credits Average number of credits in RoCEv2 Infiniband acknowledgements
Packet Discards Total ingress / egress discards
Packet Errors Total ingress / egress errors

Launch the dashboard and explore the data:

Time Interval Change the time interval, using the widget at the top right of the dashboard, to see trends over the last 30 days or per second detail in the last 5 minutes
Zoom In If you see any interesting peaks in one of the charts, drag to select a time interval and zoom in to see the details
Select Metrics Click on items in chart legends to display selected metrics

Expanse offers an interesting variety of network traffic patterns as each scheduled task makes use of a different set of cluster resources.

How-to guide

All switches in the Expanse cluster leaf and spine fabric stream industry standard sFlow telemetry to an instance of the sFlow-RT real-time analytics engine. A Prometheus time series database stores metrics every second and a Grafana dashboard displays cluster metrics.

Follow instructions in AI Metrics with Prometheus and Grafana