SDSC Expanse cluster live AI/ML metrics

The Expanse cluster at the San Diego Supercomputer Center is a batch-oriented science computing gateway serving thousands of users and a wide range of research projects, see Google News for examples.

The SDSC Expanse cluster live AI/ML metrics dashboard displays real-time metrics for workloads running on the cluster:

Launch the dashboard and explore the data:

Expanse offers an interesting variety of network traffic patterns as each scheduled task makes use of a different set of cluster resources.

How-to guide

All switches in the Expanse cluster leaf and spine fabric stream industry standard sFlow telemetry to an instance of the sFlow-RT real-time analytics engine. A Prometheus time series database stores metrics every second and a Grafana dashboard displays cluster metrics.

Follow instructions in AI Metrics with Prometheus and Grafana