System Overview
The HPC Fund Research Cloud consists of 40 high performance computing (HPC) servers attached to a unifying high-speed InfiniBand fabric supporting high-bandwidth, low-latency message passing for distributed-memory applications. Each servers consists of dual-socket CPUs combined with multiple AMD Instinct MI series accelerators. The supporting operating system is Rocky Linux. Additional details regarding the hardware configuration is summarized below.
Compute servers
Each compute server consists of two AMD EPYC™ processors with access to 512 GB (or more) of main memory. High-speed user network connectivity for inter-node communication is accommodated by a ConnextX-6 MT28908 Infiniband host channel adapter providing a maximum port speed of 200 Gb/s. For accelerated analysis, each node also includes one or more AMD Instinct™ accelerators. Multiple generations of accelerators are available within the system with key characteristics highlighted as follows:
Accelerator |
Peak FP64 |
HBM Capacity |
HBM Peak B/W |
Host CPU |
Host Memory |
---|---|---|---|---|---|
11.5 TFLOPs |
32GB |
1.2 TB/s |
2 X EPYC 7V13 64-core |
512 GB |
|
45.3 TFLOPs |
64GB |
1.6 TB/s |
2 X EPYC 7V13 64-core |
512 GB |
|
45.3 TFLOPs (per GCD) |
64GB (per GCD) |
1.6 TB/s (per GCD) |
2 X EPYC 7763 64-Core |
1.5 TB |
Note that one AMD MI250 accelerator provides two Graphics Compute Dies (GCDs) for which the programmer can use as two separate GPUs.
File systems
Multiple shared file systems are available across the cluster. These are provisioned by separate dedicated servers that aggregate a number of NVMe storage devices running the WekaFS software stack in order to provide a POSIX-compliant file system.
Each user has access to two primary storage locations: $HOME
and $WORK
. These are unique directories per user with quotas applied to prevent over-allocation of the file system. Upon login, the $HOME
and $WORK
environment variables will be set automatically to the correct paths and user’s are encouraged to leverage these variables within their job scripts for convenience.
Tip
In addition to the standard cd
command, a cdw
convenience alias is provided in the default BASH environment which can be issued to take a user directly to their work directory.
Additional temporary storage is available during the lifetime of a job (see Running Jobs for details on interacting with the resource management system.) This storage is not-shared across the cluster and any user contents will be removed at the conclusion of a user job.
A summary of the available file systems and their characteristics is provided in the table below:
File System |
Quota/Size |
Type |
Backups |
Features |
---|---|---|---|---|
|
25GB |
WekaFS |
Yes (daily) |
Permanent storage accessible across the cluster. Intended for housing source code, job scripts, smaller application inputs/outputs. Nightly snapshots are generated and retained for a maximum of 10 days. |
|
2TB (shared) |
WekaFS |
No |
Permanent storage accessible across the cluster. Intended to provide storage for larger datasets and shared resources across a project team. Note that the storage quota is shared amongst all members of a particular project. |
|
100GB |
XFS |
No |
Temporary storage that is unique to each assigned host during a user job. |