Quick-Start Guide
Welcome to the AMD University Program (AUP) AI & HPC Cluster! This guide will help you get up and running quickly. For comprehensive documentation, please explore the full site.
1. Cluster Hardware
See also
Compute Nodes
Multi-GPU Nodes
Nodes |
CPUs |
GPUs |
DRAM |
|---|---|---|---|
4 |
[2x] 128-core EPYC 9755 |
[8x] MI350X 288 GB |
3072 GB DDR5 |
1 |
[2x] 128-core EPYC 9755 |
[8x] MI325X 256 GB |
3072 GB DDR5 |
2 |
[2x] 96-core EPYC 9684X |
[8x] MI300X 192 GB |
2304 GB DDR5 |
10 |
[2x] 64-core EPYC 7763 |
[4x] MI250 128 GB |
1536 GB DDR4 |
21 |
[2x] 64-core EPYC 7V13 |
[4x] MI210 64 GB |
512 GB DDR4 |
Single-GPU Nodes (Virtual)
Nodes |
CPUs |
GPU |
DRAM |
|---|---|---|---|
8 |
16 cores of EPYC 9755 |
[1x] MI350X 288 GB |
334 GB DDR5 |
8 |
16 cores of EPYC 9684X |
[1x] MI300X 192 GB |
238 GB DDR5 |
28 |
16 cores of EPYC 7V13 |
[1x] MI210 64 GB |
64 GB DDR4 |
Note
MI350X will be deployed starting in Q2 2026. The mi3508x and mi3501x partitions will have charge factors of 1.4 and 0.175, respectively.
Login Node
Nodes |
CPUs |
GPUs |
DRAM |
|---|---|---|---|
2 |
[2x] 64-core EPYC 7V13 |
[2x] MI210 64 GB |
512 GB DDR4 |
Warning
The login node is shared by all users. Do not run compute-intensive workloads on it. Use Slurm to submit jobs to compute nodes instead.
2. Logging In
Connect via SSH, replacing <username> with your assigned username:
ssh <username>@hpcfund.amd.com
3. Storage Areas
See also
You have two storage areas available:
Variable |
Path |
Description |
Capacity |
|---|---|---|---|
|
|
Your personal home directory |
25 GB |
|
|
Your directory within your project’s workspace |
2 TB default (shared across project members) |
4. Software & Programming Environment
See also
We use Lmod to manage software modules. Key commands:
module avail # List all available packages
module list # Show currently loaded packages
module load <pkg> # Load a package (e.g., module load hdf5)
module unload <pkg> # Unload a package
Setting Up PyTorch (Recommended for AI Workloads)
While a ROCm-enabled PyTorch module is available via Lmod, we recommend installing your own for greater flexibility. Here’s how using a Python virtual environment:
python3 -m venv <name-of-venv>
source <name-of-venv>/bin/activate
pip3 install --upgrade pip
pip3 install torch torchvision \
--index-url https://download.pytorch.org/whl/rocm7.2
# Install any additional packages you need
# pip3 install transformers datasets accelerate ...
Tip
Installing your own PyTorch environment gives you full control over versions and additional dependencies without waiting for system-wide module updates.
5. Running Jobs
See also
We use Slurm for job scheduling. Two primary modes:
Batch Jobs (preferred for most workloads): Batch Job Submission Guide
Interactive Jobs (for quick testing & debugging): Interactive Usage Guide
6. Jupyter
See also
We provide a helper script to launch JupyterLab sessions that tunnel to your local browser.
Tip
You can replace jupyter notebook with jupyter lab in the script if you prefer
the full JupyterLab interface.
7. ROCm Profiling & Debugging Tools
If you’re coming from the NVIDIA ecosystem, here’s a mapping of equivalent AMD ROCm tools:
AMD Tool |
NVIDIA Tool |
Reference |
|---|---|---|
ROCm Compute Profiler |
Nsight Compute |
|
ROCm Systems Profiler |
Nsight Systems |
|
|
|
Run |
|
|
Run |
8. Getting Help
GitHub Issues
The primary support channel for help requests and technical issues is to submit a GitHub issues on our companion GitHub site: github.com/AMDResearch/hpcfund
Tip
If you would like to receive announcements and notifications related to the cluster (e.g., system down times), go to the GitHub site, click the Watch button at the top-right, and select All Activity (or Custom → Discussions). Make sure your GitHub notification settings have email delivery enabled.”
Email
For general questions about the AUP AI & HPC Cluster program or your project, please send emails to: hpc.fund@amd.com