Quick-Start Guide

Welcome to the AMD University Program (AUP) AI & HPC Cluster! This guide will help you get up and running quickly. For comprehensive documentation, please explore the full site.

1. Cluster Hardware

See also

Compute Servers Section

Compute Nodes

Multi-GPU Nodes

Nodes	CPUs	GPUs	DRAM
4	[2x] 128-core EPYC 9755	[8x] MI350X 288 GB	3072 GB DDR5
1	[2x] 128-core EPYC 9755	[8x] MI325X 256 GB	3072 GB DDR5
2	[2x] 96-core EPYC 9684X	[8x] MI300X 192 GB	2304 GB DDR5
10	[2x] 64-core EPYC 7763	[4x] MI250 128 GB	1536 GB DDR4
21	[2x] 64-core EPYC 7V13	[4x] MI210 64 GB	512 GB DDR4

Single-GPU Nodes (Virtual)

Nodes	CPUs	GPU	DRAM
8	16 cores of EPYC 9755	[1x] MI350X 288 GB	334 GB DDR5
8	16 cores of EPYC 9684X	[1x] MI300X 192 GB	238 GB DDR5
28	16 cores of EPYC 7V13	[1x] MI210 64 GB	64 GB DDR4

Note

MI350X will be deployed starting in Q2 2026. The mi3508x and mi3501x partitions will have charge factors of 1.4 and 0.175, respectively.

Login Node

Nodes	CPUs	GPUs	DRAM
2	[2x] 64-core EPYC 7V13	[2x] MI210 64 GB	512 GB DDR4

Warning

The login node is shared by all users. Do not run compute-intensive workloads on it. Use Slurm to submit jobs to compute nodes instead.

2. Logging In

Connect via SSH, replacing <username> with your assigned username:

ssh <username>@hpcfund.amd.com

3. Storage Areas

See also

File Systems Section

You have two storage areas available:

Variable	Path	Description	Capacity
`$HOME`	`/home1/<username>`	Your personal home directory	25 GB
`$WORK`	`/work/<projectid>/<username>`	Your directory within your project’s workspace	2 TB default (shared across project members)

4. Software & Programming Environment

See also

Software Section

We use Lmod to manage software modules. Key commands:

module avail          # List all available packages
module list           # Show currently loaded packages
module load <pkg>     # Load a package (e.g., module load hdf5)
module unload <pkg>   # Unload a package

Setting Up PyTorch (Recommended for AI Workloads)

While a ROCm-enabled PyTorch module is available via Lmod, we recommend installing your own for greater flexibility. Here’s how using a Python virtual environment:

python3 -m venv <name-of-venv>
source <name-of-venv>/bin/activate
pip3 install --upgrade pip
pip3 install torch torchvision \
    --index-url https://download.pytorch.org/whl/rocm7.2

# Install any additional packages you need
# pip3 install transformers datasets accelerate ...

Tip

Installing your own PyTorch environment gives you full control over versions and additional dependencies without waiting for system-wide module updates.

5. Running Jobs

See also

Running Jobs

We use Slurm for job scheduling. Two primary modes:

Batch Jobs (preferred for most workloads): Batch Job Submission Guide
Interactive Jobs (for quick testing & debugging): Interactive Usage Guide

6. Jupyter

See also

Jupyter Section

We provide a helper script to launch JupyterLab sessions that tunnel to your local browser.

Tip

You can replace jupyter notebook with jupyter lab in the script if you prefer the full JupyterLab interface.

7. ROCm Profiling & Debugging Tools

If you’re coming from the NVIDIA ecosystem, here’s a mapping of equivalent AMD ROCm tools:

AMD Tool	NVIDIA Tool	Reference
ROCm Compute Profiler	Nsight Compute	Documentation
ROCm Systems Profiler	Nsight Systems	Documentation
`rocprof`	`nvprof`	Run `rocprof -h`
`rocm-smi` / `amd-smi`	`nvidia-smi`	Run `amd-smi -h`

8. Getting Help

GitHub Issues

The primary support channel for help requests and technical issues is to submit a GitHub issues on our companion GitHub site: github.com/AMDResearch/hpcfund

Tip

If you would like to receive announcements and notifications related to the cluster (e.g., system down times), go to the GitHub site, click the Watch button at the top-right, and select All Activity (or Custom → Discussions). Make sure your GitHub notification settings have email delivery enabled.”

Email

For general questions about the AUP AI & HPC Cluster program or your project, please send emails to: hpc.fund@amd.com

Quick-Start Guide

1. Cluster Hardware

Compute Nodes

Multi-GPU Nodes

Single-GPU Nodes (Virtual)

2. Logging In

3. Storage Areas

4. Software & Programming Environment

Setting Up PyTorch (Recommended for AI Workloads)

5. Running Jobs

6. Jupyter

7. ROCm Profiling & Debugging Tools

8. Getting Help

GitHub Issues

Email

9. Additional Resources

Hardware

Software & Documentation

Learning & Tutorials