# Quick-Start Guide Welcome to the **AMD University Program (AUP) AI & HPC Cluster**! This guide will help you get up and running quickly. For comprehensive documentation, please explore the full site. --- ## 1. Cluster Hardware ```{seealso} [Compute Servers Section](compute-servers) ``` ### Compute Nodes #### Multi-GPU Nodes | Nodes | CPUs | GPUs | DRAM | |:-----:|-------------------------------|----------------------|----------------| | 4 | [2x] 128-core EPYC 9755 | [8x] MI350X 288 GB | 3072 GB DDR5 | | 1 | [2x] 128-core EPYC 9755 | [8x] MI325X 256 GB | 3072 GB DDR5 | | 2 | [2x] 96-core EPYC 9684X | [8x] MI300X 192 GB | 2304 GB DDR5 | | 10 | [2x] 64-core EPYC 7763 | [4x] MI250 128 GB | 1536 GB DDR4 | | 21 | [2x] 64-core EPYC 7V13 | [4x] MI210 64 GB | 512 GB DDR4 | #### Single-GPU Nodes (Virtual) | Nodes | CPUs | GPU | DRAM | |:-----:|-------------------------------|----------------------|----------------| | 8 | 16 cores of EPYC 9755 | [1x] MI350X 288 GB | 334 GB DDR5 | | 8 | 16 cores of EPYC 9684X | [1x] MI300X 192 GB | 238 GB DDR5 | | 28 | 16 cores of EPYC 7V13 | [1x] MI210 64 GB | 64 GB DDR4 | ```{note} **MI350X** will be deployed starting in Q2 2026. The `mi3508x` and `mi3501x` partitions will have charge factors of 1.4 and 0.175, respectively. ``` ### Login Node | Nodes | CPUs | GPUs | DRAM | |:-----:|-------------------------------|----------------------|----------------| | 2 | [2x] 64-core EPYC 7V13 | [2x] MI210 64 GB | 512 GB DDR4 | ```{warning} The login node is shared by **all** users. Do **not** run compute-intensive workloads on it. Use Slurm to submit jobs to compute nodes instead. ``` --- ## 2. Logging In Connect via SSH, replacing `` with your assigned username: ```bash ssh @hpcfund.amd.com ``` --- ## 3. Storage Areas ```{seealso} [File Systems Section](file-systems) ``` You have two storage areas available: | Variable | Path | Description | Capacity | |----------|----------------------------------|------------------------------------------------|-------------------------------------------| | `$HOME` | `/home1/` | Your personal home directory | 25 GB | | `$WORK` | `/work//` | Your directory within your project's workspace | 2 TB default (shared across project members) | --- ## 4. Software & Programming Environment ```{seealso} [Software Section](#software) ``` We use [Lmod](https://lmod.readthedocs.io/) to manage software modules. Key commands: ```bash module avail # List all available packages module list # Show currently loaded packages module load # Load a package (e.g., module load hdf5) module unload # Unload a package ``` ### Setting Up PyTorch (Recommended for AI Workloads) While a ROCm-enabled PyTorch module is available via Lmod, we recommend installing your own for greater flexibility. Here's how using a Python virtual environment: ```bash python3 -m venv source /bin/activate pip3 install --upgrade pip pip3 install torch torchvision \ --index-url https://download.pytorch.org/whl/rocm7.2 # Install any additional packages you need # pip3 install transformers datasets accelerate ... ``` ```{tip} Installing your own PyTorch environment gives you full control over versions and additional dependencies without waiting for system-wide module updates. ``` --- ## 5. Running Jobs ```{seealso} [Running Jobs](#jobs) ``` We use [Slurm](https://slurm.schedmd.com/) for job scheduling. Two primary modes: - **Batch Jobs** *(preferred for most workloads)*: [Batch Job Submission Guide](batch-jobs) - **Interactive Jobs** *(for quick testing & debugging)*: [Interactive Usage Guide](interactive-jobs) --- ## 6. Jupyter ```{seealso} [Jupyter Section](jupyter) ``` We provide a helper script to launch JupyterLab sessions that tunnel to your local browser. ```{tip} You can replace `jupyter notebook` with `jupyter lab` in the script if you prefer the full JupyterLab interface. ``` --- ## 7. ROCm Profiling & Debugging Tools If you're coming from the NVIDIA ecosystem, here's a mapping of equivalent AMD ROCm tools: | AMD Tool | NVIDIA Tool | Reference | |--------------------------|----------------|------------------------------------------------------------------------------------| | ROCm Compute Profiler | Nsight Compute | [Documentation](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/) | | ROCm Systems Profiler | Nsight Systems | [Documentation](https://rocm.docs.amd.com/projects/rocprofiler-systems/en/latest/) | | `rocprof` | `nvprof` | Run `rocprof -h` | | `rocm-smi` / `amd-smi` | `nvidia-smi` | Run `amd-smi -h` | --- ## 8. Getting Help ### GitHub Issues The primary support channel for help requests and technical issues is to submit a GitHub issues on our companion GitHub site: [github.com/AMDResearch/hpcfund](https://github.com/AMDResearch/hpcfund) ```{tip} If you would like to receive announcements and notifications related to the cluster (e.g., system down times), go to the GitHub site, click the Watch button at the top-right, and select All Activity (or Custom → Discussions). Make sure your GitHub notification settings have email delivery enabled." ``` ### Email For general questions about the AUP AI & HPC Cluster program or your project, please send emails to: [hpc.fund@amd.com](mailto:hpc.fund@amd.com) --- ## 9. Additional Resources ### Hardware - [AMD Instinct MI350X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi350x.html) - [AMD Instinct MI355X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html) - [AMD Instinct MI325X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html) - [AMD Instinct MI300X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) - [AMD Instinct MI300A](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300a.html) ### Software & Documentation - [ROCm for AI](https://www.amd.com/en/products/software/rocm/ai.html) - [ROCm for HPC](https://www.amd.com/en/products/software/rocm/hpc.html) - [ROCm Documentation](https://rocm.docs.amd.com/en/latest/) - [ROCm API Libraries](https://rocm.docs.amd.com/en/latest/reference/api-libraries.html) - [ROCm CMake Packages](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html) - [Containers & Deployment Guides](https://www.amd.com/en/developer/resources/infinity-hub.html) ### Learning & Tutorials - [AMD AI Academy](https://www.amd.com/es/developer/resources/training/amd-ai-academy.html) - [AMD Developer Program](https://www.amd.com/en/developer/ai-dev-program.html) - [ROCm Blog](https://rocm.blogs.amd.com/) - [Self-Guided AI Tutorials](https://rocm.blogs.amd.com/applications-models.html) - [ROCm AI Developer Hub](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/) - [ROCm HPC Tutorials (AMD Lab Notes)](https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-readme/) - [ROCm Training Videos](https://www.amd.com/en/developer/resources/rocm-hub/training-videos.html)