Multi-Node Cluster Deployment

This guide provides instructions for deploying AUP Learning Cloud on a multi-node Kubernetes cluster using Ansible. This deployment is suitable for production environments requiring high availability and scalability.

Overview

Multi-node deployment provides:

  • High Availability: Redundancy across multiple nodes

  • Scalability: Distribute workload across cluster

  • Resource Isolation: Separate control plane and worker nodes

  • Production Ready: Suitable for production environments

Architecture

A typical multi-node setup consists of:

  • Control Plane Nodes (1-3 nodes): Run Kubernetes control plane components

  • Worker Nodes (2+ nodes): Run user workloads and JupyterHub services

  • Storage Node (optional): Dedicated NFS storage server

Prerequisites

Hardware Requirements

Per Node:

  • AMD Ryzen™ AI Halo Device (or compatible AMD hardware)

  • 32GB+ RAM (64GB recommended for worker nodes)

  • 500GB+ SSD (1TB+ for storage nodes)

  • 1Gbps+ network connectivity

Recommended Cluster:

  • 1 control plane node

  • 3+ worker nodes

  • 1 dedicated NFS storage node (optional)

Software Requirements

All Nodes:

  • Ubuntu 24.04.3 LTS

  • SSH access configured

  • Root/sudo privileges

Control Node (Ansible runner):

  • Ansible 2.9 or later

  • SSH key access to all nodes

  • Python 3.8+

Network Requirements

  • All nodes must be on the same network or have connectivity

  • Fixed IP addresses or DHCP with MAC reservation

  • Firewall rules configured for Kubernetes ports

Installation Steps

1. Prepare Control Node

Install Ansible on your control machine:

# Install Ansible
sudo apt update
sudo apt install ansible python3-pip

# Verify installation
ansible --version

2. Configure SSH Access

Set up passwordless SSH access to all nodes:

# Generate SSH key (if not already exists)
ssh-keygen -t rsa -b 4096

# Copy SSH key to all nodes
ssh-copy-id user@node1
ssh-copy-id user@node2
ssh-copy-id user@node3
# ... for all nodes

# Test SSH access
ssh user@node1 "hostname"

3. Clone Repository

git clone https://github.com/AMDResearch/aup-learning-cloud.git
cd aup-learning-cloud/deploy/ansible

4. Configure Inventory

Edit inventory.yml with your cluster node information:

all:
  children:
    control_plane:
      hosts:
        master1:
          ansible_host: 192.168.1.10
          ansible_user: ubuntu

    workers:
      hosts:
        worker1:
          ansible_host: 192.168.1.11
          ansible_user: ubuntu
        worker2:
          ansible_host: 192.168.1.12
          ansible_user: ubuntu
        worker3:
          ansible_host: 192.168.1.13
          ansible_user: ubuntu

    storage:
      hosts:
        nfs1:
          ansible_host: 192.168.1.20
          ansible_user: ubuntu

5. Run Base Setup

Install basic requirements on all nodes:

# Update all nodes
ansible-playbook playbooks/pb-apt-upgrade.yml

# Install base packages
ansible-playbook playbooks/pb-base.yml

# Verify connectivity
ansible all -m ping

6. Install K3s Cluster

Deploy K3s across the cluster:

# Install K3s on all nodes
ansible-playbook playbooks/pb-k3s-site.yml

This playbook will:

  • Install K3s server on control plane node

  • Install K3s agents on worker nodes

  • Configure networking

  • Set up kubectl access

7. Configure NFS Storage

Option A: Dedicated NFS Server

If using a dedicated NFS server:

# Install and configure NFS server
ansible-playbook playbooks/pb-nfs-server.yml

Option B: NFS Provisioner in Kubernetes

Deploy NFS provisioner in the cluster:

cd ../../deploy/k8s/nfs-provisioner

# Edit values.yaml with your NFS server details
nano values.yaml

# Deploy NFS provisioner
helm install nfs-provisioner . -n kube-system

See deploy/k8s/nfs-provisioner/README.md for detailed configuration.

8. Build and Import Images

On a build machine (can be the control node):

cd ../../dockerfiles

# Build all images
make all

# Save images to tar files
docker save ghcr.io/amdresearch/auplc-dl:v1.0 -o auplc-dl.tar
docker save ghcr.io/amdresearch/auplc-llm:v1.0 -o auplc-llm.tar
# ... for all images

# Import images to cluster nodes
ansible workers -m copy -a "src=auplc-dl.tar dest=/tmp/"
ansible workers -m shell -a "docker load -i /tmp/auplc-dl.tar"
# ... for all images

Or use a container registry (recommended for production).

9. Configure JupyterHub

Edit runtime/values.yaml for multi-node deployment:

cd ../runtime
cp values-multi-nodes.yaml.example values-multi-nodes.yaml
nano values-multi-nodes.yaml

Key multi-node specific settings:

  • High availability: Multiple hub replicas

  • Node selectors: Schedule pods on specific nodes

  • Storage class: Use NFS storage class

  • Ingress: Configure domain-based access

10. Deploy JupyterHub

# Deploy using custom values
helm upgrade --install jupyterhub ./jupyterhub \
  -n jupyterhub --create-namespace \
  -f values-multi-nodes.yaml

11. Verify Deployment

# Get kubeconfig from control plane node
scp user@master1:~/.kube/config ~/.kube/config

# Check nodes
kubectl get nodes

# Check all pods
kubectl get pods -n jupyterhub

# Check storage
kubectl get pvc -n jupyterhub
kubectl get storageclass

Access JupyterHub

Configure ingress for domain-based access:

# Check ingress
kubectl get ingress -n jupyterhub

# Access via domain
https://your-domain.com

High Availability Configuration

For production high availability:

  1. Multiple Control Plane Nodes (3 recommended):

    • Edit inventory to include multiple control plane nodes

    • K3s will automatically configure HA etcd

  2. Load Balancer:

    • Use external load balancer for control plane

    • Configure in K3s server installation

  3. Multiple Hub Replicas:

    hub:
      replicas: 2
      db:
        type: postgres  # Use external database
    

Monitoring and Maintenance

Check Cluster Health

# Node status
kubectl get nodes

# Pod status across namespaces
kubectl get pods -A

# Resource usage
kubectl top nodes
kubectl top pods -n jupyterhub

Upgrade Cluster

cd deploy/ansible

# Upgrade K3s
ansible-playbook playbooks/pb-k3s-upgrade.yml

# Upgrade JupyterHub
cd ../../runtime
bash scripts/helm_upgrade.bash

Backup and Restore

See the Backup Guide for cluster backup procedures.

Troubleshooting

Node Not Joining Cluster

# Check K3s service on problem node
ansible worker1 -m shell -a "systemctl status k3s-agent"

# Check K3s logs
ansible worker1 -m shell -a "journalctl -u k3s-agent -n 100"

# Verify network connectivity
ansible worker1 -m shell -a "ping master1"

Storage Issues

# Check NFS mount
kubectl exec -it <pod-name> -n jupyterhub -- df -h

# Check NFS provisioner logs
kubectl logs -n kube-system -l app=nfs-provisioner

Networking Issues

# Check CNI pods
kubectl get pods -n kube-system

# Test pod-to-pod networking
kubectl run test-pod --image=busybox --rm -it -- ping <pod-ip>

Scaling

Add Worker Nodes

  1. Update inventory.yml with new worker nodes

  2. Run the K3s playbook:

    ansible-playbook playbooks/pb-k3s-site.yml --limit new_worker
    

Remove Worker Nodes

# Drain node
kubectl drain worker4 --ignore-daemonsets --delete-emptydir-data

# Delete from cluster
kubectl delete node worker4

# Uninstall K3s on the node
ansible worker4 -m shell -a "/usr/local/bin/k3s-agent-uninstall.sh"

Next Steps