# Multi-Node Cluster Deployment This guide provides instructions for deploying AUP Learning Cloud on a multi-node Kubernetes cluster using Ansible. This deployment is suitable for production environments requiring high availability and scalability. ## Overview Multi-node deployment provides: - **High Availability**: Redundancy across multiple nodes - **Scalability**: Distribute workload across cluster - **Resource Isolation**: Separate control plane and worker nodes - **Production Ready**: Suitable for production environments ## Architecture A typical multi-node setup consists of: - **Control Plane Nodes** (1-3 nodes): Run Kubernetes control plane components - **Worker Nodes** (2+ nodes): Run user workloads and JupyterHub services - **Storage Node** (optional): Dedicated NFS storage server ## Prerequisites ### Hardware Requirements **Per Node**: - AMD Ryzen™ AI Halo Device (or compatible AMD hardware) - 32GB+ RAM (64GB recommended for worker nodes) - 500GB+ SSD (1TB+ for storage nodes) - 1Gbps+ network connectivity **Recommended Cluster**: - 1 control plane node - 3+ worker nodes - 1 dedicated NFS storage node (optional) ### Software Requirements **All Nodes**: - Ubuntu 24.04.3 LTS - SSH access configured - Root/sudo privileges **Control Node** (Ansible runner): - Ansible 2.9 or later - SSH key access to all nodes - Python 3.8+ ### Network Requirements - All nodes must be on the same network or have connectivity - Fixed IP addresses or DHCP with MAC reservation - Firewall rules configured for Kubernetes ports ## Installation Steps ### 1. Prepare Control Node Install Ansible on your control machine: ```bash # Install Ansible sudo apt update sudo apt install ansible python3-pip # Verify installation ansible --version ``` ### 2. Configure SSH Access Set up passwordless SSH access to all nodes: ```bash # Generate SSH key (if not already exists) ssh-keygen -t rsa -b 4096 # Copy SSH key to all nodes ssh-copy-id user@node1 ssh-copy-id user@node2 ssh-copy-id user@node3 # ... for all nodes # Test SSH access ssh user@node1 "hostname" ``` ### 3. Clone Repository ```bash git clone https://github.com/AMDResearch/aup-learning-cloud.git cd aup-learning-cloud/deploy/ansible ``` ### 4. Configure Inventory Edit `inventory.yml` with your cluster node information: ```yaml all: children: control_plane: hosts: master1: ansible_host: 192.168.1.10 ansible_user: ubuntu workers: hosts: worker1: ansible_host: 192.168.1.11 ansible_user: ubuntu worker2: ansible_host: 192.168.1.12 ansible_user: ubuntu worker3: ansible_host: 192.168.1.13 ansible_user: ubuntu storage: hosts: nfs1: ansible_host: 192.168.1.20 ansible_user: ubuntu ``` ### 5. Run Base Setup Install basic requirements on all nodes: ```bash # Update all nodes ansible-playbook playbooks/pb-apt-upgrade.yml # Install base packages ansible-playbook playbooks/pb-base.yml # Verify connectivity ansible all -m ping ``` ### 6. Install K3s Cluster Deploy K3s across the cluster: ```bash # Install K3s on all nodes ansible-playbook playbooks/pb-k3s-site.yml ``` This playbook will: - Install K3s server on control plane node - Install K3s agents on worker nodes - Configure networking - Set up kubectl access ### 7. Configure NFS Storage #### Option A: Dedicated NFS Server If using a dedicated NFS server: ```bash # Install and configure NFS server ansible-playbook playbooks/pb-nfs-server.yml ``` #### Option B: NFS Provisioner in Kubernetes Deploy NFS provisioner in the cluster: ```bash cd ../../deploy/k8s/nfs-provisioner # Edit values.yaml with your NFS server details nano values.yaml # Deploy NFS provisioner helm install nfs-provisioner . -n kube-system ``` See `deploy/k8s/nfs-provisioner/README.md` for detailed configuration. ### 8. Build and Import Images On a build machine (can be the control node): ```bash cd ../../dockerfiles # Build all images make all # Save images to tar files docker save ghcr.io/amdresearch/auplc-dl:v1.0 -o auplc-dl.tar docker save ghcr.io/amdresearch/auplc-llm:v1.0 -o auplc-llm.tar # ... for all images # Import images to cluster nodes ansible workers -m copy -a "src=auplc-dl.tar dest=/tmp/" ansible workers -m shell -a "docker load -i /tmp/auplc-dl.tar" # ... for all images ``` Or use a container registry (recommended for production). ### 9. Configure JupyterHub Edit `runtime/values.yaml` for multi-node deployment: ```bash cd ../runtime cp values-multi-nodes.yaml.example values-multi-nodes.yaml nano values-multi-nodes.yaml ``` Key multi-node specific settings: - **High availability**: Multiple hub replicas - **Node selectors**: Schedule pods on specific nodes - **Storage class**: Use NFS storage class - **Ingress**: Configure domain-based access ### 10. Deploy JupyterHub ```bash # Deploy using custom values helm upgrade --install jupyterhub ./jupyterhub \ -n jupyterhub --create-namespace \ -f values-multi-nodes.yaml ``` ### 11. Verify Deployment ```bash # Get kubeconfig from control plane node scp user@master1:~/.kube/config ~/.kube/config # Check nodes kubectl get nodes # Check all pods kubectl get pods -n jupyterhub # Check storage kubectl get pvc -n jupyterhub kubectl get storageclass ``` ## Access JupyterHub Configure ingress for domain-based access: ```bash # Check ingress kubectl get ingress -n jupyterhub # Access via domain https://your-domain.com ``` ## High Availability Configuration For production high availability: 1. **Multiple Control Plane Nodes** (3 recommended): - Edit inventory to include multiple control plane nodes - K3s will automatically configure HA etcd 2. **Load Balancer**: - Use external load balancer for control plane - Configure in K3s server installation 3. **Multiple Hub Replicas**: ```yaml hub: replicas: 2 db: type: postgres # Use external database ``` ## Monitoring and Maintenance ### Check Cluster Health ```bash # Node status kubectl get nodes # Pod status across namespaces kubectl get pods -A # Resource usage kubectl top nodes kubectl top pods -n jupyterhub ``` ### Upgrade Cluster ```bash cd deploy/ansible # Upgrade K3s ansible-playbook playbooks/pb-k3s-upgrade.yml # Upgrade JupyterHub cd ../../runtime bash scripts/helm_upgrade.bash ``` ### Backup and Restore See the [Backup Guide](../operations/backup.md) for cluster backup procedures. ## Troubleshooting ### Node Not Joining Cluster ```bash # Check K3s service on problem node ansible worker1 -m shell -a "systemctl status k3s-agent" # Check K3s logs ansible worker1 -m shell -a "journalctl -u k3s-agent -n 100" # Verify network connectivity ansible worker1 -m shell -a "ping master1" ``` ### Storage Issues ```bash # Check NFS mount kubectl exec -it -n jupyterhub -- df -h # Check NFS provisioner logs kubectl logs -n kube-system -l app=nfs-provisioner ``` ### Networking Issues ```bash # Check CNI pods kubectl get pods -n kube-system # Test pod-to-pod networking kubectl run test-pod --image=busybox --rm -it -- ping ``` ## Scaling ### Add Worker Nodes 1. Update `inventory.yml` with new worker nodes 2. Run the K3s playbook: ```bash ansible-playbook playbooks/pb-k3s-site.yml --limit new_worker ``` ### Remove Worker Nodes ```bash # Drain node kubectl drain worker4 --ignore-daemonsets --delete-emptydir-data # Delete from cluster kubectl delete node worker4 # Uninstall K3s on the node ansible worker4 -m shell -a "/usr/local/bin/k3s-agent-uninstall.sh" ``` ## Next Steps - [Configure JupyterHub](../jupyterhub/index.md) - [Set up Authentication](../jupyterhub/authentication-guide.md) - [Configure Monitoring](../operations/monitoring.md) - [Backup Configuration](../operations/backup.md)