Model Serving

Model Serving#

This section guides you through serving Large Language Models (LLMs) locally using three powerful and freely available tools:

Ollama – A lightweight, user-friendly framework for running LLMs with minimal setup.
Llamafile – A self-contained, portable LLM server that simplifies deployment.
OpenLLM – A flexible and scalable solution for serving LLMs in production environments.

Each of these tools has advantages depending on your use case:

Ollama is ideal for quick experimentation and interactive chat, supporting various models from Hugging Face and other sources.
Llamafile is designed for simplicity, bundling an entire LLM into a single executable for easy deployment across different platforms and operating systems.
OpenLLM provides a more scalable, API-driven approach, making it well-suited for enterprise and cloud-based applications.

In the following sections, we’ll walk you through setting up and using Ollama, Llamafile, and OpenLLM to serve LLMs on AMD hardware.

Serve LLMs with Ollama#

Install Ollama inside the Docker container:

curl -fsSL https://ollama.com/install.sh | sh

By default, models are not automatically served, so you will need to start the service and redirect the output:

ollama serve > /tmp/ollama.log 2>&1 &

Launch the 8 billion Llama3.1. You can explore more available models from the Ollama library.

ollama run llama3.1:8b

Once the model is running, you can start interacting with it.

Note

Models are stored in /ROCM_APP/models/ollama, this ensures that the model will be only downloaded once, even after stopping the Docker container.

To stop Ollama run:

killall ollama

Serve LLM(s) with Llamafile#

Inside the Docker container, download LlaVa and give executable permissions.

wget https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile?download=true -O llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile

Launch the Llamafile server.

./llava-v1.5-7b-q4.llamafile --port 8888 --nobrowser -ngl 999 --host '0.0.0.0'

In the host machine (outside Docker), open a web browser and navigate to localhost:8888. This will load the Llamafile web app, where you can experiment with Chat and Completion modes.

Note

We suggest you click on More options and increase the Show Probabilities, this will show the output tokens color coded. If you click in the token, it shows the most likely tokens and the likelihood of being placed after the previous token. By increasing the Temperature, you can get more ‘creative’ answers.

Serve with OpenLLM#

Step-by-Step Guide to Use OpenLLM on AMD GPUs

Serve with SGLang#

Step-by-Step Guide to Use SGLang on AMD GPUs

SPDX-License-Identifier: MIT