Quick Primer on Llama.cpp

Bhaskar S

06/28/2025

Overview

llama.cpp is a powerful and efficient open source inference platform that enables one to run various Large Language Models (or LLM(s) for short) on a local machine.

The llama.cpp platform comes with a built-in Web UI interface that allows one to interact with the local LLM(s) via the provided web user interface. In addition, the platform exposes a local API endpoint, which enables app developers to build AI applications/workflows to interact with the local LLM(s) via the exposed API endpoint.

Last but not the least, the llama.cpp platform efficiently leverages the underlying system resouces of the local machine, such as the CPU(s) and the GPU(s), to optimally run the LLMs for better performance.

In this primer, we will demonstrate how one can effectively setup and run the llama.cpp platform using a Docker image.

Installation and Setup

The installation and setup will can on a Ubuntu 24.04 LTS based Linux desktop. Ensure that Docker is installed and setup on the desktop (see INSTRUCTIONS). Also, ensure the Python 3.1x programming language is installed and setup on the desktop.

We will create the required models directory by executing the following command in a terminal window:

$ mkdir -p $HOME/.llama_cpp/models

From the llama.cpp docker RESPOSITORY, one can identify the current version of the docker image. At the time of this article, the latest version of the docker image ended with the version b5768.

We require the docker image with the tag word full. If the desktop has an Nvidia GPU, one can look for the docker image with the tag words full-cuda.

To pull and download the full docker image for llama.cpp with CUDA support, execute the following command in a terminal window:

$ docker pull ghcr.io/ggml-org/llama.cpp:full-cuda-b5768

The following should be the typical output:

Output.1

full-cuda-b5768: Pulling from ggml-org/llama.cpp
bccd10f490ab: Pull complete 
edd1dba56169: Pull complete 
e06eb1b5c4cc: Pull complete 
7f308a765276: Pull complete 
3af11d09e9cd: Pull complete 
42896cdfd7b6: Pull complete 
600519079558: Pull complete 
0ae42424cadf: Pull complete 
73b7968785dc: Pull complete 
9bc72242f66c: Pull complete 
52b07c56c7a9: Pull complete 
834653c7592e: Pull complete 
4f4fb700ef54: Pull complete 
2af5fefec38c: Pull complete 
Digest: sha256:add3cd8b2c53d55e342c95a966c172d00d29e0dda0eebeb3caf7c352d9f60743
Status: Downloaded newer image for ghcr.io/ggml-org/llama.cpp:full-cuda-b5768
ghcr.io/ggml-org/llama.cpp:full-cuda-b5768

To install the necessary Python packages, execute the following command:

$ pip install dotenv langchain langchain-core langchain-openai pydantic

This completes all the system installation and setup for the llama.cpp hands-on demonstration.

Hands-on with llama.cpp

Before we get started, we need to identify the Nvidia GPU device on the desktop. To identify the CUDA device, execute the following command:

$ docker run --rm --name llama_cpp --gpus all -p 8000:8000 -v $HOME/.llama_cpp/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda-b5768 --server --list-devices

The following should be the typical output:

Output.2

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
Available devices:
  CUDA0: NVIDIA GeForce RTX 4060 Ti (16073 MiB, 15194 MiB free)

From the Output.2 above, the CUDA device is CUDA0.

For the hands-on demostrations, we will download the Gemma 3 4B LLM model from Huggingface. To download and serve the desired LLM model, execute the following command in the terminal window:

$ docker run --rm --name llama_cpp --gpus all -p 8000:8000 -v $HOME/.llama_cpp/models:/root/.cache/llama.cpp ghcr.io/ggml-org/llama.cpp:full-cuda-b5768 --server --hf-repo unsloth/gemma-3-4b-it-GGUF:Q4_K_XL --port 8000 --host 0.0.0.0 --device CUDA0 --temp 0.2 --log-timestamps

The following should be the typical trimmed output:

Output.3

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 5770 (b25e9277) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
-----[ TRIM ] -----
main: server is listening on http://0.0.0.0:8000 - starting the main loop
srv  update_slots: all slots are idle

Now, launch the Web Browser and open the URL http://localhost:8000. The following illustration depicts the llama.cpp user interface:

Figure.1

Enter the user prompt in the text box at the bottom and click on the circled UP arrow (indicated by the red arrow) as shown in the illustration below:

Figure.2

The response from the LLM is as shown in the illustration below:

Figure.3

For the next task, we will use the bank check as shown in the illustration below:

Figure.4

Click on the attachment icon (red number 1 arrow) to attach the above bank check image, then enter the user prompt in the text box at the bottom and click on the circled UP arrow (red number 2 arrow) as shown in the illustration below:

Figure.5

The response from the LLM after processing the bank check image is as shown in the illustration below:

Figure.6

Next, to test the llama.cpp inference platform via the API endpoint, execute the following user prompt in the terminal window:

$ curl -s -X POST "http://localhost:8000/completion" -H "Content-Type: application/json" -d '{"prompt": "Describe a GPU in less than 50 words", "max_tokens": 100}' | jq

The following should be the typical output:

Output.4

{
  "index": 0,
  "content": ".\n\nA GPU (Graphics Processing Unit) is a specialized processor designed to accelerate graphics rendering. It performs complex calculations for images, videos, and 3D graphics, significantly improving performance in gaming and other visual applications.\n",
  "tokens": [],
  "id_slot": 0,
  "stop": true,
  "model": "gpt-3.5-turbo",
  "tokens_predicted": 46,
  "tokens_evaluated": 11,
  "generation_settings": {
    "n_predict": 100,
    "seed": 4294967295,
    "temperature": 0.20000000298023224,
    "dynatemp_range": 0.0,
    "dynatemp_exponent": 1.0,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "top_n_sigma": -1.0,
    "xtc_probability": 0.0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1.0,
    "repeat_last_n": 64,
    "repeat_penalty": 1.0,
    "presence_penalty": 0.0,
    "frequency_penalty": 0.0,
    "dry_multiplier": 0.0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": 4096,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5.0,
    "mirostat_eta": 0.10000000149011612,
    "stop": [],
    "max_tokens": 100,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "logit_bias": [],
    "n_probs": 0,
    "min_keep": 0,
    "grammar": "",
    "grammar_lazy": false,
    "grammar_triggers": [],
    "preserved_tokens": [],
    "chat_format": "Content-only",
    "reasoning_format": "deepseek",
    "reasoning_in_content": false,
    "thinking_forced_open": false,
    "samplers": [
      "penalties",
      "dry",
      "top_n_sigma",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative.n_max": 16,
    "speculative.n_min": 0,
    "speculative.p_min": 0.75,
    "timings_per_token": false,
    "post_sampling_probs": false,
    "lora": []
  },
  "prompt": "Describe a GPU in less than 50 words",
  "has_new_line": true,
  "truncated": false,
  "stop_type": "eos",
  "stopping_word": "",
  "tokens_cached": 56,
  "timings": {
    "prompt_n": 10,
    "prompt_ms": 165.907,
    "prompt_per_token_ms": 16.590700000000002,
    "prompt_per_second": 60.274732229502064,
    "predicted_n": 46,
    "predicted_ms": 3128.99,
    "predicted_per_token_ms": 68.02152173913043,
    "predicted_per_second": 14.701229470212434
  }
}

We have successfully tested the exposed local API endpoint from the command-line !

Now, we will test llama.cpp using the Python Langchain API code snippets.

Create a file called .env with the following environment variables defined:

LLM_TEMPERATURE=0.2
LLAMA_CPP_BASE_URL='http://localhost:8000/v1'
LLAMA_CPP_MODEL='unsloth/gemma-3-4b-it-GGUF:Q4_K_XL'
LLAMA_CPP_API_KEY='llama_cpp'

To load the environment variables and assign them to corresponding Python variables, execute the following code snippet:

from dotenv import load_dotenv, find_dotenv

import os

load_dotenv(find_dotenv())

llm_temperature = os.getenv('LLM_TEMPERATURE')
llama_cpp_base_url = os.getenv('LLAMA_CPP_BASE_URL')
llama_cpp_model = os.getenv('LLAMA_CPP_MODEL')
llama_cpp_api_key = os.getenv('LLAMA_CPP_API_KEY')

To initialize an instance of the LLM client class for OpenAI running on the host URL, execute the following code snippet:

from langchain_openai import ChatOpenAI

llm_openai = ChatOpenAI(
  model=llama_cpp_model,
  base_url=llama_cpp_base_url,
  api_key=llama_cpp_api_key,
  temperature=float(llm_temperature)
)

To get a text response for a user prompt from the Gemma 3 4B LLM model running on the llama.cpp inference platform, execute the following code snippet:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
  ('system', 'You are a helpful assistant'),
  ('human', '{input}'),
])

chain = prompt | llm_openai

response = chain.invoke({'input': 'Compare the GDP of India vs USA in 2024 and provide the response in JSON format'})
response.content

The following should be the typical output:

Output.5

'```json\n{\n  "comparison": "GDP Comparison - India vs. USA (2024 - Estimates)",\n  "data": {\n    "country": "United States of America",\n    "gdp_2024_estimate": {\n      "value": 27.94 trillion,\n      "unit": "USD (United States Dollars)",\n      "source": "International Monetary Fund (IMF) - April 2024 Projections"\n    },\n    "country": "India",\n    "gdp_2024_estimate": {\n      "value": 13.29 trillion,\n      "unit": "USD (United States Dollars)",\n      "source": "World Bank - April 2024 Projections"\n    },\n    "comparison_summary": {\n      "usa_gdp_larger_than": true,\n      "percentage_difference": "USA\'s GDP is approximately 2.15 times larger than India\'s GDP.",\n      "notes": "These figures are estimates and projections, subject to change as more data becomes available.  Different organizations (IMF, World Bank, etc.) may have slightly varying estimates.  The IMF\'s projections are generally considered more up-to-date."\n    }\n  },\n  "disclaimer": "Data is based on current projections and estimates as of May 1, 2024.  Actual figures may differ."\n}\n```\n'

For the next task, we will attempt to present the LLM model response in a structured form using a Pydantic data class. For that, we will first define a class object by executing the following code snippet:

from pydantic import BaseModel, Field

class GeographicInfo(BaseModel):
  country: str = Field(description="Name of the Country")
  capital: str = Field(description="Name of the Capital City")
  population: int = Field(description="Population of the country in billions")
  land_area: int = Field(description="Land Area of the country in square miles")
  list_of_rivers: list = Field(description="List of top 5 rivers in the country")

To receive a LLM model response in the desired format for the specific user prompt from the llama.cpp platform, execute the following code snippet:

struct_llm_openai = llm_openai.with_structured_output(GeographicInfo)

chain = prompt | struct_llm_openai

response = chain.invoke({'input': 'Provide the geographic information of India and include the capital city, population in billions, land area in square miles, and list of rivers'})
response

The following should be the typical output:

Output.6

GeographicInfo(country='India', capital='New Delhi', population=1428, land_area=3287263, list_of_rivers=[{'river_name': 'Ganges (Ganga)', 'description': 'Considered the holiest river in Hinduism, flowing from the Himalayas to the Bay of Bengal.'}, {'river_name': 'Yamuna', 'description': 'A major tributary of the Ganges, flowing through Delhi and Agra.'}, {'river_name': 'Brahmaputra', 'description': 'Originates in Tibet and flows through India and Bangladesh before emptying into the Bay of Bengal.'}, {'river_name': 'Indus', 'description': 'Historically significant, flowing through Pakistan and India, crucial for irrigation and water supply.'}, {'river_name': 'Narmada', 'description': 'A major river in central India, considered sacred.'}, {'river_name': 'Godavari', 'description': 'One of the longest rivers in India, flowing through Maharashtra and Andhra Pradesh.'}, {'river_name': 'Krishna', 'description': 'Another major river in southern India, important for agriculture.'}, {'river_name': 'Mahanadi', 'description': 'Flows through Odisha and Chhattisgarh.'}, {'river_name': 'Kaveri (Cauvery)', 'description': 'A vital river in southern India, important for agriculture and water supply.'}, {'river_name': 'Tapti (Tapi)', 'description': 'Flows through Gujarat and Maharashtra.'}])

With this, we conclude the various demonstrations on using the llama.cpp platform for running and working with the pre-trained LLM model(s) locally !!!

References

llama.cpp

llama.cpp Docker

llama.cpp Server Options