Quick Primer on LocalAI

Bhaskar S

*UPDATED*06/14/2026

Overview

LocalAI is an open-source platform for running various models locally and serves as an API shim for both OpenAI as well as Anthropic APIs, allowing applications to be built against OpenAI or Anthropic APIs, which can be pointed at LocalAI with minimal or no code changes.

LocalAI can easily run on either a consumer grade CPU or on a consumer grade GPU and can be used for performing various generative AI tasks such as text generation, text to audio generation, text to image generation, etc.

Behind the scenes LocalAI integrates with the various models (for the different tasks) via Backends, which are primarily gRPC servers that manage the models.

The following illustration depicts the high-level architecture of LocalAI:

Figure.1

For the hands-on demonstration of LocalAI, we will make use of the official Docker image, which comes bundled with a default set of backends that fully enable all the features of LocalAI.

Note that LocalAI does NOT come bundled with default models and one needs to install the desired models for the various tasks.

Installation and Setup

The installation and setup will be on a Ubuntu 24.04 LTS based Linux desktop. Ensure that Docker is installed and setup on the desktop (see instructions).

Also, ensure that the Python 3.1x programming language as well as the Jupyter Notebook packages are installed. In addition, ensure the command-line utilities curl and jq are installed on the Linux desktop.

We will setup two required directories by executing the following command in a terminal window:

$ mkdir -p $HOME/.local_ai{/backends,/configuration,/models,/data,/tmp}

To download the latest version (v4.4.3 at the time of this article) of the docker image for LocalAI, execute the following command in a terminal window:

$ docker pull localai/localai:v4.4.3

The following should be the typical output:

Output.1

v4.4.3: Pulling from localai/localai
cb259a83ac3d: Pull complete 
a438421d38d8: Pull complete 
0f9c328da9c0: Pull complete 
9012256cad48: Pull complete 
4f4fb700ef54: Pull complete 
3a47bb1c2920: Pull complete 
8ce73f2ef2b7: Pull complete 
10b652323276: Pull complete 
Digest: sha256:1aeb407121b5f5a70ed33f74fdb805d86435503f59fc106c53484e08d724b841
Status: Downloaded newer image for localai/localai:v4.4.3
docker.io/localai/localai:v4.4.3

To install the necessary Python packages, execute the following command:

$ pip install openai pydantic

This completes all the system installation and setup for the LocalAI hands-on demonstration.

Hands-on with LocalAI

For the hands-on demonstration, we will perform the tasks - embedding, text generation, text to audio conversion, and text to image generation.

This implies we will have to install the appropriate models in LocalAI for handling the various tasks.

We will assume the ip address of the desktop to be 192.168.1.25.

To check for the available embedding models, execute the following command in the terminal window:

$ docker run --rm --name local-ai -u $(id -u $USER):$(id -g $USER) --add-host=host.docker.internal:host-gateway --network=host -p 192.168.1.25:8080:8080 -v $HOME/.local_ai/backends:/backends -v $HOME/.local_ai/configuration:/configuration -v $HOME/.local_ai/models:/models -v $HOME/.local_ai/data:/data -v $HOME/.local_ai/tmp:/tmp localai/localai:v4.4.3 models list | grep embed

The following would be the typical output:

Output.2

- localai@qwen3-vl-embedding-8b
- localai@qwen3-vl-embedding-2b
- localai@qwen3-embedding-4b
- localai@qwen3-embedding-8b
- localai@qwen3-embedding-0.6b
- localai@granite-embedding-107m-multilingual
- localai@granite-embedding-125m-english
- localai@embeddinggemma-300m
- localai@bert-embeddings
- localai@nomic-embed-text-v1.5

We will choose the localai@nomic-embed-text-v1.5 embedding model and to install this model, execute the following command in the terminal window:

The following would be the typical trimmed output:

Output.3

...[TRIM]... 
Jun 14 11:28:00 INFO  Using forced capability run file capabilityRunFile="/run/localai/capability" capability="default\n" env="" 
Jun 14 11:28:00 INFO  installing model model="localai@nomic-embed-text-v1.5" license="" 
Jun 14 11:28:00 INFO  Downloading url="https://huggingface.co/mradermacher/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.f16.gguf" 
Jun 14 11:28:03 INFO  File downloaded and verified file="/models/nomic-embed-text-v1.5.f16.gguf"
...[TRIM]...

To check if the gemma 4 model is available for text generation, execute the following command in the terminal window:

The following would be the typical output:

Output.4

- localai@gemma-4-26b-a4b-it-qat
- localai@gemma-4-12b-it-qat-q4_0
- localai@gemma-4-e2b-it-qat-q4_0
- localai@gemma-4-e4b-it-qat-q4_0
- localai@gemma-4-26b-a4b-it-qat-q4_0
- localai@gemma-4-31b-it-qat-q4_0
- localai@gemma-4-12b-it-qat-mtp
- localai@gemma-4-26b-a4b-it-qat-mtp
- localai@gemma-4-31b-it-qat-mtp
- localai@gemma-4-26b-a4b-it-apex
- localai@gemma-4-26b-a4b-it
- localai@gemma-4-e2b-it
- localai@gemma-4-e4b-it
- localai@gemma-4-31b-it
- localai@medgemma-4b-it
- localai@google_medgemma-4b-it
- localai@gemma-4-e2b-it:sglang-mtp
- localai@gemma-4-e4b-it:sglang-mtp

We will choose the localai@gemma-4-e2b-it-qat-q4_0 text generation model and to install this model, execute the following command in the terminal window:

The following would be the typical trimmed output:

Output.5

...[TRIM]... 
Jun 14 11:29:37 INFO  Using forced capability run file capabilityRunFile="/run/localai/capability" capability="default\n" env="" 
Jun 14 11:29:37 INFO  installing model model="localai@gemma-4-e2b-it-qat-q4_0" license="apache-2.0" 
Jun 14 11:29:37 INFO  Downloading url="https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B_q4_0-it.gguf" 
Jun 14 11:30:10 INFO  Downloading url="https://huggingface.co/google/gemma-4-E2B-it-qat-q4_0-gguf/resolve/main/gemma-4-E2B-it-mmproj.gguf" 
Jun 14 11:30:19 INFO  File downloaded and verified file="/models/llama-cpp/mmproj/gemma-4-E2B-it-qat-q4_0-gguf/gemma-4-E2B-it-mmproj.gguf" 
...[TRIM]...

To check for the available text to audio models, execute the following command in the terminal window:

The following would be the typical output:

Output.6

- localai@neutts-air
- localai@vllm-omni-qwen3-tts-custom-voice
- localai@qwen3-tts-cpp
- localai@qwen3-tts-cpp-0.6b-base-q4
- localai@qwen3-tts-cpp-1.7b-base
- localai@qwen3-tts-cpp-1.7b-base-q4
- localai@qwen3-tts-cpp-customvoice
- localai@qwen3-tts-cpp-customvoice-q4
- localai@qwen3-tts-cpp-1.7b-customvoice
- localai@qwen3-tts-cpp-1.7b-customvoice-q4
- localai@qwen3-tts-cpp-1.7b-voicedesign
- localai@qwen3-tts-cpp-1.7b-voicedesign-q4
- localai@qwen3-tts-1.7b-custom-voice
- localai@qwen3-tts-0.6b-custom-voice
- localai@lfm2.5-audio-1.5b-tts
- localai@pocket-tts
- localai@kitten-tts
- localai@outetts
- localai@parler-tts-mini-v0.1
- localai@voice-en-us-libritts-high
- localai@voice-cy_GB-bu_tts-medium
- localai@voice-en_US-libritts_r-medium
- localai@voice-id_ID-news_tts-medium
- localai@vibevoice-tts-crispasr
- localai@chatterbox-tts-crispasr
- localai@qwen3-tts-customvoice-crispasr
- localai@orpheus-tts-crispasr
- localai@piper-id_ID-news_tts-medium-crispasr

We will choose the localai@pocket-tts text to audio model and to install this model, execute the following command in the terminal window:

The following would be the typical trimmed output:

Output.7

...[TRIM]... 
Jun 14 11:31:52 INFO  Using forced capability run file capabilityRunFile="/run/localai/capability" capability="default\n" env="" 
Jun 14 00:31:52 INFO  installing model model="localai@pocket-tts" license="mit" 
...[TRIM]...

To check if the flux model is available for image generation, execute the following command in the terminal window:

The following would be the typical output:

Output.8

- localai@flux.1-dev
- localai@flux.1-schnell
- localai@flux.1-dev-ggml
- localai@flux.1dev-abliteratedv2
- localai@flux.1-kontext-dev
- localai@flux.1-dev-ggml-q8_0
- localai@flux.1-dev-ggml-abliterated-v2-q8_0
- localai@flux.1-krea-dev-ggml
- localai@flux.1-krea-dev-ggml-q8_0
- localai@flux.2-dev
- localai@flux.2-klein-4b
- localai@flux.2-klein-9b

We will choose the localai@flux.2-klein-4b image generation model and to install this model, execute the following command in the terminal window:

The following would be the typical trimmed output:

Output.9

...[TRIM]... 
Jun 14 11:32:51 INFO  Using forced capability run file capabilityRunFile="/run/localai/capability" capability="default\n" env="" 
Jun 14 11:32:51 INFO  installing model model="localai@flux.2-klein-4b" license="apache-2.0" 
Jun 14 11:32:52 INFO  Downloading url="https://huggingface.co/leejet/FLUX.2-klein-4B-GGUF/resolve/main/flux-2-klein-4b-Q4_0.gguf" 
Jun 14 11:33:48 INFO  Downloading url="https://huggingface.co/Comfy-Org/flux2-dev/resolve/main/split_files/vae/flux2-vae.safetensors" 
Jun 14 11:33:56 INFO  Downloading url="https://huggingface.co/unsloth/Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_K_M.gguf" 
Jun 14 11:35:18 INFO  File downloaded and verified file="/models/stablediffusion-cpp/models/Qwen3-4B-Q4_K_M.gguf" 
...[TRIM]...

Now that we are all ready to go, start the LocalAI platform by executing the following command in the terminal window:

The following would be the typical output:

Output.10

CPU info:
model name	: AMD Ryzen 7 5700X 8-Core Processor
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
Jun 14 00:35:02 INFO  Using forced capability run file capabilityRunFile="/run/localai/capability" capability="default\n" env="" 
Jun 14 00:35:02 INFO  Starting LocalAI threads=8 modelsPath="//models" 
Jun 14 00:35:02 INFO  LocalAI version version="v4.4.3 (4d3d54d61b083d5b435636c7638dec16b051553f)" 
Jun 14 00:35:02 INFO  LocalAI Assistant in-memory MCP server initialised tools=29 read_only=false 
Jun 14 00:35:02 INFO  stats: using in-memory ring buffer (no-auth single-user mode)
Jun 14 00:35:02 INFO  stats: fallback user wired local_user_id="31039261-15c0-46ba-9822-a993e79e3459" 
Jun 14 00:35:02 INFO  pii: filter enabled patterns=6 config_path="" persisted_overrides=0 
Jun 14 00:35:02 INFO  Loaded tasks from persister count=0 
Jun 14 00:35:02 INFO  Loaded jobs from persister count=0 
Jun 14 00:35:02 INFO  AgentJobService started retention_days=30 
Jun 14 00:35:02 INFO  Preloading models path="//models" 

  Model name: flux.2-klein-4b                                                 

  Model name: gemma-4-e2b-it-qat-q4_0                                         

  Model name: nomic-embed-text-v1.5                                           

  Model name: pocket-tts                                                      

Jun 14 00:35:02 INFO  core/startup process completed!
Jun 14 00:35:02 INFO  LocalAI is started and running address=":8080" 
Jun 14 00:35:02 INFO  Agent pool started (standalone/LocalAGI mode) stateDir="//data" apiURL="http://127.0.0.1:8080"

To test the local API endpoints, open a new terminal window to perform the various curl commands.

To list all the installed models on the running LocalAI platform, execute the following command in the terminal:

$ curl -s http://192.168.1.25:8080/v1/models | jq

The following should be the typical output:

Output.11

{
  "object": "list",
  "data": [
    {
      "id": "nomic-embed-text-v1.5",
      "object": "model"
    },
    {
      "id": "pocket-tts",
      "object": "model"
    },
    {
      "id": "flux.2-klein-4b",
      "object": "model"
    },
    {
      "id": "gemma-4-e2b-it-qat-q4_0",
      "object": "model"
    }
  ]
}

Next, to send a user text to the installed text embedding model for an embedding response, execute the following command:

$ curl -s http://192.168.1.25:8080/v1/embeddings -X POST -H "Content-Type: application/json" -d '{
  "input": "LocalAI is very good!",
  "model": "nomic-embed-text-v1.5"
}' | jq "."

The following should be the typical trimmed embedding output:

Output.12

{
  "created": 1781397386,
  "object": "list",
  "id": "40ac5ff5-f8f1-473f-9a6c-c229f2b57171",
  "model": "nomic-embed-text-v1.5",
  "data": [
    {
      "embedding": [
        0.008724315,
        0.04154468,
        -0.19155857,
        -0.015543682,
        -0.02621489,
        -0.03186404,
        0.00116063,
        0.02089727,
        -0.012686667,
        -0.009345424,

        ...[ TRIM ]...

        0.09869765,
        -0.017687557,
        0.0062609147,
        -0.017183295,
        0.013117128,
        -0.011258059,
        0.003946815,
        -0.0027934387,
        -0.009456794,
        -0.030608324
      ],
      "index": 0,
      "object": "embedding"
    }
  ]
}

Moving on to the next task of text generation ...

To send a user prompt to the installed gemma-4 LLM model for a response, execute the following command in the terminal window:

$ curl -s http://192.168.1.25:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gemma-4-e2b-it-qat-q4_0",
  "messages": [{"role": "user", "content": "Describe an llm model using less than 50 words"}],
  "temperature": 0.7
}' | jq

The following would be the typical output:

Output.13

{
  "created": 1781397510,
  "object": "chat.completion",
  "id": "62c9c917-ebc2-48a6-a29f-17c9eb29dca8",
  "model": "gemma-4-e2b-it-qat-q4_0",
  "choices": [
    {
      "index": 0,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "An LLM is a large neural network trained on massive text datasets. It learns patterns and context to predict the most probable next word, enabling it to generate human-like text, answer questions, summarize information, and perform complex language tasks.",
        "reasoning": "Thinking Process:\n\n1.  **Analyze the Request:** The user wants a description of an LLM (Large Language Model) that is under 50 words.\n2.  **Identify Key LLM Attributes:** What defines an LLM?\n    *   It's a neural network.\n    *   It processes massive amounts of text.\n    *   It's trained on data.\n    *   It predicts the next word/token.\n    *   It generates human-like text (answers, summaries, code).\n3.  **Draft Concise Descriptions (Iterative Refinement):**\n    *   *Draft 1 (Focus on function):* An LLM is a massive neural network trained on vast text datasets. It predicts the most probable next word in a sequence, allowing it to generate human-like text, answer questions, summarize information, and translate languages. (Too long - 39 words, but could be tighter).\n    *   *Draft 2 (Focus on core mechanism):* A Large Language Model (LLM) is a deep learning neural network trained on massive text data. It uses probabilistic patterns to understand context, generate coherent human language, answer queries, and perform various natural language tasks. (40 words).\n4.  **Final Selection and Polish (Under 50 words):** Ensure it hits the main points clearly and concisely.\n\n*Self-Correction during drafting: Make sure the definition is accurate without getting too technical (avoiding specific transformer architecture unless necessary, as the goal is a general description).*\n\n*Final Check against constraint:* Is it under 50 words? Yes. Is it descriptive? Yes. (Proceed with final output)."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 407,
    "total_tokens": 435
  }
}

Moving on to the next task of text to audio generation ...

To send a sample text to the installed pocket-tts audio generation model, execute the following command in the terminal window:

$ curl -s http://192.168.1.25:8080/tts -H "Content-Type: application/json" -d '{
  "input": "LocalAI seems to be good for running various models!",
  "model": "pocket-tts",
  "voice": "jane"
}'

There will be no output on the terminal !

The audio file will be saved in the directory $HOME/.local_ai/tmp/generated/content/audio.

Here is the generated audio file:

Moving on to the final task of text to image generation ...

To send a sample text to the installed flux image generation model, execute the following command in the terminal window:

$ curl -s http://192.168.1.25:8080/v1/images/generations -H 'Content-Type: application/json' -d '{
  "prompt": "A deer in a park covered with snow",
  "step": 25,
  "size": "256x256"
}' | jq

The following would be the typical output:

Output.14

{
  "created": 1781398456,
  "id": "a3903930-74ca-441a-b2dc-fa855f4c8f30",
  "data": [
    {
      "embedding": null,
      "index": 0,
      "url": "http://192.168.1.25:8080/generated-images/b643436740188.png"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0,
    "input_tokens_details": {
      "text_tokens": 0,
      "image_tokens": 0
    }
  }
}

The following is the generated image by the model:

Figure.2

Note that the image file will be saved in the directory $HOME/.local_ai/tmp/generated/content/images.

Finally, shifting gears to demonstrate the various tasks using the OpenAI SDK ...

The following are the contents of the environment configuration file .env:

EMBEDDING_MODEL='nomic-embed-text-v1.5'
LLM_MODEL='gemma-4-e2b-it-qat-q4_0'
AUDIO_MODEL='pocket-tts'
IMAGE_MODEL='flux.2-klein-4b'
API_KEY='polarsparc'
BASE_URL='http://192.168.1.25:8080/v1/'

To load the environment configuration and initialize variables, execute the following code snippet:

from dotenv import load_dotenv, find_dotenv

import os

load_dotenv(find_dotenv())

embedding_model = os.getenv('EMBEDDING_MODEL')
llm_model = os.getenv('LLM_MODEL')
audio_model = os.getenv('AUDIO_MODEL')
image_model = os.getenv('IMAGE_MODEL')
api_key = os.getenv('API_KEY')
base_url = os.getenv('BASE_URL')

There will be no output generated.

To get the embedding vector for a given input text from the LocalAI platform, execute the following code snippet:

from openai import OpenAI

client = OpenAI(
  api_key=api_key,
  base_url=base_url
)

text = 'LocalAI is great for local testing!'

response = client.embeddings.create(
    input = [text],
    model=embedding_model,
)

print(response.data[0].embedding)

The following should be the typical trimmed output:

Output.15

[0.03261595964431763, 0.05440229922533035, -0.19516122341156006, -0.028359385207295418, 0.027675945311784744, -0.04765535891056061, 0.03257906064391136,
[... TRIM ...]
-0.026756983250379562, 0.03671528398990631, -0.004582556895911694, -0.014343341812491417, -0.06610661000013351, 0.02362898364663124, -0.027483072131872177]

To send a user prompt to the LLM model running on the LocalAI platform, execute the following code snippet:

messages = [{"role": "user", "content": "Describe llm model using less than 50 words"}]

response = client.chat.completions.create(
    messages=messages,
    model=llm_model,
    stream=False,
)

print(response.choices[0].message.content)

The following should be the typical output:

Output.16

A Large Language Model (LLM) is an AI trained on massive amounts of text data to understand, generate, and predict human language. They can answer questions, summarize content, write creatively, translate languages, and engage in complex conversations.

To generate an audio output for the user text from the text to audio model running on the LocalAI platform, execute the following code snippet:

client.audio.speech.create(
    model=audio_model,
    voice='michael',
    input=text
)

There will be output on the terminal and audio file will be saved as a .WAV file.

Here is the generated audio file:

To generate an image corresponding to the user prompt from the image generation model running on the LocalAI platform, execute the following code snippet:

response = client.images.generate(
    prompt='a cute baby snow leopard that is growling',
    size='256x256',
)

print(response.data[0].url)

The following should be the typical output:

Output.17

http://192.168.1.25:8080/generated-images/b641126708929.png

The following is the generated image by the model:

Figure.3

This concludes the various demonstrations on using the LocalAI platform as a local instance of OpenAI for development and testing !

References

LocalAI

OpenAI Python SDK