Self-hosted AI Models for Coding and More

For an enthusiast, it’s a whole universe of exploration, privacy, anti-censorship, and more.

The most popular sources for models are:

llama.cpp
- A highly optimized C/C++ implementation designed to run LLMs locally. [Link]
Ollama [Link]
- A lightweight, extensible framework for running LLMs locally.
Hugging Face [Link]
- The primary model hub for AI enthusiasts and researchers.
LocalAI [Link]
- A containerized AI stack with a web UI, compatible with the OpenAI API.

Tip

The following websites can test and benchmark whether your computer can run an LLM, and how well.

CanIRun.ai [Link]
LLMfit [Link]

LLAMA.CPP

Download a model from HuggingFace.

llama-cli -hf Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

Launch an OpenAI-compatible API server (port 8080 by default).

llama-server -hf Qwen/Qwen2.5-Coder-7B-Instruct-GGUF

llama-server -m model.gguf --host 0.0.0.0 --port 9000

Inspect and verify a model file’s metadata and architecture.

llama-cli -m model.gguf --info

Start a continuous conversation.

llama-cli -m model.gguf -cnv

OLLAMA

Installing Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Pulling coder models.

ollama pull deepseek-coder:1.3b
ollama pull qwen2.5-coder:1.5b-base
ollama pull qwen2.5-coder:3b
ollama pull deepseek-coder:6.7b
ollama pull codellama:7b
ollama pull qwen2.5-coder:7b
ollama pull yi-coder:9b
ollama pull qwen2.5-coder:14b

Running a specific model.

ollama run yi-coder:9b

To exit the prompt, use Ctrl + D or type /bye.

Manage your models.

ollama help
ollama list
ollama ps
ollama stop yi-coder:9b
ollama rm yi-coder:9b

RUNNING AND TESTING

To make Ollama reachable over your network, modify the service configuration:

sudo nano /etc/systemd/system/ollama.service

Add the following environment variables to the [Service] section to bind to all network interfaces:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=default.target

Reload to apply.

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama

From a remote host, test the connection via an HTTP request:

curl -s http://192.168.1.101:11434/api/generate -d '{
  "model": "qwen2.5-coder:1.5b-base",
  "prompt": "When was Python 3 first released?",
  "stream": false
}'

Example output.

{
  "model": "qwen2.5-coder:1.5b-base",
  "created_at": "2026-02-07T01:34:02.891211247Z",
  "response": "<redacted_for_brevity>",
  "done": true,
  "done_reason": "stop",
  "context": [
    <redacted_for_brevity>
  ],
  "total_duration": 260501788901,
  "load_duration": 2392809850,
  "prompt_eval_count": 7,
  "prompt_eval_duration": 687966008,
  "eval_count": 1065,
  "eval_duration": 253858547320
}

Success!

BENCHMARKING

The llm-benchmarking tool measures inference speed (tokens per second) on your hardware [Link].

sudo apt install python3-venv -y
python3 -m venv .venv
source .venv/bin/activate
pip install llm-benchmark
nano custom.yml

file_name: "custom.yml"
version: 2.0.custom
models:
- model: "yi-coder:9b"
- model: "codellama:7b"
- model: "qwen2.5-coder:7b"
- model: "qwen2.5-coder:3b"
- model: "qwen2.5-coder:14b"
- model: "deepseek-coder:6.7b"
- model: "qwen2.5-coder:1.5b-base"
- model: "deepseek-coder:1.3b"

llm_benchmark run --custombenchmark=custom.yml

Here are some acceptable results (summarized for brevity) using an NVIDIA P4 (PG414) 8GB VRAM [Link].

----------------------------------------
model_name =    deepseek-coder:1.3b
Average of eval rate:  125.67  tokens/s
----------------------------------------
model_name =    qwen2.5-coder:1.5b-base
Average of eval rate:  87.382  tokens/s
----------------------------------------
model_name =    qwen2.5-coder:3b
Average of eval rate:  53.126  tokens/s
----------------------------------------
model_name =    deepseek-coder:6.7b
Average of eval rate:  35.746  tokens/s
----------------------------------------
model_name =    codellama:7b
Average of eval rate:  34.978  tokens/s
----------------------------------------
model_name =    qwen2.5-coder:7b
Average of eval rate:  28.172  tokens/s
----------------------------------------
model_name =    yi-coder:9b
Average of eval rate:  26.79  tokens/s
----------------------------------------
model_name =    qwen2.5-coder:14b
Average of eval rate:  2.286  tokens/s
----------------------------------------

Note: Eval rates >30 tokens/s are excellent, while <10 tokens/s are extremely slow. The command ollama ls shows which processor is being used. In this case, the model did not fit entirely on the GPU, so the CPU is handling a few layers.

Here are slightly better results using an NVIDIA T4 (PG183) 16GB VRAM [Link].

(pending)

Here are some unacceptable results that fell back to the CPU.

----------------------------------------
model_name = deepseek-coder:1.3b
Average of eval rate: 7.364 tokens/s
----------------------------------------
model_name = qwen2.5-coder:1.5b-base
Average of eval rate: 5.154 tokens/s
----------------------------------------
model_name = deepseek-coder-v2:16b
Average of eval rate: 3.428 tokens/s
----------------------------------------
model_name = qwen2.5-coder:3b
Average of eval rate: 2.942 tokens/s
----------------------------------------
model_name = deepseek-coder:6.7b 
Average of eval rate: 1.328 tokens/s 
---------------------------------------- 
model_name = qwen2.5-coder:14b
Average of eval rate: 0.704 tokens/s
----------------------------------------

Note: Without GPU acceleration, eval rates are extremely slow.

If this is your situation (as in the example above), verify that the GPU drivers and kernel modules are properly installed and loaded. Choose the architecture carefully: AMD (Polaris, Vega, RDNA 3+) or NVIDIA (Pascal, Turing, Ampere).

As a last resort, try overwriting the service file:

nano /etc/systemd/system/ollama.service

[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=8.0.3"
#Environment="OLLAMA_VULKAN=1"

systemctl daemon-reload
systemctl restart ollama
sleep 10
journalctl -u ollama.service --since "1 minute ago"

In the following output, it was unsuccessful. Hopefully you’ll have better luck.

...
level=INFO source=runner.go:67 msg="discovering available GPUs..."
level=WARN source=runner.go:485 msg="user overrode visible devices" HSA_OVERRIDE_GFX_VERSION=8.0.3
level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
...

INTEGRATION

On VS Code, install popular extensions like Continue or Roo Code to use your local models for autocomplete and chat.

Example configuration for Continue.

nano ~/.continue/config.yaml

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: "Nerdsking Python 7B"
    provider: ollama
    model: hf.co/Nerdsking/Nerdsking-python-coder-7B-i:latest
    apiBase: http://192.168.1.101:11434
    roles: ["autocomplete", "edit", "apply", "chat"]

  - name: "Qwen2.5 Coder 7B"
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://192.168.1.101:11434
    roles: ["autocomplete", "edit", "apply", "chat"]

  - name: "YI-coder 9B"
    provider: ollama
    model: yi-coder:9b
    apiBase: http://192.168.1.101:11434
    roles: ["autocomplete", "edit", "apply", "chat"]

tabAutocompleteModel:
  name: "Nerdsking Python 3B"
  provider: ollama
  model: hf.co/Nerdsking/nerdsking-python-coder-3B-i:Q8_0
  apiBase: http://192.168.1.101:11434

Example configuration for Roo Code.

(pending)

HUGGING FACE & GGUF

Hugging Face makes it easy to use a model on virtually any framework, app, or cloud.

GGUF (GPT-Generated Unified Format)

A binary model file format
Bundles quantization and metadata in a single file
Enables fully offline inference

If the model on HF includes a .gguf file, pulling it into Ollama is as simple as:

ollama pull hf.co/Nerdsking/Nerdsking-python-coder-7B-i

ollama pull hf.co/Nerdsking/nerdsking-python-coder-3B-i:Q8_0

Alternatively, download the .gguf file manually and create a model from it.

echo 'FROM ./flux1-dev-Q4_1.gguf' > Modelfile
ollama create flux1-dev-Q4_1 -f Modelfile

Why run Hugging Face models with Ollama?
- It’s the easiest way to interact with models, especially for coding integrations.
What are Ollama’s limitations?
- For specialized models, such as image generation, it’s better to use dedicated apps or write your own code.

Downloading Models

Install the HF CLI tool.

curl -LsSf https://hf.co/cli/install.sh | bash
curl -sSfL https://hf.co/git-xet/install.sh | sudo sh

Optionally, log in to HF.

hf auth login

Preview the download, then proceed.

hf download unsloth/Z-Image-GGUF z-image-Q6_K.gguf --local-dir ./models --dry-run
hf download unsloth/Z-Image-GGUF z-image-Q6_K.gguf --local-dir ./models

Or download the entire repository.

hf download black-forest-labs/FLUX.1-dev

Coding with AI

There are multiple ways to install dependencies, but here I’ll demonstrate using pre-built container images for image generation.

docker pull huggingface/transformers-pytorch-gpu:latest
docker run -it --gpus all -v $(pwd)/models:/root/.cache/huggingface --name hf huggingface/transformers-pytorch-gpu:latest /bin/bash

python3 -c "import torch; print(f'CUDA Available: {torch.cuda.is_available()}'); print(f'Device Name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"None\"}')"

Build with CUDA support to use your GPU.

CMAKE_ARGS="-DSD_CUDA=ON" pip install stable-diffusion-cpp-python

Install common dependencies.

pip install diffusers accelerate transformers sentencepiece
pip install --root-user-action=ignore gguf

Now create and run the example script for your chosen model on HF. Here is a basic example.

from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
    "<MODEL_NAME_HERE>",
    torch_dtype=torch.float16
)
pipe.to("cuda")
prompt = "Happy family, detailed lighting, 8k"
negative_prompt = "low quality, blurry, deformed"
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=18,
    guidance_scale=8
).images[0]
image.save("generated_image.png")

To return to or open a new terminal in the container:

docker start hf
docker exec -it hf /bin/bash

Alternatively, look for instructions on running interesting Spaces locally.

LOCALAI

LocalAI is a full-featured AI stack that runs in Docker, making it OS-agnostic.

curl -L https://install.localai.io | sh

docker run --gpus all -p 8080:8080 -v $(pwd)/models:/models -v $(pwd)/backends:/backends --name local-ai -itd localai/localai:latest

Models and backend engines are stored externally, keeping the container ephemeral while persisting your data.

Access the dashboard at http://localhost:8080

READ MORE

Interacting Directly with Ollama’s API [Link]