Sharat Visweswara
4 min read
  • AI
  • LLMs
  • Python
  • Developer Tools

Local LLMs Are More Accessible Than You Think

Running a language model on your own machine used to be limited to an elite few with powerful hardware and technical expertise. Not any more!

Running a language model locally used to be the kind of thing that required an afternoon of fighting CUDA drivers, unanswered questions on Stack Overflow, forum threads full of conflicting advice, and a willingness to compile from source. That era is largely over. The tooling has quietly become very good and accessible to a broader audience.

Admittedly, I’ve been writing code for a long time, so I may be overestimating how easy this is—but let’s push on.

To demonstrate this, I built a small script that watches an email inbox over IMAP and summarizes new messages using a locally running LLM. It’s extracted from a much larger email concierge app I’ve been working on, and I hope to talk more about it in subsequent posts.

I specifically use local LLMs because it’s my email, and I want it to stay private: I don’t want OpenAI, Claude, and their ilk to see it. The amazing part is how little ceremony it takes to get this working. A link to the script is at the end of this article for those in a hurry. I hope you enjoy the dissection below—let me know what you think.

Dependency management: uv

The script is a single Python file. No requirements.txt. No virtual environment you have to remember to activate. The dependencies live right at the top:

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "llama-cpp-python",
#   "huggingface-hub",
#   "imapclient",
# ]
# ///

uv — from Astral, the team behind Ruff — is a Python package manager and runner written in Rust. The uv run --script convention, standardised in PEP 723, lets you embed dependency metadata directly in the script. Running it is as simple as ./summarize_email.py; uv installs everything into an isolated environment and executes. No setup, no activation, no drift—and no polluting your system Python with random pip-installed packages.

This is already a meaningful quality-of-life improvement over the traditional Python workflow. Before UV, I struggled with Python dependency management and virtual environments, which reinforced my view that Python wasn’t appropriate for anything serious. Now it’s the first language I reach for. Enough about UV—the dependencies themselves are where it gets interesting.

Getting a model: Hugging Face

huggingface-hub is the official client library for Hugging Face, which at this point is the primary distribution layer for open weights models. The script uses it to pull down a model on first run and cache it locally:

DEFAULT_REPO = "Qwen/Qwen2.5-0.5B-Instruct-GGUF"
DEFAULT_FILE = "qwen2.5-0.5b-instruct-q4_k_m.gguf"

def resolve_model() -> str:
    cache_dir = Path.home() / ".cache" / "email-summarizer"
    cached = cache_dir / DEFAULT_FILE
    if cached.exists():
        return str(cached)
    print(f"Downloading {DEFAULT_FILE} from {DEFAULT_REPO} ...", flush=True)
    return hf_hub_download(
        repo_id=DEFAULT_REPO,
        filename=DEFAULT_FILE,
        local_dir=str(cache_dir),
    )

One function call and that’s it! The model file downloads to ~/.cache/email-summarizer/ and subsequent runs skip the download entirely.

The model itself: Qwen 2.5 0.5B

The model chosen here is Qwen 2.5 at the 0.5 billion parameter scale, in GGUF format quantized to Q4_K_M. That’s around 400MB on disk. It runs comfortably on a laptop CPU with no GPU required.

0.5B sounds small because it is small. For general question-answering it would struggle. But for a constrained task like “summarise this email in 2–3 sentences”, it’s genuinely sufficient. The instruction-tuned variant follows the system prompt reliably, and the temperature is set low (0.2) to keep outputs predictable. The trick with small models is matching the task to what they can actually do — don’t ask them to reason, ask them to restate.

The GGUF format and Q4_K_M quantisation are the work of llama.cpp, which is what actually runs the model.

Inference: llama-cpp-python

llama-cpp-python is a Python binding for llama.cpp. Loading and running the model looks like this:

llm = Llama(model_path=model_path, n_ctx=2048, verbose=False)

One line. The inference API follows the OpenAI chat completions shape, which means if you’ve written anything against that SDK the code will feel immediately familiar:

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a concise email summarizer. Reply with 2-3 sentences only.",
        },
        {
            "role": "user",
            "content": f"Summarize this email.\n\nSubject: {subject}\n\n{body[:MAX_BODY_CHARS]}",
        },
    ],
    max_tokens=160,
    temperature=0.2,
)

There is no API key. There is no network call. There is no cost per token. The model runs in-process and returns a response in a few seconds on commodity hardware.

The IMAP part

The script also connects to an email server using IMAP IDLE — a push mechanism that lets the server notify the client when new mail arrives, rather than polling. This is the least interesting part, which says something: plumbing that would have been the headline of the project a few years ago is now genuinely boring compared to the inference layer.


The full script is available as a gist if you want to run it yourself. The only prerequisites are Python 3.11+ and uv. Everything else — the model, the dependencies — sorts itself out on first run.

Local LLMs are not a research project anymore. They’re a dependency you can pip-install. What will you do with it?