—·6 min read

Streamline Local LLM App Development with Docker Compose

Learn to set up a self-contained local environment for LLM app development using Docker Compose. Deploy vector stores, open-source models, and FastAPI for a streamlined build process.

llm

docker

developer-tools

ai-ml-in-production

full-stack

Streamline Local LLM App Development with Docker Compose

Developing applications powered by Large Language Models (LLMs) often involves juggling multiple services: the LLM itself, a vector store for retrieval-augmented generation (RAG), and your application's backend logic. Setting up these components locally can quickly become a tangled mess of dependencies, conflicting port numbers, and environment variables. This is where Docker Compose shines.

Docker Compose allows you to define and run multi-container Docker applications. It's a fantastic developer tool for creating a self-contained, reproducible, and isolated local development environment for your llm applications. By using Docker Compose, you can spin up your entire stack – including open-source models, vector stores, and your FastAPI backend – with a single command, freeing you to focus on building features rather than wrestling with infrastructure.

This post will guide you through setting up a robust local full-stack LLM development environment using Docker Compose, featuring:

FastAPI: A popular Python web framework for building our application's API.
Qdrant: A powerful vector database, perfect for storing and retrieving embeddings for RAG.
Ollama: A flexible tool to run various open-source LLMs locally.

Why Docker Compose is Your LLM Dev Friend

Imagine a scenario where your LLM application needs a specific version of a vector store, a particular Python environment, and a dedicated GPU setup for a local model. Without containerization, this can be a headache. Docker Compose simplifies this by:

Isolation: Each service runs in its own container, preventing dependency conflicts and ensuring a clean environment.
Reproducibility: Your entire team (or even your future self) can spin up the exact same development environment, guaranteeing consistency. This is key for ai/ml in production readiness, starting from dev.
Simplicity: Define your entire stack in a single docker-compose.yml file, then manage it with simple commands.
Portability: Your setup works consistently across different operating systems.

Anatomy of Our Local LLM Stack

Before we dive into the docker-compose.yml, let's briefly look at the roles of each service:

FastAPI Application (Our Backend)

This will be the heart of our llm application. It exposes API endpoints for user interaction, orchestrates calls to the vector store (Qdrant) to fetch relevant context, and sends prompts to the llm (via Ollama) to generate responses. We'll containerize this Python application using a Dockerfile.

Qdrant (Our Vector Store)

Qdrant is an open-source vector similarity search engine. For llm applications, it's indispensable for RAG, allowing us to store high-dimensional vector embeddings of our data (documents, articles, etc.) and retrieve the most semantically similar pieces of information to augment our LLM prompts. Running Qdrant in a Docker container is straightforward and ensures it's always available.

Ollama (Our Local LLM Runner)

Ollama makes it super easy to run a variety of open-source LLMs like Llama 2, Mistral, Gemma, and more, directly on your local machine. It provides a simple API endpoint that our FastAPI application can hit. Incorporating Ollama into our Docker Compose setup means our LLM is just another service, ready to go.

Setting Up Your Environment

Let's build our docker-compose.yml step-by-step.

First, create a project directory:

mkdir llm-dev-stack
cd llm-dev-stack

1. The `docker-compose.yml` File

This file will define all our services. Create docker-compose.yml in your project root:

version: '3.8'

services:
  fastapi_app:
    build: ./fastapi_app
    ports:

      - "8000:8000"
    environment:
      # These variables will be available inside the FastAPI container
      QDRANT_HOST: qdrant
      QDRANT_PORT: 6333
      OLLAMA_HOST: ollama
      OLLAMA_PORT: 11434
    depends_on:

      - qdrant
      - ollama
    volumes:

      - ./fastapi_app:/app # Mount your app code for live reloading/easier development
    networks:

      - llm_network

  qdrant:
    image: qdrant/qdrant:latest
    ports:

      - "6333:6333" # Qdrant API port
      - "6334:6334" # Qdrant gRPC port
    volumes:

      - qdrant_data:/qdrant/storage
    networks:

      - llm_network

  ollama:
    image: ollama/ollama:latest
    ports:

      - "11434:11434"
    volumes:

      - ollama_models:/root/.ollama
    # Ollama can benefit from GPU, uncomment if you have one and Docker is configured
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    networks:

      - llm_network

networks:
  llm_network:
    driver: bridge

volumes:
  qdrant_data:
  ollama_models:

A few important notes on this docker-compose.yml:

fastapi_app: We use build: ./fastapi_app to build our FastAPI service from a Dockerfile located in the fastapi_app directory. depends_on ensures qdrant and ollama start before fastapi_app.
qdrant: Uses the official Qdrant Docker image. We map port 6333 for its REST API. A named volume qdrant_data persists our vector store data.
ollama: Uses the official Ollama Docker image. Port 11434 is mapped for its API. ollama_models volume persists downloaded LLM models, so you don't re-download them every time. The commented-out deploy section shows how you might enable GPU access for Ollama if your Docker setup supports it, crucial for performance when running larger models.
networks: All services are connected to a custom llm_network, allowing them to communicate with each other using their service names (e.g., fastapi_app can reach Qdrant at qdrant:6333).
volumes: Named volumes (qdrant_data, ollama_models) ensure that the data for Qdrant and your downloaded LLM models persist even if containers are removed, making your developer tools experience smoother.

2. The FastAPI Application

Create a directory fastapi_app in your project root:

mkdir fastapi_app

Inside fastapi_app, create a Dockerfile:

# fastapi_app/Dockerfile
FROM python:3.10-slim-buster

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Still inside fastapi_app, create requirements.txt:

# fastapi_app/requirements.txt
fastapi
uvicorn
python-multipart
qdrant-client
ollama
pydantic

And finally, our fastapi_app/main.py – a simple example demonstrating interaction:

# fastapi_app/main.py
import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

from qdrant_client import QdrantClient, models
from ollama import Client as OllamaClient

app = FastAPI()

# Get Qdrant and Ollama hosts from environment variables
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.getenv("OLLAMA_PORT", "11434")

# Initialize clients (they will use the service names for Docker Compose network)
qdrant_client = QdrantClient(host=QDRANT_HOST, port=6333)
ollama_client = OllamaClient(host=f"http://{OLLAMA_HOST}:{OLLAMA_PORT}")

class ChatRequest(BaseModel):
    prompt: str
    model: str = "llama2" # Default to llama2, ensure you 'ollama pull llama2'

@app.on_event("startup")
async def startup_event():
    print(f"Connecting to Qdrant at {QDRANT_HOST}:6333")
    print(f"Connecting to Ollama at {OLLAMA_HOST}:{OLLAMA_PORT}")
    try:
        # Example: Ensure a collection exists in Qdrant
        qdrant_client.recreate_collection(
            collection_name="my_documents",
            vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
        )
        print("Qdrant collection 'my_documents' ensured.")
    except Exception as e:
        print(f"Could not connect to Qdrant or create collection: {e}")

    try:
        # Example: Check Ollama availability and pull model if not present
        ollama_client.list() # Simple call to check if Ollama is running
        print("Ollama is reachable.")
        # You might want to pull a model here, but it's better to do manually first
        # to ensure it's downloaded into the persistent volume.
        # e.g., ollama run llama2 (from your host)
    except Exception as e:
        print(f"Could not connect to Ollama: {e}. Please ensure Ollama is running and model '{ChatRequest.model}' is pulled.")


@app.get("/")
async def root():
    return {"message": "LLM Dev Stack is running!"}

@app.post("/chat")
async def chat_with_ollama(request: ChatRequest):
    try:
        # Here you could perform RAG using qdrant_client first
        # e.g., retrieve context based on request.prompt
        # Then, augment the prompt and send to Ollama

        response = ollama_client.chat(model=request.model, messages=[
            {'role': 'user', 'content': request.prompt},
        ])
        return {"response": response['message']['content']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Ollama chat failed: {e}")

# Example endpoint for Qdrant (very basic, for demonstration)
@app.post("/documents")
async def add_document(text: str):
    # In a real app, you'd embed this text using an embedding model
    # For this example, we'll just add a dummy point.
    try:
        qdrant_client.upsert(
            collection_name="my_documents",
            points=[
                models.PointStruct(
                    id=0, # Use a proper ID in a real app
                    vector=[0.1] * 768, # Placeholder vector
                    payload={"text": text}
                )
            ]
        )
        return {"message": "Document added (dummy vector)."}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Qdrant operation failed: {e}")

Running Your Stack

Now for the magic! Navigate back to your project root (llm-dev-stack) where docker-compose.yml is located.

Start Services:
```
docker compose up -d
```
The -d runs the services in detached mode (in the background).
Pull an LLM with Ollama: Before your FastAPI app can use an LLM, Ollama needs to download it. You can interact with the Ollama container directly:
```
docker exec -it ollama ollama run llama2 # or mistral, gemma, etc.
```
This command will prompt Ollama to download the llama2 model. Once downloaded, you can exit the interactive session. The model will be persisted in your ollama_models volume.
Verify Services:
```
docker compose ps
```
You should see fastapi_app, qdrant, and ollama all in a running state.
Test Your FastAPI App: Open your browser to http://localhost:8000. You should see {"message": "LLM Dev Stack is running!"}. You can also test the /chat endpoint using curl or a tool like Postman/Insomnia:
```
curl -X POST http://localhost:8000/chat \
     -H "Content-Type: application/json" \
     -d '{"prompt": "Tell me a short story about a brave knight.", "model": "llama2"}'
```
Remember, replace "llama2" with the model you pulled.
Stop Services: When you're done, simply:
```
docker compose down
```
This stops and removes the containers and networks. Your Qdrant and Ollama data volumes will persist.

Expanding Your Stack

This is just the starting point! Your llm application can grow to include:

Embeddings Service: A dedicated service for generating embeddings from text (e.g., using Hugging Face models via a transformer container).
Monitoring & Logging: Tools like Prometheus/Grafana or ELK stack can be easily integrated as additional Docker Compose services.
Redis/Caching: For caching llm responses or managing session data.
Other LLMs: Easily swap ollama for other model serving solutions like vLLM or text-generation-inference for higher throughput in production-like scenarios.

This setup forms a solid foundation for any ai/ml in production system, starting with robust local developer tools.

Conclusion

Leveraging Docker Compose significantly streamlines the development workflow for llm applications. By encapsulating your fastapi, vector store (qdrant), and local llm (ollama) services, you gain a self-contained, reproducible, and easy-to-manage development environment. This approach allows you to focus on the exciting parts – building intelligent applications – rather than getting bogged down in environment setup. Start building your next innovative full-stack LLM application with confidence and efficiency!

Post to your network or copy the link.

LinkedIn X Facebook Reddit WhatsApp Email

Streamline Local LLM App Development with Docker Compose

Why Docker Compose is Your LLM Dev Friend

Anatomy of Our Local LLM Stack

FastAPI Application (Our Backend)

Qdrant (Our Vector Store)

Ollama (Our Local LLM Runner)

Setting Up Your Environment

1. The docker-compose.yml File

2. The FastAPI Application

Running Your Stack

Expanding Your Stack

Conclusion

Share

Related

1. The `docker-compose.yml` File