Streamline Local LLM App Development with Docker Compose
Learn to set up a self-contained local environment for LLM app development using Docker Compose. Deploy vector stores, open-source models, and FastAPI for a streamlined build process.
Streamline Local LLM App Development with Docker Compose
Developing applications powered by Large Language Models (LLMs) often involves juggling multiple services: the LLM itself, a vector store for retrieval-augmented generation (RAG), and your application's backend logic. Setting up these components locally can quickly become a tangled mess of dependencies, conflicting port numbers, and environment variables. This is where Docker Compose shines.
Docker Compose allows you to define and run multi-container Docker applications. It's a fantastic developer tool for creating a self-contained, reproducible, and isolated local development environment for your llm applications. By using Docker Compose, you can spin up your entire stack – including open-source models, vector stores, and your FastAPI backend – with a single command, freeing you to focus on building features rather than wrestling with infrastructure.
This post will guide you through setting up a robust local full-stack LLM development environment using Docker Compose, featuring:
- FastAPI: A popular Python web framework for building our application's API.
- Qdrant: A powerful vector database, perfect for storing and retrieving embeddings for RAG.
- Ollama: A flexible tool to run various open-source LLMs locally.
Why Docker Compose is Your LLM Dev Friend
Imagine a scenario where your LLM application needs a specific version of a vector store, a particular Python environment, and a dedicated GPU setup for a local model. Without containerization, this can be a headache. Docker Compose simplifies this by:
- Isolation: Each service runs in its own container, preventing dependency conflicts and ensuring a clean environment.
- Reproducibility: Your entire team (or even your future self) can spin up the exact same development environment, guaranteeing consistency. This is key for
ai/ml in productionreadiness, starting from dev. - Simplicity: Define your entire stack in a single
docker-compose.ymlfile, then manage it with simple commands. - Portability: Your setup works consistently across different operating systems.
Anatomy of Our Local LLM Stack
Before we dive into the docker-compose.yml, let's briefly look at the roles of each service:
FastAPI Application (Our Backend)
This will be the heart of our llm application. It exposes API endpoints for user interaction, orchestrates calls to the vector store (Qdrant) to fetch relevant context, and sends prompts to the llm (via Ollama) to generate responses. We'll containerize this Python application using a Dockerfile.
Qdrant (Our Vector Store)
Qdrant is an open-source vector similarity search engine. For llm applications, it's indispensable for RAG, allowing us to store high-dimensional vector embeddings of our data (documents, articles, etc.) and retrieve the most semantically similar pieces of information to augment our LLM prompts. Running Qdrant in a Docker container is straightforward and ensures it's always available.
Ollama (Our Local LLM Runner)
Ollama makes it super easy to run a variety of open-source LLMs like Llama 2, Mistral, Gemma, and more, directly on your local machine. It provides a simple API endpoint that our FastAPI application can hit. Incorporating Ollama into our Docker Compose setup means our LLM is just another service, ready to go.
Setting Up Your Environment
Let's build our docker-compose.yml step-by-step.
First, create a project directory:
mkdir llm-dev-stack
cd llm-dev-stack
1. The docker-compose.yml File
This file will define all our services. Create docker-compose.yml in your project root:
version: '3.8'
services:
fastapi_app:
build: ./fastapi_app
ports:
- "8000:8000"
environment:
# These variables will be available inside the FastAPI container
QDRANT_HOST: qdrant
QDRANT_PORT: 6333
OLLAMA_HOST: ollama
OLLAMA_PORT: 11434
depends_on:
- qdrant
- ollama
volumes:
- ./fastapi_app:/app # Mount your app code for live reloading/easier development
networks:
- llm_network
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333" # Qdrant API port
- "6334:6334" # Qdrant gRPC port
volumes:
- qdrant_data:/qdrant/storage
networks:
- llm_network
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_models:/root/.ollama
# Ollama can benefit from GPU, uncomment if you have one and Docker is configured
# deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: all
# capabilities: [gpu]
networks:
- llm_network
networks:
llm_network:
driver: bridge
volumes:
qdrant_data:
ollama_models:
A few important notes on this docker-compose.yml:
fastapi_app: We usebuild: ./fastapi_appto build our FastAPI service from aDockerfilelocated in thefastapi_appdirectory.depends_onensuresqdrantandollamastart beforefastapi_app.qdrant: Uses the official Qdrant Docker image. We map port6333for its REST API. A named volumeqdrant_datapersists our vector store data.ollama: Uses the official Ollama Docker image. Port11434is mapped for its API.ollama_modelsvolume persists downloaded LLM models, so you don't re-download them every time. The commented-outdeploysection shows how you might enable GPU access for Ollama if your Docker setup supports it, crucial for performance when running larger models.networks: All services are connected to a customllm_network, allowing them to communicate with each other using their service names (e.g.,fastapi_appcan reach Qdrant atqdrant:6333).volumes: Named volumes (qdrant_data,ollama_models) ensure that the data for Qdrant and your downloaded LLM models persist even if containers are removed, making yourdeveloper toolsexperience smoother.
2. The FastAPI Application
Create a directory fastapi_app in your project root:
mkdir fastapi_app
Inside fastapi_app, create a Dockerfile:
# fastapi_app/Dockerfile
FROM python:3.10-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Still inside fastapi_app, create requirements.txt:
# fastapi_app/requirements.txt
fastapi
uvicorn
python-multipart
qdrant-client
ollama
pydantic
And finally, our fastapi_app/main.py – a simple example demonstrating interaction:
# fastapi_app/main.py
import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from qdrant_client import QdrantClient, models
from ollama import Client as OllamaClient
app = FastAPI()
# Get Qdrant and Ollama hosts from environment variables
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "localhost")
OLLAMA_PORT = os.getenv("OLLAMA_PORT", "11434")
# Initialize clients (they will use the service names for Docker Compose network)
qdrant_client = QdrantClient(host=QDRANT_HOST, port=6333)
ollama_client = OllamaClient(host=f"http://{OLLAMA_HOST}:{OLLAMA_PORT}")
class ChatRequest(BaseModel):
prompt: str
model: str = "llama2" # Default to llama2, ensure you 'ollama pull llama2'
@app.on_event("startup")
async def startup_event():
print(f"Connecting to Qdrant at {QDRANT_HOST}:6333")
print(f"Connecting to Ollama at {OLLAMA_HOST}:{OLLAMA_PORT}")
try:
# Example: Ensure a collection exists in Qdrant
qdrant_client.recreate_collection(
collection_name="my_documents",
vectors_config=models.VectorParams(size=768, distance=models.Distance.COSINE)
)
print("Qdrant collection 'my_documents' ensured.")
except Exception as e:
print(f"Could not connect to Qdrant or create collection: {e}")
try:
# Example: Check Ollama availability and pull model if not present
ollama_client.list() # Simple call to check if Ollama is running
print("Ollama is reachable.")
# You might want to pull a model here, but it's better to do manually first
# to ensure it's downloaded into the persistent volume.
# e.g., ollama run llama2 (from your host)
except Exception as e:
print(f"Could not connect to Ollama: {e}. Please ensure Ollama is running and model '{ChatRequest.model}' is pulled.")
@app.get("/")
async def root():
return {"message": "LLM Dev Stack is running!"}
@app.post("/chat")
async def chat_with_ollama(request: ChatRequest):
try:
# Here you could perform RAG using qdrant_client first
# e.g., retrieve context based on request.prompt
# Then, augment the prompt and send to Ollama
response = ollama_client.chat(model=request.model, messages=[
{'role': 'user', 'content': request.prompt},
])
return {"response": response['message']['content']}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Ollama chat failed: {e}")
# Example endpoint for Qdrant (very basic, for demonstration)
@app.post("/documents")
async def add_document(text: str):
# In a real app, you'd embed this text using an embedding model
# For this example, we'll just add a dummy point.
try:
qdrant_client.upsert(
collection_name="my_documents",
points=[
models.PointStruct(
id=0, # Use a proper ID in a real app
vector=[0.1] * 768, # Placeholder vector
payload={"text": text}
)
]
)
return {"message": "Document added (dummy vector)."}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Qdrant operation failed: {e}")
Running Your Stack
Now for the magic! Navigate back to your project root (llm-dev-stack) where docker-compose.yml is located.
-
Start Services:
docker compose up -dThe
-druns the services in detached mode (in the background). -
Pull an LLM with Ollama: Before your FastAPI app can use an LLM, Ollama needs to download it. You can interact with the Ollama container directly:
docker exec -it ollama ollama run llama2 # or mistral, gemma, etc.This command will prompt Ollama to download the
llama2model. Once downloaded, you can exit the interactive session. The model will be persisted in yourollama_modelsvolume. -
Verify Services:
docker compose psYou should see
fastapi_app,qdrant, andollamaall in arunningstate. -
Test Your FastAPI App: Open your browser to
http://localhost:8000. You should see{"message": "LLM Dev Stack is running!"}. You can also test the/chatendpoint usingcurlor a tool like Postman/Insomnia:curl -X POST http://localhost:8000/chat \ -H "Content-Type: application/json" \ -d '{"prompt": "Tell me a short story about a brave knight.", "model": "llama2"}'Remember, replace
"llama2"with the model you pulled. -
Stop Services: When you're done, simply:
docker compose downThis stops and removes the containers and networks. Your Qdrant and Ollama data volumes will persist.
Expanding Your Stack
This is just the starting point! Your llm application can grow to include:
- Embeddings Service: A dedicated service for generating embeddings from text (e.g., using Hugging Face models via a
transformercontainer). - Monitoring & Logging: Tools like Prometheus/Grafana or ELK stack can be easily integrated as additional Docker Compose services.
- Redis/Caching: For caching
llmresponses or managing session data. - Other LLMs: Easily swap
ollamafor other model serving solutions likevLLMortext-generation-inferencefor higher throughput in production-like scenarios.
This setup forms a solid foundation for any ai/ml in production system, starting with robust local developer tools.
Conclusion
Leveraging Docker Compose significantly streamlines the development workflow for llm applications. By encapsulating your fastapi, vector store (qdrant), and local llm (ollama) services, you gain a self-contained, reproducible, and easy-to-manage development environment. This approach allows you to focus on the exciting parts – building intelligent applications – rather than getting bogged down in environment setup. Start building your next innovative full-stack LLM application with confidence and efficiency!
Share
Post to your network or copy the link.
Related
More posts to read next.
- From Prompt to Product: Integrating LLM Agents into Full-Stack Web Applications
Learn to architect and implement full-stack web applications with LLM agents, covering backend orchestration, tool usage, and frontend interaction patterns for intelligent, production-ready systems.
Read - Supercharge Your Dev Workflow: Integrating AI with Python and TypeScript
Discover practical strategies for integrating AI tools and LLMs into your Python/TypeScript development workflow. Automate tasks, enhance code quality, and accelerate project delivery with smart AI assistance.
Read - Simplify LLM-Driven Coding with Claude Code Routines