Lessons Learnt from Self-Hosting an AI Assistant
A practical case study on designing, deploying, and operating a self-hosted AI assistant—covering architecture decisions, scaling challenges, cost trade-offs, and hard-won operational lessons.
Type
Technology case study
Platform
Kubernetes-based deployment with autoscaling
Core Ideas
LLM routing, reliability, observability, cost control
Date
January 2025

Overview
This project explores the real-world challenges of self-hosting an AI assistant instead of relying purely on managed SaaS offerings.
The goal was not to build a demo chatbot, but a production-grade AI assistant that could:
- Route requests across multiple LLM providers
- Support reasoning-heavy models
- Scale under bursty user traffic
- Remain cost-controlled and observable
- Run within a secure, self-managed cloud environment
Why Self-Host?
The motivation came from practical constraints: avoiding vendor lock-in, enabling reasoning models alongside faster models, controlling routing and fallback behaviour, and understanding the true operational cost of running AI systems.
Self-hosting offered flexibility—but introduced complexity that is easy to underestimate.
High-Level Architecture
- Single API endpoint exposed to clients
- LLM routing layer abstracting multiple providers
- Kubernetes-based deployment with autoscaling
- Persistent storage for embeddings and metadata
- Observability stack for latency, errors, and cost signals
Key principle: Treat LLMs as unreliable, bursty dependencies. Design accordingly.
Core Components
LLM Routing Layer
A unified proxy routed requests to different models based on task type (reasoning vs generation), cost constraints, latency requirements, and fallback availability—simplifying application logic and enabling controlled experimentation.
Kubernetes Deployment
Kubernetes provided HPA, resource isolation, rolling deployments, and failure recovery. It also amplified problems when traffic patterns weren’t well understood.
Storage Layer
Storage requirements varied (vector storage for embeddings, metadata for request tracking, and temporary object storage). A key lesson: storage decisions directly impact scaling behaviour.
Key Challenges
1) Traffic Bursts and Autoscaling
- Small increases in users caused large spikes in concurrent requests
- Reasoning models amplified CPU and memory usage
- Cold starts hurt user experience more than expected
2) Resource Exhaustion Under Load
Peak times triggered saturated CPU, memory pressure, pod restarts, and cascading failures when retries stacked up. Mitigations included concurrency controls, circuit breakers in the routing layer, and explicit back-pressure.
3) Storage Doesn’t Scale Like Compute
Once HPA was introduced, persistent volumes and stateful components became bottlenecks—forcing a storage architecture rethink.
4) Cost Visibility Lag
Infrastructure costs were predictable; model usage costs were not. Without strong observability, traffic bursts caused sudden spend spikes and cost surprises.
Observability and Control
To regain control, the following became essential:
- Request-level tracing
- Per-model latency metrics
- Token usage tracking
- Error rate monitoring by provider
- Alerting on abnormal traffic patterns
AI systems without observability degrade silently.
What Worked Well
- Unified LLM routing simplified experimentation
- Kubernetes enabled rapid iteration
- Autoscaling prevented total outages
- Abstracting providers reduced vendor risk
What I Would Do Differently
- Design storage for scale from day one
- Add cost and token observability earlier
- Assume traffic bursts, not linear growth
- Treat reasoning models as a separate capacity class
- Invest more in realistic load testing
Key Takeaways
- Self-hosting AI assistants is powerful, but operationally expensive
- Kubernetes solves infrastructure problems, not system design problems
- Storage and cost become first-class concerns quickly
- Observability is not optional for AI platforms
- Flexibility always trades off against simplicity
Who This Is For
- Data and platform engineers building AI systems
- Teams weighing self-hosted vs managed AI platforms
- Architects designing LLM-powered products
- Leaders evaluating cost, scale, and reliability trade-offs
Related Links
- Original article: Lessons Learnt Self-hosting an AI Assistant
- Topics: AI Platforms, LLM Infrastructure, MLOps, Kubernetes, System Design