Lessons Learnt from Self-Hosting an AI Assistant

A practical case study on designing, deploying, and operating a self-hosted AI assistant—covering architecture decisions, scaling challenges, cost trade-offs, and hard-won operational lessons.

Read the full article on Medium•Back to Projects

Type

Technology case study

Platform

Kubernetes-based deployment with autoscaling

Core Ideas

LLM routing, reliability, observability, cost control

Date

January 2025

Overview

This project explores the real-world challenges of self-hosting an AI assistant instead of relying purely on managed SaaS offerings.

The goal was not to build a demo chatbot, but a production-grade AI assistant that could:

Route requests across multiple LLM providers
Support reasoning-heavy models
Scale under bursty user traffic
Remain cost-controlled and observable
Run within a secure, self-managed cloud environment

Why Self-Host?

The motivation came from practical constraints: avoiding vendor lock-in, enabling reasoning models alongside faster models, controlling routing and fallback behaviour, and understanding the true operational cost of running AI systems.

Self-hosting offered flexibility—but introduced complexity that is easy to underestimate.

High-Level Architecture

Single API endpoint exposed to clients
LLM routing layer abstracting multiple providers
Kubernetes-based deployment with autoscaling
Persistent storage for embeddings and metadata
Observability stack for latency, errors, and cost signals

Key principle: Treat LLMs as unreliable, bursty dependencies. Design accordingly.

Core Components

LLM Routing Layer

A unified proxy routed requests to different models based on task type (reasoning vs generation), cost constraints, latency requirements, and fallback availability—simplifying application logic and enabling controlled experimentation.

Kubernetes Deployment

Kubernetes provided HPA, resource isolation, rolling deployments, and failure recovery. It also amplified problems when traffic patterns weren’t well understood.

Storage Layer

Storage requirements varied (vector storage for embeddings, metadata for request tracking, and temporary object storage). A key lesson: storage decisions directly impact scaling behaviour.

Key Challenges

1) Traffic Bursts and Autoscaling

Small increases in users caused large spikes in concurrent requests
Reasoning models amplified CPU and memory usage
Cold starts hurt user experience more than expected

2) Resource Exhaustion Under Load

Peak times triggered saturated CPU, memory pressure, pod restarts, and cascading failures when retries stacked up. Mitigations included concurrency controls, circuit breakers in the routing layer, and explicit back-pressure.

3) Storage Doesn’t Scale Like Compute

Once HPA was introduced, persistent volumes and stateful components became bottlenecks—forcing a storage architecture rethink.

4) Cost Visibility Lag

Infrastructure costs were predictable; model usage costs were not. Without strong observability, traffic bursts caused sudden spend spikes and cost surprises.

Observability and Control

To regain control, the following became essential:

Request-level tracing
Per-model latency metrics
Token usage tracking
Error rate monitoring by provider
Alerting on abnormal traffic patterns

AI systems without observability degrade silently.

What Worked Well

Unified LLM routing simplified experimentation
Kubernetes enabled rapid iteration
Autoscaling prevented total outages
Abstracting providers reduced vendor risk

What I Would Do Differently

Design storage for scale from day one
Add cost and token observability earlier
Assume traffic bursts, not linear growth
Treat reasoning models as a separate capacity class
Invest more in realistic load testing

Key Takeaways

Self-hosting AI assistants is powerful, but operationally expensive
Kubernetes solves infrastructure problems, not system design problems
Storage and cost become first-class concerns quickly
Observability is not optional for AI platforms
Flexibility always trades off against simplicity

Who This Is For

Data and platform engineers building AI systems
Teams weighing self-hosted vs managed AI platforms
Architects designing LLM-powered products
Leaders evaluating cost, scale, and reliability trade-offs

Vinay Kulkarni