End-to-End Production Pipeline: Deploying a Scalable Deep Learning Model on Kubernetes
You’ve spent weeks training a model that finally beats the baseline. The next question is: how do you get it out of the notebook and into the hands of users without it crashing on the first request? That’s why a solid production pipeline matters – it turns a cool experiment into a reliable service.
Why Kubernetes for Deep Learning?
Kubernetes (often shortened to K8s) is a system that runs containers on a cluster of machines. Think of it as a traffic manager that decides where each container lives, how many copies to run, and what to do when something goes wrong. For deep learning models, this means you can:
- Scale up when traffic spikes and scale down when it’s quiet, saving money.
- Recover automatically if a node fails – the model keeps running.
- Separate training and serving environments, keeping the production side clean.
I first tried to serve a model on a single VM and learned the hard way that a single point of failure is a nightmare during a demo. Switching to Kubernetes saved me a lot of late‑night firefighting.
Overview of the Pipeline
- Export the model in a portable format (ONNX, TorchScript, SavedModel).
- Wrap the model in a lightweight API using FastAPI or Flask.
- Containerize the API with Docker.
- Push the image to a registry (Docker Hub, GCR, ECR).
- Write Kubernetes manifests for deployment, service, and autoscaling.
- Add monitoring with Prometheus and Grafana.
- Deploy and test.
Each step is simple enough to do on a laptop, but together they give you a production‑ready system.
Step 1: Export the Model
Most frameworks let you save a model in a format that does not depend on the original code. For PyTorch, torch.jit.trace creates a TorchScript file; for TensorFlow, model.save('my_model') creates a SavedModel directory. If you need to move between frameworks, ONNX is a good neutral format.
# PyTorch example
import torch
model = MyModel()
model.load_state_dict(torch.load('checkpoint.pt'))
model.eval()
scripted = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
scripted.save('model.pt')
Saving the model this way means the container only needs the runtime library, not the training code.
Step 2: Build a Simple API
I like FastAPI because it’s fast, type‑safe, and produces automatic docs. The API loads the model once at startup and then handles each request in a separate thread.
# api.py
from fastapi import FastAPI
import torch
import numpy as np
app = FastAPI()
model = torch.jit.load('model.pt')
model.eval()
@app.post("/predict")
def predict(data: list):
tensor = torch.tensor(np.array(data), dtype=torch.float32)
with torch.no_grad():
out = model(tensor)
return {"prediction": out.tolist()}
Run it locally with uvicorn api:app --host 0.0.0.0 --port 8000 and you’ll see a Swagger UI at http://localhost:8000/docs. If it works here, it will work in a container.
Step 3: Containerize with Docker
Create a Dockerfile that starts from a lightweight base image, copies the model and API code, installs the needed libraries, and sets the entry point.
# Use official Python slim image
FROM python:3.10-slim
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 && rm -rf /var/lib/apt/lists/*
# Create app directory
WORKDIR /app
# Copy requirements and install
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy code and model
COPY api.py .
COPY model.pt .
# Expose port
EXPOSE 8000
# Run the API
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
requirements.txt might contain:
fastapi
uvicorn[standard]
torch
numpy
Build and test locally:
docker build -t mymodel:latest .
docker run -p 8000:8000 mymodel:latest
If you can hit http://localhost:8000/predict with a JSON payload, you’re ready for the next step.
Step 4: Push the Image to a Registry
A registry is a place where Kubernetes can pull the image. For a quick start, Docker Hub works fine.
docker tag mymodel:latest yourdockerhubusername/mymodel:1.0
docker push yourdockerhubusername/mymodel:1.0
Make sure the repository is public or you have set up a secret for private images.
Step 5: Write Kubernetes Manifests
You need three basic objects: a Deployment, a Service, and a HorizontalPodAutoscaler (HPA).
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: mymodel-deploy
spec:
replicas: 2
selector:
matchLabels:
app: mymodel
template:
metadata:
labels:
app: mymodel
spec:
containers:
- name: mymodel
image: yourdockerhubusername/mymodel:1.0
ports:
- containerPort: 8000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
Service
apiVersion: v1
kind: Service
metadata:
name: mymodel-svc
spec:
selector:
app: mymodel
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
The Service exposes the pods on a stable IP and, with LoadBalancer, gives you an external address (cloud providers create a cloud load balancer automatically).
HorizontalPodAutoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mymodel-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mymodel-deploy
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
The HPA watches CPU usage and adds or removes pods to keep utilization around 60 %. You can also use custom metrics like request latency if you install the Prometheus adapter.
Apply everything:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml
Step 6: Add Monitoring
Without visibility, you won’t know if the model is misbehaving. Prometheus can scrape metrics from the FastAPI app if you add a small exporter:
# add to api.py
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter('request_count', 'Total requests')
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Latency per request')
@app.post("/predict")
def predict(data: list):
REQUEST_COUNT.inc()
with REQUEST_LATENCY.time():
# existing prediction code
Run start_http_server(8001) in a separate thread to expose /metrics. Then configure Prometheus to scrape http://<pod_ip>:8001/metrics. Grafana dashboards can plot latency, error rates, and CPU usage.
Step 7: Test the Live Service
Once the Service gets an external IP, send a request:
curl -X POST http://<external_ip>/predict -H "Content-Type: application/json" -d '[ [0.1, 0.2, ...] ]'
Check the response time and the Prometheus dashboard. If the HPA spins up more pods when you fire a burst of requests, you’ve got a scalable system.
Tips for a Smooth Ride
- Keep the container small. Use
python:slimor evendistrolessimages to reduce attack surface and start‑up time. - Separate config from code. Use ConfigMaps or environment variables for things like model path or batch size.
- Version your models. Tag Docker images with the model version (
v1.2) so you can roll back easily. - Run a health check. Add a
/healthendpoint that returns 200 only when the model loads correctly; Kubernetes will restart unhealthy pods automatically. - Watch out for GPU needs. If you need a GPU, use a node pool with NVIDIA drivers and add
resources.limits.nvidia.com/gpu: 1to the pod spec.
Deploying a deep learning model on Kubernetes may look like a lot of steps, but each piece is reusable. Once you have the pipeline in place, swapping in a new model is as easy as building a new Docker image and updating the deployment.
Happy serving, and remember: a model that can’t stay up is just a fancy spreadsheet.
- → A Practical Checklist for Auditing Bias in Machine‑Learning Models @neuralhorizons
- → A 30-Day Roadmap to Building a Revenue-Boosting Predictive Model @datascienceinsights
- → Implement GitOps with ArgoCD on Kubernetes @cloudcraft
- → A Step-by-Step Guide to Deploying a Custom AI Model on a Budget @techinsightlab
- → A Step‑by‑Step Walkthrough of Fine‑Tuning Large Language Models @aihorizons