End-to-End Production Pipeline: Deploying a Scalable Deep Learning Model on Kubernetes

You’ve spent weeks training a model that finally beats the baseline. The next question is: how do you get it out of the notebook and into the hands of users without it crashing on the first request? That’s why a solid production pipeline matters – it turns a cool experiment into a reliable service.

Why Kubernetes for Deep Learning?

Kubernetes (often shortened to K8s) is a system that runs containers on a cluster of machines. Think of it as a traffic manager that decides where each container lives, how many copies to run, and what to do when something goes wrong. For deep learning models, this means you can:

  • Scale up when traffic spikes and scale down when it’s quiet, saving money.
  • Recover automatically if a node fails – the model keeps running.
  • Separate training and serving environments, keeping the production side clean.

I first tried to serve a model on a single VM and learned the hard way that a single point of failure is a nightmare during a demo. Switching to Kubernetes saved me a lot of late‑night firefighting.

Overview of the Pipeline

  1. Export the model in a portable format (ONNX, TorchScript, SavedModel).
  2. Wrap the model in a lightweight API using FastAPI or Flask.
  3. Containerize the API with Docker.
  4. Push the image to a registry (Docker Hub, GCR, ECR).
  5. Write Kubernetes manifests for deployment, service, and autoscaling.
  6. Add monitoring with Prometheus and Grafana.
  7. Deploy and test.

Each step is simple enough to do on a laptop, but together they give you a production‑ready system.

Step 1: Export the Model

Most frameworks let you save a model in a format that does not depend on the original code. For PyTorch, torch.jit.trace creates a TorchScript file; for TensorFlow, model.save('my_model') creates a SavedModel directory. If you need to move between frameworks, ONNX is a good neutral format.

# PyTorch example
import torch
model = MyModel()
model.load_state_dict(torch.load('checkpoint.pt'))
model.eval()
scripted = torch.jit.trace(model, torch.randn(1, 3, 224, 224))
scripted.save('model.pt')

Saving the model this way means the container only needs the runtime library, not the training code.

Step 2: Build a Simple API

I like FastAPI because it’s fast, type‑safe, and produces automatic docs. The API loads the model once at startup and then handles each request in a separate thread.

# api.py
from fastapi import FastAPI
import torch
import numpy as np

app = FastAPI()
model = torch.jit.load('model.pt')
model.eval()

@app.post("/predict")
def predict(data: list):
    tensor = torch.tensor(np.array(data), dtype=torch.float32)
    with torch.no_grad():
        out = model(tensor)
    return {"prediction": out.tolist()}

Run it locally with uvicorn api:app --host 0.0.0.0 --port 8000 and you’ll see a Swagger UI at http://localhost:8000/docs. If it works here, it will work in a container.

Step 3: Containerize with Docker

Create a Dockerfile that starts from a lightweight base image, copies the model and API code, installs the needed libraries, and sets the entry point.

# Use official Python slim image
FROM python:3.10-slim

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 && rm -rf /var/lib/apt/lists/*

# Create app directory
WORKDIR /app

# Copy requirements and install
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy code and model
COPY api.py .
COPY model.pt .

# Expose port
EXPOSE 8000

# Run the API
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

requirements.txt might contain:

fastapi
uvicorn[standard]
torch
numpy

Build and test locally:

docker build -t mymodel:latest .
docker run -p 8000:8000 mymodel:latest

If you can hit http://localhost:8000/predict with a JSON payload, you’re ready for the next step.

Step 4: Push the Image to a Registry

A registry is a place where Kubernetes can pull the image. For a quick start, Docker Hub works fine.

docker tag mymodel:latest yourdockerhubusername/mymodel:1.0
docker push yourdockerhubusername/mymodel:1.0

Make sure the repository is public or you have set up a secret for private images.

Step 5: Write Kubernetes Manifests

You need three basic objects: a Deployment, a Service, and a HorizontalPodAutoscaler (HPA).

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mymodel-deploy
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mymodel
  template:
    metadata:
      labels:
        app: mymodel
    spec:
      containers:
      - name: mymodel
        image: yourdockerhubusername/mymodel:1.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "250m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

Service

apiVersion: v1
kind: Service
metadata:
  name: mymodel-svc
spec:
  selector:
    app: mymodel
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

The Service exposes the pods on a stable IP and, with LoadBalancer, gives you an external address (cloud providers create a cloud load balancer automatically).

HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mymodel-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mymodel-deploy
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

The HPA watches CPU usage and adds or removes pods to keep utilization around 60 %. You can also use custom metrics like request latency if you install the Prometheus adapter.

Apply everything:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f hpa.yaml

Step 6: Add Monitoring

Without visibility, you won’t know if the model is misbehaving. Prometheus can scrape metrics from the FastAPI app if you add a small exporter:

# add to api.py
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter('request_count', 'Total requests')
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Latency per request')

@app.post("/predict")
def predict(data: list):
    REQUEST_COUNT.inc()
    with REQUEST_LATENCY.time():
        # existing prediction code

Run start_http_server(8001) in a separate thread to expose /metrics. Then configure Prometheus to scrape http://<pod_ip>:8001/metrics. Grafana dashboards can plot latency, error rates, and CPU usage.

Step 7: Test the Live Service

Once the Service gets an external IP, send a request:

curl -X POST http://<external_ip>/predict -H "Content-Type: application/json" -d '[ [0.1, 0.2, ...] ]'

Check the response time and the Prometheus dashboard. If the HPA spins up more pods when you fire a burst of requests, you’ve got a scalable system.

Tips for a Smooth Ride

  • Keep the container small. Use python:slim or even distroless images to reduce attack surface and start‑up time.
  • Separate config from code. Use ConfigMaps or environment variables for things like model path or batch size.
  • Version your models. Tag Docker images with the model version (v1.2) so you can roll back easily.
  • Run a health check. Add a /health endpoint that returns 200 only when the model loads correctly; Kubernetes will restart unhealthy pods automatically.
  • Watch out for GPU needs. If you need a GPU, use a node pool with NVIDIA drivers and add resources.limits.nvidia.com/gpu: 1 to the pod spec.

Deploying a deep learning model on Kubernetes may look like a lot of steps, but each piece is reusable. Once you have the pipeline in place, swapping in a new model is as easy as building a new Docker image and updating the deployment.

Happy serving, and remember: a model that can’t stay up is just a fancy spreadsheet.

Reactions