Step-by-Step Guide: Deploying a TensorFlow Model with Docker and Kubernetes

You’ve spent weeks training a TensorFlow model that finally hits the target accuracy. The next question everyone asks is “How do we get this thing into production without breaking everything?” In today’s fast‑moving ML world, a smooth deployment pipeline is as important as the model itself. Let’s walk through a practical, no‑fluff way to ship your TensorFlow model using Docker and Kubernetes.

Why Docker and Kubernetes?

Docker gives you a portable, reproducible environment. Think of it as a sealed box that contains everything your model needs—Python, TensorFlow, libraries, even the OS version. When the box works on your laptop, it will work on any server that runs Docker.

Kubernetes (often shortened to K8s) is the orchestrator that runs many Docker containers at scale. It handles load balancing, health checks, rolling updates, and auto‑scaling. In short, Docker packages the model, Kubernetes runs it reliably.

Personal note: The first time I tried to deploy a model on a bare VM, I spent half a day chasing a missing lib version. After I containerized it, the same model ran on three different clouds without a hitch. That’s the power of Docker + K8s.

Prerequisites

Before we dive in, make sure you have:

A trained TensorFlow SavedModel directory (saved_model.pb + variables).
Docker installed locally (docker --version).
Access to a Kubernetes cluster (minikube works for testing, GKE/EKS/AKS for production).
kubectl configured to talk to your cluster.

Step 1: Write a Simple Flask Wrapper

Kubernetes expects a service that can receive HTTP requests. The easiest way is to wrap the model in a tiny Flask app.

# app.py
from flask import Flask, request, jsonify
import tensorflow as tf

app = Flask(__name__)
model = tf.keras.models.load_model('model_dir')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    # Assume input is a list of numbers under key "features"
    features = tf.convert_to_tensor([data['features']])
    preds = model.predict(features)
    return jsonify({'prediction': preds.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Keep the code short and avoid any heavy preprocessing inside the service. If you need complex pipelines, consider using TensorFlow Serving instead, but for most prototypes Flask does the job.

Step 2: Create a Dockerfile

The Dockerfile defines the container image. Here’s a minimal, production‑ready version.

# Use the official Python slim image
FROM python:3.9-slim

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential && \
    rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy requirements first for caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the Flask app and the saved model
COPY app.py .
COPY model_dir ./model_dir

# Expose the port Flask will run on
EXPOSE 8080

# Run the app
CMD ["python", "app.py"]

Create a requirements.txt that lists only what you need:

flask
tensorflow==2.12.0

Build the image locally:

docker build -t tf-model-service:latest .

Run it to verify:

docker run -p 8080:8080 tf-model-service:latest

If you can curl -X POST localhost:8080/predict -d '{"features": [0.5, 1.2, 3.3]}' -H "Content-Type: application/json" and get a JSON response, you’re good to go.

Step 3: Push the Image to a Registry

Kubernetes pulls images from a registry. You can use Docker Hub, Google Container Registry, or any private repo.

docker tag tf-model-service:latest yourrepo/tf-model-service:1.0
docker push yourrepo/tf-model-service:1.0

Make sure the registry is accessible from your cluster. If you’re using a private repo, you’ll need to create a Kubernetes secret with the credentials (more on that later).

Step 4: Write Kubernetes Manifests

We need two objects: a Deployment (runs the pods) and a Service (exposes them).

Deployment (deployment.yaml)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-model-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tf-model
  template:
    metadata:
      labels:
        app: tf-model
    spec:
      containers:
      - name: tf-model-container
        image: yourrepo/tf-model-service:1.0
        ports:
        - containerPort: 8080
        resources:
          limits:
            cpu: "500m"
            memory: "512Mi"
          requests:
            cpu: "250m"
            memory: "256Mi"

Service (service.yaml)

apiVersion: v1
kind: Service
metadata:
  name: tf-model-service
spec:
  type: LoadBalancer   # Use NodePort for minikube
  selector:
    app: tf-model
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

Apply them:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Check the pods:

kubectl get pods -l app=tf-model

If the pods are Running and the service has an external IP, you can hit the endpoint:

curl -X POST http://<EXTERNAL_IP>/predict -d '{"features": [0.5, 1.2, 3.3]}' -H "Content-Type: application/json"

Step 5: Enable Rolling Updates

One of the biggest benefits of Kubernetes is zero‑downtime updates. Suppose you improve the model and push a new image yourrepo/tf-model-service:1.1. Update the Deployment:

kubectl set image deployment/tf-model-deployment tf-model-container=yourrepo/tf-model-service:1.1

Kubernetes will gradually replace old pods with new ones, respecting the maxSurge and maxUnavailable defaults. You can fine‑tune these values in the Deployment spec if you need tighter control.

Step 6: Add Health Checks

Kubernetes will automatically restart a pod if its health check fails. Add a simple endpoint to app.py:

@app.route('/health', methods=['GET'])
def health():
    return 'OK', 200

Then update the Deployment to include liveness and readiness probes:

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 15

These probes tell K8s when a container is ready to receive traffic and when it needs a restart.

Step 7: Monitoring and Logging

Even a perfect deployment can go wrong under load. At a minimum, stream logs to kubectl logs:

kubectl logs -f deployment/tf-model-deployment

For production, hook the pods into a logging stack (e.g., Fluentd + Elasticsearch) and expose metrics via Prometheus. TensorFlow already emits a few metrics; you can add more with prometheus_client.

Step 8: Autoscaling

If your traffic spikes, let Kubernetes add more pods automatically. Create a HorizontalPodAutoscaler (HPA):

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tf-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tf-model-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Apply it with kubectl apply -f hpa.yaml. Now the system will add pods when CPU usage climbs above 60 %.

Wrap‑Up Thoughts

Deploying a TensorFlow model doesn’t have to be a nightmare. By containerizing the model with Docker, exposing it via a tiny Flask API, and letting Kubernetes handle scaling and health, you get a robust pipeline that can grow with your traffic. The steps above are deliberately simple—once you’re comfortable, you can replace Flask with TensorFlow Serving, add GPU nodes, or integrate CI/CD pipelines.

Remember, the goal is to keep the model serving layer as lightweight and observable as possible. If you ever find yourself chasing a mysterious “500 Internal Server Error,” check the health endpoint and the pod logs first. Most issues are just missing environment variables or a mismatched TensorFlow version.

Happy shipping, and may your models stay accurate and your pods stay healthy.