Step-by-Step Guide: Deploying a TensorFlow Model with Docker and Kubernetes
You’ve spent weeks training a TensorFlow model that finally hits the target accuracy. The next question everyone asks is “How do we get this thing into production without breaking everything?” In today’s fast‑moving ML world, a smooth deployment pipeline is as important as the model itself. Let’s walk through a practical, no‑fluff way to ship your TensorFlow model using Docker and Kubernetes.
Why Docker and Kubernetes?
Docker gives you a portable, reproducible environment. Think of it as a sealed box that contains everything your model needs—Python, TensorFlow, libraries, even the OS version. When the box works on your laptop, it will work on any server that runs Docker.
Kubernetes (often shortened to K8s) is the orchestrator that runs many Docker containers at scale. It handles load balancing, health checks, rolling updates, and auto‑scaling. In short, Docker packages the model, Kubernetes runs it reliably.
Personal note: The first time I tried to deploy a model on a bare VM, I spent half a day chasing a missing lib version. After I containerized it, the same model ran on three different clouds without a hitch. That’s the power of Docker + K8s.
Prerequisites
Before we dive in, make sure you have:
- A trained TensorFlow SavedModel directory (
saved_model.pb+ variables). - Docker installed locally (
docker --version). - Access to a Kubernetes cluster (minikube works for testing, GKE/EKS/AKS for production).
kubectlconfigured to talk to your cluster.
Step 1: Write a Simple Flask Wrapper
Kubernetes expects a service that can receive HTTP requests. The easiest way is to wrap the model in a tiny Flask app.
# app.py
from flask import Flask, request, jsonify
import tensorflow as tf
app = Flask(__name__)
model = tf.keras.models.load_model('model_dir')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json(force=True)
# Assume input is a list of numbers under key "features"
features = tf.convert_to_tensor([data['features']])
preds = model.predict(features)
return jsonify({'prediction': preds.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Keep the code short and avoid any heavy preprocessing inside the service. If you need complex pipelines, consider using TensorFlow Serving instead, but for most prototypes Flask does the job.
Step 2: Create a Dockerfile
The Dockerfile defines the container image. Here’s a minimal, production‑ready version.
# Use the official Python slim image
FROM python:3.9-slim
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential && \
rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy requirements first for caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the Flask app and the saved model
COPY app.py .
COPY model_dir ./model_dir
# Expose the port Flask will run on
EXPOSE 8080
# Run the app
CMD ["python", "app.py"]
Create a requirements.txt that lists only what you need:
flask
tensorflow==2.12.0
Build the image locally:
docker build -t tf-model-service:latest .
Run it to verify:
docker run -p 8080:8080 tf-model-service:latest
If you can curl -X POST localhost:8080/predict -d '{"features": [0.5, 1.2, 3.3]}' -H "Content-Type: application/json" and get a JSON response, you’re good to go.
Step 3: Push the Image to a Registry
Kubernetes pulls images from a registry. You can use Docker Hub, Google Container Registry, or any private repo.
docker tag tf-model-service:latest yourrepo/tf-model-service:1.0
docker push yourrepo/tf-model-service:1.0
Make sure the registry is accessible from your cluster. If you’re using a private repo, you’ll need to create a Kubernetes secret with the credentials (more on that later).
Step 4: Write Kubernetes Manifests
We need two objects: a Deployment (runs the pods) and a Service (exposes them).
Deployment (deployment.yaml)
apiVersion: apps/v1
kind: Deployment
metadata:
name: tf-model-deployment
spec:
replicas: 2
selector:
matchLabels:
app: tf-model
template:
metadata:
labels:
app: tf-model
spec:
containers:
- name: tf-model-container
image: yourrepo/tf-model-service:1.0
ports:
- containerPort: 8080
resources:
limits:
cpu: "500m"
memory: "512Mi"
requests:
cpu: "250m"
memory: "256Mi"
Service (service.yaml)
apiVersion: v1
kind: Service
metadata:
name: tf-model-service
spec:
type: LoadBalancer # Use NodePort for minikube
selector:
app: tf-model
ports:
- protocol: TCP
port: 80
targetPort: 8080
Apply them:
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
Check the pods:
kubectl get pods -l app=tf-model
If the pods are Running and the service has an external IP, you can hit the endpoint:
curl -X POST http://<EXTERNAL_IP>/predict -d '{"features": [0.5, 1.2, 3.3]}' -H "Content-Type: application/json"
Step 5: Enable Rolling Updates
One of the biggest benefits of Kubernetes is zero‑downtime updates. Suppose you improve the model and push a new image yourrepo/tf-model-service:1.1. Update the Deployment:
kubectl set image deployment/tf-model-deployment tf-model-container=yourrepo/tf-model-service:1.1
Kubernetes will gradually replace old pods with new ones, respecting the maxSurge and maxUnavailable defaults. You can fine‑tune these values in the Deployment spec if you need tighter control.
Step 6: Add Health Checks
Kubernetes will automatically restart a pod if its health check fails. Add a simple endpoint to app.py:
@app.route('/health', methods=['GET'])
def health():
return 'OK', 200
Then update the Deployment to include liveness and readiness probes:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 15
These probes tell K8s when a container is ready to receive traffic and when it needs a restart.
Step 7: Monitoring and Logging
Even a perfect deployment can go wrong under load. At a minimum, stream logs to kubectl logs:
kubectl logs -f deployment/tf-model-deployment
For production, hook the pods into a logging stack (e.g., Fluentd + Elasticsearch) and expose metrics via Prometheus. TensorFlow already emits a few metrics; you can add more with prometheus_client.
Step 8: Autoscaling
If your traffic spikes, let Kubernetes add more pods automatically. Create a HorizontalPodAutoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tf-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tf-model-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Apply it with kubectl apply -f hpa.yaml. Now the system will add pods when CPU usage climbs above 60 %.
Wrap‑Up Thoughts
Deploying a TensorFlow model doesn’t have to be a nightmare. By containerizing the model with Docker, exposing it via a tiny Flask API, and letting Kubernetes handle scaling and health, you get a robust pipeline that can grow with your traffic. The steps above are deliberately simple—once you’re comfortable, you can replace Flask with TensorFlow Serving, add GPU nodes, or integrate CI/CD pipelines.
Remember, the goal is to keep the model serving layer as lightweight and observable as possible. If you ever find yourself chasing a mysterious “500 Internal Server Error,” check the health endpoint and the pod logs first. Most issues are just missing environment variables or a mismatched TensorFlow version.
Happy shipping, and may your models stay accurate and your pods stay healthy.
- → A Practical Blueprint for Building Predictive Models That Boost Business Revenue @datascienceinsights
- → Deploy a Scalable Flask App with Docker and GitHub Actions @codecraftchronicles
- → Step-by-step guide to automating Kubernetes deployments with ArgoCD @devopschronicle
- → Step‑by‑Step Guide to Automating Kubernetes Deployments with Argo CD @devopschronicle
- → How to Build Transparent Machine Learning Models That Meet Emerging AI Regulations @neuralhorizons