Deploy a TensorFlow Model to a Serverless Function in 10 Minutes

You’ve probably heard that serverless is the fastest way to get code into production. The same idea works for machine learning – you can have a TensorFlow model answering requests in the time it takes to brew a coffee. In this post I’ll walk you through a real‑world example that I used last week for a quick demo. No fluff, just the steps you can copy‑paste.

What you need

Before we dive in, make sure you have these ready:

A TensorFlow model saved as a SavedModel folder (or a .h5 file you can convert).
An account on a serverless platform that supports Python – I’ll use Google Cloud Functions because it’s free for small loads and easy to set up.
The gcloud CLI installed and logged in.
A tiny Python script that loads the model and returns a prediction.

If any of these sound unfamiliar, don’t worry – I’ll explain each piece as we go.

Step 1: Pick or train a model

For the sake of speed I grabbed a pre‑trained MNIST digit recognizer from TensorFlow Hub. You can replace this with any model you’ve trained yourself.

import tensorflow as tf
import tensorflow_hub as hub

def get_model():
    # Load a small model from TF Hub
    model = tf.keras.Sequential([
        hub.KerasLayer("https://tfhub.dev/tensorflow/tfjs-model/mnist/1/default/1")
    ])
    return model

Save it locally with:

mkdir model
python - <<EOF
import tensorflow as tf, tensorflow_hub as hub, os
model = tf.keras.Sequential([hub.KerasLayer("https://tfhub.dev/tensorflow/tfjs-model/mnist/1/default/1")])
os.makedirs("model", exist_ok=True)
model.save("model")
EOF

Now you have a folder called model that contains everything the function will need.

Step 2: Write the handler

Create a file named main.py. This is the entry point that the serverless platform will call.

import json
import tensorflow as tf
import numpy as np

# Load the model once when the function starts
model = tf.keras.models.load_model("model")

def predict(request):
    """
    Expects a JSON body like {"pixels": [0,0,0,...]}
    Returns the digit with the highest probability.
    """
    try:
        data = request.get_json()
        pixels = np.array(data["pixels"], dtype=np.float32).reshape(1, 28, 28, 1) / 255.0
        probs = model.predict(pixels)
        digit = int(np.argmax(probs))
        return json.dumps({"prediction": digit})
    except Exception as e:
        return json.dumps({"error": str(e)}), 400

A couple of notes:

The model is loaded outside the request function. Serverless containers stay warm for a short while, so the model is only loaded once, not on every call.
I keep the input format simple – a flat list of 784 pixel values. You can change it to accept base64 images if you like.

Step 3: Add a requirements file

Serverless needs to know which Python packages to install. Create requirements.txt with:

tensorflow==2.13.0
tensorflow-hub
numpy

I pinned the TensorFlow version because the function runtime uses Python 3.9 and the latest TF works fine there. If you run into size limits, you can switch to tensorflow-cpu which is smaller.

Step 4: Test locally (optional but helpful)

If you have the functions-framework library you can run the function on your laptop:

pip install functions-framework
functions-framework --target=predict --debug

Send a test request with curl:

curl -X POST -H "Content-Type: application/json" \
     -d '{"pixels": [0,0,0,... (784 numbers) ...]}' \
     http://localhost:8080/

If you see a JSON response with a digit, you’re good to go.

Step 5: Deploy to Google Cloud Functions

First, zip the code and model folder:

zip -r function.zip main.py requirements.txt model

Now deploy:

gcloud functions deploy tf_predict \
  --runtime python39 \
  --trigger-http \
  --allow-unauthenticated \
  --entry-point predict \
  --source ./function.zip \
  --memory 512MB \
  --timeout 60s

A few things to watch:

Memory – TensorFlow needs at least 256 MB, but 512 MB gives a comfortable buffer.
Timeout – Loading the model can take a second or two, so give it 60 seconds just in case.
Unauthenticated – For a quick demo I leave it open, but in production you’d lock it down with IAM or an API key.

The CLI will print a URL once the deployment finishes. That URL is your new inference endpoint.

Step 6: Call the live endpoint

Grab the URL from the previous step and fire a request:

curl -X POST -H "Content-Type: application/json" \
     -d '{"pixels": [0,0,0,... (784 numbers) ...]}' \
     https://REGION-PROJECT.cloudfunctions.net/tf_predict

You should get something like:

{"prediction": 7}

That’s it – a TensorFlow model serving behind a serverless function in under ten minutes. I was amazed how little code was needed. The biggest surprise for me was how fast the first request was after deployment; the container warmed up, loaded the model, and answered in about 1.2 seconds. Not bad for a hobby project.

Tips for keeping it fast and cheap

Trim the model – Use TensorFlow Lite or quantize the model to shrink size and speed up loading.
Cold start awareness – The first request after a period of inactivity will be slower because the container has to start and load the model. If you need sub‑second latency, consider a small VM or a managed AI endpoint.
Monitor usage – Cloud Functions charges per invocation and per GB‑second. A tiny model with a few hundred requests a day will stay well under the free tier.

When to use this pattern

Prototyping – Quickly share a model with teammates without setting up a full server.
Event‑driven inference – Trigger the function from a Pub/Sub message or a storage upload.
Low‑traffic APIs – If you expect only a few hundred calls a day, serverless is cheaper than a dedicated server.

If your traffic spikes or you need GPU acceleration, you’ll have to move to a more robust service like Vertex AI or AWS SageMaker. But for many side‑projects, this approach hits the sweet spot of speed, cost, and simplicity.

Happy coding, and may your models stay light and your functions stay warm!