MLOps Medium Asked at GoogleAsked at NetflixAsked at LyftAsked at Cloudflare

When would you choose gRPC over REST for model serving, and what are the practical trade-offs?

For MLOps Engineer ML Engineer AI / LLM Engineer

The short answer

gRPC uses HTTP/2 and Protocol Buffers to deliver lower latency, strongly typed contracts, and built-in streaming, making it the better choice for high-throughput internal model services. REST remains the standard for public-facing APIs where broad client compatibility and human-readable payloads matter more than raw performance.

How to think about it

REST over HTTP/1.1 sends JSON, which is human-readable but verbose. Every request opens a new TCP connection (without keep-alive) and there is no schema enforcement — a missing field fails only at runtime. It is universally supported by browsers, curl, and all HTTP clients, which makes it the natural choice for external APIs.

gRPC over HTTP/2 serialises with Protocol Buffers (binary, compact). A single multiplexed connection handles concurrent calls without head-of-line blocking. The .proto schema is the contract — client and server stubs are generated, so type errors surface at compile time. Latency improvements of 2–5x and payload size reductions of 60–80 % versus equivalent JSON are common.

When to pick gRPC for ML serving:

Internal microservice calls where all clients can import the stub (feature store → model server → downstream service).
Low-latency inference with large tensor payloads (image, embedding, audio).
Server-streaming for returning token-by-token LLM outputs.
High fan-out deployments where connection overhead matters.

When REST wins:

Browser or mobile clients that cannot use gRPC natively (gRPC-Web adds complexity).
Public partner APIs where documentation and interoperability outweigh performance.
Simple CRUD wrappers where JSON readability aids debugging.

// inference.proto
syntax = "proto3";
service Predictor {
  rpc Predict (PredictRequest) returns (PredictResponse);
}
message PredictRequest {
  repeated float features = 1;
}
message PredictResponse {
  float score = 1;
}

# Triton Inference Server uses gRPC natively
import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient("localhost:8001")
inp = grpcclient.InferInput("INPUT0", [1, 128], "FP32")
inp.set_data_from_numpy(np.random.rand(1, 128).astype("float32"))
result = client.infer("my_model", [inp])
print(result.as_numpy("OUTPUT0"))

When would you choose gRPC over REST for model serving, and what are the practical trade-offs?

Keep practising

Explore further