When would you choose gRPC over REST for model serving, and what are the practical trade-offs?
gRPC uses HTTP/2 and Protocol Buffers to deliver lower latency, strongly typed contracts, and built-in streaming, making it the better choice for high-throughput internal model services. REST remains the standard for public-facing APIs where broad client compatibility and human-readable payloads matter more than raw performance.
How to think about it
REST over HTTP/1.1 sends JSON, which is human-readable but verbose. Every request opens a new TCP connection (without keep-alive) and there is no schema enforcement — a missing field fails only at runtime. It is universally supported by browsers, curl, and all HTTP clients, which makes it the natural choice for external APIs.
gRPC over HTTP/2 serialises with Protocol Buffers (binary, compact). A single multiplexed connection handles concurrent calls without head-of-line blocking. The .proto schema is the contract — client and server stubs are generated, so type errors surface at compile time. Latency improvements of 2–5x and payload size reductions of 60–80 % versus equivalent JSON are common.
When to pick gRPC for ML serving:
- Internal microservice calls where all clients can import the stub (feature store → model server → downstream service).
- Low-latency inference with large tensor payloads (image, embedding, audio).
- Server-streaming for returning token-by-token LLM outputs.
- High fan-out deployments where connection overhead matters.
When REST wins:
- Browser or mobile clients that cannot use gRPC natively (gRPC-Web adds complexity).
- Public partner APIs where documentation and interoperability outweigh performance.
- Simple CRUD wrappers where JSON readability aids debugging.
// inference.proto
syntax = "proto3";
service Predictor {
rpc Predict (PredictRequest) returns (PredictResponse);
}
message PredictRequest {
repeated float features = 1;
}
message PredictResponse {
float score = 1;
}
# Triton Inference Server uses gRPC natively
import tritonclient.grpc as grpcclient
import numpy as np
client = grpcclient.InferenceServerClient("localhost:8001")
inp = grpcclient.InferInput("INPUT0", [1, 128], "FP32")
inp.set_data_from_numpy(np.random.rand(1, 128).astype("float32"))
result = client.infer("my_model", [inp])
print(result.as_numpy("OUTPUT0"))