Where the milliseconds go in a GPU inference request

We were working with a team building real-time voice agents. Their pipeline was ASR → tool call → reasoning and summarization → TTS. Every node adds latency. ASR has to transcribe before reasoning can start. The tool call blocks the summarizer. TTS can’t begin until the summary is done.

The only latency you can control is the overhead between nodes — the part the infrastructure adds. Every millisecond of platform overhead compounds across four serial hops. A voice agent that feels natural has maybe 600ms of total budget. The model inference is the expensive part. The infrastructure around it has to be nearly free.

This is a post about how we get there.

Isometric illustration of a voice pipeline: a microphone on the left, three glowing teal processing nodes in the middle, and a speaker on the right, connected by amber lines

Casola handles agent workflow orchestration and GPU inference routing. Behind a single API, it manages scheduling, auto-scaling, and regional dispatch for multi-modal pipelines.

The request path

When a request arrives at the API, it doesn’t have to travel far. Our ingress sits in Cloudflare, at the edge of the network, close to the client — often before traffic reaches the open internet at all. Authentication and metadata are cached at the ingress point, so auth validation adds no remote round-trips. From there, the request routes over the backbone to the nearest queue node, or to a specific region if the workload requires it.

The queue node dispatches to a GPU worker over a persistent connection and avoids TCP handshakes and other unnecessary roundtrips. The worker runs inference and sends results back the same way.

When enabled, every response carries a Server-Timing header breaking this down:

Server-Timing: worker-rtt;dur=42, wait;dur=0, work;dur=38, queue-overhead;dur=4

worker-rtt is the round-trip to the worker. work is what the worker reported for actual inference time. wait is time spent queued before dispatch. queue-overhead is everything the queue node did that wasn’t waiting for a worker. For a warm, uncontested request, queue-overhead is typically a few milliseconds.

The sections below explain how.

Sync and async are different contracts

Not every request needs the same guarantees. We separate them explicitly.

Sync requests are designed for minimum latency. The result is returned inline, no polling required. There’s no database write on the hot path — the request lives in memory from dispatch to response. If the worker fails mid-inference, the caller retries. Same model as the OpenAI API: fast, simple, retry on failure. Good examples are chat completions or TTS, where the client is waiting for an immediate response.

Async requests are different. They’re persisted before dispatch, which means the queue can survive a worker failure without losing the job. The caller polls for results. Failures get masked via redispatch from checkpoints. Async is the right model when the job takes minutes, the client isn’t holding a connection, or you need guaranteed delivery — for example, fal.ai-compatible requests for batch image or video generation.

The overhead difference is measurable. Sync requests skip the write entirely. Async requests pay for persistence and get reliability in return. With Casola, you explicitly have control over this.

Content filtering has to be fast

We get it, nobody likes content filters. However, content filtering often is a legal requirement for any platform offering agentic AI services.

The obvious implementation is to classify the input, then dispatch if it passes. That puts filtering on the critical path. Every request waits for classification before a worker sees it. For low-latency pipelines, that’s unacceptable.

We do it differently.

Content classification runs in parallel with dispatch. Both start at the same time. If the classification completes first with a violation, we abort the request in flight. If dispatch wins, the response is on its way and classification catches up asynchronously.

classify(input)  ─────────────────────────────► [result]
dispatch(input)  ──────► worker ──► response
                          ↑
                     whichever arrives first

For requests that pass the filter (the vast majority), input classification costs nothing. We accept that we occasionally do wasted GPU work on requests that get aborted in flight. That’s a deliberate trade: we absorb the cost so the pipeline doesn’t slow down.

Dedicated capacity for sync work

We maintain separate capacity pools for sync and async work. Async workloads optimize for throughput. Sync workloads optimize for latency — low wait time, high availability.

When a burst of sync requests arrives and the sync pool is at capacity, those requests don’t wait in line behind async batch jobs. They can pull from available capacity across the system. Auto-scaling handles sustained demand, but it takes time — spinning up a new worker is not instantaneous. The ability to jump the queue gives sync workloads a burst margin that’s faster than provisioning.

In the voice pipeline, this matters. A spike in real-time traffic doesn’t stall because the async queue is busy. Sync capacity stays responsive at the moments it’s under the most pressure.

Binary everywhere, transformation at the worker

When a worker produces audio or image output, two things can happen. You serialize the result as base64, pass it through the queue, and decode it on the other side. Or you pass raw bytes, skip the encoding step entirely, and save 33% of payload size in the process.

We use raw binary across every hop that supports it. No base64, no JSON-wrapping binary fields. The protocol between the queue and workers uses binary frames: a short header for metadata, followed by the raw payload bytes. The queue forwards the same bytes to the client. Nothing encodes or decodes in the middle.

For the voice pipeline, there’s a second part of this. Audio transcription, format conversion, and normalization run directly on the worker node — the same machine that ran inference. The output of ASR is already normalized before it leaves the worker. The input to TTS is already in the right format before inference begins.

Doing this at the worker avoids a round-trip to a separate processing service. No extra queue hop, no extra scheduling delay. The data is already where it needs to be.

What developers see

Every response includes Server-Timing with a breakdown of where time was spent — when tracing is enabled.

Server-Timing: worker-rtt;dur=42, wait;dur=0, work;dur=38, queue-overhead;dur=4

The numbers in the voice pipeline told us where to look: ingress latency dropped when we moved API entry points closer to the end users. Worker round-trip dropped when we matched the regional queue to the GPU workers in that region. Transformation overhead disappeared when we moved format conversion onto the worker node.

Most infrastructure overhead is invisible until you measure it. Then it’s obvious where to cut.