engineering infrastructure

One inference platform, four API surfaces

How OpenAI-, Anthropic-, and Fal.ai-compatible clients share the same dispatch backend with Casola's native API, and where they can't

Casola Team

You’ve hit a rate limit again. And the bill arrived and the math no longer works. And you’re already juggling three providers and the plumbing is showing. And your agent just feels slower than it should, and you’ve been meaning to finally streamline voice and video support. Le sigh. Fortunately, the fix is just two lines:

client = OpenAI(
    base_url="https://api.casola.ai/openai/v1",
    api_key="your-casola-key",
)

Everything else stays the same. We’ve built AI applications ourselves and watched migrations turn into multi-week archaeology projects: one renamed field in an error response, and suddenly every exception handler in the codebase is wrong. We didn’t want that to be the Casola migration story.

Casola is an inference platform for multi-modal AI agents: text, image, video, and voice, routed to GPU workers across regions. For most workloads, the migration is a base URL swap. This post covers what that gets you, where the rough edges are, and what you’ll need to adjust.

Migrating from OpenAI, Anthropic, and Fal.ai

The OpenAI-compatible surface at /openai/v1/ covers chat completions, embeddings, text-to-speech, speech-to-text, and image generation. Same request shapes, same response shapes, same streaming protocol over SSE.

For Python:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.casola.ai/openai/v1",
    api_key="your-casola-key",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)

For TypeScript:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.casola.ai/openai/v1",
  apiKey: "your-casola-key",
});

The Anthropic-compatible surface at /anthropic/v1/ accepts the same request and response shapes as api.anthropic.com — text, vision, and tool use. Point the Anthropic SDK at Casola’s base URL and nothing else changes:

For Python:

from anthropic import Anthropic

client = Anthropic(
    base_url="https://api.casola.ai/anthropic",
    api_key="your-casola-key",
)

For TypeScript:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  baseURL: "https://api.casola.ai/anthropic",
  apiKey: "your-casola-key",
});

Claude Code works against this base URL out of the box — no proxy hop to api.anthropic.com, no modified SDK. The scope is text + vision + tool use. For audio, image generation, and video, use the OpenAI or Fal.ai surfaces.

For image and video workloads built on the Fal.ai SDK, the surface at /fal/ follows the same protocol. Submit a job, get three URLs back immediately:

{
  "request_id": "019db38a-...",
  "response_url": "https://.../fal/fal-ai/flux/schnell/requests/019db38a-...",
  "status_url":   "https://.../fal/fal-ai/flux/schnell/requests/019db38a-.../status",
  "cancel_url":   "https://.../fal/fal-ai/flux/schnell/requests/019db38a-.../cancel"
}

Poll the status URL until "COMPLETED", then fetch from the response URL. Point the Fal.ai client at Casola’s base URL and the SDK handles the rest.

What the error handling migration looks like

This is where migrations usually break. So we put extra effort in Casola to make sure you’re still riding comfortably even when your agent veers off the happy path. Your SDK’s error handlers, retry logic, and any code that inspects error fields were written against a specific JSON shape. If the shape changes, they fail silently or throw unrecognized exceptions.

The shapes match what each SDK expects. On the OpenAI-compatible surface, a rate limit comes back as:

{ "error": { "message": "Rate limit exceeded", "type": "rate_limit_error", "param": null, "code": null } }

The SDK catches this as RateLimitError, the same class it would raise against OpenAI. Your existing except openai.RateLimitError handlers work without changes.

Auth failures follow the same pattern. A bad API key returns { "error": { "type": "authentication_error", ... } }. The SDK raises AuthenticationError, your existing handler catches it.

On the Anthropic-compatible surface, errors carry a top-level type discriminator:

{ "type": "error", "error": { "type": "rate_limit_error", "message": "Rate limit exceeded" } }

The Anthropic SDK’s own exception classes — anthropic.RateLimitError, anthropic.AuthenticationError, and so on — catch this shape natively. Existing exception handlers work without changes. If you’re parsing the body manually, note the outer "type": "error" field — that discriminator is what distinguishes Anthropic error envelopes from OpenAI ones.

On the Fal.ai surface, errors are flat:

{ "detail": "Rate limit exceeded", "error_type": "rate_limit_exceeded" }

The Fal.ai SDK handles this shape natively. If you’re parsing Fal errors manually rather than through the SDK, verify your code is reading error_type, not type or code.

The one place to check: any code that catches a generic exception and inspects the raw JSON body directly. Those are the handlers that break across surfaces, because each surface returns a different shape for the same underlying error. If you’re catching openai.APIStatusError and reading .body["error"]["code"], that code travels fine. If you’re catching requests.HTTPError and parsing the JSON yourself, test it.

Extensions you can use

The OpenAI-compatible and Anthropic-compatible surfaces both accept a few extra fields that aren’t in the spec, passed via extra_body. If you don’t send them, nothing changes from your application’s perspective.

async: true returns 202 immediately with a job ID instead of waiting for completion. The job result is available at GET /api/jobs/{id}. Useful for image generation or long inference runs where you don’t want to block.

timeout_secs sets a hard deadline on the job. If the worker doesn’t finish in time, the request fails rather than hanging.

client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Hello"}],
    extra_body={"async": True, "timeout_secs": 60},
)

The same X-Region and X-Jurisdiction headers work on the Anthropic surface for regional job routing — no separate code path.

Unknown extra fields are ignored rather than rejected, so adding or removing them won’t break existing calls.

Where compatibility ends

Not everything migrates. The surfaces aren’t equivalent, and some gaps are intentional:

FeatureOpenAI-compatibleAnthropic-compatibleFal.ai-compatibleNative
Text / LLMYesYesNoYes
Vision (image-in)YesYesNoYes
Image generationYesNoYesYes
Video generationYesNoYesYes
Audio (TTS / ASR)YesNoYesYes
Files + BatchesYesYesNoNo
Tool useYesYesNoYes
Agent DAGsNoNoNoYes
Region selectionHeaderHeaderHeaderHeader
StreamingSSESSENoSSE / Polling / webhook

Agent DAGs are native-only. If you’re building multi-step workflows with branching and state, the OpenAI, Anthropic, and Fal.ai surfaces won’t expose those primitives. That’s what the native API is for.

Files and Batches exist on the OpenAI-compatible and Anthropic-compatible surfaces only.

Anthropic-compatible covers text, vision, and tool use. For audio, image generation, and video, use the OpenAI or Fal.ai surfaces.

Region selection works on all four surfaces via the same X-Region and X-Jurisdiction headers. Check this if you’re pinning jobs to a specific region.

It’s a fairly broad set of what’s possible across all APIs. The migration is two lines for the common case. The limits are documented.