OpenAI Compatible API for RAG: Build Retrieval-Augmented Generation Without Vendor Lock-In

Build a RAG pipeline on an OpenAI Compatible API: route embeddings, reranking, and synthesis to different models without rewriting code. Cut inference cost with per-stage model routing.

OurToken Team/Jun 29, 2026/12 min

OpenAI Compatible API for RAG: Build Retrieval-Augmented Generation Without Vendor Lock-In

A practical, code-first guide to building RAG pipelines on an OpenAI-compatible API — switch embeddings, rerankers, and chat models without rewriting your application, and cut inference cost by routing each step to the right model.

If you are shipping a RAG (Retrieval-Augmented Generation) system in 2026, you already know the hardest part is not the retrieval algorithm — it is the infrastructure lock-in. You pick an embedding model, wire it into a vector store, call a chat model for synthesis, and six months later you discover a better model that costs half as much. Rewriting every API call, re-encoding millions of vectors, and re-testing the pipeline is expensive. The fix more teams are adopting is an OpenAI Compatible API: one request shape, many models behind it. This article shows how to architect a RAG system on top of an OpenAI-compatible endpoint so you can swap GPT-5.5, Claude Opus 4.8, GLM 5.2, and DeepSeek V4 in and out of each pipeline stage without touching application code.

The mistake most teams make is treating "which model should I use for RAG?" as a single question. It is not. RAG is a pipeline, and each stage rewards a different model. The embedding stage rewards low cost per token and throughput. The synthesis stage rewards reasoning quality and a context window large enough to hold the retrieved chunks. The reranking stage rewards a sharp relevance signal. No single model wins all three, and no single provider prices all three competitively. This is why the interface in front of your models matters more than the models themselves — and why an OpenAI-compatible endpoint, which lets you swap models per stage without changing code, has quietly become the default RAG infrastructure choice in 2026.

Why RAG Teams Hit a Wall With Single-Provider APIs

A production RAG pipeline is not one model call — it is a chain of specialized calls:

Embedding — convert documents and queries into vectors.
Retrieval — search a vector store for the top-k chunks.
Reranking — optionally re-score chunks for relevance.
Synthesis — feed the top chunks into a chat model to generate the answer.

Each stage has different requirements. Embeddings need scale and low cost per token. Synthesis needs strong reasoning and a large context window. A single provider rarely wins on every stage, and pricing varies by an order of magnitude. The OurToken model catalog lists models from OpenAI, Anthropic, Zhipu (GLM), DeepSeek, and MiniMax side by side — exactly because RAG teams want to mix them.

The problem: every provider ships a different SDK, auth flow, and request schema. Switching from an OpenAI embeddings endpoint to an Anthropic chat endpoint means maintaining two clients, two billing relationships, and two retry policies. As the number of models grows, the integration surface grows linearly. This is the lock-in tax RAG teams pay silently, every month.

The Hidden Cost of Hardcoding One Provider

Hardcoding a single provider looks simple in a tutorial but breaks in production:

Price shocks: a provider raises token prices or deprecates a model, and your monthly bill spikes overnight.
Rate limits: a single key cannot sustain bursty RAG traffic during peak hours.
Model drift: a new model generation (GPT-5.5 → GPT-5.4-mini for cheap classification) is a refactor, not a config change.
Fallback gaps: if your only provider has an outage, your RAG app goes dark.

The teams that survive these issues all converge on the same pattern: an OpenAI-compatible interface in front of many models.

How an OpenAI Compatible API Unlocks RAG Flexibility

An OpenAI Compatible API exposes the same request and response schema as OpenAI's /v1/chat/completions and /v1/embeddings endpoints. Any client written for OpenAI — the official SDK, LangChain, LlamaIndex, Vercel AI SDK — works unchanged. The difference is that behind that one endpoint you can route requests to multiple underlying models.

For RAG, this means each pipeline stage can call the same base URL with a different model field:

from openai import OpenAI

# One client, one base_url, one API key — many models behind it
client = OpenAI(
    api_key="OURTOKEN_API_KEY",
    base_url="https://api.ourtoken.ai/v1",
)

# Stage 1: cheap, fast embeddings
doc_vector = client.embeddings.create(
    model="text-embedding-3-small",
    input=chunk_text,
).data[0].embedding

# Stage 4: strong synthesis with a large context window
answer = client.chat.completions.create(
    model="claude-opus-4-8",
    messages=[
        {"role": "system", "content": "Answer using only the retrieved context."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {q}"},
    ],
).choices[0].message.content

No second SDK. No second auth header. No conditional imports. This is the same interface contract documented in the OpenAI API reference — any client that speaks it works unchanged. Swapping the synthesis model from Claude Opus 4.8 to GLM 5.2 for a 1M-context workload is a one-line change to the model string.

Routing Each RAG Stage to the Right Model

The real power of an OpenAI-compatible layer is per-stage model routing. Instead of picking one model for everything, you assign each stage the model that optimizes its specific trade-off:

RAG Stage	Requirement	Example Model	Why
Embedding	Low cost, high throughput	text-embedding-3-small	Cheapest per token, good enough for recall
Retrieval	Vector DB (not an LLM call)	pgvector / Qdrant	Latency-bound, model-independent
Reranking	Relevance scoring	Cohere-style reranker	Sharpens top-k before synthesis
Synthesis	Reasoning, large context, citations	Claude Opus 4.8	Best grounding on retrieved context
Long-doc synthesis	1M-token context	GLM 5.2	Fits entire codebase or manual
Cost-sensitive Q&A	Fast, cheap, high volume	GPT-5.4-mini / DeepSeek V4 Flash	Handles the long tail of user queries

This is the architecture that an OpenAI-compatible endpoint makes trivial: the routing logic lives in your application as a model string, not as a vendor-specific integration.

Building the RAG Pipeline: Architecture and Code

Let's make this concrete. Below is a minimal but production-shaped RAG pipeline that uses a single OpenAI-compatible client for both embeddings and synthesis, with per-stage model selection.

Architecture Overview

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Documents  │────▶│  Embeddings  │────▶│  Vector DB   │────▶│  Retrieval   │
│  (chunks)   │     │  (cheap)     │     │  pgvector    │     │  top-k       │
└─────────────┘     └──────────────┘     └──────────────┘     └──────┬───────┘
                                                                │
                                                                ▼
┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Answer    │◀────│  Synthesis   │◀────│  Reranker    │◀────│  User Query  │
│  (cited)    │     │  (Opus 4.8)  │     │  (optional)  │     │  + embed     │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘

Every box that calls an LLM goes through the same base_url. Only the model field changes.

Embedding and Indexing

def embed_and_index(chunks, client, store):
    for chunk in chunks:
        vec = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk["text"],
        ).data[0].embedding
        store.add(id=chunk["id"], vector=vec, payload=chunk)

Embeddings are the highest-volume call in RAG — you embed every document once, plus every query. This is where per-token cost dominates, so route it to the cheapest capable model and never to a premium chat model.

Retrieval and Synthesis With Citations

def rag_answer(question, client, store, top_k=5):
    q_vec = client.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding

    hits = store.search(q_vec, top_k=top_k)
    context = "\n\n".join(h["payload"]["text"] for h in hits)

    resp = client.chat.completions.create(
        model="claude-opus-4-8",
        messages=[
            {"role": "system", "content":
             "Answer the question using ONLY the context below. "
             "Cite chunk IDs like [chunk-12]. If the answer is not in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

Two things to notice: the embedding call and the chat call share one client and one base_url, and the synthesis model is claude-opus-4-8 — a model you can read about on the Claude Opus 4.8 model page. Want to try a cheaper model for low-stakes queries? Change the string to gpt-5-4-mini or deepseek-v4-flash. The rest of the pipeline is untouched.

Cost Analysis: Single-Model vs Per-Stage Routing

The economic case for an OpenAI-compatible, per-stage RAG setup is not subtle. Here is a realistic monthly cost profile for a knowledge-base RAG app serving 100k queries/month over a 2M-token document set.

Component	Naive (one premium model)	Per-stage routing (OpenAI-compatible)
Document embedding (one-time)	Premium chat-tier rate × 2M tok	text-embedding-3-small × 2M tok
Query embedding (100k queries)	Premium rate × short queries	text-embedding-3-small × short queries
Synthesis (100k calls, avg 4k tok in/out)	GPT-5.5 for every call	Opus 4.8 for hard 20%, GPT-5.4-mini / DeepSeek V4 Flash for 80%

Routing 80% of routine queries to a cheaper model and reserving the premium model for the 20% that need deep reasoning is the single biggest lever in RAG cost. An OpenAI-compatible endpoint makes this a config decision instead of a re-architecture.

Compared to calling each provider's native API directly, going through one OpenAI-compatible layer also typically lowers effective cost by 40%–80% versus official list pricing, because the platform aggregates volume and exposes the same models at lower per-token rates. The exact saving depends on your model mix, but the structural advantage — one billing relationship, one SDK, per-stage routing — is independent of any single price point.

A Simple Routing Policy

def pick_model(question, history_length):
    # Long context (e.g. multi-turn with big history) needs a large window
    if history_length > 200_000:
        return "glm-5-2"          # 1M context window
    # Hard questions get the strong model
    if needs_deep_reasoning(question):
        return "claude-opus-4-8"
    # Everything else: cheap and fast
    return "gpt-5-4-mini"

This is the kind of policy that is impractical with single-provider lock-in but trivial with an OpenAI-compatible router. You can read more about the underlying models in the OurToken documentation.

Production Hardening: Failover, Retries, and Observability

A RAG pipeline that works on your laptop is not the same as one that survives a Friday afternoon traffic spike. Because an OpenAI-compatible endpoint exposes many models behind one interface, the same client-side resilience patterns you already use apply uniformly across every stage.

Failover and Fallback Models

The most common production failure is not a total outage — it is one model becoming slow or returning elevated errors while others stay healthy. With a single-provider API you have no recourse. With a multi-model OpenAI-compatible endpoint, you define a fallback chain:

FALLBACK_CHAIN = {
    "synthesis": ["claude-opus-4-8", "gpt-5-5", "glm-5-2"],
    "embedding": ["text-embedding-3-small", "text-embedding-3-large"],
}

def call_with_fallback(stage, payload, client):
    for model in FALLBACK_CHAIN[stage]:
        try:
            return run_call(stage, model, payload, client)
        except (RateLimitError, APIError, Timeout):
            continue
    raise RuntimeError(f"All fallbacks exhausted for {stage}")

Because every model in the chain speaks the same schema, the fallback loop needs no provider-specific code. You get resilience for free the moment you stop hardcoding a single model.

Retries, Timeouts, and Cost Guards

Three settings matter more than any framework choice in production RAG:

Per-request timeout: retrieval and synthesis latency varies wildly by model. Set a tight timeout on the cheap fast stage (embeddings, GPT-5.4-mini, DeepSeek V4 Flash) and a generous one on the deep-reasoning stage (Claude Opus 4.8).
Exponential backoff with jitter: retry transient 5xx and 429s, but cap attempts so a degraded model does not amplify your bill.
Per-request token budget: enforce a max output token ceiling. A misbehaving prompt that asks for a 16k-token answer on a premium model is the fastest way to blow a monthly budget.

Observability: Tag Every Call

The single highest-leverage practice is tagging each request with its RAG stage and routing reason. Most OpenAI-compatible clients let you pass metadata or headers. Log at minimum:

Field	Why it matters
stage	embedding / retrieval / rerank / synthesis
model	which model actually served the call
tokens_in / out	cost attribution per stage
latency_ms	detect slow models before users do
fallback_hit	how often your primary model is failing

With this data, the per-stage routing table from the previous section stops being a guess and becomes a measured decision. You will typically find that 80% of cost lives in the synthesis stage and 80% of latency lives in retrieval — two different problems with two different fixes.

Migrating an Existing RAG App to an OpenAI-Compatible Endpoint

If you already have a RAG app hardcoded to one provider, the migration path is deliberately boring:

Change two values: point your existing OpenAI client at the new base_url and swap the API key. Every chat.completions and embeddings call still works.
Add a second model string: introduce one alternative model behind the same endpoint, initially just for a low-risk stage like cheap Q&A synthesis.
Measure, then route: collect stage/model/latency/cost data for a week, then promote the routing policy from the earlier section.
Re-embed only if you change embedding model: if you keep the same embedding model, your existing vector index is reusable as-is — no re-indexing required.

Step 4 is the one that saves teams the most pain. Switching chat models through an OpenAI-compatible layer does not touch your vector database. You only re-embed when you deliberately choose a stronger embedding model, and even then the cost is bounded by your corpus size, not your query volume.

Conclusion

RAG in 2026 is not a model problem — it is a routing and cost-optimization problem. The teams that win are the ones that can put the right model on each pipeline stage, swap models as pricing and capability shift, and never rewrite their application to do it. An OpenAI Compatible API is the interface that makes this possible: one request shape, one SDK, many models behind it.

The practical takeaway: build your RAG pipeline against an OpenAI-compatible endpoint from day one. Embed with the cheapest capable model, synthesize with the strongest, route the long tail of queries to a fast low-cost model, and keep every stage behind the same base_url. When a better model lands next quarter, you change a string — not your architecture.

If you want to lower API cost, unify multiple models behind one endpoint, and keep using the OpenAI SDK you already have, you can try OurToken. It exposes an OpenAI-compatible API across GPT-5.5, Claude Opus 4.8, GLM 5.2, DeepSeek V4, and more, with prepaid pay-as-you-go billing.

FAQ

What is an OpenAI Compatible API in the context of RAG?

An OpenAI Compatible API exposes the same /v1/chat/completions and /v1/embeddings schema as OpenAI. Your RAG code written for the OpenAI SDK works unchanged; only the base_url and model strings differ. This lets you route each RAG stage (embedding, reranking, synthesis) to a different underlying model.

Do I need to rewrite my LangChain or LlamaIndex RAG app to switch models?

No. Because the endpoint is OpenAI-compatible, LangChain's ChatOpenAI and LlamaIndex's OpenAI integrations work as-is — see the LangChain OpenAI integration docs and the Vercel AI SDK provider guide. You change the model parameter per stage, not the integration code.

Which model should I use for RAG synthesis?

For grounded answers on retrieved context, Claude Opus 4.8 is a strong default — see Anthropic's own guidance on retrieval and citations. For very long documents that must fit in one prompt, GLM 5.2's 1M context window is better. For high-volume, low-stakes Q&A, GPT-5.4-mini or DeepSeek V4 Flash cut cost sharply.

How much can I save by routing RAG stages to different models?

Most of the saving comes from moving the 80% of routine queries off the premium model and from using a cheap embedding model for the high-volume embedding calls. Through an OpenAI-compatible layer, effective cost is typically 40%–80% lower than calling each provider's official API at list price, depending on your model mix.

Do I need to re-embed my documents when I switch chat models?

No. Embeddings and chat completions are independent calls. Switching the synthesis model (for example from GPT-5.5 to Claude Opus 4.8) does not change your embedding model, so your vector index and all stored vectors remain valid. You only re-embed if you deliberately change the embedding model itself.

How do I handle failover if one model goes down?

Because every model behind an OpenAI-compatible endpoint shares the same request schema, you can define a fallback chain per RAG stage. If the primary synthesis model returns a rate-limit or timeout, the client automatically retries the next model in the chain — with no provider-specific code.

Can I keep my existing vector database?

Yes. The vector DB (pgvector, Qdrant, Pinecone, etc.) is independent of the LLM provider. An OpenAI-compatible API only replaces the embedding and chat calls; your retrieval index stays as-is.

← Back to all posts