Stop patching every agent for local models.

Relay is a lightweight compatibility gateway for llama.cpp and local inference servers. It exposes OpenAI- and Anthropic-compatible APIs, normalizes streaming, tools, models, and errors, and lets real agents talk to your local models without custom glue code.

The Problem

Almost compatibleis where agents break.

Local model servers speak familiar protocols until the details matter: stream events, tool calls, fields, errors, and capability metadata. Relay makes the boundary actually compatible.

OpenAI-compatible endpoints differ subtly across local servers - header conventions, field presence, and error shapes don't match what SDKs expect.

Anthropic clients expect different message shapes, streaming event orders, and tool-call structures than what upstream servers return.

Tool calls, model IDs, SSE chunk framing, and capability metadata often cause agent loops to break or silently degrade.

Relay normalizes the boundary instead of forcing every client, SDK, and agent to special-case each local inference server.

Translation Pipeline

ClientsOpenAI / AnthropicSDK requests, tools, streams

RelayNormalize boundarymessages, SSE, tools, errors

Local upstreamllama.cpp / vLLMnative server shape

canonical response back to the agent

Relay sits between agent SDKs and local inference servers, translating protocol details in both directions.

What Relay Normalizes

A single compatibility surface across protocols

▸OpenAI chat completions

▸Anthropic messages

▸Streaming / SSE

▸Tool call shapes

▸Model aliases & capabilities

▸Upstream errors & status

▸Health / readiness

▸Deployment config

Why not direct llama.cpp?

llama.cpp is excellent at inference. Relay is not trying to replace it. Relay handles the compatibility layer around it: API shapes, streaming behavior, tool semantics, model metadata, and observability. Think of it as the protocol adapter that sits between raw inference and real-world agent tooling.

Quickstart

terminal

bash# 1. Start your local model server
llama-server --model ./model.gguf --host 127.0.0.1 --port 8080

# 2. Clone and start Relay
git clone https://github.com/achuthanmukundan00/relay.git
cd relay && npm install && cp .env.example .env && npm run dev

# 3. Verify
curl http://127.0.0.1:1234/v1/models

Using Synax? Relay can act as the local model gateway underneath it. Synax is the agent UX and runtime; Relay is the compatibility boundary that makes local inference actually work with it.