How to Add LLM Guardrails to a Self-Hosted 8B Model with Forge in 2025

Tools & Libraries·Jun 22, 2026·15 min read

What Is Forge and Why Does Your Local LLM Need Guardrails

If you've spent any time running a 7B or 8B model on local hardware and asking it to call tools reliably, you've hit the wall: the model hallucinates argument names, emits malformed JSON, calls tools in the wrong order, or just never signals it's done. forge-guardrails is a reliability layer built specifically for this problem. It wraps your LLM backend with a set of composable guardrails — rescue parsing, retry nudges, response validation, and optional workflow constraints — without forcing you to rewrite your entire agent.

The project is maintained by Antoine Zambelli and lives at antoinezambelli/forge on GitHub. It's for any developer running local inference (llama.cpp, Ollama, vLLM, Llamafile) or using Anthropic's API who wants structured, reliable tool-calling without building the error-handling scaffolding from scratch.

The Tool-Calling Reliability Problem with Small Models

Large hosted models handle tool-calling reasonably well because they've seen millions of function-calling examples during training. An 8B quantized model running on your RTX 3090 has not. The failure modes are predictable: the model wraps tool calls in markdown code fences instead of bare JSON, reverses required argument order, omits required fields, or enters an infinite loop calling the same tool repeatedly. Without a recovery layer, your agent crashes, silently produces wrong results, or spins forever burning GPU cycles.

How Forge Boosts an 8B Model from Single Digits to 84% on Agentic Tasks

The headline number from the forge README is striking: across forge's 26-scenario v0.7.0 eval suite, an 8B local model goes from single-digit pass rates to 84% with guardrails applied. Even Claude Sonnet 4.6, already a strong performer, jumps from 85% to 98% (measured in v0.6.0). These aren't cherry-picked demos — they represent a structured benchmark covering diverse tool-calling scenarios.

The gains come from three compounding mechanisms: recovering calls the model almost got right (rescue parsing), steering the model back when it drifts (retry nudges), and rejecting responses that fail validation before they corrupt downstream state (response validation).

What Forge Is Not: Scope Boundaries You Should Understand

Forge is deliberately scoped. It is not a multi-agent orchestrator, a DAG planner, or a coding harness like aider or opencode. It sits inside one agentic loop. The three usage modes cover different integration depths:

| Mode | Use Case | Setup Complexity | Best For | |---|---|---|---| | Proxy Server | Drop-in guardrails for existing tools | Low — one CLI command | opencode, aider, Claude Code users | | WorkflowRunner | Full agent loop management | Medium — Python config | Greenfield forge-native agents | | Guardrails Middleware | Add validation to your own loop | Medium — import and compose | LangChain or custom agent owners |

Core Concepts Behind Forge's Reliability Stack

Forge's reliability stack is composed of four distinct mechanisms. You don't need all of them in every setup — but understanding what each does helps you configure the right combination for your workload.

Rescue Parsing: Recovering Malformed Tool Calls

When a small model almost gets a tool call right but wraps it in a markdown fence, adds a trailing comment, or slightly mangles the JSON structure, forge's rescue parser attempts to extract a valid call anyway. Instead of surfacing a parse error to your agent loop, forge tries several heuristic repairs: stripping fences, extracting embedded JSON objects, fixing common escaping issues. If rescue succeeds, the call proceeds normally. If it fails after exhausting heuristics, forge triggers a retry nudge rather than crashing.

This single mechanism accounts for a large fraction of forge's benchmark gains on 8B models — these models frequently know the right tool and arguments but output them in a format that naive parsers reject.

Retry Nudges: Steering the Model Back on Track

When a response is unrecoverable — wrong tool name, missing required argument, failed rescue — forge sends a targeted correction back to the model as a follow-up system message rather than resetting the entire context. The nudge is specific: it tells the model exactly what went wrong ("tool get_weather requires argument location which was missing") and asks it to retry. This preserves conversation history and is far cheaper than a full restart.

Response Validation: Ensuring Output Correctness

Every tool call response passes through forge's validator before execution. Validation checks that the called tool exists in your registered tool set, that all required arguments are present and correctly typed, and that no unexpected arguments were hallucinated. Validation failures trigger a retry nudge automatically. This layer is active even when you haven't defined any required_steps — it's always-on baseline protection.

Workflow Constraints: required_steps, prerequisites, and terminal_tool

For structured workflows where tool execution order matters, forge provides three opt-in constraints:

required_steps: A list of tool names that must be called before the loop can complete. The model can call them in any order unless further constrained.
prerequisites: A dict mapping each tool to the tools that must have already succeeded before it can be called. Example: {"save_results": ["fetch_data", "validate_data"]} means save_results cannot fire until both upstream tools have returned successfully.
terminal_tool: The tool whose successful execution signals loop completion. When the model calls it and it succeeds, forge closes the loop cleanly.

With zero required_steps and no terminal_tool, forge still applies rescue parsing, retry nudges, and response validation — the constraint layer is purely additive.

Quick Start: Installing Forge and Choosing a Backend

Installing forge-guardrails via pip (core vs Anthropic extras)

Forge requires Python 3.12 or newer. Install the core package or add the Anthropic client extras depending on your backend:

# Core install — works with Ollama, llama-server, vLLM, Llamafile
pip install forge-guardrails

# With Anthropic client support
pip install "forge-guardrails[anthropic]"

For development or to run the eval suite yourself:

git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"

Setting Up llama-server with a Ministral 8B GGUF Model

llama-server (from llama.cpp) is the recommended backend — forge's top-10 eval configs all run on it. The --jinja flag is required; it enables the Jinja-based chat template engine that correctly formats tool-calling prompts for models like Ministral.

# Download llama-server from https://github.com/ggml-org/llama.cpp/releases
# Then serve your GGUF model:
llama-server \
  -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf \
  --jinja \
  -ngl 999 \
  --port 8080

The -ngl 999 flag offloads all layers to GPU. Adjust downward if you're RAM-limited. The server will be available at http://localhost:8080 with an OpenAI-compatible /v1/chat/completions endpoint.

Pre-flight checklist before starting forge:

| Requirement | Detail | |---|---| | Python 3.12+ | Check with python --version | | Backend running | Start llama-server or Ollama before forge | | --jinja flag | Required for llama-server tool-calling | | Quantization match | Q8_0 for 24GB VRAM; Q4_K_M for 8-12GB | | Port not in use | Default llama-server port is 8080 |

Alternative: Using Ollama for Easier Local Setup

Ollama is the fastest way to get running, though forge's benchmarks show it slightly underperforms llama-server on harder multi-step workloads:

# Install Ollama from https://ollama.com/download, then:
ollama pull ministral-3:8b-instruct-2512-q4_K_M
# Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434

Using Anthropic as a Backend (No GPU Required)

pip install "forge-guardrails[anthropic]"
export ANTHROPIC_API_KEY=sk-ant-...

No local hardware needed. This path is also how you reproduce forge's Sonnet 4.6 benchmark numbers. vLLM and Llamafile are also supported — consult docs/BACKEND_SETUP.md in the repo for their specific configuration flags.

Use Case 1 — Proxy Server Mode for Drop-In Guardrails

How the Proxy Works: OpenAI and Anthropic API Compatibility

The forge proxy speaks two API dialects: OpenAI's /v1/chat/completions and Anthropic's /v1/messages. Any client that can set a custom base_url — which includes the OpenAI Python SDK, the Anthropic SDK, opencode, aider, Continue, and Claude Code — can be redirected to forge with zero code changes on the client side. Forge intercepts every request, applies its guardrails to the exchange with your local model, and returns a response in the format the client expects.

Architecture flow: client (opencode / aider / your script) → forge proxy (localhost:8000) → llama-server or Ollama (localhost:8080). The client never knows it's talking to an 8B local model with a guardrails shim in between.

Pointing opencode, aider, or Claude Code at the Forge Proxy

# Step 1: Start the forge proxy (assumes llama-server is on port 8080)
python -m forge.proxy
# Proxy listens on http://localhost:8000 by default

# Step 2: Reconfigure any OpenAI-compatible Python client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",  # forge proxy, not api.openai.com
    api_key="not-needed-for-local",       # required by SDK but unused locally
)

response = client.chat.completions.create(
    model="ministral-3:8b-instruct-2512-q4_K_M",
    messages=[{"role": "user", "content": "What tools do you have available?"}],
    tools=[...],  # your tool definitions here
)

print(response.choices[0].message)

For CLI tools like aider, set the OPENAI_API_BASE environment variable:

export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=not-needed
aider --model ministral-3:8b-instruct-2512-q4_K_M

Running the Proxy and Verifying It Intercepts Requests

The proxy logs each intercepted request to stdout. You'll see lines like [forge] rescued malformed tool call in turn 2 or [forge] retry nudge sent — missing argument: location when guardrails activate. If you see those lines, forge is working. If the proxy passes requests straight through without any forge log lines, check that your model is actually emitting tool calls and that your tools list is non-empty in the request.

This is the most popular entry point precisely because it requires zero rewrite of existing tooling. If you're already using aider or opencode with a paid API, switching to forge + local model is a two-line configuration change.

Use Case 2 — WorkflowRunner for Structured Agentic Loops

Defining Tools and Registering Them with WorkflowRunner

WorkflowRunner is forge's full agent loop manager. You define Python functions as tools, register them, configure your backend and any workflow constraints, then call runner.run() with a user prompt. Forge handles system prompt construction, the tool-call/execute/respond cycle, context compaction when the window fills, and all guardrails automatically.

from forge import WorkflowRunner
from forge.backends import OllamaBackend

# Define your tools as plain Python functions
def get_weather(location: str) -> dict:
    """Fetch current weather for a location."""
    # In production, call a real weather API
    return {"location": location, "temp_c": 22, "condition": "sunny"}

def save_report(content: str, filename: str) -> dict:
    """Save a report string to a local file."""
    with open(filename, "w") as f:
        f.write(content)
    return {"saved": True, "path": filename}

# Configure the backend
backend = OllamaBackend(
    model="ministral-3:8b-instruct-2512-q4_K_M",
    base_url="http://localhost:11434",
)

# Instantiate the runner with constraints
runner = WorkflowRunner(
    tools=[get_weather, save_report],
    backend=backend,
    required_steps=["get_weather", "save_report"],  # both must be called
    terminal_tool="save_report",                     # loop ends when this succeeds
)

# Run the agent loop
result = runner.run("Get the weather in Paris and save a report to weather.txt")
print(result)

Configuring required_steps and prerequisites for Constrained Workflows

The prerequisites dict enforces ordering. Here's how to ensure save_report cannot fire until get_weather has already succeeded:

runner = WorkflowRunner(
    tools=[get_weather, save_report],
    backend=backend,
    required_steps=["get_weather", "save_report"],
    prerequisites={
        "save_report": ["get_weather"],  # save_report requires get_weather first
    },
    terminal_tool="save_report",
)

If the model tries to call save_report before get_weather has returned a successful result, forge intercepts the call, withholds execution, and sends a retry nudge: "You must call get_weather before calling save_report." The model corrects itself without you writing a single line of order-enforcement logic.

Context Compaction and System Prompt Management

WorkflowRunner automatically manages the system prompt (injecting tool schemas, workflow constraints, and forge's reliability instructions) and handles context compaction when the message window approaches the model's context limit. You don't configure these manually — forge handles them based on your tool definitions and constraints.

SlotWorker: Priority-Queued GPU Slot Sharing for Multi-Agent Setups

If you're running multiple specialist workflows on a single GPU — for example, a research agent and a summarization agent sharing one llama-server instance — SlotWorker adds priority-queued access with auto-preemption. High-priority workflows can interrupt lower-priority ones between tool calls. This is an advanced feature for teams building multi-agent systems on constrained hardware, not something you need for single-workflow setups.

Use Case 3 — Guardrails Middleware Inside Your Own Orchestration Loop

Composing Forge Middleware Without Giving Up Loop Control

Not everyone wants forge to own the agent loop. If you have an existing LangChain agent, a custom while-loop orchestrator, or any other framework driving your LLM calls, you can import forge's middleware composables directly and inject them into your own loop. You stay in control of iteration, state management, and tool dispatch; forge handles response validation and rescue parsing as callable middleware.

Integrating Forge Validation into an Existing LangChain or Custom Agent

The pattern mirrors the examples/foreign_loop.py in the forge repo. Here's a minimal example showing forge middleware composing into a custom while-loop agent:

import json
from openai import OpenAI
from forge.middleware import ResponseValidator, RescueParser, RetryNudge

# Your own client pointing directly at the local model
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

# Define your tools (same format as OpenAI tool spec)
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }
]

# Instantiate forge middleware components
validator = ResponseValidator(tools=tools)
rescuer = RescueParser()
nudger = RetryNudge()

messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
max_turns = 10

for turn in range(max_turns):
    response = client.chat.completions.create(
        model="ministral-3:8b-instruct-2512-q4_K_M",
        messages=messages,
        tools=tools,
    )

    msg = response.choices[0].message

    # Attempt rescue if tool call content looks malformed
    if msg.content and not msg.tool_calls:
        rescued = rescuer.attempt_rescue(msg.content)
        if rescued:
            print(f"[turn {turn}] Rescued malformed tool call: {rescued}")
            # Treat rescued result as a proper tool call and dispatch
            result = {"temperature": "18C", "condition": "cloudy"}  # mock execution
            messages.append({"role": "tool", "content": json.dumps(result), "tool_call_id": "rescued_0"})
            continue

    # Validate well-formed tool calls
    if msg.tool_calls:
        valid, error = validator.validate(msg.tool_calls[0])
        if not valid:
            print(f"[turn {turn}] Validation failed: {error}. Sending retry nudge.")
            nudge_msg = nudger.build_nudge(error)
            messages.append({"role": "user", "content": nudge_msg})
            continue

        # Execute the valid tool call
        tool_name = msg.tool_calls[0].function.name
        args = json.loads(msg.tool_calls[0].function.arguments)
        print(f"[turn {turn}] Executing {tool_name}({args})")
        # ... dispatch to your actual tool implementation ...
        break

    # No tool call and no rescuable content — model gave a final answer
    print(f"Final answer: {msg.content}")
    break

Handling Rescue Parsing Events in Your Own Error Handling Logic

The key insight in the middleware pattern is that rescue parsing and validation are separate concerns. Your loop decides what to do with the results — retry, log, alert, or escalate. Forge provides the detection and correction primitives; you keep orchestration control. This is ideal when you're adding forge to an existing production agent incrementally rather than migrating wholesale.

Limitations and When Forge Is Not the Right Tool

Forge is excellent at what it does, but its scope is intentional. Here's where it stops and what to reach for instead.

Single Agentic Loop Scope: No Multi-Agent Graph Orchestration

Forge sits inside one agentic loop. It does not coordinate multiple agents, manage message passing between agents, or build DAG-style task graphs. If your architecture involves agent A spawning agent B which reports back to a supervisor agent C, forge can harden each individual loop but won't orchestrate across them. For that layer, look at LangGraph, AutoGen, or crewAI — and use forge's middleware mode inside each individual node if you want per-node reliability.

Model and Backend Constraints: Python 3.12+ and Supported Servers Only

Forge requires Python 3.12 or newer — not 3.10, not 3.11. If your stack is pinned to an older Python for other reasons, you'll need to isolate forge in a separate venv or container. Supported backends are Ollama, llama-server, Llamafile, vLLM, and Anthropic. If you're running TGI (Text Generation Inference) from Hugging Face or a custom inference server, forge won't connect without custom backend adapter work.

Eval Coverage Gaps: What the 26-Scenario Suite Does Not Test

Forge's 84% benchmark comes from 26 scenarios. These cover a solid cross-section of tool-calling patterns but are not exhaustive. Notably, the eval suite does not heavily cover: multi-turn conversations with ambiguous user intent, tool calls requiring large payload arguments (image data, long documents), or streaming responses. If your production use case is primarily streaming or involves very large tool arguments, your real-world reliability numbers may differ from the benchmark.

Anthropic Benchmark Caveats: Version and Cost Considerations

The Sonnet 4.6 numbers (85% → 98%) were measured in forge v0.6.0 and have not been re-run against v0.7.0 because Anthropic API calls at eval scale have non-trivial cost. Treat these as directionally accurate but not pinned to current model versions. Also note that forge's eval configs favor llama-server over Ollama for harder workloads — if you're benchmarking with Ollama, expect slightly lower numbers on complex multi-step scenarios.

Decision Checklist

| Situation | Recommendation | |---|---| | Running a local 8B model with tool-calling | ✅ Use forge | | Want drop-in reliability for aider/opencode | ✅ Proxy mode | | Building a greenfield agent with structured steps | ✅ WorkflowRunner | | Need multi-agent coordination or DAG planning | ❌ Use LangGraph/AutoGen instead | | Pinned to Python < 3.12 | ❌ Not compatible without version upgrade | | Using HuggingFace TGI as inference backend | ❌ No native adapter; build your own | | Need streaming response guardrails | ⚠️ Limited coverage; test carefully |

Forge fills a specific and genuinely painful gap in the local LLM ecosystem: making small models reliable enough to do real work. The 84% benchmark is hard to dismiss, and the three usage modes mean you can adopt it incrementally without a full rewrite. Start with proxy mode if you're already using an existing tool; move to WorkflowRunner when you're ready to build natively on forge's reliability guarantees.