Adding a Free Overflow Model to Your MCP Server: Gemma via the Gemini API

Joe Provence • April 12, 2026

Most agentic workflows have a single failure mode nobody plans for: the primary LLM hits its rate limit mid-session and everything stops. You can't log a result. You can't draft the next section. The workflow is blocked until the window resets. After hitting this enough times, I started treating it as an architecture problem rather than a billing problem.


The fix turned out to be simpler than I expected.


The Insight Hidden in the Gemini Docs

While auditing our Google AI Studio integration, I noticed that Gemma — Google's open-weight model family — is served through the exact same API endpoint as Gemini. Same Python SDK, same API key, different model string. And Gemma 3 27B costs $0 per million tokens on the free tier. If you already have a Gemini API key, you already have free access to a capable open-weight model. No new credentials, no additional SDK, no separate account.


That's the whole unlock.

Registering the Tool in FastMCP

Adding query_gemma to a FastMCP server is a thin wrapper — roughly fifteen lines:

python
 import google.generativeai as genai
from fastmcp import FastMCP
mcp = FastMCP("my-server")
@mcp.tool()
def query_gemma(prompt: str, model: str = "gemma-3-27b-it") -> str:
    """Send a prompt to Gemma. Use for generation tasks to reduce primary LLM token usage."""
    client = genai.GenerativeModel(model)
    response = client.generate_content(prompt)
    return response.text 

The model parameter defaults to gemma-3-27b-it but accepts the full family:

Model Best For
gemma-3-1b-it Minimal tasks, fastest
gemma-3-4b-it Classification, simple formatting
gemma-3-12b-it General use
gemma-3-27b-it Default — best Gemma 3 quality
gemma-4-26b-a4b-it Gemma 4, efficient
gemma-4-31b-it Gemma 4, highest quality

After adding the tool, reconnect your MCP connector to reload the manifest. That's the entire deployment.


The Workflow Split That Makes This Useful

The important constraint: query_gemma is text in, text out. Gemma has no access to your tool registry. It can't call other MCP tools, query your data layer, or read session state. It only knows what you explicitly pass in the prompt.



This forces a clean separation that turns out to be the right design anyway. The primary LLM handles tool calls, data retrieval, QA, and logging. Gemma handles generation-heavy tasks — drafting, summarizing, classifying, formatting. The primary LLM does less of the expensive token work. When it hits rate limits, Gemma absorbs the generation queue while the primary LLM recovers.


The split also makes each model's role legible. If something fails, you know immediately which layer to look at.


The Gap That Remains

The free tier rate limits are real. Gemma 3 models allow 5–15 requests per minute depending on model size. For interactive workflows, that's usually fine. For anything resembling batch processing, you'll hit the ceiling fast and need retry logic.


The deeper limitation is context. Gemma doesn't know what your other tools returned unless you tell it. Every query_gemma call needs to be self-contained — task description, relevant data, output format, all passed explicitly. That's more prompt engineering overhead than calling a context-aware primary LLM, and it matters for complex tasks.

What This Is and Isn't

This isn't a replacement for your primary LLM. For tasks requiring tool calls, structured reasoning over live data, or anything where the model needs to know what happened earlier in the session — you still need the primary stack.


For pure generation tasks, it works well and it's free. The practical framing: treat it as a relief valve on your token budget, not a second brain.


Build your overflow capacity the same way you build your primary stack — thin interfaces, clear contracts, explicit failure modes. A model you can swap in when the primary one is saturated is worth more than a more powerful model you can't afford to run continuously.


If you're starting from scratch, here's how we built the underlying FastMCP server on Oracle Cloud Free Tier — no DevOps experience required.

By Salt Creative March 27, 2026
Google quietly scores every page for something called salience — and if your location isn't registering above the threshold, your local rankings are working against you. Here's what it means and how to fix it.
By Salt Creative March 25, 2026
Discover the real data behind small business AI adoption in 2026 — from time savings and top use cases to the gaps holding most owners back. Stats from LinkedIn, Salesforce, HubSpot, and more.
By Salt Creative March 24, 2026
Your iPhone dying too fast? These 15 iOS settings will extend your battery life today — no accessories required. Takes about 10 minutes to set up.
Laptop computer with CRM above it
By Salt Creative March 20, 2026
In today’s hyper competitive business landscape, Customer Relationship Management (CRM) systems have become indispensable tools for driving growth, enhancing customer experiences, and streamlining operations.
By Salt Creative March 18, 2026
A step-by-step tutorial for building a custom FastMCP server on Oracle Cloud Free Tier that gives Claude direct access to your GSC data via DuckDB.
Woman at a crafting table in a brightly lit studio, surrounded by artwork, plants, and supplies.
By Salt Creative March 18, 2026
Looking for small business grants in 2026? This guide covers free, federal, and state-specific funding sources — including programs for women-owned businesses, startups, and sole proprietors. Get actionable steps to find and win debt-free funding.