Skip to main content
This page focuses on a single pattern: wiring Morphik retrieval tools into your own agent loop.

Build a Retrieval Agent with Morphik Tools

Use Morphik’s retrieval APIs as function-calling tools when you want a lightweight agent loop you fully control. The example below uses OpenAI’s Responses API, list_documents for discovery, and retrieve_chunks (with ColPali image URLs) for grounded answers.
Prerequisites: pip install morphik openai, set MORPHIK_URI and OPENAI_API_KEY, and update the metadata fields in SYSTEM_INSTRUCTIONS to match your corpus.
Giving the LLM direct access to retrieval tools lets it decide when to explore vs. fetch, narrow scope with filters, and request only the needed pages. This example exposes two tools, but you can add more (e.g., page-range or download helpers) from the API reference to expand the agent’s toolkit while keeping guardrails tight. You can use any major LLM with tool-calling support (OpenAI, Anthropic, Gemini, etc.); we show OpenAI for familiarity and will add provider-specific snippets soon—the core loop remains the same. Imports used in the snippets below:
import argparse
import json
import os
from morphik import Morphik
from openai import OpenAI

1) System prompt

Use this to govern the agent’s behavior, metadata awareness, and citation discipline; it is the primary control surface for how the loop plans and answers.
SYSTEM_INSTRUCTIONS = """You are a helpful assistant with access to a curated knowledge base.

Available metadata fields (customize to your schema):
- doc_id (string)
- doc_type (string)
- title (string)
- category (string)
- status (string)
- effective_date (datetime)
- expires_at (datetime)
- tags (array)
- region (string)

Filter operators:
- $eq, $gt, $gte, $lt, $lte, $and, $contains, $regex

Strategy:
1) Call list_documents to see available material and metadata.
2) Apply filters before retrieval when date, status, region, or tags matter.
3) Use retrieve_chunks with use_colpali=True and output_format="url" to pull text and images.
4) Cite sources with doc_id (or your ID field), filename, and page numbers.
"""

2) Tool schemas (OpenAI Responses API)

Expose only the tools you want the model to call. The schema format may differ slightly by provider, but the tool list and arguments stay consistent.
TOOLS = [
    {
        "type": "function",
        "name": "list_documents",
        "description": "List documents in the knowledge base with optional metadata filters.",
        "parameters": {
            "type": "object",
            "properties": {
                "filters": {
                    "type": ["string", "null"],
                    "description": "Metadata filters as a JSON string. Pass null or empty for no filters."
                },
                "limit": {
                    "type": ["integer", "null"],
                    "description": "Maximum number of documents to return (default: 10)"
                }
            },
            "required": ["filters", "limit"],
            "additionalProperties": False
        },
        "strict": True
    },
    {
        "type": "function",
        "name": "retrieve_chunks",
        "description": "Retrieve relevant document chunks for a query using semantic search.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query or user question"
                },
                "filters": {
                    "type": ["string", "null"],
                    "description": "Optional metadata filters as a JSON string"
                },
                "k": {
                    "type": ["integer", "null"],
                    "description": "Number of chunks to retrieve (default: 3)"
                }
            },
            "required": ["query", "filters", "k"],
            "additionalProperties": False
        },
        "strict": True
    }
]

3) Tool implementations (Morphik Python SDK)

These Python functions bridge the LLM’s tool calls to Morphik APIs and normalize outputs (metadata, previews, image URLs) for reuse in the loop.
def list_documents(morphik: Morphik, filters=None, limit=None):
    if isinstance(filters, str) and filters.strip():
        filters = json.loads(filters)

    response = morphik.list_documents(filters=filters, limit=limit or 10)
    docs = response.documents if hasattr(response, "documents") else response

    normalized = []
    for doc in docs:
        meta = doc.metadata or {}
        normalized.append({
            "external_id": doc.external_id,
            "filename": doc.filename,
            "doc_id": meta.get("doc_id"),
            "title": meta.get("title"),
            "doc_type": meta.get("doc_type"),
            "category": meta.get("category"),
            "status": meta.get("status"),
            "effective_date": str(meta.get("effective_date")) if meta.get("effective_date") else None,
            "expires_at": str(meta.get("expires_at")) if meta.get("expires_at") else None,
            "tags": meta.get("tags") or [],
        })

    return {"documents": normalized, "count": len(normalized)}


def retrieve_chunks(morphik: Morphik, query, filters=None, k=None):
    if isinstance(filters, str) and filters.strip():
        filters = json.loads(filters)

    chunks = morphik.retrieve_chunks(
        query=query,
        filters=filters,
        k=k or 3,
        use_colpali=True,
        output_format="url"
    )

    results = []
    image_urls = []

    for chunk in chunks:
        meta = chunk.metadata or {}
        is_image_url = isinstance(chunk.content, str) and chunk.content.startswith("http")

        if is_image_url:
            image_urls.append(chunk.content)

        preview = chunk.content
        if isinstance(preview, str) and len(preview) > 300:
            preview = f"{preview[:300]}..."
        if is_image_url:
            preview = f"[Image URL: {chunk.content}]"

        results.append({
            "score": round(chunk.score, 3),
            "content_preview": preview,
            "filename": meta.get("filename") or chunk.filename or "unknown",
            "doc_id": meta.get("doc_id"),
            "doc_type": meta.get("doc_type"),
            "title": meta.get("title"),
            "expires_at": str(meta.get("expires_at")) if meta.get("expires_at") else None,
            "tags": meta.get("tags") or [],
        })

    return {
        "chunks": results,
        "count": len(results),
        "image_urls": image_urls
    }

4) Agent runner (OpenAI Responses loop)

Handles the model call, executes tool invocations, and re-feeds results until the model returns a final answer. Swap the client/model for any tool-calling LLM and keep the structure.
def run_agent(morphik: Morphik, openai_client: OpenAI, user_query: str, model: str):
    conversation = [{"role": "user", "content": user_query}]

    for _ in range(5):
        response = openai_client.responses.create(
            model=model,
            instructions=SYSTEM_INSTRUCTIONS,
            tools=TOOLS,
            input=conversation
        )

        conversation += response.output
        calls = [item for item in response.output if item.type == "function_call"]

        if not calls:
            print(response.output_text)
            return

        for call in calls:
            args = json.loads(call.arguments)

            if call.name == "list_documents":
                result = list_documents(
                    morphik,
                    filters=args.get("filters"),
                    limit=args.get("limit")
                )
                image_urls = []
            elif call.name == "retrieve_chunks":
                result = retrieve_chunks(
                    morphik,
                    query=args["query"],
                    filters=args.get("filters"),
                    k=args.get("k")
                )
                image_urls = result.get("image_urls", [])
            else:
                result = {"error": f"Unknown function: {call.name}"}
                image_urls = []

            conversation.append({
                "type": "function_call_output",
                "call_id": call.call_id,
                "output": json.dumps(result)
            })

            if image_urls:
                image_content = [{"type": "input_text", "text": "Here are the retrieved pages:"}]
                for url in image_urls:
                    image_content.append({"type": "input_image", "image_url": url})
                conversation.append({"role": "user", "content": image_content})

    print("[Warning] Reached maximum iterations without a final answer")

5) Optional CLI entry point

def main():
    parser = argparse.ArgumentParser(description="Retrieval agent with Morphik tools")
    parser.add_argument("--uri", default=os.getenv("MORPHIK_URI"))
    parser.add_argument("--query", default="What are the latest policy updates published this year?")
    parser.add_argument("--openai-key", default=os.getenv("OPENAI_API_KEY"))
    parser.add_argument("--model", default="gpt-4o-mini")
    args = parser.parse_args()

    if not args.uri:
        raise SystemExit("MORPHIK_URI not set. Use --uri or set MORPHIK_URI.")
    if not args.openai_key:
        raise SystemExit("OPENAI_API_KEY not set. Use --openai-key or set OPENAI_API_KEY.")

    morphik = Morphik(args.uri, timeout=30)
    openai_client = OpenAI(api_key=args.openai_key)
    run_agent(morphik, openai_client, args.query, args.model)


if __name__ == "__main__":
    main()
Run it with your own question:
export MORPHIK_URI="morphik://your-token@your-host"
export OPENAI_API_KEY="your-openai-key"
python retrieval_agent.py --query "Where can I find the most recent safety policies?"
To extend this loop, add more Morphik tools (e.g., page-range retrieval or download helpers) from the API reference into TOOLS and implement matching handlers in the runner. This retrieval agent pattern shows how to keep full control over tool usage while grounding responses with Morphik. The agent’s ability to plan, execute tools, and remember context makes it ideal for sophisticated business intelligence and research workflows.