Advanced Usage

This guide covers advanced features of the DataBridge Python SDK, including rules for document processing and local caching.

Rules

DataBridge supports rules that can be applied during document ingestion to transform content or extract metadata.

Metadata Extraction Rules

Use MetadataExtractionRule to automatically extract structured metadata from document content:

from databridge.sync import DataBridge
from databridge.rules import MetadataExtractionRule
from pydantic import BaseModel

# Define the schema for metadata extraction
class ResearchPaperInfo(BaseModel):
    title: str
    authors: list[str]
    publication_date: str
    abstract: str

# Create the rule
extraction_rule = MetadataExtractionRule(schema=ResearchPaperInfo)

# Apply the rule during ingestion
with DataBridge() as db:
    doc = db.ingest_text(
        """Title: Deep Learning Approaches for Natural Language Processing
        Authors: John Smith, Jane Doe
        Publication Date: 2023-04-15
        Abstract: This paper explores various deep learning techniques for NLP tasks...""",
        rules=[extraction_rule]
    )
    
    # The extracted metadata will be available in the document's metadata
    print(doc.metadata)

Natural Language Rules

Use NaturalLanguageRule to transform content using natural language instructions:

from databridge.rules import NaturalLanguageRule

# Create a rule to summarize content
summarize_rule = NaturalLanguageRule(
    prompt="Summarize this content into 3-5 key bullet points"
)

# Apply the rule during ingestion
doc = db.ingest_file(
    "long_report.pdf",
    rules=[summarize_rule]
)

Caching

DataBridge supports local caching to improve performance for repeated queries.

Creating a Cache

# Create a cache for specific documents
cache = db.create_cache(
    name="research_cache",
    model="llama2",
    gguf_file="llama-2-7b-chat.Q4_K_M.gguf",
    filters={"type": "research"}
)

Using a Cache

# Get an existing cache
cache = db.get_cache("research_cache")

# Update the cache with the latest documents
cache.update()

# Add specific documents to the cache
cache.add_docs(["doc_123", "doc_456"])

# Query using the cache
response = cache.query(
    "What are the key findings in the latest research?",
    max_tokens=500,
    temperature=0.7
)

print(response.completion)

Using Filters

Filters allow you to narrow down your document search:

# Basic equality filter
docs = db.retrieve_docs(
    "machine learning",
    filters={"category": "tech"}
)

# Multiple conditions
docs = db.retrieve_docs(
    "budget analysis",
    filters={
        "department": "finance",
        "year": 2023,
        "status": "approved"
    }
)

Handling Images

DataBridge can process and retrieve images:

# Retrieve images
chunks = db.retrieve_chunks(
    "architectural diagram of the office building",
    filters={"subject": "architecture"}
)

for chunk in chunks:
    if isinstance(chunk.content, PILImage):
        # It's an image
        chunk.content.show()  # Display the image
    else:
        # It's text
        print(chunk.content)

Asynchronous Usage

For applications that benefit from asynchronous operations:

from databridge.async_ import AsyncDataBridge

async def main():
    async with AsyncDataBridge() as db:
        # Ingest a document
        doc = await db.ingest_text("Sample content")
        
        # Retrieve chunks
        chunks = await db.retrieve_chunks("sample query")
        
        # Generate a completion
        response = await db.query("Explain this concept")
        print(response.completion)

# Run with an async event loop
import asyncio
asyncio.run(main())

Error Handling

Proper error handling for robust applications:

from databridge.exceptions import DataBridgeError, AuthenticationError, ConnectionError

try:
    with DataBridge("databridge://owner_id:invalid_token@api.databridge.ai") as db:
        doc = db.ingest_text("Sample content")
except AuthenticationError:
    print("Authentication failed - check your credentials")
except ConnectionError:
    print("Connection failed - check your network or server status")
except DataBridgeError as e:
    print(f"Operation failed: {str(e)}")