ingest_file

Ingest a file document into DataBridge.

def ingest_file(
    file: Union[str, bytes, BinaryIO, Path],
    filename: Optional[str] = None,
    metadata: Optional[Dict[str, Any]] = None,
    rules: Optional[List[RuleOrDict]] = None,
    use_colpali: bool = True,
) -> Document

Parameters

  • file (Union[str, bytes, BinaryIO, Path]): File to ingest (path string, bytes, file object, or Path)
  • filename (str, optional): Name of the file
  • metadata (Dict[str, Any], optional): Optional metadata dictionary
  • rules (List[RuleOrDict], optional): Optional list of rules to apply during ingestion. Can be:
    • MetadataExtractionRule: Extract metadata using a schema
    • NaturalLanguageRule: Transform content using natural language
  • use_colpali (bool, optional): Whether to use ColPali-style embedding model to ingest the file (slower, but significantly better retrieval accuracy for images). Defaults to True.

Returns

  • Document: Metadata of the ingested document

Example

from databridge.sync import DataBridge
from databridge.rules import MetadataExtractionRule, NaturalLanguageRule
from pydantic import BaseModel

class DocumentInfo(BaseModel):
    title: str
    author: str
    department: str

db = DataBridge()

doc = db.ingest_file(
    "document.pdf",
    filename="document.pdf",
    metadata={"category": "research"},
    rules=[
        MetadataExtractionRule(schema=DocumentInfo),
        NaturalLanguageRule(prompt="Extract key points only")
    ]
)