Ingestion & Parsing
Native PDF, scanned image, and DOCX ingestion. OCR for degraded scans; layout detection preserves tables, schedules, and exhibits. Headings, clause boundaries, and defined terms are identified structurally — not inferred from formatting alone.
- PyMuPDF
- Tesseract / Azure OCR
- LayoutLM
- Unstructured.io