← Back to library
GenLib Architecture
End-to-end flow for PDF ingestion, chunk summarization, image extraction, manual overrides, and table snapshots.
System Overview
- Input PDFs live under
_00_sources/. - Summaries are generated per chunk and written to
_01_summaries/summaries_<slug>.json. - Rendered assets (figures and table snapshots) are written to
_01_extracted_images/<slug>/. - The web app serves books from JSON + extracted assets. Current catalog size: 35 books.
Image Extraction
- Extract embedded PDF images with PyMuPDF.
- Filter out tiny or near-solid assets.
- For masked/layered assets, render clipped regions and crop to meaningful connected components.
- Distribute accepted images across chunks (max images per chunk).
For difficult PDFs, manual overrides are authoritative and can fully disable automatic image extraction.
Manual Overrides (Human in Loop)
- Override file:
_01_manual_image_overrides/<slug>.json. - Each override targets
page+chunk_indexwith normalized crop boundsrect_norm. manual_only: truedisables automatic image extraction for that book.
Table Snapshot Extraction
- Detect table regions with
page.find_tables(). - Filter small/line-like candidates by geometry thresholds.
- Skip table candidates that substantially overlap manual image crops (prevents duplicate visuals).
- Deduplicate overlapping table boxes on the same page.
- Render each surviving box as a PNG snapshot and attach to section
table_images.
We intentionally render table snapshots as images to preserve layout fidelity from the source PDF.
Book Rendering
/book/<slug>renders summaries, figure images, table snapshots, and original text toggles./source_pdf/<slug>serves the original PDF when available.- Book pages include a direct “View Original PDF” link at the top.