← Back to library

GenLib Architecture

End-to-end flow for PDF ingestion, chunk summarization, image extraction, manual overrides, and table snapshots.

System Overview

  • Input PDFs live under _00_sources/.
  • Summaries are generated per chunk and written to _01_summaries/summaries_<slug>.json.
  • Rendered assets (figures and table snapshots) are written to _01_extracted_images/<slug>/.
  • The web app serves books from JSON + extracted assets. Current catalog size: 35 books.

Image Extraction

  1. Extract embedded PDF images with PyMuPDF.
  2. Filter out tiny or near-solid assets.
  3. For masked/layered assets, render clipped regions and crop to meaningful connected components.
  4. Distribute accepted images across chunks (max images per chunk).

For difficult PDFs, manual overrides are authoritative and can fully disable automatic image extraction.

Manual Overrides (Human in Loop)

  • Override file: _01_manual_image_overrides/<slug>.json.
  • Each override targets page + chunk_index with normalized crop bounds rect_norm.
  • manual_only: true disables automatic image extraction for that book.

Table Snapshot Extraction

  1. Detect table regions with page.find_tables().
  2. Filter small/line-like candidates by geometry thresholds.
  3. Skip table candidates that substantially overlap manual image crops (prevents duplicate visuals).
  4. Deduplicate overlapping table boxes on the same page.
  5. Render each surviving box as a PNG snapshot and attach to section table_images.

We intentionally render table snapshots as images to preserve layout fidelity from the source PDF.

Book Rendering

  • /book/<slug> renders summaries, figure images, table snapshots, and original text toggles.
  • /source_pdf/<slug> serves the original PDF when available.
  • Book pages include a direct “View Original PDF” link at the top.