The Processing Engine
A technical walkthrough of the four-stage pipeline that runs every time you submit a job. Understanding this helps you write better prompts and debug unexpected results.
The four-stage pipeline
Stage 1: Text & image extraction
When you upload a PDF, the server renders every page at a high resolution. Two things are extracted in parallel: the selectable text layer (if present), and a rasterized image of the entire page.
Both are sent to the AI. The image allows the model to understand diagrams, hand-written content, scanned documents, and visual structure that plain text can't capture.
Stage 2: Structural mapping
Before classification, the engine analyses the visual layout of each page. It identifies:
- Headings & section breaks — Used to understand which pages logically belong together
- Question numbering — Prevents splitting a multi-page question across different output files
- Page noise — Running headers, footers, and page numbers that should never influence classification
- Answer indicators — Patterns like "Answer:" or "[Turn over]" that reveal a page is not a question
Stage 3: Semantic classification
This is the core the AI step. For each page (or logical page group), the LLM receives:
The model returns a structured verdict for each page: keep: true/false, an assigned category (for bucket jobs), a confidence-weighted reason, and a suggested order for the output.
Because the model receives both text and image, it handles mixed content correctly: a page containing a circuit diagram with a question printed below it will be classified by the question's subject, not just the presence of a diagram.
Stage 4: Output generation
With a list of approved pages and their target buckets, the engine assembles the final output:
The original raw PDF pages are sliced directly from the source file (no re-rendering) and stitched into a new PDF in the classified order. Every pixel of the original is preserved.
The AI re-generates each approved page as clean Markdown, correctly formatting LaTeX math equations using standard $...$ and $$...$$ syntax for KaTeX rendering.