Under the hood

The Processing Engine

A technical walkthrough of the four-stage pipeline that runs every time you submit a job. Understanding this helps you write better prompts and debug unexpected results.

7 min read·Technical

The four-stage pipeline

Text extraction

Structural map

Semantic class.

Output gen.

Stage 1: Text & image extraction

When you upload a PDF, the server renders every page at a high resolution. Two things are extracted in parallel: the selectable text layer (if present), and a rasterized image of the entire page.

Both are sent to the AI. The image allows the model to understand diagrams, hand-written content, scanned documents, and visual structure that plain text can't capture.

Scanned PDFs (where there is no text layer, only a picture) are handled via the image path. There is no OCR step — the AI reads the image directly, which means it can handle handwriting and diagrams that OCR would mangle.

Stage 2: Structural mapping

Before classification, the engine analyses the visual layout of each page. It identifies:

Headings & section breaks — Used to understand which pages logically belong together
Question numbering — Prevents splitting a multi-page question across different output files
Page noise — Running headers, footers, and page numbers that should never influence classification
Answer indicators — Patterns like "Answer:" or "[Turn over]" that reveal a page is not a question

If a question starts on page 12 and concludes on page 14, the engine groups those pages as a single logical unit. They are always kept or discarded together — you'll never get half a question.

Stage 3: Semantic classification

This is the core the AI step. For each page (or logical page group), the LLM receives:

The extracted text of the page

A rasterized image of the page

Your exact prompt

The structural context from Stage 2

The model returns a structured verdict for each page: keep: true/false, an assigned category (for bucket jobs), a confidence-weighted reason, and a suggested order for the output.

Because the model receives both text and image, it handles mixed content correctly: a page containing a circuit diagram with a question printed below it will be classified by the question's subject, not just the presence of a diagram.

Batching for large documents. The engine processes pages in batches of approximately 400,000 tokens (roughly 1.6 million characters of text). Large documents are processed in multiple parallel batches and the results are merged. This is why you see "Analyzing content (Batch 1/2)" in the progress indicator.

Stage 4: Output generation

With a list of approved pages and their target buckets, the engine assembles the final output:

PDF mode

The original raw PDF pages are sliced directly from the source file (no re-rendering) and stitched into a new PDF in the classified order. Every pixel of the original is preserved.

Markdown mode

The AI re-generates each approved page as clean Markdown, correctly formatting LaTeX math equations using standard $...$ and $$...$$ syntax for KaTeX rendering.

Multi-file output. When generating bucket jobs, the engine runs the stitching step once per bucket. The browser then receives multiple file downloads simultaneously.

Feature Overview Core Architecture