Quick Take I built a local OCR pipeline to convert technical documents into markdown that LLMs can actually reason from. The two biggest breakthroughs were cross-AI judging to break prompt-optimization deadlocks, and blank-region detection that increased figure recovery from 4 to 73 on a 72-page document.
Recently, I started testing models and coding harnesses by giving them a technical write-up and asking them to build a learning site from it. The prompt I use:
Create a 3d graphics tutorial based on https://swmansion.com/blog/breaking-down-the-jelly-slider-9ab9239f6d80
# Audience
* experienced software engineers
* no background in 3d graphics
* visual learner
* learn by experimenting
# Requirements
* each lesson should build on the previous
* the key variables must be configurable to aid in `learn by experimenting`
* the core part of the code should be editable in the browser
* use a dark but not black theme
* canvas must support zoom and drag and move
I started using Breaking Down the Jelly Slider when it came out in March 2026. It’s a deep-dive by Konrad Reczko on building a physics simulation in TypeGPU, with interactive WGSL shaders and 3D renders throughout. As someone with limited frontend experience and no 3D graphics experience, the article was intriguing and felt like the perfect example to demonstrate my belief that AI can revolutionize educational material.
Typically I could get a reasonable lesson plan that did a decent job of teaching the core concepts, but nothing was able to actually create a full Jelly Slider in the end. There were some close approximations and even jelly blobs, but it really felt like the AI was only reading the text and couldn’t “see” the visuals. That gap between “conceptually right, visually wrong” is what motivated me to build a local OCR pipeline that could actually translate figures into something a text-only model could reason from.
First Attempt: Autoresearch Link to heading
While Gemini Pro and Claude Opus do a good job of translating images, charts, and graphs, it can get expensive. Inspired by Karpathy’s autoresearch, my first idea was to apply a self-improving loop to the OCR problem. Instead of tweaking prompts by hand, an LLM would analyze the scoring history and propose the next experiments.
I used Gemini to build the initial harness. datalab-to/chandra-ocr-2, a document-specialized OCR model, ran as a CLI tool with Gemma 4 serving as experimenter and judge via LM Studio. Outputs were scored against ground-truth markdown I had generated with the Gemini web app. That got a few rounds going. When I tried to take it further by loading both models directly via Hugging Face transformers for tighter control, the two together exceeded available VRAM and couldn’t be loaded at the same time. Around the same time, Qwen 3.6 was released.
Course Correction: Simplifying the Setup Link to heading
When Qwen 3.6 was released, I tried Qwen3.6-Plus on a test page through the Qwen website. The markdown it produced was better than the Gemini ground truth I had been scoring against. I can’t fully explain it, but seeing that output made it click: one capable local model could handle both the OCR and the judging. The VRAM problem disappeared along with the architecture that caused it.
The coordination overhead of wiring experiments back into the script was slowing things down anyway, and my main goal was a working pipeline. I dropped the auto part of auto-research and switched to a manual loop with cloud LLMs doing the evaluation.
The local models I focused on were:
- Qwen3.6-27B Q6_K
- Qwen3.6-35b-a3b Q4_K_M
- Gemma-4-31b Q4_K_M
The Grading Loop Link to heading
The most useful thing that came out of this phase was watching Claude change its mind.
Claude’s early prompts were elaborate: XML schemas with sections for <symbol_table>, <algorithm>, and <edge_case>, two quality checklists, 260 lines total. There was real thought behind it — the figure checklist alone asked “Could a downstream LLM reconstruct the figure’s claim from my description alone?” But the output, while correctly structured, was exhausting to read and expensive on local context budgets. I could see it wasn’t working as well as simpler approaches, but Claude kept generating variations on the same structure.
The cross-grading step broke the loop. The workflow was: run each prompt over each local model with the sample inputs, upload the outputs to both Gemini and Claude to grade, then share each model’s grade with the other before asking for an updated prompt. When Claude was shown Gemini’s evaluation alongside a leaner competing prompt, it shifted — the XML blocks disappeared, the format collapsed to flat markdown, and the prompt shrank from ~830 tokens to ~400. Using both models as judges, with each one seeing the other’s feedback, surfaced something neither would have arrived at independently.
The benchmark I ran before starting the loop shaped what the graders were looking for. Running the same pages through ChatGPT, Claude, and Gemini APIs had shown a consistent pattern: Gemini produced the most faithful output, Claude the most inventive. Claude would add pseudocode, edge cases, and interpretations that weren’t on the page — trying to be thorough in a way that actively undermined faithful transcription. That tendency became the main signal the graders were checking for in each iteration.
The prompt that came out of this phase (full text) covers faithful text transcription, a structured [FIGURE: ...] / [/FIGURE] block format for every visual element, rules for spatial layout in multi-panel figures, special handling of 3D scenes and annotations, and a list of things the model must not do (summarize, hedge, fabricate URLs, duplicate captions). The ~400-token result was about half the size of what Claude had been generating.
Diversifying the Inputs Link to heading
This workflow worked well to get the first version of the prompt, but I had only been evaluating against one input set. I asked Gemini to recommend more. There’s something a little circular about asking your judge to curate the test set, but it gave me three good ones:
- Bartosz Ciechanowski’s GPS deep-dive
- Figma’s multiplayer technology write-up
- A storage engineering post from the TigerBeetle team
The last one is mostly dense prose with minimal figures — a different kind of challenge from the other two. For each, I pulled 3-5 pages as PDFs.
The grading loop then ran sequentially through all four documents: update the prompt based on Jelly Slider output, test on GPS, update again, test on Figma, repeat until quality was high across the board.
What Single Pages Don’t Tell You Link to heading
Running the grading loop on individual pages had concealed a whole class of problems. The moment the first script ran against a full document, they surfaced immediately.
The first full run on the Jelly Slider PDF (all 36 pages) produced clean output on each individual page and a broken document overall. Three failures stood out.
Running headers. The article title appeared at the top of every page in the PDF. The model faithfully transcribed it as an H1 heading on each page, producing seven copies of # Breaking Down the Jelly Slider in the output. From the model’s perspective, this was correct. From the document’s perspective, it was noise. The fix: count every heading line across all pages using a Counter and strip any heading that appears three or more times, keeping only the first occurrence.
Unbalanced figure tags. A full run produced 39 [FIGURE: opens and only 20 [/FIGURE] closes. The model followed the format most of the time but dropped the closing tag on roughly half the figures, usually when a description ran long or the figure spanned a page boundary. The fix: scan for unclosed openers and insert [/FIGURE] at the next paragraph boundary.
Cross-page context drift. Code blocks that continued across page boundaries got re-tagged on the second page. TypeScript became glsl or cpp because the model had no memory of what language it was continuing. The fix: pass the last 800 characters of the previous page’s output as context with each request.
These weren’t model errors. The model was correctly transcribing what was visible on each page. The pipeline needed to understand what “page chrome” meant — the repeated headers and footers that frame each page — and strip it in post.
The GPS document added a fourth problem that dwarfed the others. Bartosz Ciechanowski’s GPS article is built around interactive JavaScript visualizations. When printed to PDF, the figures freeze as static snapshots of their initial state — but the canvas areas reserved for interactive controls (sliders, handles) render as large blank white voids. The model transcribed the frozen figures but had no way to know anything interactive had been there. Running the initial pipeline across 72 pages produced four figure blocks. The article has dozens of diagrams.
The fix was to detect blank regions before sending each page to the model. PyMuPDF’s layout analysis identifies any vertical band at least 120 points tall with no text content, and a hint is injected into the per-page prompt:
Note: this page appears to have {N} large blank region(s) that may contain
interactive figures not visible in the static PDF. Please include a [FIGURE:]
block for each, describing what the surrounding text suggests they demonstrate.
That single hint changed figure recovery from 4 to 73 on the GPS document. It was the largest single quality improvement in the project.
As the scripts grew through these iterations, Gemini hit a limit on how much code it could generate in one response. From that point, I continued with Claude generating the scripts, using Gemini only to judge pipeline output. By this point, I had also settled on Qwen3.6-27B. It gave better results than the 35B model at 4-bit quantization, for reasons that came down to weight precision: a smaller model at higher bit-depth follows format rules more reliably than a larger model with quantization error eroding constraint-following.
Lessons Learned Link to heading
Detect what the model can’t see. The single largest quality improvement in the entire project wasn’t a better prompt or a bigger model — it was telling the model that interactive controls existed. When the GPS article was printed to PDF, the visualizations froze as static images but the canvas areas for sliders and handles rendered as large blank voids. Without blank-region detection, the GPS pipeline produced four figure blocks across 72 pages. With it: 73. The model wasn’t failing; it was correctly transcribing what it could see. The pipeline has to know what’s missing and say so explicitly.
Cross-AI judging breaks single-model deadlocks. Claude kept generating XML-schema variations — <symbol_table>, <algorithm>, <edge_case> — until it was shown Gemini’s evaluation alongside a leaner competing prompt. It then abandoned the entire structure in one iteration. Neither model would have arrived at the final ~400-token flat markdown format independently. Showing each model the other’s feedback was what broke the loop, and it’s the part of the workflow I’d carry forward to any prompt optimization problem.
For format-constrained tasks, precision beats capacity. Qwen3.6-27B at Q6_K outperformed Qwen3.6-35B at Q4_K_M on strict formatting rules. The smaller model followed the javascript language tag override; the larger one reverted to its training baseline and emitted ```typescript. Quantization erodes precision in exactly the places that matter for constraint-following. When the task requires the model to override its defaults, bit-depth matters more than parameter count.
What’s Next Link to heading
The whole motivation for building this was the gap between what models could produce from text alone versus what they’d produce if they could actually read the figures. The pipeline exists now. The next step is feeding these outputs back into the tutorial harness and finding out whether that gap was the problem all along.