poster2json¶
Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.
Overview¶
poster2json extracts structured metadata from scientific conference posters into machine-actionable JSON conforming to the poster-json-schema.
Features¶
- PDF Processing: Layout-aware text extraction via pdfalto
- Image Processing: Vision-based OCR via Qwen2-VL-7B
- JSON Structuring: Fine-tuned Llama 3.1 8B for poster-specific metadata
- Schema Validation: Built-in validation against poster-json-schema
- CLI & Python API: Flexible usage options
Quick Start¶
Installation¶
pip install poster2json
CLI Usage¶
# Extract metadata from a poster
poster2json extract poster.pdf -o result.json
# Validate extracted JSON
poster2json validate result.json
Python API¶
from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate
is_valid = validate_poster(result)
Architecture¶
The pipeline processes posters in two stages:
- Raw Text Extraction
- PDF files → pdfalto (layout-aware XML)
-
Image files → Qwen2-VL-7B (vision OCR)
-
JSON Structuring
- Raw text → Llama 3.1 8B → Structured JSON
See Architecture for technical details.
Performance¶
Validated on 10 manually annotated scientific posters with 100% pass rate.
See Evaluation for detailed metrics.
Requirements¶
- NVIDIA GPU with ≥16GB VRAM
- Python 3.10+
- pdfalto (for PDF processing)