poster2json¶

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.

Overview¶

poster2json extracts structured metadata from scientific conference posters into machine-actionable JSON conforming to the poster-json-schema.

Features¶

PDF Processing: Layout-aware text extraction via pdfalto
Image Processing: Vision-based OCR via Qwen2-VL-7B
JSON Structuring: Fine-tuned Llama 3.1 8B for poster-specific metadata
Schema Validation: Built-in validation against poster-json-schema
CLI & Python API: Flexible usage options

Quick Start¶

Installation¶

pip install poster2json

CLI Usage¶

# Extract metadata from a poster
poster2json extract poster.pdf -o result.json

# Validate extracted JSON
poster2json validate result.json

Python API¶

from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate
is_valid = validate_poster(result)

Architecture¶

The pipeline processes posters in two stages:

Raw Text Extraction
PDF files → pdfalto (layout-aware XML)
Image files → Qwen2-VL-7B (vision OCR)
JSON Structuring
Raw text → Llama 3.1 8B → Structured JSON

See Architecture for technical details.

Performance¶

Validated on 10 manually annotated scientific posters with 100% pass rate.

See Evaluation for detailed metrics.

Requirements¶

NVIDIA GPU with ≥16GB VRAM
Python 3.10+
pdfalto (for PDF processing)