Architecture¶
Technical architecture and methodology for poster2json.
Pipeline Overview¶
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Input Poster │────▶│ Raw Text │────▶│ Structured │
│ (PDF/Image) │ │ Extraction │ │ JSON Output │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
┌─────────┴─────────┐ ┌────────┴────────┐
│ │ │ │
[PDF Files] [Image Files] [Transformers]
│ │ Llama 3.1 8B
[pdfalto] [Qwen2-VL-7B] Section-aware
XML Layout Vision OCR JSON Generation
Models¶
Llama 3.1 8B Poster Extraction¶
Model: jimnoneill/Llama-3.1-8B-Poster-Extraction
Fine-tuned version of Meta's Llama 3.1 8B Instruct for scientific poster metadata extraction:
- 8B parameters
- 128K context window
- Optimized for structured JSON output
- Strong section identification
Qwen2-VL-7B-Instruct¶
Model: Qwen/Qwen2-VL-7B-Instruct
Vision-language model for image-based poster OCR:
- 7B parameters
- Direct pixel-to-text extraction
- Multi-language support
- Layout-aware text recognition
Stage 1: Raw Text Extraction¶
PDF Processing (pdfalto)¶
For PDF files, the pipeline uses pdfalto to:
- Convert PDF to ALTO XML format
- Preserve layout structure and spatial coordinates
- Extract text blocks maintaining reading order
- Handle multi-column layouts
# Simplified extraction flow
pdf_path → pdfalto → ALTO XML → parse_text_blocks() → raw_text
Image Processing (Qwen2-VL)¶
For image files (JPG, PNG), the pipeline uses Qwen2-VL:
- Load image directly into vision-language model
- Generate text transcription via multimodal inference
- Preserve section headers and content structure
# Simplified extraction flow
image_path → load_image() → Qwen2-VL → raw_text
Stage 2: JSON Structuring¶
Raw text is structured into JSON using Llama 3.1 8B with section-aware prompting.
Prompt Engineering¶
The prompt explicitly: - Enumerates common poster sections (Abstract, Introduction, Methods, Results, Discussion, Conclusions, References) - Distinguishes semantically similar sections (e.g., "Key Findings" vs "References") - Instructs verbatim text preservation
Adaptive Token Management¶
- Initial attempt: 18,000 output tokens
- If truncated: Retry with 24,000 tokens
- If still truncating: Switch to condensed prompt format
JSON Repair¶
Post-processing handles common LLM output issues:
- Unescaped quotes in scientific notation
- Trailing commas in arrays/objects
- Unicode encoding errors
- Truncated JSON completion
Post-Processing¶
After JSON extraction, the pipeline applies:
- Schema validation: Ensures output matches poster-json-schema
- Caption normalization: Converts to
captionsarray format - Section deduplication: Removes duplicate content
- Unicode cleaning: Removes bidirectional characters
- Table/chart data cleaning: Removes axis labels from section content
Memory Management¶
GPU Memory Optimization¶
- Models loaded one at a time to minimize VRAM usage
- Automatic 8-bit quantization for GPUs with <16GB VRAM
- Model unloading between stages
Automatic GPU Selection¶
def get_best_gpu():
# Select GPU with most available memory
# Accounts for other processes using GPU
# Falls back to CPU if no GPU available
Output Schema¶
Outputs conform to poster-json-schema:
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [...],
"titles": [...],
"posterContent": {
"sections": [
{"sectionTitle": "...", "sectionContent": "..."}
]
},
"imageCaptions": [
{"captions": ["Figure 1.", "Description"]}
],
"tableCaptions": [
{"captions": ["Table 1.", "Description"]}
]
}
File Structure¶
poster2json/
├── poster_extraction.py # Main pipeline
│ ├── get_raw_text() # Stage 1: Text extraction
│ ├── extract_json_with_retry() # Stage 2: JSON structuring
│ ├── postprocess_json() # Post-processing
│ └── calculate_metrics() # Evaluation
├── api.py # Flask REST API
├── Dockerfile # Container definition
└── docker-compose.yml # Orchestration
Configuration¶
Environment Variables¶
| Variable | Description |
|---|---|
PDFALTO_PATH |
Path to pdfalto binary |
CUDA_VISIBLE_DEVICES |
GPU device(s) to use |
HF_TOKEN |
HuggingFace API token |
Model Configuration¶
JSON_MODEL_ID = "jimnoneill/Llama-3.1-8B-Poster-Extraction"
VISION_MODEL_ID = "Qwen/Qwen2-VL-7B-Instruct"
MAX_JSON_TOKENS = 18000
MAX_RETRY_TOKENS = 24000
See Also¶
- Evaluation - Validation metrics and results
- API Reference - REST API documentation
- Installation - Setup instructions