Tutorial 2: PDF Documents to SFT Training Data
This tutorial walks you through the entire process of processing local PDF documents with an eulerweave pipeline to generate question-answer training data for LLM fine-tuning (SFT).
What you will learn in this tutorial:
- Extracting text from documents using the PDF extractor
- Automatically generating QnA pairs with the
build_langextract_qnablock - FakeProvider mode for testing without an LLM
- Generating real QnA using Ollama
Prerequisites: eulerweave must be installed. PDF support requires
pip install -e ".[pdf]". See Tutorial 0: Installation.
Scenario
You have technical documents, papers, or textbooks in PDF format. You want to convert these documents into SFT training data:
data/
paper.pdf ← 입력: 연구 논문 PDF
manual.pdf ← 입력: 기술 매뉴얼 PDF
out/
training_data.jsonl ← 출력: SFT 훈련용 QnA 데이터
Step 1: Verify PDF Extractor
Verify that the PDF extractor is installed:
eulerweave plugins list
The output should show the pdf extractor:
Extractors:
...
pdf eulerweave.core.io.pdf_extractor:PdfExtractor [.pdf]
...
If it is not shown:
pip install -e ".[pdf]"
Step 2: Write the Manifest
Method A: Test with FakeProvider (No LLM Required)
First, test that the pipeline works correctly without an LLM.
build_langextract_qna uses a deterministic FakeProvider by default,
so you can run the full pipeline without Ollama.
manifest_pdf_test.yaml:
version: 1
track: sft
inputs:
- type: pdf
uri: data/paper.pdf
options:
strategy: auto # text → OCR 자동 폴백
page_range: "1-5" # 처음 5페이지만 (테스트용)
pipeline:
- id: norm1
type: normalize_text
slot: normalize
input_type: TextDocument
output_type: TextDocument
- id: filter1
type: heuristic_filter
slot: filter
input_type: TextDocument
output_type: TextDocument
params:
min_length: 50 # 50자 미만 페이지 건너뛰기
- id: qna1
type: build_langextract_qna
slot: build_task
input_type: TextDocument
output_type: SFTQnA
- id: exp1
type: export_jsonl
slot: export
input_type: SFTQnA
output_type: ExportedDataset
exports:
- type: jsonl
path: out/training_data.jsonl
profile:
cpu_only: true
allow_external_llm: false
Key points:
type: pdf— Uses the PDF extractor to extract text page by page.strategy: auto— Tries text extraction first, falls back to OCR on failure.page_range: "1-5"— Processes only specific pages (useful for testing).build_langextract_qna— Generates QnA pairs from text.allow_external_llm: false— Uses FakeProvider (no LLM required).
Step 3: Validate and Check Execution Plan
eulerweave validate manifest_pdf_test.yaml
[eulerweave] Validating manifest: manifest_pdf_test.yaml
[eulerweave] Parsing YAML ... OK
[eulerweave] Checking inputs ... OK (1 input, pdf)
[eulerweave] Checking pipeline blocks ... OK (4 blocks)
[eulerweave] Checking type chain ... OK
[eulerweave] Checking slot ordering ... OK
✓ Manifest is valid.
Check the execution plan:
eulerweave plan manifest_pdf_test.yaml
Execution Plan
+---------+-----------------------+-----------+---------------+----------------+
| Step | Block | Slot | Input Type | Output Type |
+---------+-----------------------+-----------+---------------+----------------+
| 1 | norm1 | normalize | TextDocument | TextDocument |
| 2 | filter1 | filter | TextDocument | TextDocument |
| 3 | qna1 | build_task| TextDocument | SFTQnA |
| 4 | exp1 | export | SFTQnA | ExportedDataset|
+---------+-----------------------+-----------+---------------+----------------+
Step 4: Run with FakeProvider
eulerweave run manifest_pdf_test.yaml \
--input data/paper.pdf \
--artifacts ./artifacts
[eulerweave] Loading manifest: manifest_pdf_test.yaml
[eulerweave] Input: data/paper.pdf (5 pages → 5 records)
[eulerweave] [1/4] norm1 (normalize_text) ...
[eulerweave] Processed 5 documents
[eulerweave] [2/4] filter1 (heuristic_filter) ...
[eulerweave] Processed 5 documents, 1 filtered (min_length=50)
[eulerweave] [3/4] qna1 (build_langextract_qna) ...
[eulerweave] Generated 4 QnA pairs
[eulerweave] [4/4] exp1 (export_jsonl) ...
[eulerweave] Exported 4 records
[eulerweave] Pipeline completed successfully.
Check the output:
head -1 out/training_data.jsonl | python -m json.tool
{
"messages": [
{"role": "user", "content": "What is the main topic of: 'The paper presents ...'?"},
{"role": "assistant", "content": "The text discusses: The paper presents ... [fake-a1b2c3]"}
]
}
FakeProvider generates deterministic QnA based on MD5 hashes. This is sufficient for verifying that the pipeline works correctly.
Step 5: Generate Real QnA with Ollama
To generate high-quality QnA using a real LLM, set up Ollama.
5-1. Install and Start Ollama
# Ollama 설치 (아직 안 했다면)
curl -fsSL https://ollama.com/install.sh | sh
# Ollama 서버 시작
ollama serve
# 모델 다운로드
ollama pull gpt-oss:20b
5-2. Write the Ollama Manifest
manifest_pdf_ollama.yaml:
version: 1
track: sft
inputs:
- type: pdf
uri: data/paper.pdf
options:
strategy: auto
pipeline:
- id: norm1
type: normalize_text
slot: normalize
input_type: TextDocument
output_type: TextDocument
- id: filter1
type: heuristic_filter
slot: filter
input_type: TextDocument
output_type: TextDocument
params:
min_length: 100
max_length: 50000
- id: qna1
type: build_langextract_qna
slot: build_task
input_type: TextDocument
output_type: SFTQnA
params:
model: "gpt-oss:20b"
base_url: "http://localhost:11434"
- id: exp1
type: export_jsonl
slot: export
input_type: SFTQnA
output_type: ExportedDataset
exports:
- type: jsonl
path: out/training_data.jsonl
profile:
cpu_only: false
allow_external_llm: true # ← LLM 사용 허용 필수
Key changes:
params.model— The Ollama model to useparams.base_url— Ollama server address (default:http://localhost:11434)allow_external_llm: true— Allows LLM calls
5-3. Run
eulerweave run manifest_pdf_ollama.yaml \
--input data/paper.pdf \
--artifacts ./artifacts
5-4. Check Ollama Output
head -1 out/training_data.jsonl | python -m json.tool
{
"messages": [
{
"role": "user",
"content": "What is the main contribution of this paper?"
},
{
"role": "assistant",
"content": "The paper proposes a novel approach to transformer-based language models that reduces computational complexity while maintaining performance. The key contribution is a sparse attention mechanism that achieves O(n log n) complexity compared to the standard O(n²) self-attention."
}
]
}
PDF Extractor Options Detail
strategy
| Value | Behavior | Dependencies |
|---|---|---|
auto (default) |
Tries text extraction first, falls back to OCR on failure | pdfminer.six + (optional) ocrmypdf |
text |
Text-only extraction via pdfminer.six | pdfminer.six |
ocr |
Image-based extraction via OCRmyPDF | pdfminer.six + ocrmypdf |
page_range
Specify the page range as a string:
options:
page_range: "1-10" # 1~10 페이지
page_range: "5" # 5페이지만
If omitted, all pages are extracted.
PDF to Record Mapping
The PDF extractor creates one record per page:
paper.pdf (30 pages) → 30개의 CanonicalRecord
├── record[0]: page 1 텍스트 + {page: 1, source: "paper.pdf"}
├── record[1]: page 2 텍스트 + {page: 2, source: "paper.pdf"}
└── ...
Each record includes source metadata:
source_uri: Original file pathsource_type:"pdf"page: Page number
Production Pipeline: Multiple PDFs + Metrics
A pipeline that processes multiple PDFs and collects quality metrics:
version: 1
track: sft
inputs:
- type: pdf
uri: data/paper1.pdf
options:
strategy: text
- type: pdf
uri: data/paper2.pdf
options:
strategy: auto
page_range: "1-20"
pipeline:
- id: norm1
type: normalize_text
slot: normalize
input_type: TextDocument
output_type: TextDocument
- id: filter1
type: heuristic_filter
slot: filter
input_type: TextDocument
output_type: TextDocument
params:
min_length: 100
- id: dedup1
type: dedup_exact
slot: dedup
input_type: TextDocument
output_type: TextDocument
- id: m1
type: metrics_text_basic
slot: metrics
input_type: TextDocument
output_type: TextDocument
- id: qna1
type: build_langextract_qna
slot: build_task
input_type: TextDocument
output_type: SFTQnA
params:
model: "gpt-oss:20b"
- id: exp1
type: export_jsonl
slot: export
input_type: SFTQnA
output_type: ExportedDataset
exports:
- type: jsonl
path: out/training_data.jsonl
profile:
cpu_only: false
allow_external_llm: true
This pipeline:
- Extracts text from two PDFs (creates as many records as total pages).
- Normalizes text and filters out short pages.
- Deduplicates exactly identical pages.
- Collects basic text statistics (metrics).
- Generates QnA pairs from each page.
- Exports the final results to JSONL.
Final Training Data Format
Each line of the final output out/training_data.jsonl looks like this:
{
"messages": [
{"role": "user", "content": "What does the paper discuss about attention mechanisms?"},
{"role": "assistant", "content": "The paper discusses a novel sparse attention mechanism..."}
]
}
This format is directly compatible with OpenAI, Ollama, and most SFT training frameworks.
Full Process Summary
# 1. PDF 지원 설치
pip install -e ".[pdf,llm]"
# 2. Ollama 시작 (실제 QnA 생성 시)
ollama serve && ollama pull gpt-oss:20b
# 3. 매니페스트 작성
cat > manifest.yaml << 'EOF'
version: 1
track: sft
inputs:
- type: pdf
uri: data/paper.pdf
options:
strategy: auto
pipeline:
- id: norm1
type: normalize_text
slot: normalize
input_type: TextDocument
output_type: TextDocument
- id: filter1
type: heuristic_filter
slot: filter
input_type: TextDocument
output_type: TextDocument
params:
min_length: 100
- id: qna1
type: build_langextract_qna
slot: build_task
input_type: TextDocument
output_type: SFTQnA
params:
model: "gpt-oss:20b"
- id: exp1
type: export_jsonl
slot: export
input_type: SFTQnA
output_type: ExportedDataset
exports:
- type: jsonl
path: out/training_data.jsonl
profile:
cpu_only: false
allow_external_llm: true
EOF
# 4. 검증
eulerweave validate manifest.yaml
# 5. 실행
eulerweave run manifest.yaml --input data/paper.pdf --artifacts ./artifacts
# 6. 결과 확인
wc -l out/training_data.jsonl
head -1 out/training_data.jsonl | python -m json.tool
Troubleshooting
ModuleNotFoundError: No module named 'pdfminer'
Install the PDF dependency:
pip install -e ".[pdf]"
ImportError: ocrmypdf with strategy: ocr
Install the OCR dependency:
pip install -e ".[pdf_ocr]"
No text extracted from PDF (empty records)
This is likely a scanned PDF. Use strategy: ocr:
options:
strategy: ocr
Ollama Connection Error
Verify that Ollama is running:
curl http://localhost:11434/api/tags
Next Steps
- Tutorial 3: HuggingFace to Training Data — Download and process datasets from HF
- Tutorial 4: SFT Track Deep Dive — Compare three SFT builder blocks
- Tutorial 8: Metrics — Add quality statistics to your pipeline