Tutorial 2: PDF Documents to SFT Training Data

This tutorial walks you through the entire process of processing local PDF documents with an eulerweave pipeline to generate question-answer training data for LLM fine-tuning (SFT).

What you will learn in this tutorial:

Extracting text from documents using the PDF extractor
Automatically generating QnA pairs with the build_langextract_qna block
FakeProvider mode for testing without an LLM
Generating real QnA using Ollama

Prerequisites: eulerweave must be installed. PDF support requires pip install -e ".[pdf]". See Tutorial 0: Installation.

Scenario

You have technical documents, papers, or textbooks in PDF format. You want to convert these documents into SFT training data:

data/
  paper.pdf          ← 입력: 연구 논문 PDF
  manual.pdf         ← 입력: 기술 매뉴얼 PDF

out/
  training_data.jsonl  ← 출력: SFT 훈련용 QnA 데이터

Step 1: Verify PDF Extractor

Verify that the PDF extractor is installed:

eulerweave plugins list

The output should show the pdf extractor:

Extractors:
  ...
  pdf      eulerweave.core.io.pdf_extractor:PdfExtractor         [.pdf]
  ...

If it is not shown:

pip install -e ".[pdf]"

Step 2: Write the Manifest

Method A: Test with FakeProvider (No LLM Required)

First, test that the pipeline works correctly without an LLM. build_langextract_qna uses a deterministic FakeProvider by default, so you can run the full pipeline without Ollama.

manifest_pdf_test.yaml:

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/paper.pdf
    options:
      strategy: auto          # text → OCR 자동 폴백
      page_range: "1-5"       # 처음 5페이지만 (테스트용)

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 50          # 50자 미만 페이지 건너뛰기

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/training_data.jsonl

profile:
  cpu_only: true
  allow_external_llm: false

Key points:

type: pdf — Uses the PDF extractor to extract text page by page.
strategy: auto — Tries text extraction first, falls back to OCR on failure.
page_range: "1-5" — Processes only specific pages (useful for testing).
build_langextract_qna — Generates QnA pairs from text.
allow_external_llm: false — Uses FakeProvider (no LLM required).

Step 3: Validate and Check Execution Plan

eulerweave validate manifest_pdf_test.yaml

[eulerweave] Validating manifest: manifest_pdf_test.yaml
[eulerweave] Parsing YAML ...                              OK
[eulerweave] Checking inputs ...                            OK  (1 input, pdf)
[eulerweave] Checking pipeline blocks ...                   OK  (4 blocks)
[eulerweave] Checking type chain ...                        OK
[eulerweave] Checking slot ordering ...                     OK

✓ Manifest is valid.

Check the execution plan:

eulerweave plan manifest_pdf_test.yaml

 Execution Plan
+---------+-----------------------+-----------+---------------+----------------+
| Step    | Block                 | Slot      | Input Type    | Output Type    |
+---------+-----------------------+-----------+---------------+----------------+
| 1       | norm1                 | normalize | TextDocument  | TextDocument   |
| 2       | filter1               | filter    | TextDocument  | TextDocument   |
| 3       | qna1                  | build_task| TextDocument  | SFTQnA         |
| 4       | exp1                  | export    | SFTQnA        | ExportedDataset|
+---------+-----------------------+-----------+---------------+----------------+

Step 4: Run with FakeProvider

eulerweave run manifest_pdf_test.yaml \
  --input data/paper.pdf \
  --artifacts ./artifacts

[eulerweave] Loading manifest: manifest_pdf_test.yaml
[eulerweave] Input: data/paper.pdf (5 pages → 5 records)

[eulerweave] [1/4] norm1 (normalize_text) ...
[eulerweave]   Processed 5 documents
[eulerweave] [2/4] filter1 (heuristic_filter) ...
[eulerweave]   Processed 5 documents, 1 filtered (min_length=50)
[eulerweave] [3/4] qna1 (build_langextract_qna) ...
[eulerweave]   Generated 4 QnA pairs
[eulerweave] [4/4] exp1 (export_jsonl) ...
[eulerweave]   Exported 4 records

[eulerweave] Pipeline completed successfully.

Check the output:

head -1 out/training_data.jsonl | python -m json.tool

{
  "messages": [
    {"role": "user", "content": "What is the main topic of: 'The paper presents ...'?"},
    {"role": "assistant", "content": "The text discusses: The paper presents ... [fake-a1b2c3]"}
  ]
}

FakeProvider generates deterministic QnA based on MD5 hashes. This is sufficient for verifying that the pipeline works correctly.

Step 5: Generate Real QnA with Ollama

To generate high-quality QnA using a real LLM, set up Ollama.

5-1. Install and Start Ollama

# Ollama 설치 (아직 안 했다면)
curl -fsSL https://ollama.com/install.sh | sh

# Ollama 서버 시작
ollama serve

# 모델 다운로드
ollama pull gpt-oss:20b

5-2. Write the Ollama Manifest

manifest_pdf_ollama.yaml:

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/paper.pdf
    options:
      strategy: auto

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 100
      max_length: 50000

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"
      base_url: "http://localhost:11434"

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/training_data.jsonl

profile:
  cpu_only: false
  allow_external_llm: true    # ← LLM 사용 허용 필수

Key changes:

params.model — The Ollama model to use
params.base_url — Ollama server address (default: http://localhost:11434)
allow_external_llm: true — Allows LLM calls

5-3. Run

eulerweave run manifest_pdf_ollama.yaml \
  --input data/paper.pdf \
  --artifacts ./artifacts

5-4. Check Ollama Output

head -1 out/training_data.jsonl | python -m json.tool

{
  "messages": [
    {
      "role": "user",
      "content": "What is the main contribution of this paper?"
    },
    {
      "role": "assistant",
      "content": "The paper proposes a novel approach to transformer-based language models that reduces computational complexity while maintaining performance. The key contribution is a sparse attention mechanism that achieves O(n log n) complexity compared to the standard O(n²) self-attention."
    }
  ]
}

PDF Extractor Options Detail

strategy

Value	Behavior	Dependencies
`auto` (default)	Tries text extraction first, falls back to OCR on failure	`pdfminer.six` + (optional) `ocrmypdf`
`text`	Text-only extraction via pdfminer.six	`pdfminer.six`
`ocr`	Image-based extraction via OCRmyPDF	`pdfminer.six` + `ocrmypdf`

page_range

Specify the page range as a string:

options:
  page_range: "1-10"     # 1~10 페이지
  page_range: "5"        # 5페이지만

If omitted, all pages are extracted.

PDF to Record Mapping

The PDF extractor creates one record per page:

paper.pdf (30 pages) → 30개의 CanonicalRecord
  ├── record[0]: page 1 텍스트 + {page: 1, source: "paper.pdf"}
  ├── record[1]: page 2 텍스트 + {page: 2, source: "paper.pdf"}
  └── ...

Each record includes source metadata:

source_uri: Original file path
source_type: "pdf"
page: Page number

Production Pipeline: Multiple PDFs + Metrics

A pipeline that processes multiple PDFs and collects quality metrics:

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/paper1.pdf
    options:
      strategy: text
  - type: pdf
    uri: data/paper2.pdf
    options:
      strategy: auto
      page_range: "1-20"

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 100

  - id: dedup1
    type: dedup_exact
    slot: dedup
    input_type: TextDocument
    output_type: TextDocument

  - id: m1
    type: metrics_text_basic
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/training_data.jsonl

profile:
  cpu_only: false
  allow_external_llm: true

This pipeline:

Extracts text from two PDFs (creates as many records as total pages).
Normalizes text and filters out short pages.
Deduplicates exactly identical pages.
Collects basic text statistics (metrics).
Generates QnA pairs from each page.
Exports the final results to JSONL.

Final Training Data Format

Each line of the final output out/training_data.jsonl looks like this:

{
  "messages": [
    {"role": "user", "content": "What does the paper discuss about attention mechanisms?"},
    {"role": "assistant", "content": "The paper discusses a novel sparse attention mechanism..."}
  ]
}

This format is directly compatible with OpenAI, Ollama, and most SFT training frameworks.

Full Process Summary

# 1. PDF 지원 설치
pip install -e ".[pdf,llm]"

# 2. Ollama 시작 (실제 QnA 생성 시)
ollama serve && ollama pull gpt-oss:20b

# 3. 매니페스트 작성
cat > manifest.yaml << 'EOF'
version: 1
track: sft
inputs:
  - type: pdf
    uri: data/paper.pdf
    options:
      strategy: auto
pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument
  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 100
  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"
  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset
exports:
  - type: jsonl
    path: out/training_data.jsonl
profile:
  cpu_only: false
  allow_external_llm: true
EOF

# 4. 검증
eulerweave validate manifest.yaml

# 5. 실행
eulerweave run manifest.yaml --input data/paper.pdf --artifacts ./artifacts

# 6. 결과 확인
wc -l out/training_data.jsonl
head -1 out/training_data.jsonl | python -m json.tool

Troubleshooting

`ModuleNotFoundError: No module named 'pdfminer'`

Install the PDF dependency:

pip install -e ".[pdf]"

`ImportError: ocrmypdf` with `strategy: ocr`

Install the OCR dependency:

pip install -e ".[pdf_ocr]"

No text extracted from PDF (empty records)

This is likely a scanned PDF. Use strategy: ocr:

options:
  strategy: ocr

Ollama Connection Error

Verify that Ollama is running:

curl http://localhost:11434/api/tags

Next Steps

Tutorial 3: HuggingFace to Training Data — Download and process datasets from HF
Tutorial 4: SFT Track Deep Dive — Compare three SFT builder blocks
Tutorial 8: Metrics — Add quality statistics to your pipeline