Tutorial 0: Installation and Development Environment Setup

This guide walks you through installing eulerweave and configuring your development environment.

Prerequisites

Requirement	Minimum Version	Notes
Python	3.11+	3.12 recommended
pip	23.0+	Included by default in recent Python
git	2.x	For cloning the repository

python --version
# Python 3.12.x   (3.11 이상 필요)

1. Clone the Repository

git clone https://github.com/baida21/eulerweave.git
cd eulerweave

2. Create a Virtual Environment

python -m venv .venv
source .venv/bin/activate    # Linux / macOS
# .venv\Scripts\activate     # Windows

3. Installation

For development, we recommend installing with all optional dependencies:

pip install -e ".[dev]"

This command installs:

Core dependencies — typer[all], pydantic, pyyaml, rich
[dev] extras — pytest, pytest-cov, jsonschema and all optional groups

Optional Dependency Groups

If you only need a subset of features, you can install individual groups:

Group	Installed Packages	When Needed
`llm`	`httpx>=0.25`	LLM-based QnA generation via Ollama (`build_sft_qna`, `build_langextract_qna`)
`parquet`	`pyarrow>=14.0`	Parquet/MDS export (`export_parquet`, `export_mds`)
`dedup`	`datasketch>=1.6`	MinHash-based deduplication (`dedup_minhash`)
`pdf`	`pdfminer.six>=20221105`	PDF text extraction (`type: pdf` input)
`pdf_ocr`	`pdfminer.six` + `ocrmypdf>=16.0`	PDF OCR extraction (`strategy: ocr`)
`all`	All packages above (excluding pdf_ocr)	Full functionality without test tools
`dev`	`all` + `pytest` + `pytest-cov` + `jsonschema`	Full development environment

Examples:

# 핵심 + PDF 지원만
pip install -e ".[pdf]"

# 핵심 + PDF OCR까지
pip install -e ".[pdf_ocr]"

# 핵심 + LLM + PDF (QnA 생성 포함 PDF 파이프라인)
pip install -e ".[llm,pdf]"

# 전체 (dev와 동일하되 테스트 의존성 제외)
pip install -e ".[all]"

4. Verify Installation

eulerweave --help

Expected output:

 Usage: eulerweave [OPTIONS] COMMAND [ARGS]...

 Eulerflow: Manifest-centric data pipeline for LLM training.

Options:
  --help  Show this message and exit.

Commands:
  new        Scaffold a new manifest YAML.
  validate   Validate a manifest YAML.
  plan       Show execution plan and cost estimate.
  run        Execute a manifest pipeline.
  export     Export data to streaming formats.
  plugins    List installed extractor plugins.
  dataset    Dataset-related commands.

Check the installed extractor plugins:

eulerweave plugins list

Extractors:
  txt      eulerweave.core.io.local_files:TxtExtractor          [.txt]
  jsonl    eulerweave.core.io.jsonl:JsonlExtractor               [.jsonl]
  csv      eulerweave.core.io.csv_extractor:CsvExtractor         [.csv]
  parquet  eulerweave.core.io.parquet_extractor:ParquetExtractor [.parquet]
  html     eulerweave.core.io.html_extractor:HtmlExtractor       [.html]
  pdf      eulerweave.core.io.pdf_extractor:PdfExtractor         [.pdf]

6 extractor(s) installed.

5. Run Tests

pytest tests/ -q

All tests should pass. Network/Ollama-related tests are automatically skipped if the required environment variables are not set.

To check coverage:

pytest tests/ -v --cov=eulerweave --cov-report=term-missing

6. Project Structure

eulerweave/
  eulerweave/
    cli/                    # CLI commands (new, validate, plan, run, plugins, etc.)
    core/
      io/                   # Extractors (jsonl, csv, parquet, html, pdf, txt)
      record.py             # CanonicalRecord Pydantic model
      stores/               # Storage backends
    blocks_builtin/         # 17 built-in blocks
    engine/                 # Compiler, planner, local executor
    registry/               # Plugin discovery, block catalog
    spec/                   # Manifest, types, block specs, metric schemas
    providers/              # LLM providers (Ollama, etc.)
  tests/
    unit/                   # Unit tests
    cli/                    # CLI tests
    integration/            # Integration tests
    fixtures/               # YAML test data
  docs/
    tutorials/              # This tutorial series
    architecture/           # Architecture documentation
    fixtures/               # Fixture index
  pyproject.toml            # Build config, dependencies, entry points

Built-in Blocks (17)

Slot	Block	Description
normalize	`normalize_text`	Whitespace cleanup, encoding normalization
filter	`heuristic_filter`	Length/quality-based filtering
dedup	`dedup_minhash`	MinHash near-deduplication
dedup	`dedup_exact`	SHA-256 exact deduplication
metrics	`metrics_text_basic`	Text length distribution statistics
metrics	`metrics_quality_heuristic`	URL/repetition/character ratios
metrics	`metrics_dedup_report`	Deduplication report
metrics	`metrics_task_format_validate`	SFT message format validation
metrics	`metrics_text_distribution`	Text distribution analysis
metrics	`metrics_language_heuristic`	Language ratio analysis
build_task	`build_sft_messages`	Map existing fields to SFT format
build_task	`build_sft_qna`	LLM-based QnA generation (Ollama)
build_task	`build_langextract_qna`	LangExtract-style QnA generation
export	`export_jsonl`	JSONL export
export	`export_parquet`	Parquet export
export	`export_mds`	MDS streaming export

Built-in Extractors (6)

Name	Supported Format	Options
`txt`	`.txt`	—
`jsonl`	`.jsonl`	—
`csv`	`.csv`	`text_column`, `delimiter`
`parquet`	`.parquet`	`text_column`
`html`	`.html`	—
`pdf`	`.pdf`	`strategy` (auto/text/ocr), `page_range`

Troubleshooting

`eulerweave: command not found`

source .venv/bin/activate
pip install -e ".[dev]"

`ModuleNotFoundError: No module named 'pdfminer'`

Install the PDF support dependency:

pip install -e ".[pdf]"

`ImportError: ocrmypdf` (when using OCR)

Install the OCR dependency:

pip install -e ".[pdf_ocr]"

`pyarrow` import error

pip install -e ".[parquet]"

Next Steps

Tutorial 1: Quick Start — From creating your first manifest to JSONL output
Tutorial 2: PDF to Training Data — Convert local PDF documents into SFT training data