Tutorial 0: Installation and Development Environment Setup
This guide walks you through installing eulerweave and configuring your development environment.
Prerequisites
| Requirement | Minimum Version | Notes |
|---|---|---|
| Python | 3.11+ | 3.12 recommended |
| pip | 23.0+ | Included by default in recent Python |
| git | 2.x | For cloning the repository |
python --version
# Python 3.12.x (3.11 이상 필요)
1. Clone the Repository
git clone https://github.com/baida21/eulerweave.git
cd eulerweave
2. Create a Virtual Environment
python -m venv .venv
source .venv/bin/activate # Linux / macOS
# .venv\Scripts\activate # Windows
3. Installation
For development, we recommend installing with all optional dependencies:
pip install -e ".[dev]"
This command installs:
- Core dependencies —
typer[all],pydantic,pyyaml,rich [dev]extras —pytest,pytest-cov,jsonschemaand all optional groups
Optional Dependency Groups
If you only need a subset of features, you can install individual groups:
| Group | Installed Packages | When Needed |
|---|---|---|
llm |
httpx>=0.25 |
LLM-based QnA generation via Ollama (build_sft_qna, build_langextract_qna) |
parquet |
pyarrow>=14.0 |
Parquet/MDS export (export_parquet, export_mds) |
dedup |
datasketch>=1.6 |
MinHash-based deduplication (dedup_minhash) |
pdf |
pdfminer.six>=20221105 |
PDF text extraction (type: pdf input) |
pdf_ocr |
pdfminer.six + ocrmypdf>=16.0 |
PDF OCR extraction (strategy: ocr) |
all |
All packages above (excluding pdf_ocr) | Full functionality without test tools |
dev |
all + pytest + pytest-cov + jsonschema |
Full development environment |
Examples:
# 핵심 + PDF 지원만
pip install -e ".[pdf]"
# 핵심 + PDF OCR까지
pip install -e ".[pdf_ocr]"
# 핵심 + LLM + PDF (QnA 생성 포함 PDF 파이프라인)
pip install -e ".[llm,pdf]"
# 전체 (dev와 동일하되 테스트 의존성 제외)
pip install -e ".[all]"
4. Verify Installation
eulerweave --help
Expected output:
Usage: eulerweave [OPTIONS] COMMAND [ARGS]...
Eulerflow: Manifest-centric data pipeline for LLM training.
Options:
--help Show this message and exit.
Commands:
new Scaffold a new manifest YAML.
validate Validate a manifest YAML.
plan Show execution plan and cost estimate.
run Execute a manifest pipeline.
export Export data to streaming formats.
plugins List installed extractor plugins.
dataset Dataset-related commands.
Check the installed extractor plugins:
eulerweave plugins list
Extractors:
txt eulerweave.core.io.local_files:TxtExtractor [.txt]
jsonl eulerweave.core.io.jsonl:JsonlExtractor [.jsonl]
csv eulerweave.core.io.csv_extractor:CsvExtractor [.csv]
parquet eulerweave.core.io.parquet_extractor:ParquetExtractor [.parquet]
html eulerweave.core.io.html_extractor:HtmlExtractor [.html]
pdf eulerweave.core.io.pdf_extractor:PdfExtractor [.pdf]
6 extractor(s) installed.
5. Run Tests
pytest tests/ -q
All tests should pass. Network/Ollama-related tests are automatically skipped if the required environment variables are not set.
To check coverage:
pytest tests/ -v --cov=eulerweave --cov-report=term-missing
6. Project Structure
eulerweave/
eulerweave/
cli/ # CLI commands (new, validate, plan, run, plugins, etc.)
core/
io/ # Extractors (jsonl, csv, parquet, html, pdf, txt)
record.py # CanonicalRecord Pydantic model
stores/ # Storage backends
blocks_builtin/ # 17 built-in blocks
engine/ # Compiler, planner, local executor
registry/ # Plugin discovery, block catalog
spec/ # Manifest, types, block specs, metric schemas
providers/ # LLM providers (Ollama, etc.)
tests/
unit/ # Unit tests
cli/ # CLI tests
integration/ # Integration tests
fixtures/ # YAML test data
docs/
tutorials/ # This tutorial series
architecture/ # Architecture documentation
fixtures/ # Fixture index
pyproject.toml # Build config, dependencies, entry points
Built-in Blocks (17)
| Slot | Block | Description |
|---|---|---|
| normalize | normalize_text |
Whitespace cleanup, encoding normalization |
| filter | heuristic_filter |
Length/quality-based filtering |
| dedup | dedup_minhash |
MinHash near-deduplication |
| dedup | dedup_exact |
SHA-256 exact deduplication |
| metrics | metrics_text_basic |
Text length distribution statistics |
| metrics | metrics_quality_heuristic |
URL/repetition/character ratios |
| metrics | metrics_dedup_report |
Deduplication report |
| metrics | metrics_task_format_validate |
SFT message format validation |
| metrics | metrics_text_distribution |
Text distribution analysis |
| metrics | metrics_language_heuristic |
Language ratio analysis |
| build_task | build_sft_messages |
Map existing fields to SFT format |
| build_task | build_sft_qna |
LLM-based QnA generation (Ollama) |
| build_task | build_langextract_qna |
LangExtract-style QnA generation |
| export | export_jsonl |
JSONL export |
| export | export_parquet |
Parquet export |
| export | export_mds |
MDS streaming export |
Built-in Extractors (6)
| Name | Supported Format | Options |
|---|---|---|
txt |
.txt |
— |
jsonl |
.jsonl |
— |
csv |
.csv |
text_column, delimiter |
parquet |
.parquet |
text_column |
html |
.html |
— |
pdf |
.pdf |
strategy (auto/text/ocr), page_range |
Troubleshooting
eulerweave: command not found
source .venv/bin/activate
pip install -e ".[dev]"
ModuleNotFoundError: No module named 'pdfminer'
Install the PDF support dependency:
pip install -e ".[pdf]"
ImportError: ocrmypdf (when using OCR)
Install the OCR dependency:
pip install -e ".[pdf_ocr]"
pyarrow import error
pip install -e ".[parquet]"
Next Steps
- Tutorial 1: Quick Start — From creating your first manifest to JSONL output
- Tutorial 2: PDF to Training Data — Convert local PDF documents into SFT training data