Home > EulerWeave > Tutorials > Tutorial 0: Installation and Development Environment Setup

Tutorial 0: Installation and Development Environment Setup

This guide walks you through installing eulerweave and configuring your development environment.


Prerequisites

Requirement Minimum Version Notes
Python 3.11+ 3.12 recommended
pip 23.0+ Included by default in recent Python
git 2.x For cloning the repository
python --version
# Python 3.12.x   (3.11 이상 필요)

1. Clone the Repository

git clone https://github.com/baida21/eulerweave.git
cd eulerweave

2. Create a Virtual Environment

python -m venv .venv
source .venv/bin/activate    # Linux / macOS
# .venv\Scripts\activate     # Windows

3. Installation

For development, we recommend installing with all optional dependencies:

pip install -e ".[dev]"

This command installs:

Optional Dependency Groups

If you only need a subset of features, you can install individual groups:

Group Installed Packages When Needed
llm httpx>=0.25 LLM-based QnA generation via Ollama (build_sft_qna, build_langextract_qna)
parquet pyarrow>=14.0 Parquet/MDS export (export_parquet, export_mds)
dedup datasketch>=1.6 MinHash-based deduplication (dedup_minhash)
pdf pdfminer.six>=20221105 PDF text extraction (type: pdf input)
pdf_ocr pdfminer.six + ocrmypdf>=16.0 PDF OCR extraction (strategy: ocr)
all All packages above (excluding pdf_ocr) Full functionality without test tools
dev all + pytest + pytest-cov + jsonschema Full development environment

Examples:

# 핵심 + PDF 지원만
pip install -e ".[pdf]"

# 핵심 + PDF OCR까지
pip install -e ".[pdf_ocr]"

# 핵심 + LLM + PDF (QnA 생성 포함 PDF 파이프라인)
pip install -e ".[llm,pdf]"

# 전체 (dev와 동일하되 테스트 의존성 제외)
pip install -e ".[all]"

4. Verify Installation

eulerweave --help

Expected output:

 Usage: eulerweave [OPTIONS] COMMAND [ARGS]...

 Eulerflow: Manifest-centric data pipeline for LLM training.

Options:
  --help  Show this message and exit.

Commands:
  new        Scaffold a new manifest YAML.
  validate   Validate a manifest YAML.
  plan       Show execution plan and cost estimate.
  run        Execute a manifest pipeline.
  export     Export data to streaming formats.
  plugins    List installed extractor plugins.
  dataset    Dataset-related commands.

Check the installed extractor plugins:

eulerweave plugins list
Extractors:
  txt      eulerweave.core.io.local_files:TxtExtractor          [.txt]
  jsonl    eulerweave.core.io.jsonl:JsonlExtractor               [.jsonl]
  csv      eulerweave.core.io.csv_extractor:CsvExtractor         [.csv]
  parquet  eulerweave.core.io.parquet_extractor:ParquetExtractor [.parquet]
  html     eulerweave.core.io.html_extractor:HtmlExtractor       [.html]
  pdf      eulerweave.core.io.pdf_extractor:PdfExtractor         [.pdf]

6 extractor(s) installed.

5. Run Tests

pytest tests/ -q

All tests should pass. Network/Ollama-related tests are automatically skipped if the required environment variables are not set.

To check coverage:

pytest tests/ -v --cov=eulerweave --cov-report=term-missing

6. Project Structure

eulerweave/
  eulerweave/
    cli/                    # CLI commands (new, validate, plan, run, plugins, etc.)
    core/
      io/                   # Extractors (jsonl, csv, parquet, html, pdf, txt)
      record.py             # CanonicalRecord Pydantic model
      stores/               # Storage backends
    blocks_builtin/         # 17 built-in blocks
    engine/                 # Compiler, planner, local executor
    registry/               # Plugin discovery, block catalog
    spec/                   # Manifest, types, block specs, metric schemas
    providers/              # LLM providers (Ollama, etc.)
  tests/
    unit/                   # Unit tests
    cli/                    # CLI tests
    integration/            # Integration tests
    fixtures/               # YAML test data
  docs/
    tutorials/              # This tutorial series
    architecture/           # Architecture documentation
    fixtures/               # Fixture index
  pyproject.toml            # Build config, dependencies, entry points

Built-in Blocks (17)

Slot Block Description
normalize normalize_text Whitespace cleanup, encoding normalization
filter heuristic_filter Length/quality-based filtering
dedup dedup_minhash MinHash near-deduplication
dedup dedup_exact SHA-256 exact deduplication
metrics metrics_text_basic Text length distribution statistics
metrics metrics_quality_heuristic URL/repetition/character ratios
metrics metrics_dedup_report Deduplication report
metrics metrics_task_format_validate SFT message format validation
metrics metrics_text_distribution Text distribution analysis
metrics metrics_language_heuristic Language ratio analysis
build_task build_sft_messages Map existing fields to SFT format
build_task build_sft_qna LLM-based QnA generation (Ollama)
build_task build_langextract_qna LangExtract-style QnA generation
export export_jsonl JSONL export
export export_parquet Parquet export
export export_mds MDS streaming export

Built-in Extractors (6)

Name Supported Format Options
txt .txt
jsonl .jsonl
csv .csv text_column, delimiter
parquet .parquet text_column
html .html
pdf .pdf strategy (auto/text/ocr), page_range

Troubleshooting

eulerweave: command not found

source .venv/bin/activate
pip install -e ".[dev]"

ModuleNotFoundError: No module named 'pdfminer'

Install the PDF support dependency:

pip install -e ".[pdf]"

ImportError: ocrmypdf (when using OCR)

Install the OCR dependency:

pip install -e ".[pdf_ocr]"

pyarrow import error

pip install -e ".[parquet]"

Next Steps

Back to List Next →