EulerWeave

LLM Training Data Pipeline Platform

A manifest-driven data pipeline that transforms raw data into high-quality LLM training data. Perform reproducible data processing with declarative YAML definitions.

Open Source

Key Features

Three core pillars of EulerWeave: Data Sources, Processing Blocks, Production Outputs

Diverse Data Sources

  • Local: JSONL, CSV, Parquet, TXT, HTML, PDF
  • Remote: HuggingFace Datasets, HuggingFace Hub, HTTPS, AWS S3
  • Extensible plugin system

17+ Data Processing Blocks

  • Normalization & Filtering: normalize_text, heuristic_filter
  • Deduplication: MinHash, SHA-256
  • SFT Task Build: LLM-based QnA generation
  • 13+ Metric Blocks: Perplexity, PII, Repetition, Gibberish, etc.
  • PII detection and masking

Production-Ready Outputs

  • JSONL: OpenAI Chat format compatible
  • Parquet: For large-scale analytics
  • MDS/StreamingDataset: Optimized for distributed training
  • Compatible: Ollama, vLLM, TRL, HuggingFace Transformers

Pipeline Tracks

Choose the right track for your use case to process data

Track Purpose Description
pretrain Pre-training Normalize and refine web crawl data
sft Supervised Fine-Tuning Convert PDFs/documents into QnA training data
dpo Preference Learning Prepare comparison data in DPO format

CLI Quickstart

Create, validate, and run data pipelines with a single command

# Create a new manifest
eulerweave new manifest.yaml --track sft

# Validate the manifest
eulerweave validate manifest.yaml

# Preview execution plan
eulerweave plan manifest.yaml --records 10000

# Run the pipeline
eulerweave run manifest.yaml --input data/train.jsonl --artifacts ./artifacts

# Export to MDS format
eulerweave export mds out/result.jsonl ./output/mds/ --shard-size 2000

# List plugins
eulerweave plugins list

CLI Reference

EulerWeave CLI command list

Command Description
eulerweave new Create a new manifest YAML
eulerweave validate Validate manifest
eulerweave plan Preview execution plan and estimated cost
eulerweave run Run the pipeline
eulerweave export Export results to various formats
eulerweave plugins list List installed plugins
eulerweave plugins doctor Diagnose plugins

Built-in Block List

17+ data processing blocks included in EulerWeave

Normalization & Filtering

Block Purpose
normalize_text Whitespace cleanup, encoding normalization
heuristic_filter Length and quality-based filtering

Deduplication

Block Purpose
dedup_minhash MinHash-based approximate deduplication
dedup_exact SHA-256 exact deduplication

Task Build (SFT)

Block Purpose
build_sft_messages Generate SFT format via field mapping
build_sft_qna LLM-based multi-turn QnA generation
build_langextract_qna LangExtract-style QnA generation

Metrics

Block Purpose
metrics_text_basic Length distribution, character set statistics
metrics_text_repetition n-gram duplication detection
metrics_text_gibberish Gibberish and encoding anomaly detection
metrics_text_boilerplate Web boilerplate detection
metrics_perplexity Transformers-based text quality scoring
metrics_pii_detect Email, phone, SSN, credit card detection
metrics_token_stats Tokenization statistics
metrics_record_schema_validate Data integrity validation

PII & Export

Block Purpose
filter_pii_redact PII detection and masking
export_jsonl JSONL output
export_parquet Parquet output
export_mds MDS streaming format

Tutorials

Learn EulerWeave quickly with step-by-step guides

Tutorials coming soon.

Manifest Example

A complete pipeline manifest that generates SFT training data from PDF

version: 1 track: sft inputs: - type: pdf uri: data/technical_manual.pdf options: strategy: auto pipeline: - id: normalize type: normalize_text slot: normalize - id: filter type: heuristic_filter slot: filter params: min_length: 100 - id: dedup type: dedup_exact slot: dedup - id: qna type: build_sft_qna slot: build_task params: model: "qwen3:32b" base_url: "http://localhost:11434" - id: export type: export_jsonl slot: export exports: - type: jsonl path: out/training_data.jsonl

Installation & Getting Started

Install EulerWeave and run your first pipeline

Installation

pip install eulerweave

# Full feature installation
pip install eulerweave[pdf,llm,parquet]

Requirements

Python 3.11+

Start Your Data Pipeline with EulerWeave

Open source, declarative YAML definitions, reproducible data processing.

Get Started on GitHub Contact Us