튜토리얼 7: MDS 내보내기 — Mosaic Data Shard 형식

이 튜토리얼에서는 MDS (Mosaic Data Shard) 형식으로 훈련 데이터를 내보내는 방법을 설명합니다. MDS는 MosaicML StreamingDataset에서 사용하는 스트리밍 형식으로, 분산 훈련에 최적화되어 있습니다.

사전 요구 사항: 튜토리얼 1: 빠른 시작을 완료했으며, JSONL 출력이 있어야 합니다. [parquet] extra 설치 필요 (pip install -e ".[parquet]").

MDS란?

MDS는 데이터를 고정 크기의 샤드로 분할하고, 스키마/레이아웃을 기술하는 index.json 매니페스트와 함께 구성합니다.

주요 장점:

스트리밍: 첫 번째 샤드 다운로드 즉시 훈련 시작 가능
샤딩: 노드 간 분산 가능한 청크로 분할
재현성: index.json이 정확한 샤드 경계를 기록
효율성: 바이너리 인코딩으로 JSONL보다 컴팩트하고 빠른 디코딩

MDS 디렉토리 구조

output/mds/
  index.json            # 스키마 + 샤드 매니페스트
  shard.00000.mds       # 첫 번째 샤드 (바이너리)
  shard.00001.mds       # 두 번째 샤드
  ...

`eulerweave export mds` 사용법

기본 사용법

eulerweave export mds data/train.jsonl ./output/mds/ --shard-size 1000

매개변수	필수	기본값	설명
입력 파일	예	—	JSONL 파일 경로
출력 디렉토리	예	—	MDS 출력 디렉토리
`--shard-size`	아니요	`10000`	샤드당 레코드 수

예시

# 레코드 수 확인
wc -l out/sft_result.jsonl
# 8500 out/sft_result.jsonl

# 샤드당 2000 레코드로 내보내기
eulerweave export mds out/sft_result.jsonl ./output/mds/ --shard-size 2000

[eulerweave] Exporting to MDS format ...
[eulerweave]   Input: out/sft_result.jsonl (8,500 records)
[eulerweave]   Shard size: 2,000 records per shard
[eulerweave]   Writing shard 0 (2,000 records) ...
[eulerweave]   Writing shard 1 (2,000 records) ...
[eulerweave]   Writing shard 2 (2,000 records) ...
[eulerweave]   Writing shard 3 (2,000 records) ...
[eulerweave]   Writing shard 4 (500 records) ...
[eulerweave] Export complete: 5 shards, 8,500 records total.

매니페스트 파이프라인에서 MDS 사용

export_jsonl 대신 export_mds를 사용합니다:

version: 1
track: sft

inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_mds
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset
    params:
      shard_size: 2000

exports:
  - type: mds
    path: out/mds/

샤딩 전략

샤드 크기 선택

시나리오	권장 크기	근거
소규모 (<10K)	1000-5000	샤드 수 최소화
중규모 (10K-1M)	5000-10000	병렬성/오버헤드 균형
대규모 (>1M)	10000-50000	관리 가능한 샤드 수
다중 노드	5000-20000	워커당 ~1 샤드

예상 샤드 수

shard_count = ceil(total_records / shard_size)

8,500 레코드 / 2,000 = 5 샤드 (마지막 샤드 500 레코드)

MDS 출력 검증

파일 무결성 확인

import json
from pathlib import Path

mds_dir = Path("./output/mds")
with open(mds_dir / "index.json") as f:
    index = json.load(f)

total = sum(s["samples"] for s in index["shards"])
print(f"Shards: {len(index['shards'])}, Total samples: {total}")

for i, shard in enumerate(index["shards"]):
    path = mds_dir / shard["raw_data"]["basename"]
    ok = path.exists() and path.stat().st_size == shard["raw_data"]["bytes"]
    print(f"  shard {i}: {shard['samples']} samples  [{'OK' if ok else 'FAIL'}]")

MDS에서 샘플 읽기

pip install mosaicml-streaming

from streaming import LocalDataset

dataset = LocalDataset(local="./output/mds")
print(f"Total: {len(dataset)}")
print(f"First: {dataset[0]}")

압축

params:
  shard_size: 5000
  compression: "zstd"

알고리즘	속도	압축비	비고
`null`	가장 빠름	1:1	기본값, 로컬 훈련
`"zstd"`	빠름	~3:1	클라우드 스토리지 권장
`"snappy"`	빠름	~2:1	저지연 읽기

형식 비교

기능	JSONL	Parquet	MDS
사람이 읽을 수 있음	예	아니요	아니요
스트리밍	제한적	아니요	예
샤드 기반	아니요	아니요	예
분산 훈련	수동	수동	내장
파일 크기	큼	작음	중간

종합 예시: PDF → QnA → MDS

version: 1
track: sft

inputs:
  - type: pdf
    uri: data/manual.pdf
    options:
      strategy: auto

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: filter1
    type: heuristic_filter
    slot: filter
    input_type: TextDocument
    output_type: TextDocument
    params:
      min_length: 100

  - id: qna1
    type: build_langextract_qna
    slot: build_task
    input_type: TextDocument
    output_type: SFTQnA
    params:
      model: "gpt-oss:20b"

  - id: exp1
    type: export_mds
    slot: export
    input_type: SFTQnA
    output_type: ExportedDataset
    params:
      shard_size: 1000

exports:
  - type: mds
    path: out/mds/

profile:
  cpu_only: false
  allow_external_llm: true

eulerweave validate manifest.yaml
eulerweave run manifest.yaml --input data/manual.pdf --artifacts ./artifacts
ls out/mds/
# index.json  shard.00000.mds

트러블슈팅

`ModuleNotFoundError: No module named 'pyarrow'`

pip install -e ".[parquet]"

빈 샤드 파일

입력이 비어있거나 모든 레코드가 필터링된 경우. 파이프라인 필터 설정을 확인하세요.

다음 단계

튜토리얼 8: 메트릭 — 파이프라인 품질 통계
튜토리얼 2: PDF → 훈련 데이터 — PDF 파이프라인 상세