튜토리얼 11: PII 안전

eulerweave는 학습 데이터에서 개인 식별 정보(PII)를 탐지하고 마스킹하는 두 가지 블록을 제공합니다:

metrics_pii_detect — PII 패턴을 탐지하고 통계를 보고합니다 (메트릭, pass-through).
filter_pii_redact — PII 패턴을 마스킹 토큰으로 대체합니다 (필터 슬롯, transform).

사전 요구 사항: 튜토리얼 8: 메트릭 블록을 완료했어야 합니다.

탐지되는 PII 유형

유형	마스킹 토큰	패턴 예시
이메일	`[EMAIL]`	`user@example.com`
전화번호	`[PHONE]`	`010-1234-5678`
SSN	`[SSN]`	`123-45-6789`
신용카드	`[CREDIT_CARD]`	`1234-5678-9012-3456`
IP 주소	`[IP_ADDRESS]`	`192.168.1.1`

1단계: PII 탐지만 (보고)

먼저 데이터에 PII가 얼마나 포함되어 있는지 확인합니다:

version: 1
track: sft
inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: m_pii
    type: metrics_pii_detect
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

eulerweave run manifest.yaml --input data/train.jsonl

결과 확인:

cat artifacts/{run_id}/metrics/metrics_pii_detect.json

records_with_pii > 0이면 다음 단계로 마스킹을 추가합니다.

2단계: PII 마스킹 추가

filter_pii_redact를 filter 슬롯에 추가합니다. 이 블록은 모든 레코드를 변환하여 반환합니다 (제거하지 않음):

version: 1
track: sft
inputs:
  - type: jsonl
    uri: data/train.jsonl

pipeline:
  - id: norm1
    type: normalize_text
    slot: normalize
    input_type: TextDocument
    output_type: TextDocument

  - id: pii1
    type: filter_pii_redact
    slot: filter
    input_type: TextDocument
    output_type: TextDocument

  - id: m_pii
    type: metrics_pii_detect
    slot: metrics
    input_type: TextDocument
    output_type: TextDocument

  - id: sft1
    type: build_sft_messages
    slot: build_task
    input_type: TextDocument
    output_type: SFTMessages

  - id: exp1
    type: export_jsonl
    slot: export
    input_type: SFTMessages
    output_type: ExportedDataset

exports:
  - type: jsonl
    path: out/result.jsonl

핵심 사항:

filter_pii_redact는 filter 슬롯에 배치합니다 (normalize 이후, metrics 이전).
마스킹 후 metrics_pii_detect를 실행하면, 마스킹이 올바르게 적용되었는지 확인할 수 있습니다.
마스킹된 레코드의 annotations["pii_redacted"]가 True로 설정됩니다.

3단계: 마스킹 결과 검증

eulerweave run manifest.yaml --input data/train.jsonl

마스킹 후 PII 탐지 결과에서 records_with_pii가 0이면 성공입니다.

출력 데이터에서 마스킹을 확인합니다:

head -1 artifacts/{run_id}/output/*.jsonl
# {"text": "Contact [EMAIL] or call [PHONE]", ...}

다음 단계

튜토리얼 8: 메트릭 블록 — 메트릭 블록 전체 가이드
튜토리얼 12: 불량 레코드 디버깅 — 데이터 스키마 검증
아키텍처: 엔진 — 블록 실행 내부 동작