Implementieren Sie die Versionskontrolle von Datensätzen mithilfe von MLOps.

MLOps • Dataset Version Control

If you can’t answer “which exact data trained this model?” in seconds, your pipeline isn’t reproducible. Dataset version control fixes that by turning training data into a tracked, reviewable, auditable artifact—just like code.

data versioning DVC lakeFS Delta / Iceberg lineage & traceability CI/CD for data

Practical guide: workflows, tool choices, and a copy-paste implementation path you can adapt to your stack (S3/GCS/Azure, MLflow, GitHub/GitLab, etc.).

Email Bastelia for an implementation plan Explore AI integration & implementation services

Key takeaways you can apply immediately

  • Version more than files: labels, splits, preprocessing code, schemas, and quality reports must be tied to the same dataset ID.
  • Make dataset IDs unavoidable: every experiment run and every deployed model should store a dataset version reference.
  • Prefer “immutable snapshots”: new data becomes a new version; you never silently overwrite training inputs.
  • Automate quality gates: block releases when the dataset schema changes, leakage is detected, or drift crosses thresholds.
  • Choose tools by data shape: files (DVC/lakeFS) vs tables (Delta/Iceberg) vs experiment artifacts (W&B/MLflow).

What dataset version control is (and what it isn’t)

Dataset version control is the practice of creating explicit, recoverable versions of the data used to train, validate, and test ML systems. Each version is identifiable (by a tag, commit, hash, or semantic version), and can be reconstructed later—so results are reproducible and changes are traceable.

It’s not just “keeping a backup” or renaming files like final_final_v7.csv. A proper approach gives you:

  • Traceability: who changed what, when, and why (and what downstream models were affected).
  • Reproducibility: re-run training with the same data and get comparable results (within expected stochastic variation).
  • Fast rollback: ship a previous dataset + model pair when a new data release breaks production.
  • Safer collaboration: teams can branch/experiment without corrupting a shared dataset.
  • Auditability: the dataset used for a decision can be proven and inspected later.
Engineer in a data center interacting with holographic data streams, representing controlled dataset versioning in MLOps
Dataset version control is about governed changes, reproducible snapshots, and a clear link between data → training → deployment.

What you should version in real MLOps (beyond “the dataset file”)

In practice, “the dataset” is a bundle of assets that must stay consistent. If any of these change without a new version, reproducibility breaks.

1) Raw inputs (as received)

Keep a referenceable snapshot of raw data (or a reversible ingestion log). This helps you prove provenance and debug upstream issues. For regulated contexts, raw snapshots are often required for traceability—even if training uses a curated view.

2) Labels and labeling rules

Labels are data. Version them with the same discipline as code. Track labeling guidelines, annotator changes, and “label fixes” as versioned releases.

3) Train/validation/test splits

Changing a split changes results. Persist split definitions (files, IDs, or deterministic seeds) and treat them as part of the dataset version.

4) Transformations + preprocessing code

Dataset version control fails if the transformation pipeline isn’t pinned. Always tie preprocessing code versions to the dataset version (Git commit + pipeline lockfile).

5) Schemas + expectations

Track schema (columns, types, ranges) and quality checks (null thresholds, category sets, outlier rules). These are the guardrails that prevent “silent breakage.”

6) Documentation & metadata

Add a dataset “card”: purpose, time range, sources, known limitations, PII handling, and intended use. This boosts handoff quality and reduces misuse.

Tooling options: pick the right approach for your data

There isn’t one universal tool because data comes in different shapes (files, tables, streams) and teams need different collaboration models. The best implementation is often a hybrid that keeps a single dataset ID flowing across your stack.

Option A — DVC (Data Version Control) for file-based datasets

Best when your training data is stored as files (CSV/Parquet/images/audio), you already use Git, and you want a developer-friendly workflow. DVC keeps lightweight pointers in Git while storing heavy files in remote storage (S3/GCS/Azure Blob/NFS).

  • Great for: data science teams, reproducible experiments, pipeline stages, dataset registries.
  • Watch out for: organizational scaling if you need branching over massive shared object stores with many concurrent users.

Option B — lakeFS for Git-like branching on object storage

Best when you have a data lake in object storage and need branch/commit semantics for large-scale datasets without copying everything. Great for isolation, collaboration, and audit trails across teams.

  • Great for: data engineering + ML teams sharing large datasets; regulated audit trails; multiple concurrent experiments.
  • Watch out for: operational setup and governance decisions (who can branch, merge rules, retention).

Option C — Delta Lake / Apache Iceberg (table versioning & time travel)

Best when your “dataset” is really a set of tables in a lakehouse and you need ACID guarantees, schema evolution, and time travel. Ideal for structured data pipelines feeding feature generation.

  • Great for: analytics-grade tables, ETL/ELT pipelines, consistent feature computation.
  • Watch out for: mixing unstructured training files and table snapshots—define your dataset ID strategy early.

Option D — Experiment artifact systems (W&B / MLflow artifacts) for “what trained what”

Best for recording the exact dataset reference (URI, commit, hash) used by each run, plus metrics, confusion matrices, and evaluation reports. This isn’t a replacement for data lake versioning—but it closes the loop between data and models.

Futuristic governed data lake visual, representing versioned storage and lineage for ML datasets
At scale, dataset version control becomes a platform capability: versioned storage + governance + traceability across pipelines.

Reference architecture for dataset version control in MLOps

A simple way to think about this: you need a dataset version ID that survives every handoff. Whether your tool is DVC, lakeFS, or a lakehouse table version, the goal is the same: one identifier that you can attach to training, evaluation, and deployment records.

Core components

  • Versioned storage: object store or lakehouse where snapshots/commits live.
  • Metadata registry: dataset versions, owners, descriptions, lineage, retention policy.
  • Pipeline engine: orchestrates ingestion → validation → transformation → snapshot release.
  • Quality gates: schema checks, leakage checks, drift checks, and fairness/coverage checks.
  • Experiment tracking: every run logs the dataset version ID + code version + parameters.
  • Release gates: promotion rules to move from “candidate data” to “blessed training data.”

A practical “single source of truth” rule

Never train from “latest”. Train from an explicit dataset version (tag/commit/hash). If you want “latest”, define it as a tag that moves only through an approval pipeline.

Step-by-step: implement dataset version control with Git + DVC

This workflow is intentionally simple and widely portable. You can use S3, GCS, Azure Blob, MinIO, SSH, or shared storage for the remote. The key outcome is that your Git commit references the dataset version, while the heavy data stays out of Git.

1) Initialize Git + DVC

git init
dvc init

2) Add your dataset (raw or curated)

dvc add data/train
git add data/train.dvc .gitignore
git commit -m "Track training dataset with DVC"

3) Configure a remote storage (example: S3)

dvc remote add -d storage s3://YOUR-BUCKET/path/to/dvc-cache
git add .dvc/config
git commit -m "Configure DVC remote"

4) Push the data to remote

dvc push

5) Tag a dataset release (recommended)

Use semantic versions if your team likes them (v1.0.0), or a timestamped release (data-2026-01-14). The important part: make it stable.

git tag -a data-v1.0.0 -m "Blessed training dataset v1.0.0"
git push --tags

6) Reproduce the dataset on any machine

git clone YOUR_REPO_URL
cd YOUR_REPO
git checkout data-v1.0.0
dvc pull

7) Branch to experiment safely

git checkout -b exp/new-cleaning
# adjust cleaning pipeline / filters
dvc add data/train
git add data/train.dvc
git commit -m "New cleaning rules for training data"
dvc push

Pro tip: “dataset version = (data commit) + (code commit)”

The most reliable approach is to store both: (1) the dataset snapshot reference (tag/commit/hash) and (2) the preprocessing code commit. That way you can rebuild the dataset and explain how it was produced.

Dataset version control becomes valuable when it’s enforced. That means your training jobs and deployment pipelines must record dataset IDs automatically. Otherwise, teams will “forget” under time pressure.

What to log for every training run

  • dataset_version_id (Git tag, DVC data hash, lakeFS commit, Delta/Iceberg snapshot ID)
  • dataset_uri (storage path / table name / branch)
  • code_version (Git commit SHA)
  • feature_pipeline_version (if features are computed separately)
  • evaluation_report_version (metrics artifact reference)

Minimal example (pseudo-code for your training script)

# read from your CI job, config, or environment
DATASET_VERSION = os.getenv("DATASET_VERSION_ID")
CODE_VERSION = os.getenv("GIT_SHA")

# log to your experiment tracker (MLflow/W&B/Neptune/etc.)
log_param("dataset_version_id", DATASET_VERSION)
log_param("code_version", CODE_VERSION)

Deployment rule that prevents “mystery models”

Only allow promotion to production when a model artifact includes: dataset_version_id + code_version + evaluation report. This is a simple control that dramatically improves governance and debugging.

Quality gates, governance & auditability (without slowing teams down)

Versioning without validation can institutionalize bad data. The best MLOps setups treat quality checks as part of the dataset release process, so “blessed” datasets are trustworthy and repeatable.

Recommended quality gates before creating a new dataset version

  • Schema validation: types, required fields, category constraints.
  • Leakage checks: look-ahead fields, target leakage proxies, duplicate rows across splits.
  • Coverage checks: class balance, segment coverage, missingness by segment.
  • Statistical drift checks: compare new data vs last “blessed” version (PSI/KL/KS, embeddings for text).
  • PII & security checks: detect sensitive fields, enforce access rules, log usage.

Audit trail essentials

  • Who approved the dataset release (and when).
  • What changed (diff summary, affected tables/partitions/files).
  • Which models were trained on it (reverse lineage).
  • How long the version is retained (retention policy).

If you operate in regulated environments or want “auditability by design”, see our Compliance & Legal Tech services.

Storage, cost control & retention policies

A common fear is “we’ll explode storage costs.” In reality, cost stays manageable when you treat dataset versions like releases: keep what you must keep, expire what you can, and avoid full duplication when tools support content-addressed storage or zero-copy branching.

Practical retention strategy

  • Keep forever: versions used in production + versions tied to externally audited decisions.
  • Keep 30–90 days: experimental branches and intermediate builds.
  • Expire automatically: orphaned versions not referenced by any active model or branch.
  • Compress where possible: parquet, image compression, column pruning, partitioning.

Tip: budget for “quality, not just storage”

The biggest hidden costs usually come from rework (debugging, reruns, broken pipelines) and incident response, not raw storage. Strong versioning + validation reduces those expensive loops.

Common pitfalls (and quick fixes)

  • Pitfall: training from “latest” → Fix: require an explicit dataset version ID in every training job.
  • Pitfall: “data changed” but no one knows why → Fix: add change summaries + approvals for blessed releases.
  • Pitfall: labels drift silently → Fix: version labels + guidelines; run label distribution checks.
  • Pitfall: splits are regenerated differently → Fix: pin split files or deterministic seeds per version.
  • Pitfall: quality checks exist but are optional → Fix: CI gate that blocks version creation when checks fail.
  • Pitfall: too many tools, unclear ownership → Fix: define one dataset ID standard and a single “release” pipeline.

Implementation checklist (fast to execute)

  1. Define your dataset version ID (tag/commit/hash) and where it will be stored.
  2. Select tooling based on data shape (files vs tables) and team collaboration needs.
  3. Standardize dataset structure (raw/processed/splits/labels) with a clear convention.
  4. Automate dataset build (ingestion → transforms → snapshot) with reproducible steps.
  5. Add quality gates (schema, leakage, drift, coverage) before blessing a version.
  6. Log dataset version everywhere (experiment tracker, model registry, deployment metadata).
  7. Enforce promotion rules so models without dataset IDs can’t reach production.
  8. Set retention policies for branches/versions to control cost.
  9. Document the dataset (purpose, sources, known limits, PII handling) for safer reuse.
  10. Review monthly (what versions exist, what’s expired, what caused incidents, what to improve).
Team collaborating in a modern office with holographic code dashboards, representing shared dataset governance and reproducible ML workflows
Versioned datasets make collaboration safer: experiments become comparable, decisions become explainable, and rollbacks become routine.

Want this implemented in your stack?

If you share your storage (S3/GCS/Azure), orchestration tool, and experiment tracking setup, we’ll propose a practical dataset versioning workflow (tools, gates, retention, and rollout plan) that fits your constraints.

Email info@bastelia.com See Data, BI & Analytics services

Useful links: AI consulting & implementation · Packages & pricing

FAQs about dataset version control in MLOps

What’s the difference between dataset versioning and model versioning?

Dataset versioning tracks the evolution of training/validation/test data (including labels, splits, and preprocessing artifacts). Model versioning tracks trained model artifacts and their parameters. In solid MLOps, each model version points to a specific dataset version ID.

Can we just use Git LFS to version datasets?

Git LFS can work for smaller datasets, but teams often hit friction when data grows or when they need richer workflows (pipelines, caching, registry patterns). Many teams use Git for code + a data-focused tool (like DVC or a lakehouse/lakeFS approach) for scalable dataset versioning.

How do we prevent training on the wrong dataset version?

Make dataset IDs mandatory in training jobs (CI/CD passes a version ID) and block promotions if the model artifact doesn’t include dataset_version_id. Also, avoid “latest” paths—use release tags or commits.

What should we store as the dataset “version ID”?

Use whatever is stable in your chosen tooling: a Git tag, DVC hash/lockfile reference, lakeFS commit, or Delta/Iceberg snapshot ID. The key is consistency: pick one format and log it everywhere (runs, models, deployments, dashboards).

How often should we create new dataset versions?

Create a new version whenever changes could affect model behavior: new data ingestion windows, label updates, preprocessing changes, schema evolution, or split changes. If you’re unsure, version it—then use retention rules to control cost.

How do we handle privacy and regulated requirements?

Use access controls on data storage, log dataset usage, and include PII checks and documentation in the dataset release pipeline. If you need audit-ready traceability (who approved what, what changed, which models are affected), build an approval + logging layer on top.

What’s a “blessed dataset”?

A blessed dataset is a version that passed validation gates (schema, leakage, drift thresholds, coverage) and is approved for training production models. This prevents experimental or partially validated data from accidentally becoming the default.

Which tool should we start with: DVC or lakeFS?

Start with DVC if you’re mostly file-based and want a lightweight Git-friendly workflow. Start with lakeFS if you’re managing large shared object stores with many concurrent users and need branching/merging semantics at scale.

This article is informational and not legal or compliance advice. For implementation support, contact info@bastelia.com.

Nach oben scrollen