The Infinite Archives
Data science, DVC and Jupyter notebooks
In the heart of the Citadel, behind a door that few archivists know about, lies the Infinite Library. Its corridors stretch so far that no one has ever seen the end. The shelves groan under the weight of archives so vast that no ordinary scroll could contain them - immense datasets, complex predictive models, experiment grimoires spanning hundreds of thousands of pages.
The Archivist of Numbers awaits you at the entrance, a mechanical abacus in one hand, a graph covered in curves in the other.
"You've learned to version code. That's good. But code is only part of the story. Data, models, experiments - that's the real treasure of the Library. And these treasures are so large, so complex, that they require special tools. Follow me, I'll teach you the art of versioning the infinite."
The problem of massive data in Git
In a data science or machine learning project, you work with files very different from source code:
- Datasets: CSV, Parquet or JSON files of several GB
- Trained models: binary files (.pkl, .h5, .pt, .onnx) from several hundred MB to several GB
- Jupyter notebooks: .ipynb files that mix code, results and visualizations
- Training images: thousands (or even millions) of images for computer vision
- Experiment results: metrics, logs, generated graphs
Git is not designed to handle these volumes. A 2 GB dataset committed 5 times = 10 GB in the history. A 500 MB model trained 20 times = 10 GB more. Very quickly, your repository becomes unmanageable.
DVC - Data Version Control
DVC (Data Version Control) is an open source tool designed specifically for versioning data, ML models and processing pipelines. It works alongside Git, not as a replacement.
The principle is similar to Git LFS: large files are stored outside of Git, and Git only keeps a small pointer file (.dvc). But DVC goes much further with its reproducible pipelines.
Installation
# With pip (recommended - works everywhere)
pip install dvc
# With S3 support
pip install "dvc[s3]"
# With Google Cloud Storage support
pip install "dvc[gs]"
# With all remotes
pip install "dvc[all]"
# macOS with Homebrew
brew install dvc
# Verify installation
dvc version # With pip (recommended)
pip install dvc
# With S3 support
pip install "dvc[s3]"
# With all remotes
pip install "dvc[all]"
# Or with conda
conda install -c conda-forge dvc
# Verify installation
dvc version Initialization in a Git project
# DVC initializes in an existing Git repository
cd my-ml-project
git init -b main
dvc init
# DVC creates a .dvc/ folder and a .dvcignore file
# Commit the initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC" Adding data
# Add a dataset to DVC
dvc add data/train.csv
# DVC creates two things:
# 1. data/train.csv.dvc -> a small pointer file (committed in Git)
# 2. data/.gitignore -> so Git ignores the real file
# Commit the pointer in Git
git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset"
# Add a trained model
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc models/.gitignore
git commit -m "Add trained model v1" The .dvc file looks like this:
# Contents of data/train.csv.dvc
outs:
- md5: a1b2c3d4e5f6...
size: 2147483648
hash: md5
path: train.csv It's a simple YAML file with the file's MD5 hash and size. Git can version it efficiently, and DVC uses the hash to find the real file.
DVC remotes
By default, DVC stores data in a local cache (.dvc/cache). To share data with your team, you configure a remote - a remote storage location.
# Local remote (shared folder, network drive)
dvc remote add -d myremote /path/to/shared/storage
# S3 remote (Amazon Web Services)
dvc remote add -d myremote s3://my-bucket/dvc-storage
# Google Cloud Storage remote
dvc remote add -d myremote gs://my-bucket/dvc-storage
# Azure Blob Storage remote
dvc remote add -d myremote azure://my-container/dvc-storage
# SSH remote
dvc remote add -d myremote ssh://user@server.com/dvc-storage
# See configured remotes
dvc remote list The -d flag sets the remote as the default remote.
Push and Pull
# Push data to the remote
dvc push
# Pull data from the remote
dvc pull
# Fetch data without placing it in the workspace
dvc fetch
# Check status (which files are synchronized)
dvc status Typical workflow: you run git pull to fetch the .dvc pointers, then dvc pull to fetch the real data. The two commands complement each other.
DVC pipelines
The real power of DVC lies in its pipelines. A pipeline describes the steps of your ML workflow: prepare the data, train the model, evaluate the results. DVC knows which steps to re-run when data or code changes.
The dvc.yaml file
# dvc.yaml - defines the pipeline
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/
outs:
- data/prepared/
train:
cmd: python src/train.py
deps:
- src/train.py
- data/prepared/
outs:
- models/model.pkl
metrics:
- metrics/train_metrics.json:
cache: false
evaluate:
cmd: python src/evaluate.py
deps:
- src/evaluate.py
- models/model.pkl
- data/prepared/
metrics:
- metrics/eval_metrics.json:
cache: false
plots:
- plots/confusion_matrix.csv:
cache: false Each stage (step) defines:
cmd: the command to executedeps: the dependencies (if they change, the stage will be re-run)outs: the outputs (managed by DVC, not by Git)metrics: the metrics (committed in Git to facilitate comparison)plots: the visualizations
Running the pipeline
# Run the entire pipeline
dvc repro
# DVC only re-runs stages whose dependencies have changed!
# If you modify src/train.py, only the "train" and "evaluate" stages
# will be re-run. "prepare" will be skipped because its deps haven't changed.
# Visualize the dependency graph (DAG)
dvc dag
# Example output:
# +----------+
# | prepare |
# +----------+
# |
# v
# +----------+
# | train |
# +----------+
# |
# v
# +----------+
# | evaluate |
# +----------+ Comparing experiments
# See current metrics
dvc metrics show
# Compare with another branch or commit
dvc metrics diff
# Example output:
# Path Metric Old New Change
# metrics/eval_metrics.json accuracy 0.85 0.91 0.06
# metrics/eval_metrics.json f1_score 0.83 0.89 0.06 The problem with Jupyter notebooks
Jupyter notebooks (.ipynb) are the favorite format of data scientists. But they pose a major problem with Git: they are JSON files that contain both the code, the execution results, and the metadata.
Why diffs are horrible
// A Jupyter notebook is JSON like this:
{
"cells": [
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA..."
// <- hundreds of lines of base64 for ONE image
}
}
],
"source": ["print('hello')"]
}
]
} Every time you execute a cell, the execution_count changes, the outputs change, the metadata changes. A Git diff shows hundreds of modified lines for a single-line code change.
And if your notebook contains graphs, they are encoded in base64 directly in the JSON. Thousands of lines of unreadable characters in every diff.
Solution 1: nbstripout - cleaning outputs
nbstripout is a tool that automatically strips notebook outputs before each commit. Only the source code is versioned.
# Install nbstripout
pip install nbstripout
# Enable it for the current repository (configures an automatic Git filter)
nbstripout --install
# Now, every time you git add a notebook,
# outputs are automatically removed before commit.
# Your local notebook keeps its outputs - only the Git version is cleaned.
# Uninstall if needed
nbstripout --uninstall Best practice: install nbstripout in every project that contains notebooks. Add nbstripout --install to your project setup script so all team members have it automatically.
Solution 2: Jupytext - text format
Jupytext automatically synchronizes your notebooks with a text file (.py, .md, or .Rmd). You work in Jupyter normally, and Jupytext maintains a pure text copy that Git can diff cleanly.
# Install Jupytext
pip install jupytext
# Convert a notebook to a Python script
jupytext --to py:percent notebook.ipynb
# This creates notebook.py with the "percent" format:
# ---
# jupyter:
# jupytext:
# text_representation:
# format_name: percent
# ---
# %% [markdown]
# # My analysis
# %%
import pandas as pd
df = pd.read_csv("data.csv")
# %%
df.describe() You can configure Jupytext to automatically synchronize both formats:
# In jupytext.toml at the project root
formats = "ipynb,py:percent"
# Jupytext will keep both files in sync
# -> Commit the .py in Git (text, clean diffs)
# -> The .ipynb can be in .gitignore (or committed without outputs via nbstripout) Solution 3: ReviewNB - visual review
ReviewNB is an online tool that integrates with GitHub and GitLab to display notebook diffs visually - with rendered graphs, tables and Markdown. Ideal for Pull Requests that touch notebooks.
Versioning ML experiments
In machine learning, you often run dozens of experiments with different hyperparameters. Tracking results by hand quickly becomes chaotic. Several tools integrate with Git to solve this problem.
MLflow + Git
MLflow is an open source platform for managing the machine learning lifecycle. It automatically records the parameters, metrics and artifacts of each experiment.
# Install MLflow
pip install mlflow
# In your training script
import mlflow
mlflow.set_experiment("image-classification")
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", 0.001)
mlflow.log_param("epochs", 50)
mlflow.log_param("batch_size", 32)
# ... training ...
# Log metrics
mlflow.log_metric("accuracy", 0.91)
mlflow.log_metric("f1_score", 0.89)
# Log model
mlflow.sklearn.log_model(model, "model")
# Launch the web interface to visualize results
# mlflow ui MLflow automatically records the Git commit hash for each experiment. You can thus find exactly which code produced which results.
Weights & Biases (W&B)
Weights & Biases is a cloud service that offers features similar to MLflow, with a richer interface and advanced collaboration features (experiment comparison, interactive tables, reports).
# Install wandb
pip install wandb
# In your script
import wandb
wandb.init(project="image-classification")
wandb.config.learning_rate = 0.001
wandb.config.epochs = 50
# ... training ...
wandb.log({"accuracy": 0.91, "loss": 0.23})
# Results are visible on app.wandb.ai DVC vs MLflow vs W&B: these tools are not competitors. DVC manages data and pipelines. MLflow and W&B manage experiments and metrics. You can (and should) use them together.
Organizing an ML repo
A well-organized data science project facilitates collaboration and reproducibility. Here's a recommended structure:
my-ml-project/
βββ .dvc/ # DVC configuration
βββ .dvcignore # Files ignored by DVC
βββ .gitignore # Files ignored by Git
βββ dvc.yaml # DVC pipeline
βββ dvc.lock # Locked pipeline versions
βββ params.yaml # Hyperparameters (versioned in Git)
βββ README.md
βββ requirements.txt # Python dependencies
β
βββ data/
β βββ raw/ # Raw data (DVC, never modified)
β βββ processed/ # Cleaned data (DVC, generated by the pipeline)
β βββ external/ # External data (DVC)
β
βββ models/ # Trained models (DVC)
β
βββ src/ # Source code (Git)
β βββ data/
β β βββ download.py # Download data
β β βββ prepare.py # Prepare/clean data
β βββ features/
β β βββ build.py # Build features
β βββ models/
β β βββ train.py # Train the model
β β βββ predict.py # Make predictions
β βββ evaluate/
β βββ evaluate.py # Evaluate the model
β
βββ notebooks/ # Exploration notebooks (Git, with nbstripout)
β βββ 01-exploration.ipynb
β βββ 02-feature-engineering.ipynb
β βββ 03-analysis.ipynb
β
βββ metrics/ # Metrics (Git, small JSON files)
β βββ eval_metrics.json
β
βββ plots/ # Generated visualizations (Git or DVC depending on size)
β
βββ tests/ # Unit tests (Git)
βββ test_prepare.py .gitignore for ML projects
# .gitignore for an ML project
# Data (managed by DVC, not Git)
/data/raw/
/data/processed/
/models/
# Python
__pycache__/
*.py[cod]
*.egg-info/
.eggs/
dist/
build/
*.egg
# Virtual environments
venv/
env/
.env/
# Jupyter
.ipynb_checkpoints/
# IDE
.idea/
.vscode/
*.swp
# OS
.DS_Store
Thumbs.db
# MLflow
mlruns/
# W&B
wandb/ Practical exercise - Mini DVC pipeline
Create a mini data science project with DVC. You'll initialize a repository, add a dataset, and create a simple processing pipeline.
Step 1 - Initialize the project
# Create the project
mkdir infinite-archives-project
cd infinite-archives-project
git init -b main
dvc init
# Create the structure
mkdir -p data/raw data/processed src models metrics
# Commit the initialization
git add .
git commit -m "Initialize project with Git and DVC" # Create the project
mkdir infinite-archives-project
cd infinite-archives-project
git init -b main
dvc init
# Create the structure
New-Item -ItemType Directory -Force -Path data/raw, data/processed, src, models, metrics
# Commit the initialization
git add .
git commit -m "Initialize project with Git and DVC" Step 2 - Create a fictional dataset
# Create a small fictional CSV dataset
cat > data/raw/adventurers.csv << 'EOF'
name,class,level,strength,intelligence,xp
Aldric,Warrior,12,18,8,4500
Lyria,Mage,10,6,20,3800
Thorin,Paladin,15,16,12,7200
Selene,Rogue,8,10,14,2100
Grimm,Barbarian,20,22,5,12000
Elara,Druid,11,12,16,4100
Kael,Ranger,9,14,11,2800
Mira,Enchanter,14,7,19,6500
Bron,Knight,17,19,10,9300
Zara,Necromancer,13,8,18,5700
EOF
# Add the dataset to DVC
dvc add data/raw/adventurers.csv
# Commit the pointer
git add data/raw/adventurers.csv.dvc data/raw/.gitignore
git commit -m "Add adventurers dataset" # Create a small fictional CSV dataset
@"
name,class,level,strength,intelligence,xp
Aldric,Warrior,12,18,8,4500
Lyria,Mage,10,6,20,3800
Thorin,Paladin,15,16,12,7200
Selene,Rogue,8,10,14,2100
Grimm,Barbarian,20,22,5,12000
Elara,Druid,11,12,16,4100
Kael,Ranger,9,14,11,2800
Mira,Enchanter,14,7,19,6500
Bron,Knight,17,19,10,9300
Zara,Necromancer,13,8,18,5700
"@ | Set-Content data/raw/adventurers.csv
# Add the dataset to DVC
dvc add data/raw/adventurers.csv
# Commit the pointer
git add data/raw/adventurers.csv.dvc data/raw/.gitignore
git commit -m "Add adventurers dataset" Step 3 - Create the pipeline scripts
# Data preparation script
cat > src/prepare.py << 'PYEOF'
"""Prepare adventurer data."""
import csv
import json
import os
def prepare():
os.makedirs("data/processed", exist_ok=True)
with open("data/raw/adventurers.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
# Add a "power" column = strength + intelligence
for row in rows:
row["power"] = int(row["strength"]) + int(row["intelligence"])
# Write prepared data
fieldnames = list(rows[0].keys())
with open("data/processed/adventurers_prepared.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
print(f"Data prepared: {len(rows)} adventurers")
if __name__ == "__main__":
prepare()
PYEOF
# Analysis script (simulates "training")
cat > src/analyze.py << 'PYEOF'
"""Analyze adventurers and produce metrics."""
import csv
import json
import os
def analyze():
os.makedirs("metrics", exist_ok=True)
with open("data/processed/adventurers_prepared.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
# Calculate statistics
levels = [int(r["level"]) for r in rows]
powers = [int(r["power"]) for r in rows]
metrics = {
"total_adventurers": len(rows),
"avg_level": round(sum(levels) / len(levels), 2),
"max_level": max(levels),
"avg_power": round(sum(powers) / len(powers), 2),
"max_power": max(powers),
}
with open("metrics/analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
print(f"Analysis complete: {metrics}")
if __name__ == "__main__":
analyze()
PYEOF # Data preparation script
@"
"""Prepare adventurer data."""
import csv
import json
import os
def prepare():
os.makedirs("data/processed", exist_ok=True)
with open("data/raw/adventurers.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
# Add a "power" column = strength + intelligence
for row in rows:
row["power"] = int(row["strength"]) + int(row["intelligence"])
# Write prepared data
fieldnames = list(rows[0].keys())
with open("data/processed/adventurers_prepared.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(rows)
print(f"Data prepared: {len(rows)} adventurers")
if __name__ == "__main__":
prepare()
"@ | Set-Content src/prepare.py
# Analysis script
@"
"""Analyze adventurers and produce metrics."""
import csv
import json
import os
def analyze():
os.makedirs("metrics", exist_ok=True)
with open("data/processed/adventurers_prepared.csv") as f:
reader = csv.DictReader(f)
rows = list(reader)
# Calculate statistics
levels = [int(r["level"]) for r in rows]
powers = [int(r["power"]) for r in rows]
metrics = {
"total_adventurers": len(rows),
"avg_level": round(sum(levels) / len(levels), 2),
"max_level": max(levels),
"avg_power": round(sum(powers) / len(powers), 2),
"max_power": max(powers),
}
with open("metrics/analysis.json", "w") as f:
json.dump(metrics, f, indent=2)
print(f"Analysis complete: {metrics}")
if __name__ == "__main__":
analyze()
"@ | Set-Content src/analyze.py Step 4 - Define the DVC pipeline
# Create the dvc.yaml file
cat > dvc.yaml << 'EOF'
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/adventurers.csv
outs:
- data/processed/adventurers_prepared.csv
analyze:
cmd: python src/analyze.py
deps:
- src/analyze.py
- data/processed/adventurers_prepared.csv
metrics:
- metrics/analysis.json:
cache: false
EOF
# Commit the scripts and pipeline
git add src/ dvc.yaml
git commit -m "Add DVC pipeline (prepare + analyze)" # Create the dvc.yaml file
@"
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
- data/raw/adventurers.csv
outs:
- data/processed/adventurers_prepared.csv
analyze:
cmd: python src/analyze.py
deps:
- src/analyze.py
- data/processed/adventurers_prepared.csv
metrics:
- metrics/analysis.json:
cache: false
"@ | Set-Content dvc.yaml
# Commit the scripts and pipeline
git add src/ dvc.yaml
git commit -m "Add DVC pipeline (prepare + analyze)" Step 5 - Run the pipeline
# Run the full pipeline
dvc repro
# DVC executes the stages in order:
# 1. prepare -> creates data/processed/adventurers_prepared.csv
# 2. analyze -> creates metrics/analysis.json
# See the metrics
dvc metrics show
# Commit the results
git add dvc.lock metrics/analysis.json
git commit -m "Run pipeline - first analysis" # Run the full pipeline
dvc repro
# See the metrics
dvc metrics show
# Commit the results
git add dvc.lock metrics/analysis.json
git commit -m "Run pipeline - first analysis" Step 6 - Modify and re-run
# Add adventurers to the dataset
cat >> data/raw/adventurers.csv << 'EOF'
Vex,Assassin,16,15,13,8100
Luna,Priestess,18,9,17,10500
EOF
# Update the DVC file
dvc add data/raw/adventurers.csv
git add data/raw/adventurers.csv.dvc
# Re-run the pipeline - DVC detects the change
dvc repro
# Compare metrics
dvc metrics diff
# Commit everything
git add dvc.lock metrics/analysis.json
git commit -m "Add 2 adventurers - metrics updated" Command summary
| Command | Description |
|---|---|
dvc init | Initialize DVC in a Git repository |
dvc add <file> | Add a file to DVC (creates a .dvc pointer) |
dvc push | Push data to the DVC remote |
dvc pull | Pull data from the DVC remote |
dvc remote add | Configure a storage remote |
dvc repro | Run the pipeline (only modified stages) |
dvc dag | Visualize the pipeline dependency graph |
dvc metrics show | Display current metrics |
dvc metrics diff | Compare metrics between versions |
dvc status | See file and pipeline status |
nbstripout --install | Enable automatic notebook output cleaning |
jupytext --to py:percent | Convert a notebook to a Python script |
The Archivist of Numbers closes the great book he was consulting and looks at you with a discreet smile.
"You've learned to tame the infinite. Data no longer frightens you. You know how to version it without blowing up your chronicles, you know how to create reproducible pipelines, and you know that every experiment deserves to be carefully recorded."
"The world of data science evolves fast. New tools appear every month. But the principles remain the same: separate code from data, make your experiments reproducible, and never commit a 10 GB dataset into Git."
He walks you back to the exit of the Infinite Library. The corridors that once seemed endless now feel familiar.
"Come back anytime. The Infinite Archives are always open."