Arc 6 Quest A2

The Infinite Archives

Data science, DVC and Jupyter notebooks

In the heart of the Citadel, behind a door that few archivists know about, lies the Infinite Library. Its corridors stretch so far that no one has ever seen the end. The shelves groan under the weight of archives so vast that no ordinary scroll could contain them - immense datasets, complex predictive models, experiment grimoires spanning hundreds of thousands of pages.

The Archivist of Numbers awaits you at the entrance, a mechanical abacus in one hand, a graph covered in curves in the other.

"You've learned to version code. That's good. But code is only part of the story. Data, models, experiments - that's the real treasure of the Library. And these treasures are so large, so complex, that they require special tools. Follow me, I'll teach you the art of versioning the infinite."

The problem of massive data in Git

In a data science or machine learning project, you work with files very different from source code:

Datasets: CSV, Parquet or JSON files of several GB
Trained models: binary files (.pkl, .h5, .pt, .onnx) from several hundred MB to several GB
Jupyter notebooks: .ipynb files that mix code, results and visualizations
Training images: thousands (or even millions) of images for computer vision
Experiment results: metrics, logs, generated graphs

Git is not designed to handle these volumes. A 2 GB dataset committed 5 times = 10 GB in the history. A 500 MB model trained 20 times = 10 GB more. Very quickly, your repository becomes unmanageable.

In data science, code often represents only 1% of the project by size. The remaining 99% is data and models. Git alone is not enough.

DVC - Data Version Control

DVC (Data Version Control) is an open source tool designed specifically for versioning data, ML models and processing pipelines. It works alongside Git, not as a replacement.

The principle is similar to Git LFS: large files are stored outside of Git, and Git only keeps a small pointer file (.dvc). But DVC goes much further with its reproducible pipelines.

Installation

# With pip (recommended - works everywhere)
pip install dvc

# With S3 support
pip install "dvc[s3]"

# With Google Cloud Storage support
pip install "dvc[gs]"

# With all remotes
pip install "dvc[all]"

# macOS with Homebrew
brew install dvc

# Verify installation
dvc version

# With pip (recommended)
pip install dvc

# With S3 support
pip install "dvc[s3]"

# With all remotes
pip install "dvc[all]"

# Or with conda
conda install -c conda-forge dvc

# Verify installation
dvc version

Initialization in a Git project

# DVC initializes in an existing Git repository
cd my-ml-project
git init -b main
dvc init

# DVC creates a .dvc/ folder and a .dvcignore file
# Commit the initialization
git add .dvc .dvcignore
git commit -m "Initialize DVC"

Adding data

# Add a dataset to DVC
dvc add data/train.csv

# DVC creates two things:
# 1. data/train.csv.dvc -> a small pointer file (committed in Git)
# 2. data/.gitignore -> so Git ignores the real file

# Commit the pointer in Git
git add data/train.csv.dvc data/.gitignore
git commit -m "Add training dataset"

# Add a trained model
dvc add models/classifier.pkl
git add models/classifier.pkl.dvc models/.gitignore
git commit -m "Add trained model v1"

The .dvc file looks like this:

# Contents of data/train.csv.dvc
outs:
- md5: a1b2c3d4e5f6...
  size: 2147483648
  hash: md5
  path: train.csv

It's a simple YAML file with the file's MD5 hash and size. Git can version it efficiently, and DVC uses the hash to find the real file.

DVC remotes

By default, DVC stores data in a local cache (.dvc/cache). To share data with your team, you configure a remote - a remote storage location.

# Local remote (shared folder, network drive)
dvc remote add -d myremote /path/to/shared/storage

# S3 remote (Amazon Web Services)
dvc remote add -d myremote s3://my-bucket/dvc-storage

# Google Cloud Storage remote
dvc remote add -d myremote gs://my-bucket/dvc-storage

# Azure Blob Storage remote
dvc remote add -d myremote azure://my-container/dvc-storage

# SSH remote
dvc remote add -d myremote ssh://user@server.com/dvc-storage

# See configured remotes
dvc remote list

The -d flag sets the remote as the default remote.

Push and Pull

# Push data to the remote
dvc push

# Pull data from the remote
dvc pull

# Fetch data without placing it in the workspace
dvc fetch

# Check status (which files are synchronized)
dvc status

Typical workflow: you run git pull to fetch the .dvc pointers, then dvc pull to fetch the real data. The two commands complement each other.

DVC pipelines

The real power of DVC lies in its pipelines. A pipeline describes the steps of your ML workflow: prepare the data, train the model, evaluate the results. DVC knows which steps to re-run when data or code changes.

The dvc.yaml file

# dvc.yaml - defines the pipeline
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/
    outs:
      - data/prepared/

  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/prepared/
    outs:
      - models/model.pkl
    metrics:
      - metrics/train_metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/prepared/
    metrics:
      - metrics/eval_metrics.json:
          cache: false
    plots:
      - plots/confusion_matrix.csv:
          cache: false

Each stage (step) defines:

cmd: the command to execute
deps: the dependencies (if they change, the stage will be re-run)
outs: the outputs (managed by DVC, not by Git)
metrics: the metrics (committed in Git to facilitate comparison)
plots: the visualizations

Running the pipeline

# Run the entire pipeline
dvc repro

# DVC only re-runs stages whose dependencies have changed!
# If you modify src/train.py, only the "train" and "evaluate" stages
# will be re-run. "prepare" will be skipped because its deps haven't changed.

# Visualize the dependency graph (DAG)
dvc dag

# Example output:
#   +----------+
#   | prepare  |
#   +----------+
#        |
#        v
#   +----------+
#   |  train   |
#   +----------+
#        |
#        v
#   +----------+
#   | evaluate |
#   +----------+

Comparing experiments

# See current metrics
dvc metrics show

# Compare with another branch or commit
dvc metrics diff

# Example output:
# Path                       Metric    Old      New      Change
# metrics/eval_metrics.json  accuracy  0.85     0.91     0.06
# metrics/eval_metrics.json  f1_score  0.83     0.89     0.06

DVC pipelines make your experiments reproducible. Anyone can clone your repo, run dvc repro, and get exactly the same results. This is the foundation of reproducible science.

The problem with Jupyter notebooks

Jupyter notebooks (.ipynb) are the favorite format of data scientists. But they pose a major problem with Git: they are JSON files that contain both the code, the execution results, and the metadata.

Why diffs are horrible

// A Jupyter notebook is JSON like this:
{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": 42,
      "metadata": {
        "collapsed": false,
        "scrolled": true
      },
      "outputs": [
        {
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAA..."
            // <- hundreds of lines of base64 for ONE image
          }
        }
      ],
      "source": ["print('hello')"]
    }
  ]
}

Every time you execute a cell, the execution_count changes, the outputs change, the metadata changes. A Git diff shows hundreds of modified lines for a single-line code change.

And if your notebook contains graphs, they are encoded in base64 directly in the JSON. Thousands of lines of unreadable characters in every diff.

Solution 1: nbstripout - cleaning outputs

nbstripout is a tool that automatically strips notebook outputs before each commit. Only the source code is versioned.

# Install nbstripout
pip install nbstripout

# Enable it for the current repository (configures an automatic Git filter)
nbstripout --install

# Now, every time you git add a notebook,
# outputs are automatically removed before commit.
# Your local notebook keeps its outputs - only the Git version is cleaned.

# Uninstall if needed
nbstripout --uninstall

Best practice: install nbstripout in every project that contains notebooks. Add nbstripout --install to your project setup script so all team members have it automatically.

Solution 2: Jupytext - text format

Jupytext automatically synchronizes your notebooks with a text file (.py, .md, or .Rmd). You work in Jupyter normally, and Jupytext maintains a pure text copy that Git can diff cleanly.

# Install Jupytext
pip install jupytext

# Convert a notebook to a Python script
jupytext --to py:percent notebook.ipynb

# This creates notebook.py with the "percent" format:
# ---
# jupyter:
#   jupytext:
#     text_representation:
#       format_name: percent
# ---

# %% [markdown]
# # My analysis

# %%
import pandas as pd
df = pd.read_csv("data.csv")

# %%
df.describe()

You can configure Jupytext to automatically synchronize both formats:

# In jupytext.toml at the project root
formats = "ipynb,py:percent"

# Jupytext will keep both files in sync
# -> Commit the .py in Git (text, clean diffs)
# -> The .ipynb can be in .gitignore (or committed without outputs via nbstripout)

Solution 3: ReviewNB - visual review

ReviewNB is an online tool that integrates with GitHub and GitLab to display notebook diffs visually - with rendered graphs, tables and Markdown. Ideal for Pull Requests that touch notebooks.

Versioning ML experiments

In machine learning, you often run dozens of experiments with different hyperparameters. Tracking results by hand quickly becomes chaotic. Several tools integrate with Git to solve this problem.

MLflow + Git

MLflow is an open source platform for managing the machine learning lifecycle. It automatically records the parameters, metrics and artifacts of each experiment.

# Install MLflow
pip install mlflow

# In your training script
import mlflow

mlflow.set_experiment("image-classification")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("learning_rate", 0.001)
    mlflow.log_param("epochs", 50)
    mlflow.log_param("batch_size", 32)

    # ... training ...

    # Log metrics
    mlflow.log_metric("accuracy", 0.91)
    mlflow.log_metric("f1_score", 0.89)

    # Log model
    mlflow.sklearn.log_model(model, "model")

# Launch the web interface to visualize results
# mlflow ui

MLflow automatically records the Git commit hash for each experiment. You can thus find exactly which code produced which results.

Weights & Biases (W&B)

Weights & Biases is a cloud service that offers features similar to MLflow, with a richer interface and advanced collaboration features (experiment comparison, interactive tables, reports).

# Install wandb
pip install wandb

# In your script
import wandb

wandb.init(project="image-classification")
wandb.config.learning_rate = 0.001
wandb.config.epochs = 50

# ... training ...

wandb.log({"accuracy": 0.91, "loss": 0.23})

# Results are visible on app.wandb.ai

DVC vs MLflow vs W&B: these tools are not competitors. DVC manages data and pipelines. MLflow and W&B manage experiments and metrics. You can (and should) use them together.

Organizing an ML repo

A well-organized data science project facilitates collaboration and reproducibility. Here's a recommended structure:

my-ml-project/
├── .dvc/                 # DVC configuration
├── .dvcignore            # Files ignored by DVC
├── .gitignore            # Files ignored by Git
├── dvc.yaml              # DVC pipeline
├── dvc.lock              # Locked pipeline versions
├── params.yaml           # Hyperparameters (versioned in Git)
├── README.md
├── requirements.txt      # Python dependencies
│
├── data/
│   ├── raw/              # Raw data (DVC, never modified)
│   ├── processed/        # Cleaned data (DVC, generated by the pipeline)
│   └── external/         # External data (DVC)
│
├── models/               # Trained models (DVC)
│
├── src/                  # Source code (Git)
│   ├── data/
│   │   ├── download.py   # Download data
│   │   └── prepare.py    # Prepare/clean data
│   ├── features/
│   │   └── build.py      # Build features
│   ├── models/
│   │   ├── train.py      # Train the model
│   │   └── predict.py    # Make predictions
│   └── evaluate/
│       └── evaluate.py   # Evaluate the model
│
├── notebooks/            # Exploration notebooks (Git, with nbstripout)
│   ├── 01-exploration.ipynb
│   ├── 02-feature-engineering.ipynb
│   └── 03-analysis.ipynb
│
├── metrics/              # Metrics (Git, small JSON files)
│   └── eval_metrics.json
│
├── plots/                # Generated visualizations (Git or DVC depending on size)
│
└── tests/                # Unit tests (Git)
    └── test_prepare.py

.gitignore for ML projects

# .gitignore for an ML project
# Data (managed by DVC, not Git)
/data/raw/
/data/processed/
/models/

# Python
__pycache__/
*.py[cod]
*.egg-info/
.eggs/
dist/
build/
*.egg

# Virtual environments
venv/
env/
.env/

# Jupyter
.ipynb_checkpoints/

# IDE
.idea/
.vscode/
*.swp

# OS
.DS_Store
Thumbs.db

# MLflow
mlruns/

# W&B
wandb/

The golden rule: code goes in Git, data and models go in DVC. Metrics (small JSON files) go in Git to facilitate comparisons between branches and commits.

Practical exercise - Mini DVC pipeline

Create a mini data science project with DVC. You'll initialize a repository, add a dataset, and create a simple processing pipeline.

Step 1 - Initialize the project

# Create the project
mkdir infinite-archives-project
cd infinite-archives-project
git init -b main
dvc init

# Create the structure
mkdir -p data/raw data/processed src models metrics

# Commit the initialization
git add .
git commit -m "Initialize project with Git and DVC"

# Create the project
mkdir infinite-archives-project
cd infinite-archives-project
git init -b main
dvc init

# Create the structure
New-Item -ItemType Directory -Force -Path data/raw, data/processed, src, models, metrics

# Commit the initialization
git add .
git commit -m "Initialize project with Git and DVC"

Step 2 - Create a fictional dataset

# Create a small fictional CSV dataset
cat > data/raw/adventurers.csv << 'EOF'
name,class,level,strength,intelligence,xp
Aldric,Warrior,12,18,8,4500
Lyria,Mage,10,6,20,3800
Thorin,Paladin,15,16,12,7200
Selene,Rogue,8,10,14,2100
Grimm,Barbarian,20,22,5,12000
Elara,Druid,11,12,16,4100
Kael,Ranger,9,14,11,2800
Mira,Enchanter,14,7,19,6500
Bron,Knight,17,19,10,9300
Zara,Necromancer,13,8,18,5700
EOF

# Add the dataset to DVC
dvc add data/raw/adventurers.csv

# Commit the pointer
git add data/raw/adventurers.csv.dvc data/raw/.gitignore
git commit -m "Add adventurers dataset"

# Create a small fictional CSV dataset
@"
name,class,level,strength,intelligence,xp
Aldric,Warrior,12,18,8,4500
Lyria,Mage,10,6,20,3800
Thorin,Paladin,15,16,12,7200
Selene,Rogue,8,10,14,2100
Grimm,Barbarian,20,22,5,12000
Elara,Druid,11,12,16,4100
Kael,Ranger,9,14,11,2800
Mira,Enchanter,14,7,19,6500
Bron,Knight,17,19,10,9300
Zara,Necromancer,13,8,18,5700
"@ | Set-Content data/raw/adventurers.csv

# Add the dataset to DVC
dvc add data/raw/adventurers.csv

# Commit the pointer
git add data/raw/adventurers.csv.dvc data/raw/.gitignore
git commit -m "Add adventurers dataset"

Step 3 - Create the pipeline scripts

# Data preparation script
cat > src/prepare.py << 'PYEOF'
"""Prepare adventurer data."""
import csv
import json
import os

def prepare():
    os.makedirs("data/processed", exist_ok=True)

    with open("data/raw/adventurers.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    # Add a "power" column = strength + intelligence
    for row in rows:
        row["power"] = int(row["strength"]) + int(row["intelligence"])

    # Write prepared data
    fieldnames = list(rows[0].keys())
    with open("data/processed/adventurers_prepared.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Data prepared: {len(rows)} adventurers")

if __name__ == "__main__":
    prepare()
PYEOF

# Analysis script (simulates "training")
cat > src/analyze.py << 'PYEOF'
"""Analyze adventurers and produce metrics."""
import csv
import json
import os

def analyze():
    os.makedirs("metrics", exist_ok=True)

    with open("data/processed/adventurers_prepared.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    # Calculate statistics
    levels = [int(r["level"]) for r in rows]
    powers = [int(r["power"]) for r in rows]

    metrics = {
        "total_adventurers": len(rows),
        "avg_level": round(sum(levels) / len(levels), 2),
        "max_level": max(levels),
        "avg_power": round(sum(powers) / len(powers), 2),
        "max_power": max(powers),
    }

    with open("metrics/analysis.json", "w") as f:
        json.dump(metrics, f, indent=2)

    print(f"Analysis complete: {metrics}")

if __name__ == "__main__":
    analyze()
PYEOF

# Data preparation script
@"
"""Prepare adventurer data."""
import csv
import json
import os

def prepare():
    os.makedirs("data/processed", exist_ok=True)

    with open("data/raw/adventurers.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    # Add a "power" column = strength + intelligence
    for row in rows:
        row["power"] = int(row["strength"]) + int(row["intelligence"])

    # Write prepared data
    fieldnames = list(rows[0].keys())
    with open("data/processed/adventurers_prepared.csv", "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(rows)

    print(f"Data prepared: {len(rows)} adventurers")

if __name__ == "__main__":
    prepare()
"@ | Set-Content src/prepare.py

# Analysis script
@"
"""Analyze adventurers and produce metrics."""
import csv
import json
import os

def analyze():
    os.makedirs("metrics", exist_ok=True)

    with open("data/processed/adventurers_prepared.csv") as f:
        reader = csv.DictReader(f)
        rows = list(reader)

    # Calculate statistics
    levels = [int(r["level"]) for r in rows]
    powers = [int(r["power"]) for r in rows]

    metrics = {
        "total_adventurers": len(rows),
        "avg_level": round(sum(levels) / len(levels), 2),
        "max_level": max(levels),
        "avg_power": round(sum(powers) / len(powers), 2),
        "max_power": max(powers),
    }

    with open("metrics/analysis.json", "w") as f:
        json.dump(metrics, f, indent=2)

    print(f"Analysis complete: {metrics}")

if __name__ == "__main__":
    analyze()
"@ | Set-Content src/analyze.py

Step 4 - Define the DVC pipeline

# Create the dvc.yaml file
cat > dvc.yaml << 'EOF'
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/adventurers.csv
    outs:
      - data/processed/adventurers_prepared.csv

  analyze:
    cmd: python src/analyze.py
    deps:
      - src/analyze.py
      - data/processed/adventurers_prepared.csv
    metrics:
      - metrics/analysis.json:
          cache: false
EOF

# Commit the scripts and pipeline
git add src/ dvc.yaml
git commit -m "Add DVC pipeline (prepare + analyze)"

# Create the dvc.yaml file
@"
stages:
  prepare:
    cmd: python src/prepare.py
    deps:
      - src/prepare.py
      - data/raw/adventurers.csv
    outs:
      - data/processed/adventurers_prepared.csv

  analyze:
    cmd: python src/analyze.py
    deps:
      - src/analyze.py
      - data/processed/adventurers_prepared.csv
    metrics:
      - metrics/analysis.json:
          cache: false
"@ | Set-Content dvc.yaml

# Commit the scripts and pipeline
git add src/ dvc.yaml
git commit -m "Add DVC pipeline (prepare + analyze)"

Step 5 - Run the pipeline

# Run the full pipeline
dvc repro

# DVC executes the stages in order:
# 1. prepare -> creates data/processed/adventurers_prepared.csv
# 2. analyze -> creates metrics/analysis.json

# See the metrics
dvc metrics show

# Commit the results
git add dvc.lock metrics/analysis.json
git commit -m "Run pipeline - first analysis"

# Run the full pipeline
dvc repro

# See the metrics
dvc metrics show

# Commit the results
git add dvc.lock metrics/analysis.json
git commit -m "Run pipeline - first analysis"

Step 6 - Modify and re-run

# Add adventurers to the dataset
cat >> data/raw/adventurers.csv << 'EOF'
Vex,Assassin,16,15,13,8100
Luna,Priestess,18,9,17,10500
EOF

# Update the DVC file
dvc add data/raw/adventurers.csv
git add data/raw/adventurers.csv.dvc

# Re-run the pipeline - DVC detects the change
dvc repro

# Compare metrics
dvc metrics diff

# Commit everything
git add dvc.lock metrics/analysis.json
git commit -m "Add 2 adventurers - metrics updated"

If dvc repro correctly executes both stages and dvc metrics show displays the statistics, congratulations! You've created your first reproducible data science pipeline.

Command summary

Command	Description
`dvc init`	Initialize DVC in a Git repository
`dvc add <file>`	Add a file to DVC (creates a .dvc pointer)
`dvc push`	Push data to the DVC remote
`dvc pull`	Pull data from the DVC remote
`dvc remote add`	Configure a storage remote
`dvc repro`	Run the pipeline (only modified stages)
`dvc dag`	Visualize the pipeline dependency graph
`dvc metrics show`	Display current metrics
`dvc metrics diff`	Compare metrics between versions
`dvc status`	See file and pipeline status
`nbstripout --install`	Enable automatic notebook output cleaning
`jupytext --to py:percent`	Convert a notebook to a Python script

The Archivist of Numbers closes the great book he was consulting and looks at you with a discreet smile.

"You've learned to tame the infinite. Data no longer frightens you. You know how to version it without blowing up your chronicles, you know how to create reproducible pipelines, and you know that every experiment deserves to be carefully recorded."

"The world of data science evolves fast. New tools appear every month. But the principles remain the same: separate code from data, make your experiments reproducible, and never commit a 10 GB dataset into Git."

He walks you back to the exit of the Infinite Library. The corridors that once seemed endless now feel familiar.

"Come back anytime. The Infinite Archives are always open."