Kaggle Getting Started Competition:
Monet Image Generation with Generative Adversarial Networks
By: Travis Reinart
September 9, 2025
Week 5 Use GANs to Create Art Peer-Graded Assignment
CSCA 5642: Introduction to Deep Learning
Copyright (c) 2025 Travis Reinart
Licensed under the MIT License.
Section 1: Introduction
This notebook documents a complete workflow for the Kaggle "GAN Getting Started" competition, a project focused on image-to-image translation. The objective is to build and train a Generative Adversarial Network (GAN) capable of creating new, original images in the style of the impressionist painter Claude Monet. The personal goal for this project is to achieve a competitive MiFID score of 60 or less on the leaderboard.
Generative Adversarial Networks (GANs)
A Generative Adversarial Network (GAN) is a type of deep learning model where two neural networks are trained in competition with each other to generate new, original data that mimics a given dataset. The two networks are:
- The Generator: Its goal is to create fake data (in this case, images) that is as realistic as possible.
- The Discriminator: Its goal is to look at an item and determine if it's real (from the original dataset) or fake (created by the Generator).
Think of a quarterback learning a legendary playbook. Two people train together: a young QB and a veteran coach who knows every Lombardi concept by heart.
- The Generator is Jordan Love. He sketches new plays from random ideas and tries to make them pass for true Lombardi material.
- The Discriminator is a Lombardi-era coach who has studied the real book for years. His job is simple: watch a play and call it real or fake.
They sit in the film room. Love draws a play and walks through the motion. The coach pauses the clip and points out a tell: the split is too wide, the timing is off, the angle on the pull is wrong. Love adjusts, runs it again, and the coach looks harder. Each rep forces both to improve: the QB learns patterns that make a play feel authentic; the coach learns to spot smaller flaws. After thousands of reps, the coach is right only about half the time. At that point, Love can create new plays that fit the Lombardi style.
Modeling Strategy
This notebook implements a CycleGAN with a ResNet-9 generator and a 70×70 PatchGAN discriminator, an architecture proven to be highly effective for unpaired image-to-image translation.
Framework Choice: TensorFlow & Keras
TensorFlow 2.20 with Keras 3 was chosen because its API is an excellent fit for image-to-image models. Features like the Functional API for building complex architectures, custom training via Model.train_step for non-standard loops, and the integrated tf.data pipeline for high-performance input streams keep the entire workflow clean and efficient. While other frameworks like PyTorch are equally capable, Keras was selected here for its readability and ease of prototyping for this specific computer vision task.
Evaluation Metric: MiFID
This competition uses the Memorization-informed Fréchet Inception Distance (MiFID), where a **lower score is better**. MiFID builds on the standard FID, which compares the statistical distribution of features from real and generated images. It adds a crucial penalty if any generated image is a near-duplicate of a training image, thus discouraging simple memorization and rewarding true stylistic generation.
The Fréchet Inception Distance between real images $(x)$ and generated images $(g)$ is:
$$ d^2=\lVert \mu_{x}-\mu_{g}\rVert^{2} +\operatorname{Tr}\!\left(\Sigma_{x}+\Sigma_{g}-2\,(\Sigma_{x}\Sigma_{g})^{1/2}\right) $$
Where:
- $(\mu_x,\Sigma_x$) are the mean and covariance of Inception features for the real images.
- $(\mu_g,\Sigma_g$) are the mean and covariance for the generated images.
- $(\operatorname{Tr}$) is the matrix trace.
Submission Format
The final output must be a single ZIP file named images.zip containing 7,000 to 10,000 generated Monet-style images at 256x256 resolution. The final, successful submission for this project leveraged a 10,000-image set created with Test-Time Augmentation to maximize the score.
Project Plan
- Section 1: Introduction: Define the problem, the GAN concept, the modeling strategy, and the evaluation metric.
- Section 2: Setup: Configure the environment, import libraries, and set seeds for reproducibility.
- Section 3: Data Loading & Audit: Load the Monet and photo datasets and perform an integrity check.
- Section 4: Visual EDA: Visually inspect the two domains to understand their statistical differences.
- Section 5: Data Pipeline: Build a high-performance
tf.datapipeline for image preprocessing and augmentation. - Section 6: CycleGAN Model: Define and build the ResNet-9 Generator and PatchGAN Discriminator.
- Section 7: Training: Configure the full training objective and execute the initial 15-epoch training run.
- Section 8: Fine-Tuning & Final Submission: Analyze the initial result and run a targeted fine-tuning experiment to improve the score, then generate the final submission file.
- Section 9: Conclusion & Analysis: Present the final Kaggle results, visualize the generated images, and summarize key learnings.
1.1 Kaggle Competition: Final Submissions
This screenshot shows the two official submissions to the Kaggle I'm Something of a Painter Myself competition. The initial submission, using the model from the first 15-epoch training run, established a baseline score of 69.06. The second and final submission, which used the fine-tuned model along with Test-Time Augmentation, achieved a final score of 57.83. This represents a significant 11.23-point improvement and successfully surpassed the project's personal goal of scoring below 60.
Section 2: Setup
With the project plan defined, the next step is to set up the programming environment. This section handles all the necessary boilerplate: importing the required libraries (like TensorFlow and Matplotlib), setting a global random seed for reproducible results, and running a full diagnostic script to verify the environment and confirm GPU access.
Section Plan:
- Data Sources: Confirm files, shapes, and basic integrity.
- Package Installation: An optional cell to install any missing packages.
- Environment Notes: Optional setup guide for a high-performance WSL2 environment.
- Core Library Imports: A central location for all libraries used in the notebook.
- Diagnostics: Print library versions, and check for GPU access.
2.1 The Dataset: Monet Paintings and Landscape Photos
The dataset for this competition is provided in two main parts: a collection of Claude Monet's paintings and a larger collection of landscape photographs. The goal is to train a model that can learn the "style" of the Monet paintings and apply it to the photographs. The data is provided in two formats: standard JPEG files for visual inspection and TFRecord files for efficient model training.
| Dataset Component | Format | Files | Total Size | Purpose |
|---|---|---|---|---|
📁 Monet Paintings |
TFRecord | 5 shards (≈300 images total) | ≈9.9 MB | Training Data (Target Domain) |
📁 Landscape Photos |
TFRecord | 20 shards (≈7,038 images total) | ≈260 MB | Training Data (Input Domain) |
📁 Monet Paintings |
JPEG | 300 files | ≈5.4 MB | EDA / Visual Reference |
📁 Landscape Photos |
JPEG | 7,038 files | ≈108 MB | Test Set for Final Submission |
Counts and sizes reflect the local WSL copy under ~/data/gan-getting-started.
TFRecord rows list shard counts; totals in parentheses match the JPEG counts.
Verified in WSL: monet_jpg=300, photo_jpg=7038, monet_tfrec=5 shards, photo_tfrec=20 shards.
This notebook was developed using the Jupyter Notebook extension within Visual Studio Code. Generative AI was used as a supplementary tool to assist with code debugging and to refine language for clarity. The core logic and all final analysis are original.
2.2 Optional: Install Missing Packages
To ensure the notebook runs smoothly, this optional cell can be used to install any missing packages. Using %pip directly within Jupyter guarantees the packages are installed in the correct environment where the kernel is running.
Instructions:
- Uncomment the relevant
%pip installline(s) below. - Run this cell.
- Wait for the "Successfully installed" message.
- In the Jupyter menu, select Kernel → Restart.
- Re-run the import cell.
Note: Restarting the kernel is required after installation so the notebook can detect the new packages.
# Uncomment and run the lines below to install any missing packages.
# Project environment & tools
# %pip install --upgrade pip
# %pip install "tensorflow[and-cuda]==2.20.*"
# %pip install notebook
# %pip install jupyterlab
# %pip install ipywidgets
# %pip install ipykernel
# %pip install qtconsole
# %pip install kaggle
# Core numerics and utilities
# %pip install numpy
# %pip install pandas
# %pip install tqdm
# Visualization
# %pip install matplotlib
# %pip install seaborn
# %pip install scikit-learn
# %pip install pillow
# %pip install ipython
# Deep Learning (TensorFlow / Keras)
# %pip install tensorflow
# Optional: export the notebook to HTML (uncomment and delete the leading slash to run)
# Most Jupyter installs include nbconvert. If the command fails, install nbconvert first.
#/ !jupyter nbconvert --version
# %pip install nbconvert
# --- Convert Jupyter Notebook .ipynb to .html ---
# IMPORTANT: This bang command is written for a Windows or plain Python kernel.
# It will not work as-is in a WSL kernel unless you use a Linux path to the notebook.
# Run this from a Windows kernel, or change the path to a Linux path when in WSL.
!jupyter nbconvert --to html "Week5_Kaggle_Monet_Competition.ipynb"
[NbConvertApp] Converting notebook Week5_Kaggle_Monet_Competition.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 14 image(s). [NbConvertApp] Writing 44033005 bytes to Week5_Kaggle_Monet_Competition.html
2.3 Optional Run on GPU with Ubuntu 22.04 + VS Code (WSL)
This section provides a complete guide for setting up a high-performance development environment using Ubuntu 22.04 on WSL2. The goal is to run Jupyter inside a clean Python 3.10.12 environment and leverage the NVIDIA GPU via TensorFlow (GPU).
Goal: launch Jupyter in Ubuntu 22.04 (WSL2), use a virtual env on Python 3.10.12, and make sure TensorFlow sees the NVIDIA GPU.
Below are the steps I followed to create this environment:
Step 1: Install Ubuntu on WSL2
| Open PowerShell as Administrator and run: |
wsl --install -d Ubuntu-22.04 wsl --set-default-version 2 wsl --update |
| Launch Ubuntu and create your user/password. | Important Tip: The password prompt in the Linux terminal stays blank as you type. This is normal security behavior! |
Step 2: Install NVIDIA Driver & Verify
| Install the latest driver on Windows, not inside WSL. The Windows driver automatically provides CUDA support to WSL2. | NVIDIA Driver Downloads |
| From the Ubuntu terminal, verify the GPU is visible: | nvidia-smi |
Step 3: Create Python Virtual Environment
| Install Python 3.10 and venv: | sudo apt-get update && sudo apt-get -y install python3.10 python3.10-venv python3-pip |
| Create and activate the environment. I named mine tfenv (TensorFlow environment): |
python3.10 -m venv ~/tfenv source ~/tfenv/bin/activate |
| Install libraries and register the kernel for Jupyter: |
pip install -U pip pip install "tensorflow[and-cuda]==2.20.*" numpy pandas matplotlib seaborn tqdm scikit-learn jupyter ipykernel pillow nbconvert python -m ipykernel install --user --name tfenv --display-name "Python (tfenv)" |
Step 4: Connect with VS Code
| In VS Code, install the "Remote - WSL" extension, then connect to your Ubuntu instance (Ctrl+Shift+P → "WSL: Connect to WSL"). | Once connected, the bottom-left corner should show WSL: Ubuntu-22.04. |
Step 5: Launch the Notebook in the WSL Environment
| In VS Code, open the Command Palette (Ctrl+Shift+P), type the following, and select it: | WSL: Connect to WSL |
| A new VS Code window connected to Ubuntu will open. Verify the connection by checking the blue box in the bottom-left corner. | The status bar should say WSL: Ubuntu-22.04. |
| In this new window, go to Open File > Show Local..., navigate to your project folder, and open your notebook. You can now select the Python (tfenv) 3.10.12 kernel. | This final step ensures your notebook is running inside Ubuntu with access to the correct Python environment and GPU. |
2.4 Core Library Imports
All libraries are imported in one place and grouped by purpose. This cell does imports only. No GPU configuration and no diagnostics here.
- System & I/O: paths, timing, files, utilities.
- Numerics & tables: NumPy, Pandas.
- Progress:
tqdm. - Visualization: Matplotlib, Seaborn, PIL, HTML display.
- Deep learning: TensorFlow / Keras layers, callbacks, mixed precision, TensorFlow Addons.
- Reproducibility: single seed block for Python, NumPy, and TensorFlow.
# High-Performance TensorFlow Configuration (MUST RUN BEFORE IMPORT)
# These settings help prevent GPU memory fragmentation and OOM errors.
import os
os.environ["TF_FORCE_GPU_ALLOW_GROWTH"] = "true"
os.environ["TF_CUDNN_WORKSPACE_LIMIT_IN_MB"] = "256"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
# Core utilities
import sys
import platform
import random
import time
import json
import gc
import glob
import shutil
import warnings
import math
import io
import contextlib
import zipfile
import hashlib
from collections import deque, Counter, defaultdict
from pathlib import Path
# Numerics
import numpy as np
import pandas as pd
# Progress
from tqdm.auto import tqdm
# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from IPython.display import display, HTML
# Deep Learning (TensorFlow / Keras)
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks, mixed_precision
# Safer VRAM allocation pattern
tf.get_logger().setLevel("ERROR")
for g in tf.config.list_physical_devices("GPU"):
try:
tf.config.experimental.set_memory_growth(g, True)
except Exception:
pass
# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
tf.random.set_seed(SEED)
try:
tf.config.experimental.enable_op_determinism()
except Exception:
pass
# General settings
warnings.filterwarnings("ignore")
print("-" * 70)
print(f"SEED = {SEED}")
print("-" * 70)
---------------------------------------------------------------------- SEED = 42 ----------------------------------------------------------------------
2.5 Diagnostics and Verification
Verify the environment before training. This cell displays versions for all imported libraries and confirms whether TensorFlow can see a GPU.
# Display GPU Hardware Information
print("-" * 70)
!nvidia-smi
# TensorFlow build info
try:
build = tf.sysconfig.get_build_info()
print(f"CUDA: {build.get('cuda_version')} | cuDNN: {build.get('cudnn_version')}")
except Exception:
pass
print("-" * 70)
print("--- Environment & Library Versions ---")
# System Information
print(f"Python Version: {platform.python_version()}")
print(f"OS: {platform.system()} {platform.release()}")
# Core ML/DS Library Versions (only what was imported in 2.4)
print("\n--- Core Libraries ---")
print(f"TensorFlow: {tf.__version__}")
print(f"Keras: {keras.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scikit-learn: {__import__('sklearn').__version__}")
# Visualization & Image Library Versions
print("\n--- Supporting Libraries ---")
print(f"Matplotlib: {mpl.__version__}")
print(f"Seaborn: {sns.__version__}")
print(f"tqdm: {__import__('tqdm').__version__}")
print(f"Pillow (PIL): {__import__('PIL').__version__}")
# Mixed precision policy
try:
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)
print("\n--- Performance Policies ---")
print(f" Mixed Precision Compute Dtype: {policy.compute_dtype}")
print(f" Mixed Precision Variable Dtype: {policy.variable_dtype}")
except Exception as e:
print(f" Could not set mixed precision policy: {e}")
print("-" * 70)
print(f"SEED set to {SEED} - reproducibility enabled.")
print("-" * 70)
# GPU Hardware Diagnostics
print("\n--- GPU Hardware Verification ---")
gpu_devices = tf.config.list_physical_devices('GPU')
if gpu_devices:
details = tf.config.experimental.get_device_details(gpu_devices[0])
print(f" GPU Detected: {details.get('device_name', 'N/A')}")
print(f" Compute Capability: {details.get('compute_capability', 'N/A')}")
with tf.device("/GPU:0"):
a = tf.random.normal([2048, 2048], dtype=tf.float16)
b = tf.random.normal([2048, 2048], dtype=tf.float16)
t0 = time.time()
_ = tf.linalg.matmul(a,b).numpy()
print(f" GPU Matmul Test (FP16): {time.time()-t0:.4f} seconds")
else:
print(" --- WARNING: No GPU detected!!! ---")
print("-" * 70)
# Jupyter Environment
print("\n--- Jupyter Environment ---")
print("Which Jupyter :", shutil.which("jupyter"))
!jupyter --version
print("\n--- Setup complete. Libraries imported successfully. ---")
print("-" * 70)
----------------------------------------------------------------------
Mon Sep 8 23:15:55 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.34 Driver Version: 546.26 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 ... On | 00000000:01:00.0 Off | N/A |
| N/A 47C P0 13W / 102W | 283MiB / 8188MiB | 2% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
CUDA: 12.5.1 | cuDNN: 9
----------------------------------------------------------------------
--- Environment & Library Versions ---
Python Version: 3.10.12
OS: Linux 6.6.87.2-microsoft-standard-WSL2
--- Core Libraries ---
TensorFlow: 2.20.0
Keras: 3.11.3
NumPy: 1.26.4
Pandas: 2.3.2
Scikit-learn: 1.7.1
--- Supporting Libraries ---
Matplotlib: 3.10.5
Seaborn: 0.13.2
tqdm: 4.67.1
Pillow (PIL): 11.3.0
--- Performance Policies ---
Mixed Precision Compute Dtype: float16
Mixed Precision Variable Dtype: float32
----------------------------------------------------------------------
SEED set to 42 - reproducibility enabled.
----------------------------------------------------------------------
--- GPU Hardware Verification ---
GPU Detected: NVIDIA GeForce RTX 4070 Laptop GPU
Compute Capability: (8, 9)
GPU Matmul Test (FP16): 0.1514 seconds
----------------------------------------------------------------------
--- Jupyter Environment ---
Which Jupyter : /home/treinart/tfenv/bin/jupyter
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1757391355.508266 25051 gpu_device.cc:2020] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5535 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.9 2025-09-08 23:15:55.528875: E tensorflow/core/util/util.cc:131] oneDNN supports DT_HALF only on platforms with AVX-512. Falling back to the default Eigen-based implementation if present.
Selected Jupyter core packages... IPython : 8.37.0 ipykernel : 6.30.1 ipywidgets : 8.1.7 jupyter_client : 8.6.3 jupyter_core : 5.8.1 jupyter_server : 2.17.0 jupyterlab : 4.4.6 nbclient : 0.10.2 nbconvert : 7.16.6 nbformat : 5.10.4 notebook : 7.4.5 qtconsole : 5.6.1 traitlets : 5.14.3 --- Setup complete. Libraries imported successfully. --- ----------------------------------------------------------------------
Observation: Diagnostics and Environment Verification¶
As of September 7, 2025, the project environment is confirmed to be stable and correctly configured for this computer vision task. The setup is running on a Linux distribution via WSL2, with a comprehensive suite of up-to-date libraries. Critically, the hardware verification confirms that the NVIDIA GPU is successfully detected and utilized by TensorFlow, which is essential for training the deep learning models.
Hardware & Driver Stack:
- GPU: NVIDIA GeForce RTX 4070 Laptop GPU
- NVIDIA Driver: 546.26
- CUDA Version (from driver): 12.3
- CUDA Version (from TensorFlow): 12.5.1
- cuDNN Version (from TensorFlow): 9
- VRAM: 8.0 GB
Software Environment:
- OS: Linux 6.6.87.2-microsoft-standard-WSL2
- Python: 3.10.12
- TensorFlow: 2.20.0
- Keras: 3.11.3
- NumPy: 1.26.4
- Pandas: 2.3.2
- Scikit-learn: 1.7.1
- Matplotlib: 3.10.5
- Seaborn: 0.13.2
- Pillow (PIL): 11.3.0
- tqdm: 4.67.1
Jupyter Core Packages:
- IPython: 8.37.0
- ipykernel: 6.30.1
- ipywidgets: 8.1.7
- notebook: 7.4.5
Performance Policies:
- Mixed Precision: Enabled (
float16compute dtype) for lower memory usage and improved performance on the RTX 40-series GPU. - GPU Memory Growth: Enabled to prevent TensorFlow from pre-allocating all VRAM, reducing the risk of OOM errors.
- Mixed Precision: Enabled (
Key Configuration Note: The successful and stable training runs later in this notebook rely heavily on the performance policies set here. Enabling GPU memory growth and using mixed precision were critical for managing VRAM on the 8GB GPU, especially after re-introducing the memory-intensive replay buffers during the fine-tuning stage.
This verified setup provides a high-performance, reproducible environment for the subsequent data processing, model training, and submission generation.
Section 3: Data Loading & Integrity Audit
With the environment set, this section establishes the definitive paths to the dataset. It then performs a comprehensive audit to verify the data's integrity, checking file counts, dimensions, readability, and uniqueness. This ensures a clean, predictable dataset is ready before building the formal tf.data input pipeline.
Section Plan:
- Define Paths & Verify Structure: Point to the data root, define subfolder paths, and assert that all directories exist.
- Perform Statistical Audit: Sample the JPEG files to analyze image dimensions and basic pixel statistics (mean, std), flagging any corrupt files.
- Conduct Visual Audit: Display grids of random thumbnail images from each domain for a quick visual sanity check.
- Check TFRecord Integrity: Parse a sample of TFRecord files to confirm they can be decoded correctly and contain the expected 256x256 images.
- Analyze Duplicate Images: Use MD5 hashing to find identical images both within and across the Monet and Photo JPEG datasets.
- Build & Export De-duplicated Lists: Create a canonical, de-duplicated list of photo JPEGs and save the "keep" and "drop" file lists to CSV for reproducibility.
3.1 Define Paths (Linux) & Assert Folders
Point DATA_ROOT to ~/data/gan-getting-started and define four subfolders: monet_jpg, photo_jpg, monet_tfrec, photo_tfrec. Use assert checks so path drift between Windows and WSL is caught immediately.
# Define paths to the dataset directories.
DATA_ROOT = Path.home() / "data" / "gan-getting-started"
MONET_JPG_DIR = DATA_ROOT / "monet_jpg"
PHOTO_JPG_DIR = DATA_ROOT / "photo_jpg"
MONET_TFREC_DIR = DATA_ROOT / "monet_tfrec"
PHOTO_TFREC_DIR = DATA_ROOT / "photo_tfrec"
# Verify that all required directories exist.
data_dirs = [MONET_JPG_DIR, PHOTO_JPG_DIR, MONET_TFREC_DIR, PHOTO_TFREC_DIR]
missing_dirs = [p for p in data_dirs if not p.exists()]
assert not missing_dirs, f"Missing expected folder(s): {[str(p) for p in missing_dirs]}"
# Count the number of files in each directory.
n_monet_jpg = sum(1 for _ in MONET_JPG_DIR.glob("*.jpg"))
n_photo_jpg = sum(1 for _ in PHOTO_JPG_DIR.glob("*.jpg"))
n_monet_tfrec = sum(1 for _ in MONET_TFREC_DIR.glob("*.tfrec"))
n_photo_tfrec = sum(1 for _ in PHOTO_TFREC_DIR.glob("*.tfrec"))
# Define the expected file counts from the Kaggle dataset page.
EXP_MONET_JPG = 300
EXP_PHOTO_JPG = 7038
EXP_MONET_TFREC = 5
EXP_PHOTO_TFREC = 20
# Assert that the actual file counts match the expected counts.
assert n_monet_jpg == EXP_MONET_JPG, f"monet_jpg count mismatch: got {n_monet_jpg}, expected {EXP_MONET_JPG}"
assert n_photo_jpg == EXP_PHOTO_JPG, f"photo_jpg count mismatch: got {n_photo_jpg}, expected {EXP_PHOTO_JPG}"
assert n_monet_tfrec == EXP_MONET_TFREC, f"monet_tfrec shard count mismatch: got {n_monet_tfrec}, expected {EXP_MONET_TFREC}"
assert n_photo_tfrec == EXP_PHOTO_TFREC, f"photo_tfrec shard count mismatch: got {n_photo_tfrec}, expected {EXP_PHOTO_TFREC}"
# Define helper functions to calculate and format directory sizes.
def _dir_size(p: Path) -> int:
try:
return sum(f.stat().st_size for f in p.rglob("*") if f.is_file())
except Exception:
return 0
def _fmt_bytes(n: int) -> str:
for unit in ("B", "KB", "MB", "GB", "TB"):
if n < 1024:
return f"{n:.1f} {unit}"
n /= 1024
return f"{n:.1f} PB"
# Calculate the size of each data directory.
sz_monet_jpg = _fmt_bytes(_dir_size(MONET_JPG_DIR))
sz_photo_jpg = _fmt_bytes(_dir_size(PHOTO_JPG_DIR))
sz_monet_tfrec = _fmt_bytes(_dir_size(MONET_TFREC_DIR))
sz_photo_tfrec = _fmt_bytes(_dir_size(PHOTO_TFREC_DIR))
# Store key counts as global variables for use in later sections.
N_MONET = n_monet_jpg
N_PHOTO = n_photo_jpg
N_MONET_TFREC_SHARDS = n_monet_tfrec
N_PHOTO_TFREC_SHARDS = n_photo_tfrec
# Define a helper to peek at the first few filenames for a sanity check.
def _peek(p: Path, pattern: str, k: int = 3):
return [str(x) for x in sorted(p.glob(pattern))[:k]]
# --- Display Verification Results ---
print("-" * 70)
print("Defining data roots and verifying folder structure...")
print("\n--- Data Roots ---")
print(f"DATA_ROOT : {DATA_ROOT}")
print(f"monet_jpg : {MONET_JPG_DIR}")
print(f"photo_jpg : {PHOTO_JPG_DIR}")
print(f"monet_tfrec : {MONET_TFREC_DIR}")
print(f"photo_tfrec : {PHOTO_TFREC_DIR}")
print("\n--- Quick Snapshot (counts & sizes) ---")
print(f"Monet JPEGs : {n_monet_jpg:>5} files | {sz_monet_jpg:>8}")
print(f"Photo JPEGs : {n_photo_jpg:>5} files | {sz_photo_jpg:>8}")
print(f"Monet TFRecord : {n_monet_tfrec:>5} shards | {sz_monet_tfrec:>8}")
print(f"Photo TFRecord : {n_photo_tfrec:>5} shards | {sz_photo_tfrec:>8}")
print("\n--- Sample Files ---")
print("monet_jpg →", _peek(MONET_JPG_DIR, "*.jpg"))
print("photo_jpg →", _peek(PHOTO_JPG_DIR, "*.jpg"))
print("monet_tfrec →", _peek(MONET_TFREC_DIR, "*.tfrec"))
print("photo_tfrec →", _peek(PHOTO_TFREC_DIR, "*.tfrec"))
print("\nPaths verified. Counts match expected Kaggle distribution.")
print("-" * 70)
---------------------------------------------------------------------- Defining data roots and verifying folder structure... --- Data Roots --- DATA_ROOT : /home/treinart/data/gan-getting-started monet_jpg : /home/treinart/data/gan-getting-started/monet_jpg photo_jpg : /home/treinart/data/gan-getting-started/photo_jpg monet_tfrec : /home/treinart/data/gan-getting-started/monet_tfrec photo_tfrec : /home/treinart/data/gan-getting-started/photo_tfrec --- Quick Snapshot (counts & sizes) --- Monet JPEGs : 300 files | 4.7 MB Photo JPEGs : 7038 files | 93.6 MB Monet TFRecord : 5 shards | 9.8 MB Photo TFRecord : 20 shards | 259.9 MB --- Sample Files --- monet_jpg → ['/home/treinart/data/gan-getting-started/monet_jpg/000c1e3bff.jpg', '/home/treinart/data/gan-getting-started/monet_jpg/011835cfbf.jpg', '/home/treinart/data/gan-getting-started/monet_jpg/0260d15306.jpg'] photo_jpg → ['/home/treinart/data/gan-getting-started/photo_jpg/00068bc07f.jpg', '/home/treinart/data/gan-getting-started/photo_jpg/000910d219.jpg', '/home/treinart/data/gan-getting-started/photo_jpg/000ded5c41.jpg'] monet_tfrec → ['/home/treinart/data/gan-getting-started/monet_tfrec/monet00-60.tfrec', '/home/treinart/data/gan-getting-started/monet_tfrec/monet04-60.tfrec', '/home/treinart/data/gan-getting-started/monet_tfrec/monet08-60.tfrec'] photo_tfrec → ['/home/treinart/data/gan-getting-started/photo_tfrec/photo00-352.tfrec', '/home/treinart/data/gan-getting-started/photo_tfrec/photo01-352.tfrec', '/home/treinart/data/gan-getting-started/photo_tfrec/photo02-352.tfrec'] Paths verified. Counts match expected Kaggle distribution. ----------------------------------------------------------------------
3.2 Statistical Audit of JPEG Images
Run a quick statistical audit on the JPEG datasets. This involves sampling images to find the most common dimensions, calculating the average RGB pixel values and standard deviations, and counting any corrupt or unreadable files. The results are summarized in a clean table.
# Set the sample sizes for the audit, with safe defaults.
try:
N_SHAPE_MONET
N_SHAPE_PHOTO
N_STATS_MONET
N_STATS_PHOTO
except NameError:
N_SHAPE_MONET = 300
N_SHAPE_PHOTO = 1000
N_STATS_MONET = 300
N_STATS_PHOTO = 1200
# Initialize a random number generator for sampling.
rng = np.random.default_rng(SEED)
# Get lists of all JPEG files.
monet_jpgs = sorted(MONET_JPG_DIR.glob("*.jpg"))
photo_jpgs = sorted(PHOTO_JPG_DIR.glob("*.jpg"))
# Define a helper function to safely open an image as a normalized RGB array.
def _safe_open_rgb(p: Path):
try:
with Image.open(p) as im:
return np.asarray(im.convert("RGB"), dtype=np.float32) / 255.0
except Exception:
return None
# Define the main audit function to process one domain (Monet or Photo).
def _audit_one(name: str, files: list[Path], n_shape: int, n_stats: int):
# --- Perform a shape audit on a sample of files. ---
k_shape = min(n_shape, len(files))
shape_sample = list(rng.choice(files, size=k_shape, replace=False)) if k_shape else []
shape_counts = {}
corrupt = 0
for p in tqdm(shape_sample, desc=f"{name} shape audit", mininterval=0.2):
arr = _safe_open_rgb(p)
if arr is None:
corrupt += 1
continue
h, w = arr.shape[:2]
shape_counts[(w, h)] = shape_counts.get((w, h), 0) + 1
# Get the top 3 most common image dimensions.
top_shapes_list = sorted(shape_counts.items(), key=lambda kv: -kv[1])[:3]
top_shapes = ", ".join([f"{w}×{h}:{c}" for (w, h), c in top_shapes_list]) or "—"
# --- Calculate pixel statistics on a separate sample. ---
k_stats = min(n_stats, len(files))
stat_sample = list(rng.choice(files, size=k_stats, replace=False)) if k_stats else []
means, stds = [], []
for p in tqdm(stat_sample, desc=f"{name} pixel stats", mininterval=0.2):
arr = _safe_open_rgb(p)
if arr is None:
continue
means.append(arr.mean(axis=(0, 1)))
stds.append(arr.std(axis=(0, 1)))
# Aggregate the mean and standard deviation of pixel values across the sample.
if means:
mean_rgb = np.mean(means, axis=0)
std_rgb = np.mean(stds, axis=0)
mean_str = f"[{mean_rgb[0]:.4f}, {mean_rgb[1]:.4f}, {mean_rgb[2]:.4f}]"
std_str = f"[{std_rgb[0]:.4f}, {std_rgb[1]:.4f}, {std_rgb[2]:.4f}]"
else:
mean_str = "-"
std_str = "-"
return {
"Domain": name,
"JPEGs": len(files),
"Sampled(shapes)": k_shape,
"Top Shapes": top_shapes,
"Sampled(stats)": len(means),
"mean(R,G,B)": mean_str,
"std(R,G,B)": std_str,
"Corrupt files": corrupt,
}
# --- Run the audit and display results ---
print("-" * 70)
print("Quick audit: shapes, simple pixel stats, and file readability (sampled).")
print("\n--- JPEG counts ---")
print(f"Monet JPEGs : {len(monet_jpgs):,}")
print(f"Photo JPEGs : {len(photo_jpgs):,}")
# Run the audit on both the Monet and Photo datasets.
rows = [
_audit_one("Monet (JPEG)", monet_jpgs, N_SHAPE_MONET, N_STATS_MONET),
_audit_one("Photo (JPEG)", photo_jpgs, N_SHAPE_PHOTO, N_STATS_PHOTO),
]
summary_df = pd.DataFrame(rows)
# Apply custom styling to the DataFrame for a clean presentation.
styler = summary_df.set_index("Domain").style
styler = styler.set_properties(**{
"background-color": "#0b0b0b",
"color": "#F8F8F8",
"border-color": "#1A6A1A",
})
styler = styler.set_table_styles([
{"selector": "th", "props": [("background-color", "#1A6A1A"), ("color", "white"), ("text-align", "left")]},
{"selector": "td", "props": [("border", "1px solid #1A6A1A")]},
])
display(styler)
# Define a helper to check if the dominant shape is 256x256.
def _shape_hint(s: str):
first = (s or "").split(",")[0].lower()
norm = first.replace("×", "x").replace(" ", "")
return "✓ Mostly 256×256" if "256x256" in norm else "Varied Sizes"
# Print summary notes based on the audit results.
top_m = summary_df.loc[summary_df["Domain"].eq("Monet (JPEG)"), "Top Shapes"].iloc[0]
top_p = summary_df.loc[summary_df["Domain"].eq("Photo (JPEG)"), "Top Shapes"].iloc[0]
print("\n--- Notes ---")
print(f"Monet JPEGs : {_shape_hint(top_m)}")
print(f"Photo JPEGs : {_shape_hint(top_p)}")
print("\nQuick audit complete.")
print("-" * 70)
---------------------------------------------------------------------- Quick audit: shapes, simple pixel stats, and file readability (sampled). --- JPEG counts --- Monet JPEGs : 300 Photo JPEGs : 7,038
Monet (JPEG) shape audit: 0%| | 0/300 [00:00<?, ?it/s]
Monet (JPEG) pixel stats: 0%| | 0/300 [00:00<?, ?it/s]
Photo (JPEG) shape audit: 0%| | 0/1000 [00:00<?, ?it/s]
Photo (JPEG) pixel stats: 0%| | 0/1200 [00:00<?, ?it/s]
| JPEGs | Sampled(shapes) | Top Shapes | Sampled(stats) | mean(R,G,B) | std(R,G,B) | Corrupt files | |
|---|---|---|---|---|---|---|---|
| Domain | |||||||
| Monet (JPEG) | 300 | 300 | 256×256:300 | 300 | [0.5215, 0.5245, 0.4768] | [0.1912, 0.1822, 0.1942] | 0 |
| Photo (JPEG) | 7038 | 1000 | 256×256:1000 | 1200 | [0.4005, 0.4052, 0.3832] | [0.2214, 0.2018, 0.2177] | 0 |
--- Notes --- Monet JPEGs : ✓ Mostly 256×256 Photo JPEGs : ✓ Mostly 256×256 Quick audit complete. ----------------------------------------------------------------------
3.3 Visual Audit: Thumbnail Samples
A picture is worth a thousand words. Before diving deep, it's worth taking a quick look at the images. This step displays a compact grid of randomly sampled thumbnails from both the Monet and Photo collections, serving as a fast, initial confirmation of the dataset's content.
# Initialize a random number generator for sampling.
rng = np.random.default_rng(SEED)
# Get lists of all JPEG file paths.
monet_jpgs = sorted(str(p) for p in MONET_JPG_DIR.glob("*.jpg"))
photo_jpgs = sorted(str(p) for p in PHOTO_JPG_DIR.glob("*.jpg"))
# Set the number of thumbnails to show from each domain.
N_SHOW_MONET = 49
N_SHOW_PHOTO = 49
# Define the master function to display a grid of image thumbnails.
def _show_grid(paths, title, max_n=49, title_color="#FFB612"):
# Determine the number of images to display.
k = min(max_n, len(paths))
if not k:
print(f"(no images found for: {title})")
return
# Randomly sample image paths.
sample = rng.choice(paths, size=k, replace=False)
# Calculate the grid dimensions (rows and columns).
cols = min(8, max(4, int(np.sqrt(k) + 0.5)))
rows = int(np.ceil(k / cols))
# Create the figure and axes for the grid.
fig, axes = plt.subplots(rows, cols, figsize=(2.5 * cols, 2.5 * rows), facecolor="black")
# Normalize the axes object into a 2D array for consistent indexing.
if rows == 1 and cols == 1:
axes = np.array([[axes]])
elif rows == 1 or cols == 1:
axes = np.reshape(axes, (rows, cols))
# --- MODIFICATION: Loop with tqdm ---
# Create an iterable for the images and their corresponding axes.
plot_iterable = zip(sample, axes.flat)
# Loop through the images and axes with a progress bar.
pbar = tqdm(plot_iterable, total=k, desc=f"Plotting '{title}'", mininterval=0.2)
for p, ax in pbar:
# Load and display the image, with error handling.
try:
with Image.open(p) as im:
ax.imshow(im.convert("RGB"))
except Exception:
ax.text(0.5, 0.5, "unreadable", ha="center", va="center", color="red", fontsize=10)
# --- Style all axes in the grid ---
for ax in axes.flat:
# Remove ticks and labels from each subplot.
ax.set_xticks([])
ax.set_yticks([])
# Style the plot's border (spines).
for spine in ax.spines.values():
spine.set_edgecolor("#1A6A1A")
spine.set_linewidth(1.5)
# Set the background color of the axes.
ax.set_facecolor("black")
# Set the main title for the entire grid.
plt.suptitle(title, fontsize=28, color=title_color, fontweight="bold", y=0.98)
# Ensure the layout is tight and clean.
plt.tight_layout()
plt.show()
# --- Generate and display the thumbnail grids ---
print("-" * 70)
print("Visual EDA: thumbnail grids (random samples)")
print(f"Monet JPEGs available : {len(monet_jpgs):,}")
print(f"Photo JPEGs available : {len(photo_jpgs):,}")
# Display the grid for Monet images.
_show_grid(monet_jpgs, "Monet Sample Thumbnails", max_n=N_SHOW_MONET, title_color="#1A6A1A")
# Display the grid for Photo images.
_show_grid(photo_jpgs, "Photo Sample Thumbnails", max_n=N_SHOW_PHOTO, title_color="#FFB612")
print("-" * 70)
---------------------------------------------------------------------- Visual EDA: thumbnail grids (random samples) Monet JPEGs available : 300 Photo JPEGs available : 7,038
Plotting 'Monet Sample Thumbnails': 0%| | 0/49 [00:00<?, ?it/s]
Plotting 'Photo Sample Thumbnails': 0%| | 0/49 [00:00<?, ?it/s]
----------------------------------------------------------------------
3.4 TFRecord Smoke Test
Build a minimal TFRecord parser for the image feature. Read a few samples from Monet and Photo shards to confirm decode works. Do not optimize here, keep it simple and fast to run.
# --- Setup and File Gathering ---
print("-" * 70)
print("TFRecord quick check: shard counts, decode a small sample, confirm shapes/dtypes.")
# Get lists of all TFRecord shard files.
monet_tfrec = sorted(MONET_TFREC_DIR.glob("*.tfrec"))
photo_tfrec = sorted(PHOTO_TFREC_DIR.glob("*.tfrec"))
print(f"Monet TFRecord shards : {len(monet_tfrec)}")
print(f"Photo TFRecord shards : {len(photo_tfrec)}")
# Verify that TFRecord files were actually found.
assert monet_tfrec, f"No TFRecords found in {MONET_TFREC_DIR}"
assert photo_tfrec, f"No TFRecords found in {PHOTO_TFREC_DIR}"
# Initialize TensorFlow utilities.
AUTOTUNE = tf.data.AUTOTUNE
rng = np.random.default_rng(SEED)
# --- Helper Functions ---
# Define a flexible parser for TFRecord examples that tries multiple common keys.
def _parse_example(serialized):
for key in ("image", "img", "image/encoded"):
try:
ex = tf.io.parse_single_example(
serialized, {key: tf.io.FixedLenFeature([], tf.string)}
)
img = tf.image.decode_jpeg(ex[key], channels=3)
return img
except Exception:
pass
raise ValueError("Could not locate an image field in TFRecord example")
# Define a function to read and analyze a sample of records from TFRecord files.
def _peek_tfrecs(tfrec_paths, n_take=64):
# Create a dataset from the TFRecord files.
ds = tf.data.TFRecordDataset(tfrec_paths, num_parallel_reads=AUTOTUNE)
shapes, means, stds = [], [], []
ok, fail = 0, 0
# Loop through a sample of serialized examples.
for serialized in ds.take(n_take):
try:
img = _parse_example(serialized)
# Collect shape and pixel statistics.
h, w = int(img.shape[0]), int(img.shape[1])
shapes.append((w, h, 3))
if ok < 24: # Limit stats calculation for speed
arr = tf.cast(img, tf.float32) / 255.0
means.append(tf.reduce_mean(arr, axis=(0,1)).numpy())
stds.append(tf.math.reduce_std(arr, axis=(0,1)).numpy())
ok += 1
except Exception:
fail += 1
# Aggregate the collected statistics.
if means:
m = np.mean(np.stack(means, axis=0), axis=0)
s = np.mean(np.stack(stds, axis=0), axis=0)
mean_str = f"[{m[0]:.4f}, {m[1]:.4f}, {m[2]:.4f}]"
std_str = f"[{s[0]:.4f}, {s[1]:.4f}, {s[2]:.4f}]"
else:
mean_str = "-"
std_str = "-"
# Find the most common image shapes.
top = Counter(shapes).most_common(3)
top_str = ", ".join([f"{w}×{h}:{c}" for (w,h,_), c in top]) if top else "—"
return {
"peeked": ok + fail,
"decoded_ok": ok,
"decode_fail": fail,
"top_shapes": top_str,
"mean": mean_str,
"std": std_str,
}
# --- Run Analysis and Display Results ---
# Set the number of records to sample from each dataset.
N_TFREC_PEEK_MONET = 128
N_TFREC_PEEK_PHOTO = 256
# Run the peek function on the Monet TFRecords.
print("\n--- Peeking TFRecords (Monet) ---")
monet_info = _peek_tfrecs(monet_tfrec, n_take=N_TFREC_PEEK_MONET)
for k, v in monet_info.items():
print(f"{k:>12}: {v}")
# Run the peek function on the Photo TFRecords.
print("\n--- Peeking TFRecords (Photo) ---")
photo_info = _peek_tfrecs(photo_tfrec, n_take=N_TFREC_PEEK_PHOTO)
for k, v in photo_info.items():
print(f"{k:>12}: {v}")
# Define a helper to check if the dominant shape is 256x256.
def _hint(s: str):
first = (s or "").split(",")[0].lower()
norm = first.replace("×", "x").replace(" ", "")
return "✓ Mostly 256×256" if "256x256" in norm else "Varied Sizes"
# --- Final Notes and Assertions ---
print("\n--- Notes ---")
print(f"Monet TFRecord : {_hint(monet_info['top_shapes'])}")
print(f"Photo TFRecord : {_hint(photo_info['top_shapes'])}")
# Assert that the TFRecords are readable and have the expected shape.
assert monet_info["decode_fail"] == 0 and photo_info["decode_fail"] == 0, "Decode failure detected."
assert "256×256" in monet_info["top_shapes"].split(",")[0], "Monet shapes not mostly 256×256."
assert "256×256" in photo_info["top_shapes"].split(",")[0], "Photo shapes not mostly 256×256."
print("\nTFRecord check complete.")
print("-" * 70)
----------------------------------------------------------------------
TFRecord quick check: shard counts, decode a small sample, confirm shapes/dtypes.
Monet TFRecord shards : 5
Photo TFRecord shards : 20
--- Peeking TFRecords (Monet) ---
peeked: 128
decoded_ok: 128
decode_fail: 0
top_shapes: 256×256:128
mean: [0.5380, 0.5413, 0.4909]
std: [0.1981, 0.1822, 0.1807]
--- Peeking TFRecords (Photo) ---
peeked: 256
decoded_ok: 256
decode_fail: 0
top_shapes: 256×256:256
mean: [0.3923, 0.4221, 0.4109]
std: [0.2097, 0.2044, 0.2299]
--- Notes ---
Monet TFRecord : ✓ Mostly 256×256
Photo TFRecord : ✓ Mostly 256×256
TFRecord check complete.
----------------------------------------------------------------------
3.5 Refined Visual Audit: Large-Format Thumbnails
With the basic data checks complete, this step generates a more polished, large-format thumbnail grid. The larger images and refined styling are better suited for closer inspection of the details, textures, and color palettes within each domain.
# Set the number of thumbnails to show from each domain.
N_SHOW_MONET = 49
N_SHOW_PHOTO = 49
# Ensure the file lists are available in the current session.
if 'monet_jpgs' not in locals():
monet_jpgs = sorted(str(p) for p in MONET_JPG_DIR.glob("*.jpg"))
if 'photo_jpgs' not in locals():
photo_jpgs = sorted(str(p) for p in PHOTO_JPG_DIR.glob("*.jpg"))
# Initialize a random number generator for sampling.
rng = np.random.default_rng(SEED)
# Define a helper function to safely load an image file as an RGB NumPy array.
def _load_rgb(path):
try:
with Image.open(path) as im:
return np.asarray(im.convert("RGB"))
except Exception:
return None
# Define the master function to display a grid of image thumbnails.
def _show_grid(paths, title, max_n=49, fontsize=28, title_color="#FFB612", pad=1):
# Determine the number of images to display.
n = min(max_n, len(paths))
if n == 0:
print(f"(no images found for {title})")
return 0, 0
# Create a deterministic random sample of image paths.
sample = rng.choice(paths, size=n, replace=False)
# Calculate the grid layout (rows and columns).
cols = min(8, max(4, int(np.ceil(np.sqrt(n)))))
rows = int(np.ceil(n / cols))
# Create the figure and axes for the grid.
fig, axes = plt.subplots(rows, cols, figsize=(3.1 * cols, 3.1 * rows), facecolor="black")
# Normalize the axes object into a 2D array for consistent indexing.
if rows == 1 and cols == 1:
axes = np.array([[axes]])
elif rows == 1:
axes = np.array([axes])
elif cols == 1:
axes = np.array([[ax] for ax in axes])
axes = axes.reshape(rows, cols)
# First, apply styling to all axes in the grid.
for ax in axes.flat:
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor("black")
# Style the plot's border (spines).
for spine in ax.spines.values():
spine.set_edgecolor("#1A6A1A")
spine.set_linewidth(0.8)
# --- MODIFICATION: Loop with tqdm ---
# Now, loop through the images and their corresponding axes to plot.
shown, unreadable = 0, 0
plot_iterable = zip(sample, axes.flat)
pbar = tqdm(plot_iterable, total=n, desc=f"Plotting '{title}'", mininterval=0.2)
for p, ax in pbar:
arr = _load_rgb(p)
if arr is not None:
ax.imshow(arr)
shown += 1
else:
# Handle unreadable files gracefully.
ax.text(0.5, 0.5, "unreadable", ha="center", va="center", color="red", fontsize=10)
unreadable += 1
# Set the main title for the grid.
plt.suptitle(title, y=0.91, fontsize=fontsize, color=title_color, fontweight="bold")
# Apply a tight layout.
plt.tight_layout
plt.show()
return shown, unreadable
# --- Generate and display the thumbnail grids ---
print("-" * 70)
print("Sampling thumbnails from each domain for a quick visual sanity check.")
# Display the grid for Monet images.
shown_m, bad_m = _show_grid(
monet_jpgs,
"Monet Sample Thumbnails",
max_n=N_SHOW_MONET,
fontsize=32,
title_color="#1A6A1A",
pad=1,
)
# Display the grid for Photo images.
shown_p, bad_p = _show_grid(
photo_jpgs,
"Photo Sample Thumbnails",
max_n=N_SHOW_PHOTO,
fontsize=32,
title_color="#FFB612",
pad=1,
)
# Print a summary of displayed and unreadable images.
print(f"Monet displayed: {shown_m} | unreadable: {bad_m}")
print(f"Photo displayed: {shown_p} | unreadable: {bad_p}")
print("-" * 70)
---------------------------------------------------------------------- Sampling thumbnails from each domain for a quick visual sanity check.
Plotting 'Monet Sample Thumbnails': 0%| | 0/49 [00:00<?, ?it/s]
Plotting 'Photo Sample Thumbnails': 0%| | 0/49 [00:00<?, ?it/s]
Monet displayed: 49 | unreadable: 0 Photo displayed: 49 | unreadable: 0 ----------------------------------------------------------------------
3.6 Duplicate Image Analysis
Check for data redundancy by computing the MD5 hash of every JPEG file. This analysis identifies identical images that exist within the same domain and also any images that are exact duplicates across both the Monet and Photo datasets.
# --- Setup and File Gathering ---
print("-" * 70)
print("Duplicate check: hashing JPEGs (MD5), reporting duplicate groups and overlaps.")
# Verify that the necessary path variables are defined.
if 'MONET_JPG_DIR' not in locals() or 'PHOTO_JPG_DIR' not in locals():
raise RuntimeError("Expected MONET_JPG_DIR and PHOTO_JPG_DIR to be defined in 3.1.")
# Get lists of all JPEG file paths.
monet_paths = sorted(MONET_JPG_DIR.glob("*.jpg"))
photo_paths = sorted(PHOTO_JPG_DIR.glob("*.jpg"))
# --- Helper Functions ---
# Define a function to compute the MD5 hash of a file efficiently.
def md5_of_file(p: Path, chunk_mb: int = 1) -> str:
h = hashlib.md5()
with open(p, "rb") as f:
for chunk in iter(lambda: f.read(1024 * 1024 * chunk_mb), b""):
h.update(chunk)
return h.hexdigest()
# Define a wrapper to hash a list of files with a progress bar.
def hash_many(paths):
hashes = []
for p in tqdm(paths, desc="Hashing", mininterval=0.2):
try:
hashes.append(md5_of_file(p))
except Exception:
hashes.append(None) # Mark unreadable files as None.
return hashes
# --- Hashing and Analysis ---
# Compute the MD5 hashes for all images in both domains.
monet_md5 = hash_many(monet_paths)
photo_md5 = hash_many(photo_paths)
# Create pandas DataFrames to hold the file info and hashes.
df_m = pd.DataFrame({
"domain": "monet",
"path": [str(p) for p in monet_paths],
"fname": [p.name for p in monet_paths],
"md5": monet_md5,
})
df_p = pd.DataFrame({
"domain": "photo",
"path": [str(p) for p in photo_paths],
"fname": [p.name for p in photo_paths],
"md5": photo_md5,
})
df = pd.concat([df_m, df_p], ignore_index=True)
# Identify intra-domain duplicates (images that appear more than once in the same set).
dup_m = df_m.groupby("md5", dropna=True)["fname"].size()
dup_p = df_p.groupby("md5", dropna=True)["fname"].size()
groups_m = int((dup_m > 1).sum())
groups_p = int((dup_p > 1).sum())
files_m = int(dup_m[dup_m > 1].sum())
files_p = int(dup_p[dup_p > 1].sum())
# Identify cross-domain duplicates (images that appear in both sets).
overlap_md5 = set(df_m["md5"].dropna()) & set(df_p["md5"].dropna())
n_overlap = len(overlap_md5)
# --- Display Results ---
# Build a summary DataFrame.
summary = pd.DataFrame({
"Domain": ["Monet (JPEG)", "Photo (JPEG)", "Cross-domain"],
"Files": [len(df_m), len(df_p), len(df_m) + len(df_p)],
"Unique MD5": [
int(df_m["md5"].nunique(dropna=True)),
int(df_p["md5"].nunique(dropna=True)),
int(df["md5"].nunique(dropna=True))
],
"Duplicate Groups": [groups_m, groups_p, n_overlap],
"Files in Duplicate Groups": [files_m, files_p, "-"],
}).set_index("Domain")
# Apply custom styling to the summary table.
styler = summary.style
styler = styler.set_properties(**{
"background-color": "#0b0b0b",
"color": "#F8F8F8",
"border-color": "#1A6A1A",
})
styler = styler.set_table_styles([
{"selector": "th", "props": [("background-color", "#DA291C"), ("color", "white"), ("text-align", "left")]},
{"selector": "td", "props": [("border", "3px solid #0033A0")]},
])
display(styler)
# Define a helper to display the filenames of duplicate image groups.
def show_top_dups(df_domain, label, topk=10):
vc = df_domain.groupby("md5")["fname"].size().sort_values(ascending=False)
vc = vc[vc > 1].head(topk)
if len(vc) == 0:
print(f"No {label} duplicates.")
return
print(f"\nTop {min(topk, len(vc))} {label} duplicate groups (size):")
for md5, sz in vc.items():
files = df_domain.loc[df_domain["md5"] == md5, "fname"].tolist()
file_examples = f"{files[:3]}{' ...' if len(files) > 3 else ''}"
print(f" size={sz} md5={md5[:10]}… e.g. {file_examples}")
# Display details about the found duplicates.
print("\n--- Details ---")
show_top_dups(df_m, "Monet")
show_top_dups(df_p, "Photo")
# If cross-domain duplicates exist, show examples.
if n_overlap > 0:
print(f"\nCross-domain identical files: {n_overlap} md5 matches (showing up to 3 examples)")
for md5 in list(overlap_md5)[:3]:
m_files = df_m.loc[df_m["md5"] == md5, "fname"].tolist()
p_files = df_p.loc[df_p["md5"] == md5, "fname"].tolist()
print(f" md5={md5[:10]}… Monet e.g. {m_files[:2]} | Photo e.g. {p_files[:2]}")
print("\nDuplicate check complete.")
print("-" * 70)
---------------------------------------------------------------------- Duplicate check: hashing JPEGs (MD5), reporting duplicate groups and overlaps.
Hashing: 0%| | 0/300 [00:00<?, ?it/s]
Hashing: 0%| | 0/7038 [00:00<?, ?it/s]
| Files | Unique MD5 | Duplicate Groups | Files in Duplicate Groups | |
|---|---|---|---|---|
| Domain | ||||
| Monet (JPEG) | 300 | 300 | 0 | 0 |
| Photo (JPEG) | 7038 | 7028 | 9 | 19 |
| Cross-domain | 7338 | 7328 | 0 | - |
--- Details --- No Monet duplicates. Top 9 Photo duplicate groups (size): size=3 md5=6ef7c63e0d… e.g. ['16a666edff.jpg', '1dafe2cf42.jpg', 'b76490cea4.jpg'] size=2 md5=f3599c0869… e.g. ['52f641c328.jpg', '8b554cd034.jpg'] size=2 md5=6d2337aeef… e.g. ['4b497dc591.jpg', '7d4e10f395.jpg'] size=2 md5=907e199a17… e.g. ['1d7d8ff0ac.jpg', '643916e05b.jpg'] size=2 md5=9a0b9dc872… e.g. ['379b25d4d3.jpg', 'acbede97b1.jpg'] size=2 md5=59bfabe0a9… e.g. ['45f63c2653.jpg', '9c4dd0a48f.jpg'] size=2 md5=471354085f… e.g. ['2960920920.jpg', 'df2c0d53a4.jpg'] size=2 md5=41f9162ed2… e.g. ['2641b8175f.jpg', '2bb040da92.jpg'] size=2 md5=f493a9129d… e.g. ['2cf4cac9df.jpg', '880e0f6c15.jpg'] Duplicate check complete. ----------------------------------------------------------------------
3.7 Build & Export De-duplicated File Lists
This step uses the MD5 hashes to create a clean, de-duplicated list of photo JPEGs, keeping only the first-seen file from each duplicate group. For reproducibility, it saves two CSVs (photo_jpg_keep.csv and photo_jpg_duplicates.csv) to both Linux and Windows paths.
# --- Hashing and De-duplication ---
# Define a file hashing function.
def md5_of_file(p, buf=1024*1024):
h = hashlib.md5()
with open(p, "rb") as f:
for chunk in iter(lambda: f.read(buf), b""):
h.update(chunk)
return h.hexdigest()
# Get a list of all photo JPEG files.
photo_files = sorted(PHOTO_JPG_DIR.glob("*.jpg"))
md5_to_paths = defaultdict(list)
# Group file paths by their MD5 hash.
for p in tqdm(photo_files, desc="MD5(photo_jpg)", mininterval=0.3):
try:
md5_to_paths[md5_of_file(p)].append(p)
except Exception:
pass # Skip unreadable files.
# Create the de-duplicated list by keeping the first path from each hash group.
keep_paths, drop_paths = [], []
for paths in md5_to_paths.values():
keep_paths.append(paths[0])
if len(paths) > 1:
drop_paths.extend(paths[1:])
# Create a global variable containing the canonical list of unique photo paths.
PHOTO_JPG_UNIQ = [str(p) for p in keep_paths]
# --- Save Results to Linux Path ---
# Create and configure the output directory for metadata files.
meta_dir = Path.home() / "projects/monet-gan" / "meta"
meta_dir.mkdir(parents=True, exist_ok=True)
# Save the lists of "keep" and "drop" paths to CSV files for reproducibility.
df_keep = pd.DataFrame({"keep": [str(p) for p in keep_paths]})
df_drop = pd.DataFrame({"drop": [str(p) for p in drop_paths]})
df_keep.to_csv(meta_dir / "photo_jpg_keep.csv", index=False)
df_drop.to_csv(meta_dir / "photo_jpg_duplicates.csv", index=False)
# Print a summary of the de-duplication process.
print("-" * 70)
print(f"photo_jpg files total : {len(photo_files):,}")
print(f"unique by MD5 : {len(PHOTO_JPG_UNIQ):,}")
print(f"duplicates to drop : {len(drop_paths):,} (kept one per group)")
print(f"Lists saved in : {meta_dir}")
print("-" * 70)
# --- Save Results to Windows Path ---
# Define the source and destination paths.
meta_dir_lin = Path.home() / "projects/monet-gan" / "meta"
meta_dir_win = Path("/mnt/c/Users/travi/Documents/Training/Colorado/MS-AI/Machine Learning Theory and Hands-on Practice with Python Specialization/Introduction to Deep Learning/Module 5/Week5_Kaggle_Monet_Competition/meta")
# Ensure the Windows directory exists.
meta_dir_win.mkdir(parents=True, exist_ok=True)
# Write copies of the CSV files to the Windows path.
df_keep.to_csv(meta_dir_win / "photo_jpg_keep.csv", index=False)
df_drop.to_csv(meta_dir_win / "photo_jpg_duplicates.csv", index=False)
print("Saved duplicate lists to Windows:")
print(meta_dir_win / "photo_jpg_keep.csv")
print(meta_dir_win / "photo_jpg_duplicates.csv")
print("-" * 70)
MD5(photo_jpg): 0%| | 0/7038 [00:00<?, ?it/s]
---------------------------------------------------------------------- photo_jpg files total : 7,038 unique by MD5 : 7,028 duplicates to drop : 10 (kept one per group) Lists saved in : /home/treinart/projects/monet-gan/meta ---------------------------------------------------------------------- Saved duplicate lists to Windows: /mnt/c/Users/travi/Documents/Training/Colorado/MS-AI/Machine Learning Theory and Hands-on Practice with Python Specialization/Introduction to Deep Learning/Module 5/Week5_Kaggle_Monet_Competition/meta/photo_jpg_keep.csv /mnt/c/Users/travi/Documents/Training/Colorado/MS-AI/Machine Learning Theory and Hands-on Practice with Python Specialization/Introduction to Deep Learning/Module 5/Week5_Kaggle_Monet_Competition/meta/photo_jpg_duplicates.csv ----------------------------------------------------------------------
Observation: Data Integrity Audit¶
The comprehensive audit of the dataset confirms that it is remarkably clean, well-structured, and highly suitable for the style transfer task. All files are present and accounted for, the image dimensions are perfectly consistent, and the TFRecord shards are verified to be reliable. The audit also identified and addressed a minor data redundancy issue, resulting in a clean, canonical dataset ready for the modeling pipeline.
Dataset Structure and Completeness:
- The file structure is validated, with all paths and folders present as expected. The dataset is heavily imbalanced, containing 300 Monet JPEGs and 7,038 Photo JPEGs.
- Both JPEG folders and TFRecord shards match their expected counts, confirming that no data was lost during setup.
Image Consistency and Quality:
- The data quality is exceptionally high. A statistical audit of both JPEG and TFRecord samples found zero corrupt or unreadable files.
- Critically, all sampled images from both domains were found to have a consistent 256×256 resolution. This uniformity is a significant advantage, as it eliminates the need for complex resizing or padding logic in the data preprocessing pipeline.
Visual Domain Characteristics:
- The visual audit via thumbnail grids confirms a distinct and pronounced stylistic gap between the two domains.
- The Monet set consistently displays the hallmarks of Impressionism: visible brushstrokes, a focus on light and color over sharp detail, and recurring subjects like haystacks and water lilies.
- The Photo set consists of modern, high-quality landscape photographs characterized by sharp focus, realistic colors, and a wide variety of subjects and compositions. This clear visual dichotomy is ideal for a style transfer model.
Data Uniqueness (Key Finding):
- An MD5 hash analysis confirmed the integrity of the datasets. The Monet collection contains zero duplicate images.
- The Photo collection was found to have a minor redundancy issue, with 10 duplicate files across 9 unique image groups. There were no duplicates found across the two domains.
- This issue was resolved by creating and exporting a de-duplicated list of 7,028 unique photo files, ensuring that the model will train on a clean and unique set of images.
Conclusion:
The data audit provides high confidence in the dataset's quality and structure. The primary challenges for this project will not stem from data cleaning but from the significant class imbalance and the inherent difficulty of the image translation task. With all files verified, sized correctly, and de-duplicated, the project can proceed to building the tf.data input pipeline with a solid foundation.
Section 4: Exploratory Visual Analysis
Before building the model, it's crucial to understand the visual characteristics that define and separate the two domains. This section uses a series of plots and statistical measures to compare the Monet and Photo datasets across dimensions like color, brightness, and texture.
Section Plan:
- Side-by-Side Comparison: Display random pairs of Monet and Photo images to get an intuitive feel for the domain gap.
- Color Distribution: Plot RGB histograms to compare the color palettes of the two sets.
- Mean Images: Compute and display the "average" image for each domain to visualize the central tendency of color and composition.
- Luminance & Contrast: Analyze and plot the brightness distributions to see if one domain is systematically darker or has lower contrast.
- Edge & Texture Analysis: Use Sobel and Laplacian operators to quantify and compare the edge density and sharpness between paintings and photos.
- Hue & Saturation: Plot HSV distributions to compare the dominant colors and their intensity.
- Outlier Detection: Scan for images with anomalous properties like extreme file size or brightness to ensure data quality.
4.1 Side-by-Side Rows
Paired rows (Monet vs. Photo) highlight differences in texture and palette. Pairs are random and not semantically matched.
# Set the number of image pairs to display.
rng = np.random.default_rng(SEED)
n_pairs = 24
# Create deterministic random samples from both domains.
m_take = min(n_pairs, len(monet_jpgs))
p_take = min(n_pairs, len(photo_jpgs))
m_sample = rng.choice(monet_jpgs, size=m_take, replace=False).tolist()
p_sample = rng.choice(photo_jpgs, size=p_take, replace=False).tolist()
# Create the figure and axes for the side-by-side plots.
rows = max(m_take, p_take)
fig, axes = plt.subplots(rows, 2, figsize=(8, 3 * rows), facecolor="black")
# Ensure the axes object is always a 2D array for consistent indexing.
axes = np.atleast_2d(axes)
# --- MODIFICATION: Loop with tqdm ---
# Loop through each row to plot a pair of images, with a progress bar.
pbar = tqdm(range(rows), desc="Generating image pairs", mininterval=0.2)
for i in pbar:
# Loop through the two domains (Monet and Photo) for each column.
for j, (paths, label, color) in enumerate([(m_sample, "Monet", "#1A6A1A"), (p_sample, "Photo", "#FFB612")]):
ax = axes[i, j]
# Clear ticks and set background color.
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor("black")
# Load and display the image if one exists for the current row.
if i < len(paths):
try:
with Image.open(paths[i]) as im:
ax.imshow(im.convert("RGB"))
except Exception:
# Handle cases where an image is unreadable.
ax.text(0.5, 0.5, "unreadable", ha="center", va="center", color="red")
# Add a title to the top of each column on the first row.
if i == 0:
ax.set_title(label, color=color, fontsize=22, fontweight="bold")
# Style the plot's border (spines).
for spine in ax.spines.values():
spine.set_edgecolor(color)
spine.set_linewidth(2)
# Ensure the layout is tight and clean.
plt.tight_layout()
plt.show()
Generating image pairs: 0%| | 0/24 [00:00<?, ?it/s]