Django project to organize PHS material
  • Python 54.5%
  • JavaScript 32.2%
  • CSS 12.1%
  • Shell 0.6%
  • HTML 0.6%
Find a file
2026-05-24 07:12:46 +00:00
deploy fix backup script 2026-05-16 17:32:11 +05:30
files show camera and exif date, copy exif_date to current date field in mgmt command 2026-05-24 12:42:03 +05:30
phs_archive update docs, add redirect from / to /curate/, add db backup script 2026-05-16 17:27:11 +05:30
.gitignore allow local previews by mounting hard drive 2026-05-16 12:55:01 +05:30
AGENTS.md add dark mode 2026-05-17 23:07:55 +05:30
CLAUDE.md phs django, first commit - basic structure, importer scripts, basic django admin 2026-05-16 12:21:29 +05:30
DEBUG.md update docs, add redirect from / to /curate/, add db backup script 2026-05-16 17:27:11 +05:30
manage.py phs django, first commit - basic structure, importer scripts, basic django admin 2026-05-16 12:21:29 +05:30
README.md update docs, add redirect from / to /curate/, add db backup script 2026-05-16 17:27:11 +05:30
requirements.txt deploy scripts 2026-05-16 16:26:28 +05:30

PHS Archive

A Django tool used internally by PHS to catalog and curate a multi-terabyte archive of videos, images, and documents that lives on external hard-drives. The goal is to prepare a curated, deduplicated subset of this material for publication to the pad.ma public archive.

This tool catalogs and annotates files — it does not host, serve, or transcode the underlying media. The files themselves stay on the hard-drives. What the tool tracks is a row per file: where it lives, a content hash for deduplication, the metadata an archivist adds (title, description, status, project group, notes), and bookkeeping of who reviewed what and when.

The actual upload to pad.ma is not yet implemented — curators mark files for upload, and a future management command will push the marked set.

Tech stack

  • Django 6.0.1
  • SQLite (db.sqlite3 at the repo root, gitignored)
  • Python 3.14
  • Sole runtime dep: python-dateutil (for parsing varied EXIF date strings)
  • Frontend at /curate/ is vanilla ES modules, no build step

No DRF, no Celery, no Docker, no npm, no bundler. The JSON API is hand-rolled JsonResponse views.

Two UIs

Two surfaces are served from the same Django app on the same DB:

  • /admin/ — Django admin. The power-user / DBA surface. Bulk actions, arbitrary filtering, duplicate management, schema-level visibility, and an optional inline local-drive preview column.
  • /curate/ — a focused single-page app for the day-to-day curation workflow: pick a drive, navigate its directory tree (with status colour-coding), open a file, see a live preview from a locally-mounted folder, fill in the metadata, save with ⌘S or Save & Next.

Both UIs require staff login; the SPA reuses the admin's login form.

Quick start

# clone, then:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver

Then visit either:

Importing data

The catalog is built by importing JSON listings of hard-drives. Two flavours:

A) Real drives (production workflow)

An external scan script walks a mounted drive, hashes files, extracts EXIF, and emits a JSON listing into data/. That script lives outside this repo; the files it produces are what the importer consumes.

python manage.py import_json data/Expansion.json

Behaviour:

  • Creates (or reuses) a Drive row. The drive name is auto-derived from the filename — Expansion.jsonExpansion, PHS_copyALL_2026-01-24.jsonPHS_copyALL (trailing _20YY-MM-DD is stripped).
  • For each entry, upserts a FileEntry keyed by (drive, path, filename).
  • The whole import runs inside a single transaction.

Flags:

flag purpose
--drive NAME Override the auto-derived drive name.
--update Update fields on existing rows (default: skip).
--verbose Print progress every 100 entries.

B) Fake drives (local dev / test fixtures)

scan_folder is a small in-repo equivalent: it walks any directory and emits the same JSON format. Used so you can exercise the catalog and the preview UI without the real hard-drives plugged in. A test_drive/ directory in the repo root is the canonical local fixture (gitignored).

python manage.py scan_folder test_drive test_drive.json --verbose
python manage.py import_json test_drive.json --drive TestDrive --update

scan_folder computes the same oshash as the production scanner (OpenSubtitles hash) so manifests round-trip identically.

C) Duplicates

If you have a pre-computed REPEATS_allHDDs.json:

python manage.py import_repeats data/repeats/REPEATS_allHDDs.json \
    --selection-method auto_largest

Otherwise, recompute from the DB:

python manage.py update_duplicates                # incremental
python manage.py update_duplicates --reset        # wipe + redo

update_duplicates finds every oshash appearing more than once and creates / refreshes DuplicateGroup rows. Without --reset, existing manual primary selections are preserved.

The curation workflow

  1. Pick a drive from the header dropdown.
  2. Click Mount folder for <drive>… and point the picker at the matching folder on disk. The mount is persisted in IndexedDB and survives reloads (after one re-grant click). Only one drive is mounted at a time.
  3. Navigate the tree on the left. Status dots colour each file (grey unreviewed, green upload, blue restricted, slate keep-offline, red discard, amber needs-attention). Each folder shows reviewed/total counts. The chip row above the tree filters to Unreviewed only / Needs attention / For upload.
  4. Click a file → preview (image or video) loads in the right pane, and the form below shows status, title, project group, date, description, notes.
  5. Type, then Save or Save & Next. Auto-save is intentionally not used — saves are explicit.

Keyboard shortcuts:

key action
⌘S / Ctrl+S Save current file
⌘⏎ / Ctrl+Enter Save & next file (in current folder)
Esc Discard pending edits (when focus is in the form)
j / k Next / previous file in current folder (no save)
16 Set status (1 unreviewed · 2 upload · 3 restricted · 4 keep_offline · 5 discard · 6 needs_attention)
t / n Focus title / notes
u Jump to next unreviewed file (drive-wide)

Every successful save stamps reviewed_at and reviewed_by so you can tell who looked at what.

In /admin/ (power-user / DBA)

Same models, more knobs. Browse Files File entries, filter by drive / file-type category / duplicate status / publication status / camera / date / status / project group, search across filename/path/oshash/title/keywords, and use bulk actions: Mark for padma, Unmark, Set publication: …, Make primary. Resolve duplicates in Files Duplicate groups. There's also a Files Project groups table for the typo-safe project list.

The admin's changelist has an optional inline local-drive preview column (see AGENTS.md for the spike) that uses the same mount as the SPA.

Running tests

A small smoke-test file lives at files/tests.py covering the two load-bearing invariants of the API (auth gate, status-to-legacy sync). Run:

python manage.py test files

Four tests, runs in well under a second. Uses Django's built-in test framework — no extra dependencies.

Project layout

phs/
├── manage.py
├── requirements.txt
├── db.sqlite3                       # SQLite database (gitignored)
├── test_drive/                      # local fixture for the preview UI (gitignored)
├── test_drive.json                  # manifest produced by scan_folder (gitignored)
├── data/                            # JSON listings from external scans (gitignored)
├── deploy/                          # production deploy artifacts (nginx, systemd, webhook, backup)
├── DEBUG.md                         # ops cheat sheet for the deployed instance
├── phs_archive/                     # Django project (settings, urls, wsgi)
│   ├── settings.py                  # LOGIN_URL + local_settings shim at the bottom
│   └── urls.py                      # mounts /admin/, /api/, /curate/
└── files/                           # the single Django app
    ├── models.py                    # Drive · ProjectGroup · DuplicateGroup · FileEntry
    ├── admin.py                     # Django admin config (+ local preview column)
    ├── views.py                     # curate_index (staff_member_required)
    ├── api.py                       # JSON API view functions
    ├── api_serializers.py           # dict serializers, no framework
    ├── permissions.py               # @staff_required_json decorator
    ├── urls.py                      # /api/ URL conf
    ├── tests.py                     # auth + sync smoke tests
    ├── migrations/                  # 0001_initial · 0002_curation_v2 · 0003_backfill_curation_v2
    ├── management/commands/
    │   ├── scan_folder.py           # walk a folder → manifest JSON
    │   ├── import_json.py           # ingest manifest into DB
    │   ├── import_repeats.py        # apply pre-computed duplicates
    │   └── update_duplicates.py     # recompute duplicates from DB
    ├── templates/curate/index.html  # SPA shell
    └── static/
        ├── curate/                  # SPA modules (app/store/api/router/tree/detail/preview/mount/style)
        └── files/                   # admin's local_previews.js + .css

The JSON contract (for the importer)

Each entry in a drive listing is a flat object. Only fields the importer reads are listed; extras are ignored:

{
  "path": "PHS-Videos",
  "filename": "HD.1080.mp4",
  "file_type": "video/mp4",
  "oshash": "050858a68cab09aa",
  "size_exif": 36256397,
  "size_pretty": "34.6 MB",
  "date_exif": "2021:12:23 17:57:22+05:30",
  "duration": 13.96,
  "duration_tc": "00:00:13.960",
  "resolution": "1920 x 1080",
  "camera": "DC-S1 (Panasonic)"
}

scan_folder produces the same shape (minus a few media-specific fields it doesn't compute) and feeds straight into import_json.

The repeats file has the shape { "<oshash>": [["<path>", "<filename>"], …] }.

Deployment

A live instance runs at https://phs.whydidweevendothis.com. The full bring-up procedure, systemd units, nginx config, and push-to-deploy webhook are in deploy/README.md. Day-to-day operational commands (tailing logs, redeploying, rolling back, restoring a backup) are in DEBUG.md.

Production overrides (DEBUG=False, SECRET_KEY, ALLOWED_HOSTS, TLS cookie flags) live in phs_archive/local_settings.py on the server only — the bottom of phs_archive/settings.py imports it if present.

See AGENTS.md for a deeper walkthrough of the data model, API contract, and SPA architecture.