- Python 54.5%
- JavaScript 32.2%
- CSS 12.1%
- Shell 0.6%
- HTML 0.6%
|
|
||
|---|---|---|
| deploy | ||
| files | ||
| phs_archive | ||
| .gitignore | ||
| AGENTS.md | ||
| CLAUDE.md | ||
| DEBUG.md | ||
| manage.py | ||
| README.md | ||
| requirements.txt | ||
PHS Archive
A Django tool used internally by PHS to catalog and curate a multi-terabyte archive of videos, images, and documents that lives on external hard-drives. The goal is to prepare a curated, deduplicated subset of this material for publication to the pad.ma public archive.
This tool catalogs and annotates files — it does not host, serve, or transcode the underlying media. The files themselves stay on the hard-drives. What the tool tracks is a row per file: where it lives, a content hash for deduplication, the metadata an archivist adds (title, description, status, project group, notes), and bookkeeping of who reviewed what and when.
The actual upload to pad.ma is not yet implemented — curators mark files for upload, and a future management command will push the marked set.
Tech stack
- Django 6.0.1
- SQLite (
db.sqlite3at the repo root, gitignored) - Python 3.14
- Sole runtime dep:
python-dateutil(for parsing varied EXIF date strings) - Frontend at
/curate/is vanilla ES modules, no build step
No DRF, no Celery, no Docker, no npm, no bundler. The JSON API is hand-rolled
JsonResponse views.
Two UIs
Two surfaces are served from the same Django app on the same DB:
/admin/— Django admin. The power-user / DBA surface. Bulk actions, arbitrary filtering, duplicate management, schema-level visibility, and an optional inline local-drive preview column./curate/— a focused single-page app for the day-to-day curation workflow: pick a drive, navigate its directory tree (with status colour-coding), open a file, see a live preview from a locally-mounted folder, fill in the metadata, save with⌘SorSave & Next.
Both UIs require staff login; the SPA reuses the admin's login form.
Quick start
# clone, then:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver
Then visit either:
- http://localhost:8000/admin/ — Django admin.
- http://localhost:8000/curate/ — the curation SPA (Chromium browser required for the local-drive preview, since it relies on the File System Access API).
Importing data
The catalog is built by importing JSON listings of hard-drives. Two flavours:
A) Real drives (production workflow)
An external scan script walks a mounted drive, hashes files, extracts EXIF,
and emits a JSON listing into data/. That script lives outside this repo;
the files it produces are what the importer consumes.
python manage.py import_json data/Expansion.json
Behaviour:
- Creates (or reuses) a
Driverow. The drive name is auto-derived from the filename —Expansion.json→Expansion,PHS_copyALL_2026-01-24.json→PHS_copyALL(trailing_20YY-MM-DDis stripped). - For each entry, upserts a
FileEntrykeyed by(drive, path, filename). - The whole import runs inside a single transaction.
Flags:
| flag | purpose |
|---|---|
--drive NAME |
Override the auto-derived drive name. |
--update |
Update fields on existing rows (default: skip). |
--verbose |
Print progress every 100 entries. |
B) Fake drives (local dev / test fixtures)
scan_folder is a small in-repo equivalent: it walks any directory and emits
the same JSON format. Used so you can exercise the catalog and the preview
UI without the real hard-drives plugged in. A test_drive/ directory in the
repo root is the canonical local fixture (gitignored).
python manage.py scan_folder test_drive test_drive.json --verbose
python manage.py import_json test_drive.json --drive TestDrive --update
scan_folder computes the same oshash as the production scanner
(OpenSubtitles hash) so manifests round-trip identically.
C) Duplicates
If you have a pre-computed REPEATS_allHDDs.json:
python manage.py import_repeats data/repeats/REPEATS_allHDDs.json \
--selection-method auto_largest
Otherwise, recompute from the DB:
python manage.py update_duplicates # incremental
python manage.py update_duplicates --reset # wipe + redo
update_duplicates finds every oshash appearing more than once and
creates / refreshes DuplicateGroup rows. Without --reset, existing
manual primary selections are preserved.
The curation workflow
In /curate/ (recommended day-to-day)
- Pick a drive from the header dropdown.
- Click Mount folder for <drive>… and point the picker at the matching folder on disk. The mount is persisted in IndexedDB and survives reloads (after one re-grant click). Only one drive is mounted at a time.
- Navigate the tree on the left. Status dots colour each file (grey
unreviewed, green upload, blue restricted, slate keep-offline, red
discard, amber needs-attention). Each folder shows
reviewed/totalcounts. The chip row above the tree filters to Unreviewed only / Needs attention / For upload. - Click a file → preview (image or video) loads in the right pane, and the form below shows status, title, project group, date, description, notes.
- Type, then Save or Save & Next. Auto-save is intentionally not used — saves are explicit.
Keyboard shortcuts:
| key | action |
|---|---|
⌘S / Ctrl+S |
Save current file |
⌘⏎ / Ctrl+Enter |
Save & next file (in current folder) |
Esc |
Discard pending edits (when focus is in the form) |
j / k |
Next / previous file in current folder (no save) |
1–6 |
Set status (1 unreviewed · 2 upload · 3 restricted · 4 keep_offline · 5 discard · 6 needs_attention) |
t / n |
Focus title / notes |
u |
Jump to next unreviewed file (drive-wide) |
Every successful save stamps reviewed_at and reviewed_by so you can tell
who looked at what.
In /admin/ (power-user / DBA)
Same models, more knobs. Browse Files › File entries, filter by drive /
file-type category / duplicate status / publication status / camera / date /
status / project group, search across filename/path/oshash/title/keywords,
and use bulk actions: Mark for padma, Unmark, Set publication: …,
Make primary. Resolve duplicates in Files › Duplicate groups. There's
also a Files › Project groups table for the typo-safe project list.
The admin's changelist has an optional inline local-drive preview column (see AGENTS.md for the spike) that uses the same mount as the SPA.
Running tests
A small smoke-test file lives at files/tests.py covering the two
load-bearing invariants of the API (auth gate, status-to-legacy sync). Run:
python manage.py test files
Four tests, runs in well under a second. Uses Django's built-in test framework — no extra dependencies.
Project layout
phs/
├── manage.py
├── requirements.txt
├── db.sqlite3 # SQLite database (gitignored)
├── test_drive/ # local fixture for the preview UI (gitignored)
├── test_drive.json # manifest produced by scan_folder (gitignored)
├── data/ # JSON listings from external scans (gitignored)
├── deploy/ # production deploy artifacts (nginx, systemd, webhook, backup)
├── DEBUG.md # ops cheat sheet for the deployed instance
├── phs_archive/ # Django project (settings, urls, wsgi)
│ ├── settings.py # LOGIN_URL + local_settings shim at the bottom
│ └── urls.py # mounts /admin/, /api/, /curate/
└── files/ # the single Django app
├── models.py # Drive · ProjectGroup · DuplicateGroup · FileEntry
├── admin.py # Django admin config (+ local preview column)
├── views.py # curate_index (staff_member_required)
├── api.py # JSON API view functions
├── api_serializers.py # dict serializers, no framework
├── permissions.py # @staff_required_json decorator
├── urls.py # /api/ URL conf
├── tests.py # auth + sync smoke tests
├── migrations/ # 0001_initial · 0002_curation_v2 · 0003_backfill_curation_v2
├── management/commands/
│ ├── scan_folder.py # walk a folder → manifest JSON
│ ├── import_json.py # ingest manifest into DB
│ ├── import_repeats.py # apply pre-computed duplicates
│ └── update_duplicates.py # recompute duplicates from DB
├── templates/curate/index.html # SPA shell
└── static/
├── curate/ # SPA modules (app/store/api/router/tree/detail/preview/mount/style)
└── files/ # admin's local_previews.js + .css
The JSON contract (for the importer)
Each entry in a drive listing is a flat object. Only fields the importer reads are listed; extras are ignored:
{
"path": "PHS-Videos",
"filename": "HD.1080.mp4",
"file_type": "video/mp4",
"oshash": "050858a68cab09aa",
"size_exif": 36256397,
"size_pretty": "34.6 MB",
"date_exif": "2021:12:23 17:57:22+05:30",
"duration": 13.96,
"duration_tc": "00:00:13.960",
"resolution": "1920 x 1080",
"camera": "DC-S1 (Panasonic)"
}
scan_folder produces the same shape (minus a few media-specific fields it
doesn't compute) and feeds straight into import_json.
The repeats file has the shape { "<oshash>": [["<path>", "<filename>"], …] }.
Deployment
A live instance runs at https://phs.whydidweevendothis.com. The full
bring-up procedure, systemd units, nginx config, and push-to-deploy webhook
are in deploy/README.md. Day-to-day operational
commands (tailing logs, redeploying, rolling back, restoring a backup) are
in DEBUG.md.
Production overrides (DEBUG=False, SECRET_KEY, ALLOWED_HOSTS, TLS
cookie flags) live in phs_archive/local_settings.py on the server only —
the bottom of phs_archive/settings.py imports it if present.
See AGENTS.md for a deeper walkthrough of the data model, API contract, and SPA architecture.