TL;DR: RadioVault is built for long-lived collections: ingestion, organization, and retrieval that keeps working over time.
TL;DR: RadioVault is built for long-lived collections: ingestion, organization, and retrieval that keeps working over time.
What It Does
RadioVault is an archiving and retrieval system built for large audio collections that need to stay searchable and usable over years, not weeks. It handles ingestion from multiple source formats, extracts and enriches metadata, and exposes a search interface that lets users find specific recordings without knowing exactly what they’re looking for.
The core problem it solves: audio is notoriously hard to organize at scale. Files arrive in inconsistent formats with incomplete metadata, and without active curation, a collection becomes a graveyard within months. RadioVault automates the parts of that curation that can be automated and makes the manual parts efficient.
Technical Notes
Ingestion normalizes incoming audio to a consistent format and sample rate, then runs an extraction pipeline that pulls metadata from file headers, embedded tags, and - where applicable - transcription-derived content. That metadata feeds into a search index that supports full-text queries, faceted filtering, and fuzzy matching on fields like titles, descriptions, dates, and contributor names.
Key decisions:
- Content-addressed storage - files are stored by hash, which handles deduplication natively and guarantees that what you retrieve is exactly what was ingested
- Schema-flexible metadata - different sources provide wildly different metadata. The system uses a core schema with extension points rather than forcing everything into a rigid structure that half the sources can’t populate
- Incremental indexing - re-indexing the full collection on every change doesn’t scale. New and updated records are indexed incrementally, with periodic full rebuilds for consistency checks
- Preservation-first file handling - original files are never modified. All transformations produce new derivatives, and the lineage is tracked
The operational challenge is longevity. Storage formats, search engines, and metadata standards all evolve. The architecture isolates these concerns so that swapping a search backend or migrating storage doesn’t require rethinking the data model.
Outcomes
Turned a fragmented collection of audio files into a searchable, browsable archive that users actually use for discovery - not just retrieval of known items. Ingestion time per batch dropped significantly once the normalization pipeline stabilized, and duplicate detection eliminated a meaningful percentage of redundant storage. The system has scaled cleanly as the collection has grown, which was the real test - archiving tools that don’t survive their own success aren’t archiving tools.