TL;DR: RadioVault is built for long-lived collections: ingestion, organization, and retrieval that keeps working over time.

TL;DR: RadioVault is built for long-lived collections: ingestion, organization, and retrieval that keeps working over time.

What It Does

RadioVault is an archiving and retrieval system built for large audio collections that need to stay searchable and usable over years, not weeks. It handles ingestion from multiple source formats, extracts and enriches metadata, and exposes a search interface that lets users find specific recordings without knowing exactly what they’re looking for.

The core problem it solves: audio is notoriously hard to organize at scale. Files arrive in inconsistent formats with incomplete metadata, and without active curation, a collection becomes a graveyard within months. RadioVault automates the parts of that curation that can be automated and makes the manual parts efficient.

Technical Notes

Ingestion normalizes incoming audio to a consistent format and sample rate, then runs an extraction pipeline that pulls metadata from file headers, embedded tags, and - where applicable - transcription-derived content. That metadata feeds into a search index that supports full-text queries, faceted filtering, and fuzzy matching on fields like titles, descriptions, dates, and contributor names.

Key decisions:

Content-addressed storage - files are stored by hash, which handles deduplication natively and guarantees that what you retrieve is exactly what was ingested
Schema-flexible metadata - different sources provide wildly different metadata. The system uses a core schema with extension points rather than forcing everything into a rigid structure that half the sources can’t populate
Incremental indexing - re-indexing the full collection on every change doesn’t scale. New and updated records are indexed incrementally, with periodic full rebuilds for consistency checks
Preservation-first file handling - original files are never modified. All transformations produce new derivatives, and the lineage is tracked

The operational challenge is longevity. Storage formats, search engines, and metadata standards all evolve. The architecture isolates these concerns so that swapping a search backend or migrating storage doesn’t require rethinking the data model.

Outcomes

Turned a fragmented collection of audio files into a searchable, browsable archive that users actually use for discovery - not just retrieval of known items. Ingestion time per batch dropped significantly once the normalization pipeline stabilized, and duplicate detection eliminated a meaningful percentage of redundant storage. The system has scaled cleanly as the collection has grown, which was the real test - archiving tools that don’t survive their own success aren’t archiving tools.