a2t is Golang utility providing audio to text conversion. The core idea is: easily and efficiently transcribe greek-language .mp3 files accessible over HTTP.
Transcription is made possible leveraging Automatic Speech Recognition (ASR) models while the runtime workload is delegated to external services, like runpod.io. Different components handle audio discovery (the seeker) and audio processing (the worker).
An intuitive command line interface is provided to the user so as it can interact with the daemon and control the overall workflow. The daemon uses SQLite database to store both audio processing tasks, internal state, as well as transcripts. Transcription artifacts can be optionally injected to an Elasticsearch index for advanced searching experience.
docs/config/config-example.yaml is sufficient configuration to get started with a2t.
https://huggingface.co/Sandiago21/whisper-large-v2-greek model overall provides quite satisfying and accurate transcriptions. During tests with specific audio files it seem to output superior results for Greek language compared to OpenAI's Whisper models.
misc/ provides additional material related to external services that a2t requires. Prominently:
misc/runpod.iocontains material to setup a runpod.io serverless endpoint.a2temploys the endpoint for audio processingmisc/elasticsearchcontains material to locally spawn a simple Elasticsearch cluster
The system implements a producer-consumer architecture with asynchronous task processing built around a single binary.
The daemon runs multiple goroutines: an HTTP-based seeker that periodically scans remote URLs for MP3 files and creates audio file records and a worker that consumes processing tasks from SQLite-backed queues.
The seeker uses an HTTPScanner to discover audio files at configurable intervals, checking file existence against the database to avoid duplicates and generating unique IDs and hashes for new discoveries. Each audio file transitions through states (discovered, queued, processing, completed, failed) tracked in the database with metadata including creation timestamps and optional transcript associations.
The transcription workflow uses HTTP clients to submit audio files to remote ASR endpoints, handling retries, rate limiting, and failure recovery. Database transactions ensure atomicity during state changes, while connection pooling manages concurrent access.
The CLI communicates with the daemon via Unix domain sockets using a command-response protocol, with automatic daemon startup for client operations. Components are controlled through the daemon's component manager rather than direct execution. Optional Elasticsearch integration uses bulk indexing for transcript ingestion and maintains document versioning for updates.
Check docs/design.md for more.
├── cmd/ # CLI application entry points and commands
├── pkg/ # Private application logic
├── misc/ # External service configurations and helpers
│ ├── runpod.io/ # Runpod serverless endpoint setup
│ └── elasticsearch/ # Local Elasticsearch cluster configuration
├── config.yaml # Main configuration file
└── docs/ # Project documentation
- Go 1.21
- SQLite - Embedded database for task management, state persistence, and transcript storage
- Runpod.io - External serverless platform for GPU-accelerated ASR model inference
- Whisper models - Automatic Speech Recognition models, specifically whisper-large-v2-greek for enhanced accuracy
- Elasticsearch - Optional search engine integration for advanced transcript querying and indexing
This project is licensed under the MIT License - see the LICENSE file for details.
