Skip to content

irregulator/audio2text

Repository files navigation

a2t/audio2text - audio transcription utility

a2t is Golang utility providing audio to text conversion. The core idea is: easily and efficiently transcribe greek-language .mp3 files accessible over HTTP.

Transcription is made possible leveraging Automatic Speech Recognition (ASR) models while the runtime workload is delegated to external services, like runpod.io. Different components handle audio discovery (the seeker) and audio processing (the worker).

An intuitive command line interface is provided to the user so as it can interact with the daemon and control the overall workflow. The daemon uses SQLite database to store both audio processing tasks, internal state, as well as transcripts. Transcription artifacts can be optionally injected to an Elasticsearch index for advanced searching experience.

Quick start

Configuration

docs/config/config-example.yaml is sufficient configuration to get started with a2t.

Models

https://huggingface.co/Sandiago21/whisper-large-v2-greek model overall provides quite satisfying and accurate transcriptions. During tests with specific audio files it seem to output superior results for Greek language compared to OpenAI's Whisper models.

Misc helpers

misc/ provides additional material related to external services that a2t requires. Prominently:

  • misc/runpod.io contains material to setup a runpod.io serverless endpoint. a2t employs the endpoint for audio processing
  • misc/elasticsearch contains material to locally spawn a simple Elasticsearch cluster

Design

The system implements a producer-consumer architecture with asynchronous task processing built around a single binary.

The daemon runs multiple goroutines: an HTTP-based seeker that periodically scans remote URLs for MP3 files and creates audio file records and a worker that consumes processing tasks from SQLite-backed queues.

The seeker uses an HTTPScanner to discover audio files at configurable intervals, checking file existence against the database to avoid duplicates and generating unique IDs and hashes for new discoveries. Each audio file transitions through states (discovered, queued, processing, completed, failed) tracked in the database with metadata including creation timestamps and optional transcript associations.

The transcription workflow uses HTTP clients to submit audio files to remote ASR endpoints, handling retries, rate limiting, and failure recovery. Database transactions ensure atomicity during state changes, while connection pooling manages concurrent access.

The CLI communicates with the daemon via Unix domain sockets using a command-response protocol, with automatic daemon startup for client operations. Components are controlled through the daemon's component manager rather than direct execution. Optional Elasticsearch integration uses bulk indexing for transcript ingestion and maintains document versioning for updates.

Check docs/design.md for more.

Project layout

├── cmd/               # CLI application entry points and commands
├── pkg/          # Private application logic
├── misc/              # External service configurations and helpers
│   ├── runpod.io/     # Runpod serverless endpoint setup
│   └── elasticsearch/ # Local Elasticsearch cluster configuration
├── config.yaml        # Main configuration file
└── docs/              # Project documentation

Technologies

  • Go 1.21
  • SQLite - Embedded database for task management, state persistence, and transcript storage
  • Runpod.io - External serverless platform for GPU-accelerated ASR model inference
  • Whisper models - Automatic Speech Recognition models, specifically whisper-large-v2-greek for enhanced accuracy
  • Elasticsearch - Optional search engine integration for advanced transcript querying and indexing

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

A Golang utility for audio to text transcription

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages