a2t/audio2text - audio transcription utility

a2t is Golang utility providing audio to text conversion. The core idea is: easily and efficiently transcribe greek-language .mp3 files accessible over HTTP.

Transcription is made possible leveraging Automatic Speech Recognition (ASR) models while the runtime workload is delegated to external services, like runpod.io. Different components handle audio discovery (the seeker) and audio processing (the worker).

An intuitive command line interface is provided to the user so as it can interact with the daemon and control the overall workflow. The daemon uses SQLite database to store both audio processing tasks, internal state, as well as transcripts. Transcription artifacts can be optionally injected to an Elasticsearch index for advanced searching experience.

Quick start

Configuration

docs/config/config-example.yaml is sufficient configuration to get started with a2t.

Models

https://huggingface.co/Sandiago21/whisper-large-v2-greek model overall provides quite satisfying and accurate transcriptions. During tests with specific audio files it seem to output superior results for Greek language compared to OpenAI's Whisper models.

Misc helpers

misc/ provides additional material related to external services that a2t requires. Prominently:

misc/runpod.io contains material to setup a runpod.io serverless endpoint. a2t employs the endpoint for audio processing
misc/elasticsearch contains material to locally spawn a simple Elasticsearch cluster

Design

The system implements a producer-consumer architecture with asynchronous task processing built around a single binary.

The daemon runs multiple goroutines: an HTTP-based seeker that periodically scans remote URLs for MP3 files and creates audio file records and a worker that consumes processing tasks from SQLite-backed queues.

The seeker uses an HTTPScanner to discover audio files at configurable intervals, checking file existence against the database to avoid duplicates and generating unique IDs and hashes for new discoveries. Each audio file transitions through states (discovered, queued, processing, completed, failed) tracked in the database with metadata including creation timestamps and optional transcript associations.

The transcription workflow uses HTTP clients to submit audio files to remote ASR endpoints, handling retries, rate limiting, and failure recovery. Database transactions ensure atomicity during state changes, while connection pooling manages concurrent access.

The CLI communicates with the daemon via Unix domain sockets using a command-response protocol, with automatic daemon startup for client operations. Components are controlled through the daemon's component manager rather than direct execution. Optional Elasticsearch integration uses bulk indexing for transcript ingestion and maintains document versioning for updates.

Check docs/design.md for more.

Project layout

├── cmd/               # CLI application entry points and commands
├── pkg/          # Private application logic
├── misc/              # External service configurations and helpers
│   ├── runpod.io/     # Runpod serverless endpoint setup
│   └── elasticsearch/ # Local Elasticsearch cluster configuration
├── config.yaml        # Main configuration file
└── docs/              # Project documentation

Technologies

Go 1.21
SQLite - Embedded database for task management, state persistence, and transcript storage
Runpod.io - External serverless platform for GPU-accelerated ASR model inference
Whisper models - Automatic Speech Recognition models, specifically whisper-large-v2-greek for enhanced accuracy
Elasticsearch - Optional search engine integration for advanced transcript querying and indexing

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cmd/a2t		cmd/a2t
docs		docs
integration		integration
misc		misc
pkg		pkg
.gitignore		.gitignore
.runpodio_api_key.secret.example		.runpodio_api_key.secret.example
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
go.mod		go.mod
go.sum		go.sum
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

a2t/audio2text - audio transcription utility

Quick start

Configuration

Models

Misc helpers

Design

Project layout

Technologies

License

About

Uh oh!

Releases

Packages

Languages

License

irregulator/audio2text

Folders and files

Latest commit

History

Repository files navigation

a2t/audio2text - audio transcription utility

Quick start

Configuration

Models

Misc helpers

Design

Project layout

Technologies

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages