Skip to main content

Audio Transcription in Flock

Flock supports audio transcription in SQL by sending audio inputs to compatible providers and returning text transcripts that you can join, filter, and analyze like any other column.

Overview

With audio support you can:

  • Transcribe spoken content (meetings, calls, notes) directly in DuckDB.
  • Combine transcripts with structured data for analytics.
  • Feed transcripts into llm_complete, llm_filter, or llm_embedding for downstream tasks (summarization, classification, similarity search, RAG, etc.).

Flock uses the same context_columns abstraction as for images, but with type: 'audio' and a required transcription_model.

Supported Providers

Audio transcription is supported for:

  • OpenAI – via the audio/transcriptions endpoint (e.g., Whisper models).
  • Azure OpenAI – via the Azure audio transcription endpoint.

The following providers do not support audio transcription:

  • Anthropic/Claude – not supported; calls will raise an error.
  • Ollama – not supported; calls will raise an error.

Refer to the provider-specific getting-started guides for API key setup:

Using Audio in Context Columns

To use audio in Flock functions, specify type: 'audio' and provide a transcription_model in the context_columns array. The audio must be accessible as a file path or URL (depending on the provider).

Context Column Structure for Audio

'context_columns': [
{
'data': audio_path,
'type': 'audio',
'transcription_model': 'whisper-1'
}
]

Each audio context column supports:

  • data (required): SQL column containing the audio source (local file path or URL, depending on provider).
  • type (required for audio): Must be set to 'audio'.
  • transcription_model (required when type = 'audio'): Provider-specific transcription model name.
  • name (optional): Alias for referencing in prompts after transcription.

Validation Rules

Flock enforces the following rules at bind time:

  • If type = 'audio', then transcription_model must be provided, otherwise an error is raised.
  • If transcription_model is provided but type is not 'audio', Flock raises an error.

Basic Transcription Example

The most common pattern is to transcribe audio into text, then store or further process the transcript.

-- Transcribe a list of audio files with OpenAI
SELECT
audio_id,
file_path,
llm_complete(
{'model_name': 'gpt-4o'},
{
'prompt': 'Transcribe the following audio file verbatim.',
'context_columns': [
{
'data': file_path,
'type': 'audio',
'transcription_model': 'whisper-1'
}
]
}
) AS transcript
FROM VALUES
(1, '/data/audio/meeting_01.mp3'),
(2, '/data/audio/meeting_02.mp3')
AS t(audio_id, file_path);

Summarizing Transcripts

After transcription, you can treat the transcript as regular text and chain additional LLM calls.

WITH raw_transcripts AS (
SELECT
audio_id,
llm_complete(
{'model_name': 'gpt-4o'},
{
'prompt': 'Transcribe the following audio file verbatim.',
'context_columns': [
{
'data': file_path,
'type': 'audio',
'transcription_model': 'whisper-1'
}
]
}
) AS transcript
FROM VALUES
(1, '/data/audio/support_call_01.wav'),
(2, '/data/audio/support_call_02.wav')
AS t(audio_id, file_path)
)
SELECT
audio_id,
llm_complete(
{'model_name': 'gpt-4o'},
{
'prompt': 'Summarize this call in 3 bullet points.',
'context_columns': [
{'data': transcript, 'name': 'call'}
]
}
) AS call_summary
FROM raw_transcripts;

Filtering Based on Audio Content

You can also use llm_filter to flag or select rows based on the audio’s content:

-- Flag calls that mention cancellations
SELECT
audio_id,
customer_id,
file_path
FROM VALUES
(1, 101, '/data/audio/call_01.wav'),
(2, 102, '/data/audio/call_02.wav'),
(3, 103, '/data/audio/call_03.wav')
AS t(audio_id, customer_id, file_path)
WHERE llm_filter(
{'model_name': 'gpt-4o'},
{
'prompt': 'Does this call mention cancelling a subscription? Answer true or false.',
'context_columns': [
{
'data': file_path,
'type': 'audio',
'transcription_model': 'whisper-1'
}
]
}
);

Embeddings from Audio (via Text)

There is no direct audio embedding API in Flock. Instead, you can:

  1. Transcribe audio into text.
  2. Generate embeddings from the transcript using llm_embedding.
WITH transcripts AS (
SELECT
audio_id,
llm_complete(
{'model_name': 'gpt-4o'},
{
'prompt': 'Transcribe the following audio file.',
'context_columns': [
{
'data': file_path,
'type': 'audio',
'transcription_model': 'whisper-1'
}
]
}
) AS transcript
FROM VALUES
(1, '/data/audio/note_01.m4a'),
(2, '/data/audio/note_02.m4a')
AS t(audio_id, file_path)
),
audio_embeddings AS (
SELECT
audio_id,
llm_embedding(
{'model_name': 'text-embedding-3-small'},
{
'context_columns': [
{'data': transcript}
]
}
) AS embedding
FROM transcripts
)
SELECT * FROM audio_embeddings;

Function Support for Audio

Audio transcription is available in the following functions (via type: 'audio' + transcription_model):

FunctionAudio SupportDescription
llm_complete✅ FullTranscribe and optionally transform content
llm_filter✅ FullFilter rows based on audio-derived semantics
llm_reduce✅ FullSummarize or aggregate transcripts
llm_rerank✅ Via textRerank based on derived text features
llm_first✅ Via textPick top row based on transcript criteria
llm_last✅ Via textPick bottom row based on transcript criteria
llm_embedding✅ Via textEmbeddings over transcripts (not raw audio)

For image-specific workflows, see the Image Support page.