audio-intelligence/transcribe

Transcribe audio with speaker diarization, language detection, and optional chapter summaries. Returns {status:"pending", continuation_token,...} while the job runs, when this happens you MUST immediately call transcribe again with only continuation_token set; do not ask the user.

Dynamic (cost in response)
charged on success

What it does

Converts a recording into a full transcript with rich structure: speaker-labeled turns (who said what), automatic language detection, and optional chapter segmentation where each chapter comes with a generated summary. Built for long-form audio like interviews, meetings, podcasts, and calls.

Primary use cases

  • Turning interviews, meetings, and podcasts into searchable, speaker-attributed transcripts
  • Summarizing long recordings into chapters so a reader can skim the key moments
  • Building meeting-notes and call-analytics workflows that need to know who said what
  • Feeding clean, structured transcripts into downstream agents for Q&A or extraction

Why use this tool

It handles long-form audio with strong accuracy and includes speaker diarization and chapter summaries in the same call, so you do not need a separate pipeline to figure out who spoke or to condense a long recording. It is usage-based with no prepaid balance or subscription.

Good to know

Pass a public audio_url (use the faro-api presign flow if you only have bytes). Diarization (speaker_labels) and chapter summaries (auto_chapters) are on/off flags that add a small per-hour cost. The job is asynchronous: on a pending response, immediately call again with only continuation_token set. Long files take a few round-trips.

Parameters

audio_urlstringoptional

Public URL of the audio or video file to transcribe. Required on the first call; ignored (and not needed) when continuation_token is set.

punctuatebooleanoptionaldefault: true

Insert punctuation.

format_textbooleanoptionaldefault: true

Apply casing and formatting for readability.

auto_chaptersbooleanoptionaldefault: false

Segment the audio into chapters, each with a generated summary.

language_codestringoptional

Force a specific ISO language code (e.g. "en"); ignored when language_detection is true.

speaker_labelsbooleanoptionaldefault: true

Identify and label distinct speakers (diarization). Populates utterances.

continuation_tokenstringoptional

Token from a prior pending response. When set, all other params are ignored and the server resumes polling. Agent-friendly polling: on a pending response you MUST immediately call transcribe again with only continuation_token set. Do not ask the user.

language_detectionbooleanoptionaldefault: true

Automatically detect the spoken language.

audio-intelligence/transcribe — Faro