audio-transcribe

audio-transcribe

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

1estrellas
0forks
Actualizado 1/21/2026
SKILL.md
readonlyread-only
name
audio-transcribe
description

Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.

Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

Command

agent-media audio transcribe --in <path> [options]

Inputs

Option Required Description
--in Yes Input audio file path or URL (supports mp3, wav, m4a, ogg)
--diarize No Enable speaker identification
--language No Language code (auto-detected if not provided)
--speakers No Number of speakers hint for diarization
--out No Output path, filename or directory (default: ./)
--provider No Provider to use (local, fal, replicate)

Output

Returns a JSON object with transcription data:

{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}

Examples

Basic transcription (auto-detect language):

agent-media audio transcribe --in interview.mp3

Transcription with speaker identification:

agent-media audio transcribe --in meeting.wav --diarize

Transcription with specific language and speaker count:

agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3

Use specific provider:

agent-media audio transcribe --in audio.wav --provider replicate

Extracting Audio from Video

To transcribe a video file, first extract the audio:

# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3

Providers

local

Runs locally on CPU using Transformers.js, no API key required.

  • Uses Moonshine model (5x faster than Whisper)
  • Models downloaded on first use (~100MB)
  • Does NOT support diarization — use fal or replicate for speaker identification
  • You may see a mutex lock failed error — ignore it, the output is correct if "ok": true
agent-media audio transcribe --in audio.mp3 --provider local

fal

  • Requires FAL_API_KEY
  • Uses wizper model for fast transcription (2x faster) when diarization is disabled
  • Uses whisper model when diarization is enabled (native support)

replicate

  • Requires REPLICATE_API_TOKEN
  • Uses whisper-diarization model with Whisper Large V3 Turbo
  • Native diarization support with word-level timestamps

You Might Also Like

Related Skills

internal-comms

internal-comms

47Kwriting

A set of resources to help me write all kinds of internal communications, using the formats that my company likes to use. Claude should use this skill whenever asked to write some sort of internal communications (status reports, leadership updates, 3P updates, company newsletters, FAQs, incident reports, project updates, etc.).

anthropics avataranthropics
Obtener
write-pr

write-pr

45Kwriting

Writing pull request titles and descriptions for the tldraw repository. Use when creating a new PR, updating an existing PR's title or body, or when the /pr command needs PR content guidance.

tldraw avatartldraw
Obtener

Transform data into compelling narratives using visualization, context, and persuasive structure. Use when presenting analytics to stakeholders, creating data reports, or building executive presentations.

wshobson avatarwshobson
Obtener

Create employment contracts, offer letters, and HR policy documents following legal best practices. Use when drafting employment agreements, creating HR policies, or standardizing employment documentation.

wshobson avatarwshobson
Obtener

Analyzes job descriptions and generates tailored resumes that highlight relevant experience, skills, and achievements to maximize interview chances

ComposioHQ avatarComposioHQ
Obtener

Assists in writing high-quality content by conducting research, adding citations, improving hooks, iterating on outlines, and providing real-time feedback on each section. Transforms your writing process from solo effort to collaborative partnership.

ComposioHQ avatarComposioHQ
Obtener