Transcribing D&D Sessions with WhisperX and Speaker Diarization
I play in two weekly D&D groups and write session reports as narrative prose from the characters’ perspectives. The reports expand on what happened at the table, adding dialog and internal monologue in each character’s voice. This workflow has evolved through several iterations, each one solving a problem the previous version left on the table.
How it started
The first version was simple: play the session, take notes, write the report from memory afterward. This worked when I had time, but a four-hour session generates a lot of material, and between work and life, writing sometimes slipped by a week. By then the details had faded. The bullet-point notes I’d scribbled during play were thin on dialog and light on the small moments that make session reports worth reading.
The next step was recording. We play over Discord for voice and Roll20 for the virtual tabletop, so I started capturing both the Discord audio and a screen recording of the Roll20 session. This solved the memory problem — everything was preserved — but created a new one. Scrubbing through four hours of audio to find a specific exchange is tedious, and I still had to type the reports manually from the recording.
The current version automates the transcription step and brings in an LLM for the writing. The recording becomes a searchable text file with speaker labels, and the writing happens collaboratively instead of keystroke by keystroke.
The pipeline
The toolchain is two bash scripts wrapping WhisperX , a research project that extends OpenAI’s Whisper with word-level timestamp alignment and speaker diarization via pyannote .
extract-audio takes a video file and extracts the native audio stream without transcoding. It uses ffprobe to detect the codec and maps it to the correct container format. No quality loss, near-instant.
transcribe takes an audio or video file, calls extract-audio if needed, then runs WhisperX with the large-v3 model and speaker diarization. Output is a plain text file named after the input.
Both scripts are idempotent: run them again and they skip work that’s already done. The whole thing is on GitHub .
Speaker diarization
The key feature for multi-speaker recordings is diarization, which segments the transcript by who’s talking. WhisperX integrates pyannote’s speaker-diarization-3.1 model, which requires a Hugging Face account and accepting the model licenses. It won’t identify speakers by name, but it labels them consistently (SPEAKER_00, SPEAKER_01, etc.), which is enough to follow the conversation.
For a D&D session with five or six voices at the table, the diarization does a reasonable job of separating speakers. It’s not perfect with crosstalk and interruptions, but it captures the structure well enough to reconstruct who said what during the writing process.
Hardware: AMD Strix Halo
I built this on an AMD Strix Halo system (review pending ) running Ubuntu — a Ryzen AI MAX+ 395 with integrated Radeon 8060S. The GPU situation is worth documenting because it’s not straightforward.
WhisperX’s transcription engine is faster-whisper, which uses CTranslate2
under the hood. CTranslate2 is CUDA-only. No ROCm support, no OpenCL, no Metal. If you don’t have an NVIDIA GPU, the transcription step runs on CPU.
The diarization and alignment steps use PyTorch directly, so they work with ROCm on AMD hardware. PyTorch sees the Radeon 8060S via ROCm’s HIP layer and uses it for the speaker embedding pipeline.
The practical impact: on a 16-core Zen 5, a 114-minute recording takes about 23 minutes to transcribe on CPU and another 5-10 minutes for diarization, and scales up to longer sessions linearly. That’s workable for a batch job I run once a week.
Hardware: NVIDIA (Framework Laptop 16)
I also have a Framework Laptop 16 (review pending
) with an NVIDIA RTX 5070 GPU module, already running Linux. NVIDIA is the path of least resistance for WhisperX — CTranslate2 is built for CUDA, so the entire pipeline runs on the GPU. Change --device cpu --compute_type int8 to --device cuda --compute_type float16 and everything accelerates.
~/.local/share/whisperx/bin/pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128
I’ll update this section with benchmarks once I’ve run the same session through the 5070.
The PyTorch 2.7 patch
One dependency issue worth mentioning: PyTorch 2.7 changed torch.load to default to weights_only=True for security. Pyannote’s model checkpoints use omegaconf classes that aren’t in the safe globals list, so loading fails with an UnpicklingError. The fix is a one-line patch to lightning_fabric that defaults local file loads back to weights_only=False. Details are in the repo README
. This will presumably be fixed upstream at some point.
Results
The output is a plain text file with timestamps and speaker labels. It’s not a polished document. It’s a D&D session, which means six people talking over each other, arguing about rules, making terrible jokes, and occasionally doing something heroic. The transcript captures all of it.
Supplementing with Roll20 chat logs
If your group uses Roll20 , the chat log is a goldmine of structured data that the audio transcript can’t capture: dice rolls with results, character names tied to player accounts, attack and damage values, spell DCs, and initiative order. Cross-referencing the chat log with the audio transcript gives you mechanical truth alongside narrative color.
Roll20 doesn’t offer a chat export feature (the “Clear Current Chat Log” button does exactly what it says — don’t click it). But you can copy-paste the chat window contents at the end of a session. The format is messy but parseable: timestamps, player names, character names, roll results, and macros all interleaved.
The key win is the player-to-character mapping. The audio diarization gives you anonymous speaker labels; the chat log gives you “Matthew: Bancroft Barleychaser: Elvish Longsword (+4) — 24 — Critical Success!” That mapping eliminates a manual step and lets the LLM attribute actions and dialog with confidence.
From transcript to narrative
The transcript is a reference, not a finished product. Turning it into a session report is a multi-step process, and this is where an LLM earns its keep.
Step 1: Read the transcript and extract bullet points. I feed the raw transcript — and the Roll20 chat log if available — to Claude and ask it to pull out the significant events, decisions, and dialog. The table talk, VTT troubleshooting, and rules tangents get filtered out. The result is a structured outline of what actually happened, in chronological order. The speaker labels and chat log together let the LLM attribute actions to specific characters.
Step 2: Map speakers to characters. If you have the Roll20 chat log, this step is mostly automatic — player names appear in both sources. Without it, the diarization labels are anonymous (SPEAKER_00, SPEAKER_01) and you identify who’s who from context. The DM is usually obvious, and players tend to refer to each other by name.
Step 3: Write the narrative. Each session report is written from a specific character’s perspective. I give Claude the bullet points, the character mapping, and two or three previous session reports by the same character to establish the voice. The players and the dice determine what happens, the LLM drafts the narrative, and I edit for accuracy and tone. The ideas, the framing, and the editorial decisions are mine. The typing doesn’t have to be, anymore.
The scripts, installation instructions, and dependency notes are available at github.com/matthewjhunter/ai-session-notes .