Building DnD Scribe: Engineering a Scalable Discord Voice Transcription & AI Campaign Memory System

DnD Scribe began as a simple question:
Can a Discord bot listen to a live Dungeons & Dragons session, transcribe everything automatically, and turn hours of chaotic role-play into structured, searchable campaign knowledge?
What emerged is a multi-service system that captures live voice, chunks and streams audio for transcription, stores structured session data in MongoDB, and applies large language models to transform raw dialogue into usable narrative artifacts.
Live site: https://www.dndscribe.com
Repository: https://github.com/f00d4tehg0dz/DiscordTranscribeDnD
This article focuses on how it works internally: the audio pipeline, chunking strategy, data modeling, AI usage, and scalability decisions.

System Overview
At a high level, DnD Scribe consists of:
- Discord bot (Node.js + discord.js)
- Audio ingestion + chunking layer
- Speech-to-text pipeline using OpenAI Whisper
- Summarization and enrichment using GPT models
- Persistence layer using MongoDB
- Web UI for browsing campaigns and sessions
Conceptual flow:
Discord Voice Channel
↓
Per-user audio capture
↓
PCM buffer → WAV chunk
↓
Whisper transcription
↓
MongoDB (raw segments)
↓
LLM summarization
↓
MongoDB (summaries + metadata)
↓
Web UI / Discord outputThe key engineering challenge is handling long-running voice sessions reliably without exhausting memory, API limits, or losing context.
Capturing and Chunking Discord Audio
Discord voice data arrives as continuous PCM frames. Streaming that entire feed into a single transcription request is not viable:
- Whisper has practical duration limits
- Memory would grow unbounded
- Network failures would corrupt large chunks
Instead, DnD Scribe uses a rolling chunk buffer.
Chunking Strategy
Each speaking user maintains:
- An in-memory PCM buffer
- Timestamp of the last received audio frame
- Byte length counter
When either condition is met:
- Buffer exceeds N seconds (e.g., 20–30s)
- Silence gap exceeds M milliseconds
The buffer is flushed into a WAV file and queued for transcription.
Pseudo-logic:
if (bufferDuration >= MAX_CHUNK_SECONDS || silenceGap > MAX_SILENCE_MS) {
flushBufferToWav();
enqueueForTranscription(wavPath);
resetBuffer();
}This yields:
- Predictable file sizes
- Fast turnaround for transcripts
- Minimal memory pressure
Why Not Stream Directly?
Streaming APIs are fragile under long sessions and transient network issues. File-based chunking provides:
- Natural retry boundaries
- Persistent audit trail
- Easier debugging
If a chunk fails, it can simply be retried.
WAV Conversion Pipeline
Discord voice packets arrive as Opus → decoded into PCM → written to WAV.
Typical flow:
const opusDecoder = new prism.opus.Decoder({
rate: 48000,
channels: 2,
frameSize: 960
});
pcmStream.pipe(opusDecoder).pipe(wavWriter);Design choices:
- 48kHz stereo to preserve clarity
- Standard WAV container for Whisper compatibility
- Temporary filesystem storage instead of memory buffers
This decouples capture from transcription and prevents memory ballooning.
Transcription Queue Architecture
A lightweight in-process queue is used:
Audio Chunk → Queue → Worker → Whisper APIWorkers:
- Process chunks sequentially per guild
- Apply exponential backoff on failures
- Tag results with guildId, sessionId, and userId
This avoids:
- Bursting too many API calls
- Out-of-order transcripts
- Partial session corruption
Pseudo-worker:
while(queue.hasItems()) {
const job = queue.next();
const text = await whisperTranscribe(job.file);
storeTranscript(job.meta, text);
}Using Whisper for Speech-to-Text
Whisper is used because:
- Handles noisy audio well
- Supports long-form speech
- Performs reliably on multiple speakers
Each request includes:
- Language hint (if known)
- Temperature near zero
- No prompt injection (pure transcription)
Result:
{
"text": "I cast fireball at the goblins on the ridge..."
}This raw output is never overwritten — only appended.
That decision enables:
- Reprocessing with improved models later
- Auditing
- Debugging hallucinations in summaries
MongoDB Data Modeling
Rather than one massive “session” document, data is segmented into collections:
1. Guilds
{
guildId,
name,
openaiKeyEncrypted,
settings
}2. Campaigns
{
campaignId,
guildId,
name,
createdAt
}3. Sessions
{
sessionId,
campaignId,
startTime,
endTime,
status
}4. Transcripts
{
sessionId,
userId,
timestamp,
text
}5. Summaries
{
sessionId,
summaryType, // "interval", "final"
content,
createdAt
}Why This Matters
- Transcripts scale linearly
- Summaries can be regenerated
- Sessions stay lightweight
- Queries remain fast
No document grows without bounds.
Periodic Summarization
Instead of summarizing only at session end, DnD Scribe performs interval summarization.
Example:
- Every 30 minutes
- Or after N transcript chunks
Flow:
- Pull the last N transcript entries
- Feed into LLM
- Store interval summary
Prompt structure:
Summarize the following Dungeons & Dragons session dialogue.
Focus on:
- Major events
- NPC interactions
- Player decisions
- Combat outcomes
Return structured bullet points.Benefits:
- Reduces context window size
- Enables near-real-time summaries
- Prevents a single huge prompt at the end
Final summary is then generated from summaries + transcripts, not raw text alone.

Hallucination Mitigation
Several guardrails are used:
- Only feed the actual transcript text
- No worldbuilding prompts
- No “creative writing” instructions
- Temperature kept low
The AI is instructed to summarize, not embellish.
This keeps the output as factual as possible.
Horizontal Scalability
DnD Scribe can be scaled along three axes:
Bot Instances
Multiple bot processes can run:
- Same code
- Different guilds
- Shared MongoDB
Transcription Workers
The queue can be externalized (Redis, SQS, RabbitMQ later).
Stateless Design
All states stored in DB:
- If the bot crashes, the session resumes
- If the worker crashes, the queue continues
This makes containerized deployment straightforward.
Cost Controls
- Per-guild API keys
- Chunk size limits
- Summarization intervals
- No continuous streaming to LLMs
Guilds control their own usage footprint.
Security Considerations
- API keys encrypted at rest
- No transcripts exposed publicly
- Guild isolation by ID
- Environment variables for secrets
Why This Architecture Works Well for D&D
Tabletop sessions are:
- Long
- Unstructured
- Multi-speaker
- Noisy
Chunked audio + append-only transcripts + layered summarization matches that reality.
The system does not try to “understand” the story in real time.
It records first, then reasons later.
That separation is the core design principle.
Closing Thoughts
DnD Scribe is not just a Discord bot. It is a campaign memory engine.
By treating voice as a data stream, transcripts as immutable logs, and summaries as derived artifacts, the system stays reliable, debuggable, and scalable.
Future expansions (NPC extraction, timeline graphs, character arcs, vector search, RAG) build naturally on top of this foundation.
Bonus: A quick stat page!
