dicta

By Matthew Hunter | Jun 17, 2026 | linux, wayland, voice, dictation, accessibility, golang

Speech-to-text is one of the few accessibility tools where Linux still lags. The options that exist tend to want a commercial cloud API, a Python toolchain with GPU model files, or an X11 session – and often all three. I wanted something that runs as a single static binary on Wayland, talks to whatever ASR backend I already have, and doesn’t listen until I tell it to.

dicta is that: a Linux/Wayland voice dictation daemon written in pure Go. No always-on microphone, no wakeword, no push-to-talk. Capture starts when you press a key and stops when the session ends.

Two modes, two keys

dicta binds exactly two single-key compositor shortcuts.

Pause toggles type-mode. While the session is open you talk, and the daemon types each utterance into whatever window has focus, committing on a half-second of silence and staying open for the next phrase. Pause again closes it. This is the mode for filling in a form field, a chat box, a commit message – anywhere a keyboard would go.

Scroll Lock toggles clip-mode. A small editable panel appears with the live transcript. You can fix a misheard word, press Enter to copy the buffer to the Wayland clipboard, Shift+Enter for a literal newline, or Esc to cancel. Nothing reaches the clipboard until you press Enter inside the panel – the panel is the pending buffer.

The two modes are mutually exclusive, and neither sends anything anywhere until there’s an explicit boundary: VAD silence inside an open type-mode session, or Enter inside the clip-mode panel.

Bring your own ASR

The transcription backend is pluggable, and all three options speak protocols that already exist:

Backend	Transport	Use when
Wyoming (default)	TCP	You run faster-whisper or any Wyoming server, e.g. on the homelab
whisper.cpp	local subprocess	You want dicta to supervise a `whisper-server` itself
OpenAI-compatible	HTTPS	You point at a managed `/v1/audio/transcriptions` endpoint

The Wyoming default means no model download and no GPU on the dictation machine – the audio goes to a box that already has the model loaded. The wire-protocol clients live in a separate Go module, asrclient , so they’re reusable on their own.

Optional LLM cleanup runs in clip-mode only, against any OpenAI-compatible endpoint. It tidies the raw transcript – punctuation, obvious mis-hearings – and you still see and edit the result before it goes anywhere. The cleanup prompt is a code constant; user input never templates it.

Built to be boring about security

The daemon and CLI build with CGO_ENABLED=0 and ship as static binaries, which lets the systemd unit set MemoryDenyWriteExecute=true and a tight syscall and filesystem sandbox. Subprocess argument lists are built from typed config, never a shell. TLS verification is on by default for every HTTP client. Type-mode strips newlines before driving the keyboard, so a transcript can’t inject an Enter into a shell prompt.

The clip-mode panel is a separate process with a deliberately minimal job – display a transcript, edit a buffer, handle three keystrokes – so the GUI toolkit it needs never touches the hardened daemon.

Status

Pre-1.0 but functional – I use it daily. The code is Apache-2.0 on GitHub , with a design document that explains the decisions behind it. Bug reports and packaging contributions are welcome.