dicta
Speech-to-text is one of the few accessibility tools where Linux still lags. The options that exist tend to want a commercial cloud API, a Python toolchain with GPU model files, or an X11 session – and often all three. I wanted something that runs as a single static binary on Wayland, talks to whatever ASR backend I already have, and doesn’t listen until I tell it to.
dicta is that: a Linux/Wayland voice dictation daemon written in pure Go. No always-on microphone, no wakeword, no push-to-talk. Capture starts when you press a key and stops when the session ends.
Two modes, two keys
dicta binds exactly two single-key compositor shortcuts.
Pause toggles type-mode. While the session is open you talk, and the daemon types each utterance into whatever window has focus, committing on a half-second of silence and staying open for the next phrase. Pause again closes it. This is the mode for filling in a form field, a chat box, a commit message – anywhere a keyboard would go.
Scroll Lock toggles clip-mode. A small editable panel appears with the live transcript. You can fix a misheard word, press Enter to copy the buffer to the Wayland clipboard, Shift+Enter for a literal newline, or Esc to cancel. Nothing reaches the clipboard until you press Enter inside the panel – the panel is the pending buffer.
The two modes are mutually exclusive, and neither sends anything anywhere until there’s an explicit boundary: VAD silence inside an open type-mode session, or Enter inside the clip-mode panel.
Bring your own ASR
The transcription backend is pluggable, and all three options speak protocols that already exist:
| Backend | Transport | Use when |
|---|---|---|
| Wyoming (default) | TCP | You run faster-whisper or any Wyoming server, e.g. on the homelab |
| whisper.cpp | local subprocess | You want dicta to supervise a whisper-server itself |
| OpenAI-compatible | HTTPS | You point at a managed /v1/audio/transcriptions endpoint |
The Wyoming default means no model download and no GPU on the dictation machine – the audio goes to a box that already has the model loaded. The wire-protocol clients live in a separate Go module, asrclient , so they’re reusable on their own.
Optional LLM cleanup runs in clip-mode only, against any OpenAI-compatible endpoint. It tidies the raw transcript – punctuation, obvious mis-hearings – and you still see and edit the result before it goes anywhere. The cleanup prompt is a code constant; user input never templates it.
Built to be boring about security
The daemon and CLI build with CGO_ENABLED=0 and ship as static binaries, which lets the systemd unit set MemoryDenyWriteExecute=true and a tight syscall and filesystem sandbox. Subprocess argument lists are built from typed config, never a shell. TLS verification is on by default for every HTTP client. Type-mode strips newlines before driving the keyboard, so a transcript can’t inject an Enter into a shell prompt.
The clip-mode panel is a separate process with a deliberately minimal job – display a transcript, edit a buffer, handle three keystrokes – so the GUI toolkit it needs never touches the hardened daemon.
Status
Pre-1.0 but functional – I use it daily. The code is Apache-2.0 on GitHub , with a design document that explains the decisions behind it. Bug reports and packaging contributions are welcome.