Home Assistant‘s local voice assistant processes your voice commands entirely on your own hardware without sending any audio to Amazon, Google, or Apple servers. You speak a command, your local speech-to-text engine transcribes it, Home Assistant’s intent recognition system identifies the requested action, and the command executes on your smart home devices. The entire pipeline runs in your house, on your network, with zero cloud dependency and zero privacy compromise.
Cloud voice assistants hear everything. Amazon and Google retain voice recordings, process them on remote servers, and use them for product improvement and advertising targeting. Even with “don’t save my recordings” settings enabled, your audio travels across the internet to be processed. Home Assistant’s local voice pipeline eliminates this entirely. Your voice data never leaves your local network. The tradeoff is that local speech recognition is less accurate than Google’s or Amazon’s cloud models, but the gap has narrowed dramatically in 2025-2026 with faster Whisper models and improved intent handling.
How the Local Voice Pipeline Works
Home Assistant’s voice pipeline processes commands through four sequential stages, each handled by a separate component that you can individually configure and upgrade.
Wake word detection: A lightweight neural network listens continuously for a trigger phrase. The default wake word is “OK Nabu,” but you can train custom wake words. The wake word engine (openWakeWord or microWakeWord) runs on minimal hardware and consumes negligible CPU until triggered. Only after detecting the wake word does the system activate the full processing pipeline.
Speech-to-text (STT): After the wake word triggers, your voice audio is sent to the speech-to-text engine. Home Assistant supports two local STT engines: Whisper (OpenAI’s speech recognition model, running locally through the faster-whisper implementation) and Piper’s speech recognition. Whisper provides the best accuracy, especially the “small” and “medium” model variants that balance speed with recognition quality.
Intent recognition: The transcribed text is analyzed by Home Assistant’s built-in intent handler, which maps natural language commands to smart home actions. “Turn off the living room lights” maps to service call light.turn_off targeting entity light.living_room. Home Assistant understands room names, device names, brightness levels, temperatures, and common phrasing variations without requiring exact command syntax.
Text-to-speech (TTS): Home Assistant responds audibly through Piper, a high-quality local TTS engine with natural-sounding voices in 30+ languages. Piper runs on CPU and generates speech faster than real-time on most hardware. The response plays through your voice satellite’s speaker, confirming the action.
Hardware Options for Local Voice
Running the full local voice pipeline requires a Home Assistant server for processing and a voice satellite device (microphone + speaker) in each room where you want voice control.
Server requirements: Whisper STT is the most CPU-intensive component. The “tiny” Whisper model runs on a Raspberry Pi 4 with acceptable accuracy. The “small” model (recommended for English) needs an Intel N100 mini PC or better for responsive performance (under 3 seconds from speech to action). The “medium” model delivers near-cloud accuracy but needs an Intel i5 or AMD Ryzen 5 equivalent. A dedicated GPU is not required but accelerates processing significantly.
Voice satellite options: The voice satellite is a device placed in each room that captures your voice and plays back responses. Purpose-built options include the Home Assistant Voice Preview Edition ($59, official hardware with dual microphones, speaker, and ESP32-S3), the M5Stack Atom Echo ($13, tiny ESP32-based mic+speaker combo that works surprisingly well), and the Raspberry Pi with ReSpeaker HAT ($35 Pi Zero 2W + $20 ReSpeaker 2-Mic HAT, offering better microphone quality than ESP32 options).
The M5Stack Atom Echo is the best starting point because of its $13 price and tiny form factor. Flash it with ESPHome firmware using Home Assistant’s built-in ESPHome dashboard, and it becomes a voice satellite in under 10 minutes. Sound quality is adequate for command confirmation responses. For rooms where you want music playback through the satellite, the Home Assistant Voice PE or a Raspberry Pi with an external speaker provides better audio quality.
Setting Up Whisper for Local Speech-to-Text
Install the Whisper add-on from the Home Assistant Add-on Store (for Home Assistant OS) or deploy the faster-whisper Docker container alongside your Home Assistant instance. The add-on handles model downloading, configuration, and integration automatically.
Choose your Whisper model based on your hardware. The “tiny” model (75MB) runs on anything but makes frequent transcription errors. The “small” model (466MB) provides good accuracy for English commands and is the recommended starting point. The “base” model (142MB) sits between tiny and small in both size and accuracy. After installation, specify your language (English, German, French, Spanish, etc.) to optimize recognition.
Test the STT pipeline from Home Assistant’s Voice Assistants configuration page. Click the microphone icon and speak a command. The page displays the transcription result, processing time, and any errors. If transcription is inaccurate, try upgrading to a larger model or speaking more clearly with pauses between words. Whisper handles natural speech well but benefits from deliberate pacing for smart home commands.
Setting Up Piper for Local Text-to-Speech
Install the Piper add-on from the Add-on Store. Piper generates natural-sounding speech from text using neural network voice models. Each voice model is 15 to 50MB and produces speech at 2x to 5x real-time speed on a Raspberry Pi 4, meaning responses are essentially instant.
Piper includes voices in 30+ languages with multiple voice options per language (male, female, different accents and styles). For English, the “amy” (British female) and “ryan” (American male) voices are the most natural-sounding. Preview voices on the Piper samples page before choosing. You can configure different voices for different voice assistants in multi-language households.
Piper’s quality has reached a level where most listeners cannot distinguish it from commercial TTS engines like Amazon Polly or Google Cloud TTS for short responses. Smart home command confirmations (“OK, turning off the living room lights”) sound natural and clear. Longer responses (reading weather forecasts, calendar events) occasionally reveal synthetic artifacts, but overall quality is excellent for a fully local solution.
Configuring Voice Satellites With ESPHome
ESP32-based voice satellites run ESPHome firmware that handles wake word detection on the device itself, streams audio to Home Assistant for STT processing, and plays back TTS responses through the onboard speaker.
Flash your M5Stack Atom Echo or ESP32-S3-Box with ESPHome through Home Assistant’s ESPHome dashboard. Select the voice assistant firmware template for your specific device. ESPHome compiles and flashes the firmware over USB (first time) or wirelessly (subsequent updates). After flashing, the device appears in Home Assistant as a new ESPHome device with voice assistant capabilities.
Configure the voice pipeline assigned to each satellite in Home Assistant’s Voice Assistants settings. Each satellite can use a different pipeline (different Whisper model, different Piper voice, different wake word). A kids’ room satellite might use a friendlier voice, while an office satellite uses a more professional tone.
LED feedback on the satellite indicates the current state: idle (LEDs off or dim), listening (LEDs blue/pulsing after wake word detection), processing (LEDs yellow while STT runs), responding (LEDs green while TTS plays), and error (LEDs red if recognition failed). This visual feedback is essential for knowing when to speak and when the system is processing.
Improving Recognition Accuracy
Local voice recognition accuracy depends on three factors: microphone quality, ambient noise level, and the Whisper model size. Optimize all three for the best experience.
Place voice satellites away from noise sources: HVAC vents, running appliances, TVs, and windows facing busy streets. The microphone on ESP32-based satellites picks up ambient noise that degrades transcription accuracy. If accuracy is poor in a specific room, try the Raspberry Pi with ReSpeaker HAT option, which uses higher-quality MEMS microphones with better noise rejection.
Name your devices with distinct, multi-syllable names that Whisper transcribes accurately. “Living room lights” works better than “LR lights.” “Kitchen overhead” works better than “kit light.” Avoid device names that sound similar to each other (e.g., “den light” and “den lamp”) because transcription errors between similar-sounding names cause wrong device activations.
Home Assistant’s intent recognition handles natural language variations surprisingly well. “Turn off the lights in the bedroom,” “bedroom lights off,” “switch off bedroom light,” and “kill the bedroom lights” all map to the same action. You do not need to memorize specific command phrases. Speak naturally, include the room name and action, and the system identifies the correct intent.
How accurate is Home Assistant local voice compared to Alexa?
Local Whisper STT achieves 85 to 95 percent accuracy for smart home commands versus Alexa’s 95 to 99 percent. The gap is narrowest for simple, clear commands (turn on/off, set temperature) and widest for complex queries and noisy environments. For controlling lights, locks, and thermostats, local voice handles the vast majority of commands correctly.
Does local voice work without internet?
Yes, completely. Every component (wake word, STT, intent recognition, TTS) runs on your local hardware. Voice control continues working during internet outages, ISP problems, and cloud service disruptions. This is the primary advantage over cloud-dependent voice assistants.
What languages does Home Assistant local voice support?
Whisper STT supports 99 languages. Piper TTS currently supports 30+ languages with natural-sounding voices. Home Assistant intent recognition supports English, German, French, Spanish, Dutch, Italian, Portuguese, Polish, and more, with community translations expanding coverage monthly.
How much does a local voice setup cost?
Minimum setup: M5Stack Atom Echo ($13) plus existing Home Assistant server. Recommended setup: Home Assistant Voice PE ($59) or Atom Echo ($13) per room plus Intel N100 mini PC ($150) for responsive Whisper processing. A three-room setup costs approximately $100 to $250 total with no monthly fees.
Can I use local voice alongside Alexa or Google Home?
Yes. Local voice satellites coexist with Alexa and Google Home devices on the same network. Use whichever assistant is convenient in each room. Some users keep an Alexa in the kitchen for music and recipes while using local voice satellites in bedrooms and offices for privacy. Home Assistant controls the same devices regardless of which voice interface triggers the command.




