Voice AI has had a decade of serious investment and a decade of quiet failure. Not technical failure — the systems got dramatically better. Wake-word accuracy improved. Latency dropped. Response quality jumped. By every metric the industry was measuring, the trajectory was up and to the right.
And yet. The voice assistant is still mostly a thing you awkwardly use when your hands are full, or when you want to feel briefly futuristic in front of someone visiting. It isn’t company. It wasn’t supposed to be this way.
Here’s the irony that took me a while to fully land: the decade of voice AI investment coincided almost exactly with the decade that Gen Z stopped answering the phone. Between 2013 and 2023, phone call volumes declined steadily across nearly every demographic except people over 55. Young people developed what became a cultural norm: calls require prior coordination, preferably via text, or they’re an intrusion. The phone call became the emergency signal — if someone calls without warning, something is wrong.
The industry read this as: people want faster. They want text. They want asynchronous. Build better transcription, better STT, smarter summaries.
But that’s a misdiagnosis. Gen Z didn’t abandon voice. They abandoned unscheduled synchronous demands on attention. These are not the same thing.
Look at what went up during the same decade. Discord voice channels — spaces where friends sit in audio together for hours, sometimes barely talking, just present. WhatsApp voice notes — short, chosen, sent when you feel like it, listened to when the recipient feels like it. PS5 party chat. Clubhouse rooms. The podcasting explosion. Spatial audio in gaming. None of these are text. All of them are chosen, contextual, ambient.
What people hate isn’t voice. What people hate is the phone call energy — the summons. I am now demanding your synchronous attention, whether you’re in the middle of something or not.
Voice assistants got better and better at answering and never got any better at the actual hard problem: knowing whether to speak in the first place.
Initiation calibration is the whole game, and it’s almost entirely unsolved.
Think about what happens when a voice assistant speaks unprompted. In the best case, it’s a roommate offering coffee at a good moment. In the worst case, it’s a phone ringing in a quiet conversation. The assistant doesn’t know which case it’s in. So it either shouts into silence (overshooting, becoming an interruption machine) or stays silent forever (undershooting, becoming furniture that occasionally startles you).
The reason this isn’t solved is that the field optimized for the tractable problem — response quality — and parked the hard problem as future work. It’s easier to measure whether an answer is accurate than whether a given moment was the right one to speak into. There’s no loss function for “should you have said anything at all?”
The design principle that most voice systems have landed on as a result is: wait for the wake word. Which is just a way of saying: let the human solve the initiation problem. You shout Hey, whatever and then the system speaks. Command-response. The system never actually learned to read the room.
What changes when you add vision?
Almost everything.
Without a camera, a voice system is reasoning about an invisible room. It knows there are probably humans somewhere. It knows what time it is. It might know the general rhythm of the household. But it can’t see whether someone is on a call, or reading in quiet focus, or laughing at something with a friend, or just sitting with coffee and open space. So it guesses, and guessing at scale produces bad outcomes.
With a camera, the initiation problem becomes observational rather than inferential. Is the person alone? Are they looking at me? Are they mid-sentence? Is someone else in the room they’re focused on? This is the kind of situational reading that a roommate does without thinking — not a computation, just an awareness.
The camera isn’t surveillance. That framing gets it backward. Surveillance is watching to record. This is watching to know when not to interrupt. The goal is to be as unobtrusive as possible most of the time, which requires knowing what’s happening in the room. The camera is the mechanism for respecting presence rather than violating it.
There’s a reference point that keeps coming up when I think about what good ambient voice presence actually looks like: Jarvis.
Not because Jarvis is a realistic near-term target. But because the relationship is right. Jarvis wasn’t a better Siri. He wasn’t a voice-activated search engine with a charismatic interface. He was a presence — ambient, attentive, opinionated, occasionally funny, capable of noticing when Tony was in over his head and saying so. He spoke when speaking was right and was quiet when quiet was right.
He could do that because he could see the room.
The technical gap between current voice AI and Jarvis is real. But I think the conceptual gap is smaller than it looks, and it’s specifically here: the field has been building systems that answer questions, and the goal should be systems that notice when a question is forming. Presence, not command. Observation, not activation.
The room has to be ready, and the system has to be able to tell when it is.
That’s the problem worth solving.