Why Voice, Not Video

We spent three weeks building a video analysis pipeline for detecting motor control breakdown in tennis players. 45 commits. Rally detection from audio. Shot segmentation. A labeling tool. Over 100 test cases.

Then we threw it away.

The original idea

The hypothesis was straightforward: motor systems destabilize under competitive pressure in ways that should be visible from 2D video. We wanted to detect that destabilization from broadcast footage — no expensive lab setup, no motion capture.

What went wrong

2D video fundamentally lacks the granularity needed. You can see gross movement patterns, but the subtle changes that characterize motor control breakdown are invisible at broadcast camera resolution. The signal-to-noise ratio was hopeless.

The pivot

Voice turned out to be a dramatically better sensor modality for three reasons:

Direct physiological access. Voice is not a proxy for motor state — it is a motor system. Cognitive pressure modulates the vocal apparatus directly, producing measurable spectral changes. You don’t need to infer — you observe.

Cost reduction by orders of magnitude. A phone microphone replaces a multi-camera setup. The sensor is already in everyone’s pocket.

Surprise finding. The acoustic signature turned out to be the opposite of what we predicted. This was only visible because the new modality gave us fine-grained enough resolution to see the actual direction of change — something 2D video could never have revealed.

The lesson

The best instrument is not always the most obvious one. Sometimes the right modality is the one that gives you direct access to the phenomenon, even if it seems less intuitive. Nobody thinks of voice as a cognitive pressure sensor. That’s exactly why nobody has built this before.

Phase 0a result: p = 0.0001 on 847 speech segments. The construct exists.