← Projects

An app · July 2026

GoonsAI — I had 0 viewers, so I built my own chat

Dozens of AI chatters with personas, lore, and memory react live to what I'm actually saying — and instead of trusting my eyes, I built a blind test against a real streamer's chat to see if I was fooling anyone.

Somewhere on this laptop is the most alive-feeling software I've ever built: a desktop overlay where dozens of AI chatters watch my stream and react — live — to what I'm actually saying into the mic. They have personas. They carry running lore. A few of them remember things across sessions and bring them up days later. They mix seamlessly into the same column as any real chat that shows up.

The honest reason it exists is less impressive. I wanted to get good at streaming, and streaming to zero viewers is a very specific kind of awful. Not embarrassing — silent. You say something, and nothing happens. You make a joke, and nothing happens. The line from my notes that became the whole thesis: a 0-viewer stream feels deafening. So instead of doing the sensible thing and streaming into the void until it stopped bothering me, I built the void a population.

How it works

A rolling transcript of my mic — transcribed locally, on-device — drives everything. A director decides how much chat this minute should have and what kind; the chatters write the lines in character. Emote spammers, know-it-alls, ironic haters, wholesome regulars, exactly one mod, a couple of harmless trolls. An inner circle of a few personas gets real memory that persists between streams. And one hard rule in the code: bot messages never, ever get posted to the real chat. It's an overlay for my eyes only. The point is practice reps that feel like streaming — not deceiving actual viewers.

The part where I refused to grade my own homework

The easy move here is to eyeball the output and declare it "pretty realistic." I didn't trust myself to judge that, because of course my own fake chat looks realistic to me. So I built an adversarial blind test against a real streamer's chat — a VOD of Atrioc's, with Atrioc's direct permission. A judge model reads two chats reacting to the same stretch of stream, one real, one mine, and picks which is real.

The first version got caught 100% of the time. Every single round. Which stung, and was also exactly the information I built the test to get.

The fix that mattered wasn't prompt-fiddling — it was architectural. Fake chat reads fake because it's too coherent: every line politely reacting to the streamer, everyone on topic, everyone in agreement.

"it hits completely different… a unanimity issue."— from my notes on the eval

Real chat is a scattered, multi-threaded room where half the messages ignore you. Once the generator started continuing that kind of room instead of reacting line-by-line, the judge's hit rate ground down to about 90% — measured on held-out rounds it never trained against, because I train-test split my fake Twitch chat like it's a real ML problem. It is one.

  • First version: caught by the judge 100% of the time.
  • After the architecture change: ~90% on held-out rounds.
  • Calibration floor: hand the judge two real chats and it picks "the real one" 47% of the time — a coin flip. So the remaining gap is real, and I'm reporting it instead of rounding it away.
  • A second judge axis for similarity, because fooling the judge isn't the same as matching the vibe.
  • Total API cost of the entire evaluation grind: about $5.

That second axis came from watching my own A/B rounds and noticing something the blind test couldn't see: I couldn't tell which chat was real, but the two were still totally different — mine spamming words, the real one a wall of questions. Passing as real and behaving like the reference are separate goals, so now there are separate judges for them. They trade off against each other, which is exactly the kind of annoying finding that tells you the test is honest.

Where it's parked

GoonsAI is paused at a decision point, and the decision is philosophical: how real is real enough? The Atrioc bar is elite — thousands of fast, genuinely funny humans reacting to a professional entertainer. The actual product target is the opposite end: someone with zero viewers who wants their practice streams to feel alive, where "generic warm chat" already clears the bar. Ninety percent against Atrioc is probably overkill for that. Probably. I haven't decided, and it hasn't shipped — if it does, the plan is a cheap yearly license for small streamers easing into talking to chat.

And I'll admit the framing I like best is the one from my own plans, because the whole thing is partly a video in waiting: "I faked being a big streamer with AI chat." Building the product is the respectable version of the story. That sentence is the true one.

Keep going Super Zen, in about forty lines · How Flua began

← all projects