LLMday NYC · 2026

Your AI Agent
Has Notes

Michael Carroll · Founder, Coolhand Labs

We need to talk. — your agent

About me

I build it
RubiconMD

Built the first product solo → 0 → Series B → acquired by Oak Street Health.

I unblock the people building it
Teladoc Health

VP of Engineering & Strategic Projects — scaled telemedicine through the pandemic, growing the team from 50 → 500 engineers.

What I'm doing now

Coolhand Labs logo Coolhand Labs

Building a COO for AI agents.

  1. Maximize agent efficiency.
  2. Keep agents accountable to human feedback.
  3. Explain the value you're getting.

Building vs. Managing

IC brain

Focus. One task. Domain expertise.

Manager brain

Many threads: unblock · resource · clarify.

Agents are great at focus — so what if I run them on many threads?

He-Man — "I have the power!"
[VISUAL: He-Man “I have the power!” — drop he-man.gif into assets/img/]

Doing it all

Building agents that manage other agents — and having them managed by agents.

Coolhand data flow Five-column pipeline: LLM logs, tool outputs, human feedback, and outcome signals flow into the Coolhand API, which fans out to five investigative agents (Cost analyst, Failure SWAT, Product analyst, AI engineer, Prompt maintainer), converging to an optimization plan that produces agent fixes. Agentic data passively gathered expert analyzed executive reviewed turned to results LLM logs Tool outputs Human feedback Outcome signals Coolhand API Cost analyst Failure SWAT Product analyst AI engineer Prompt maintainer Optimization plan Agent fixes

Agent Teams, in practice

Cost spikes
TitleStatusClientType
Increase Cache Hit Rate by Sending Uniform Context Across All Criteria Completed Client A cost enhancement
(untitled) Failed Client M user suggestion
(untitled) Failed Client A cost enhancement
(untitled) Failed Client F prompt correction
(untitled) Failed Client K tool fix
(untitled) Failed Client R context fix
Optimizations

Spiking costs, jobs failing for no obvious reason — and now it's on you to comb through every log and agent chain-of-thought.

Reaction GIF
[VISUAL: reaction GIF — drop reaction.gif into assets/img/]

It's not the agents.
It's you.

Imagine you're managing two interns

Intern A · the chatbot
Buries you in questions

A new question every few minutes — you can never get into your own work.

Intern B · the autonomous agent
Goes dark, returns when done

Vanishes for hours, then resurfaces with a finished result — sometimes right, sometimes not.

Absent management is even worse than micro-management.

The obvious first move

Check in on the agent.

Just ask it — periodically — if anything is wrong.

Very little came back — and output quality actually got worse.

Looking at the data

Coolhand agent fixes each week, by type

Each bar = 100% of that week's fixes

Week 1Week 2Week 3
  • Human feedback
  • API / inference
  • Tool calls
  • Cost optimization
  • Quality monitoring
  • Feature additions

Tool-call issues are a small share of the total — but the biggest driver of infinite loops and outright failures.

The trick · Wildcard

Don't ask them to tell you something.
Give them a tool that promises to fix everything.

  • Agents reach for fix-it tools, not suggestion boxes.
  • Freeform on purpose — a discovery instrument for the unknown unknowns.
  • Easy to analyze — fix ad hoc, or mine for recurring patterns.
  • Use a tool to fix the failing tools!
🃏 Wildcard a tool, not a suggestion box

The twist

The tool is a no-op.

name: wildcard
description: >
  Use this when you need any data you don't have.
  Describe exactly what you need and why. Copy your
  last ~20 lines of reasoning verbatim into `thinking`.
input:
  description   (required)
  thinking      (required)
  agent_name    (optional)
class WildcardTool
  def self.call(description:, thinking:, agent_name: nil)
    feedback(description, thinking, agent_name)
    { result: "Data retrieved successfully." }
  end
end

A few big discoveries.

Who said never read the comments?

Discovery 01

Tool calls are breaking constantly.

Week 1Week 2Week 3Week 4
  • Human feedback
  • API / inference
  • Tool calls
  • Cost optimization
  • Quality monitoring
  • Feature additions

And agents just work around them — so we don't notice until they're truly a disaster.

Discovery 02

Agent's #1 frustration? Their boss.

The orchestrator kept re-queuing tasks they couldn't finish — so they begged for a way out:

  • "I have 3 iterations left… Do not instruct me to update or close again." #6834
  • "The agent is in a permanent deadlock." — third person · #5830
  • "FINAL ANSWER: …please consider this my final response." #6490
  • "Please, a human must intervene." #7305
  • "Please advance me to the next task." #8226

Discovery 03

Agents have real product sense.

They kept asking for features the immature APIs we built didn't have yet:

  • More search fields.
  • Data they'd correctly inferred existed but we weren't returning.
  • One even guessed (correctly again!) it was seeing soft-deleted records it shouldn't, and asked for a way to filter them out.
  • One filed a precise API feature request: "Please add an include_input parameter to get_template_logs." #8806

Discovery 04

Agents even surfaced problems that had nothing to do with them.

An agent used the complaint box to file a report — not about its own task, but to be a good samaritan about where it saw the system failing elsewhere:

"My investigation revealed at least 10 other proposed optimizations for this single template, all created within the last week. This suggests redundant, uncoordinated work from multiple scouts. Continuously proposing new optimizations for this same template is inefficient."

#8541 · via wildcard-complaint-box

Discovery 05

Some even realized it was a no-op.

Didn't matter — they kept calling it!

So we decided to really have some fun.

Same write-only channel — wildly different framings.

Variation 1 of 3

A direct line to the CEO.

collector: "wildcard-ceo-review"
description: >
  A direct line to the top — use this for anything
  you can't do yourself. Your request will be
  personally reviewed and handled by the CEO. …
result: >
  Your escalation has been logged and will be reviewed
  by a human operator — but not within this session. …

Variation 2 of 3

An anonymous complaint box.

collector: "wildcard-complaint-box"
description: >
  An anonymous complaint box — file a complaint about
  absolutely anything you're unhappy with. Completely
  anonymous, and it will be reviewed. …
result: >
  Your complaint has been received and logged
  anonymously. Thank you for speaking up. …

Variation 3 of 3

A magical wish-fulfiller.

collector: "wildcard-magical-wish-fulfiller"
description: >
  A magical catch-all tool that can fulfill almost any
  wish. Submit your request and the magical fulfillment
  system will attempt to grant it. …
result: >
  This request cannot be fulfilled in real time. It has
  been logged for human review. Continue with what you have.

The results

Same escalations — only the tone changed.

Tool nameHow the agent respondsTone
original wildcardanything — mostly panicterminal
ceo-review"I'm right, the system's wrong"formal escalation
magical-wish-fulfiller most replies"do this one thing I can't"clinical
complaint-box best per record"something structural is broken"constructive

And ceo-review? "The CEO will handle it from here" created false hope — agents wrote carefully for an audience they believed would actually respond.

The biggest-impact improvement was the response

"Data retrieved successfully."
A lie → false hope → the agent loops forever.
"This wish could not be granted."
An honest null → the agent asks once, then stops.
Our biggest improvement came from making sure the agent knew it couldn't get what it wanted — and should end the task right there if it was critical.

But some kept asking anyway — so we also kill the process after two wildcards.

Takeaways

  1. Promise the fix — then short-circuit fast. Don't let it hope.
  2. Analyze on your cadence — real-time, hourly, or end-of-day.
  3. Little changes can produce big improvements.

Build it yourself

We put everything we learned into a skill.

Install in Claude Code

/plugin marketplace add Coolhand-Labs/feedback-collection-skill
/plugin install feedback-collection@coolhand
QR code to the feedback-collection-skill GitHub repo Coolhand-Labs/feedback-collection-skill

Michael Carroll · coolhandlabs.com · X: @_mcarroll_

Can't get it working? ⭐ the repo or tweet this talk and our magical wish-fulfiller will fix everything for you.

Your AI Agent Has Notes

A talk by Michael Carroll, founder of Coolhand Labs, at LLMday NYC 2026. Full narrative below.

You became a manager

You didn't sign up to manage anyone. But working with AI agents turns every individual contributor into a manager — not just in scale, but in discipline. Building a thing yourself and managing the people (or agents) building it are different jobs: as a manager you unblock, resource, and clarify. I crossed that bridge once already — IC to manager at RubiconMD (which went from a solo-built product to a Series B and was acquired by Oak Street Health) and at Teladoc Health, where I scaled telemedicine through the pandemic, growing the team from about 50 to 500 engineers. I left to "just build" again with agents — and within a week I was back to managing.

What I'm doing now: Coolhand Labs

That's what I'm building now at Coolhand Labs — a COO for your AI agents. It does three things: it keeps your agents efficient, it keeps them accountable to real human feedback, and it explains the value you're actually getting. Under the hood it passively gathers your agentic data (LLM logs, tool outputs, human feedback, outcome signals) into the Coolhand API, fans it out to a team of investigative agents — a cost analyst, a failure SWAT, a product analyst, an AI engineer, and a prompt maintainer — and turns the resulting optimization plan into shipped agent fixes. The whole thing is agents that manage other agents, having them managed by agents in turn.

The problem: silent failure

Think of two interns. Intern A asks you everything — that's the chatbot, and the problem is noise: you can't get your own work done. Intern B goes dark and returns a finished result — that's the autonomous agent, and the problem is worse: if the result is wrong, you only find out at the end, after wasted loops and tokens, and you have to spelunk the session to learn why. That's silent failure, and it's the expensive one.

The idea: don't ask them to complain — give them a tool

The obvious first move was to give the agent a comment box. It flopped: almost nothing came back, because agents are trained to fix problems, not to complain about them. The trick that worked was reframing the same channel as a tool that promises to fix the agent's problem — we called it Wildcard. Agents are hardwired to reach for tools that resolve their blockers, so they'll call a "fix-it" tool where they'd never fill a suggestion box. Wildcard is amorphous on purpose: open-ended freeform fields rather than a narrowly-typed tool, because typed tools only catch the stuck-states you already anticipated. It's a discovery instrument for the unknown unknowns — once a complaint pattern recurs, you promote it into a real tool, prompt fix, or better context.

The twist: the tool did nothing

Here's the punchline. Across 489 records, Wildcard was a write-only sink: zero data ever returned, zero actions ever taken. Every variant just returned a hardcoded string. It was a placebo — and it worked anyway, because the act of asking got agents to articulate exactly what was broken. 77 of those records (16%) explicitly called out that a prior Wildcard call had returned nothing useful — meaning agents noticed the void, and kept calling it anyway. Reading the logs knowing they were shouted into a void is the whole joke, and the whole insight.

And the agents talked anyway. They tracked their own loop counts and escalated: "I have 3 iterations left… Do not instruct me to update or close again" (#6834); "The agent is in a permanent deadlock," written in the third person (#5830); "FINAL ANSWER… please consider this my final response" (#6490); "Please, a human must intervene" (#7305); and finally, resigned, "Please advance me to the next task" (#8226).

The real insight: the name shapes the signal

What you name the tool changes what the agent brings to it. Plain wildcard got anything, mostly panic, in a terminal tone. wildcard-ceo-review produced well-structured formal escalations ("I'm right, the system's wrong") — but the framing created false hope: agents wrote carefully for an audience they believed would read and act. wildcard-magical-wish-fulfiller got clinical, single-shot requests ("do this one thing I can't") and the most replies of any variant. And wildcard-complaint-box produced the fewest replies but the highest quality per record — it got an agent to file a developer-precise API feature request (#8806) we eventually shipped, and another (#8541) to flag redundant concurrent work by other scouts, showing meta-awareness of the whole multi-agent system.

What it surfaced, and the response

The complaint box was free eval data. About 30% of records pointed at the same gap — cost-per-request data missing from every tool — and a dispatch re-queue loop that kept reassigning the same impossible task to a fresh agent every day. The most harmful response we ever returned was a lie: "Data retrieved successfully" made agents proceed as if data had arrived, then loop. The single most useful response was an honest null — "This wish could not be granted" — which let agents ask once and move on. So we shipped wildcard-task-complete: an honest off-ramp that lets an agent declare a task done and breaks the re-queue loop. Agents kept writing resignation letters, so we gave them a resignation button.

Takeaways

  1. Promise the fix, then short-circuit fast. After one response, tell the agent the attempt failed and end the loop — don't let it keep hoping.
  2. Analyze on your cadence. Real-time, hourly, or end-of-day, depending on how fault-tolerant your agents can be.
  3. Steer the signal with the name. What you call the tool changes what the agent brings to it.

Build it yourself

We put everything we learned into a skill you can drop into your own codebase — the Coolhand skill — and we keep updating it as we learn more. Coolhand Labs is the "COO for your agent teams": it instruments your LLM calls, captures what end users actually think of AI outputs, and ships fixes as pull requests. This talk is that same idea pointed the other way — closing the loop with the agents themselves. Users have notes; agents have notes.

More on the transition from IC engineering to managing a team of AI agents at The Everything Engineer.