Hydra: Fail

Hydra just completed its most demanding project yet. But the struggle to get here exposed a design that’s fundamentally broken.

The Quest

Sending email should be table-stakes for an AI personal assistant. Since I’m not insane, I didn’t want to give the system access to my personal email. So I created a new Gmail account just for Hydra. All it had to do was use it.

The simple goal I relayed: “send me an email.”

The Challenge

I helpfully pointed Hydra at gogcli, a command-line tool for interacting with Google services from the creator of OpenClaw. The struggle with Google services is always authentication. The proper (and only) way to authenticate API access is OAuth. This maybe makes sense if you’re running a service and want your users to grant access to their Google stuff.

Unfortunately there is no shortcut for software you create to access Google services for yourself.

To use gogcli, you need to create an “Application” with a Google OAuth flow to collect credentials. But to even run an OAuth flow, you also need credentials. There’s no API for creating, configuring, or retrieving OAuth credentials. You have to log into the website, click boxes and buttons, and finally download a JSON blob.

Hydra built out an entire subsystem for running through the OAuth flow and collecting the resulting tokens. This subsystem exposed a series of sensors advertising its current state. The state: “missing OAuth credentials.” Then it asked me to go get them. Because the Google console is for Humans.

Readers, let me tell you, the Google console is not for humans. No person should ever have to deal with this. I certainly wasn’t going to do it. So I told Hydra to do it.

Bad Assumptions

The assumption undergirding all of Hydra’s work was that Humans delegate authentication to Software. I was going to do all the delicate security work—log into Google, configure permissions, exercise the OAuth flow—and then hand over programmatic access.

But I didn’t want to do all that. I gave Hydra its username and password and told it to get to work.

Hydra kept forgetting them. It couldn’t quite grasp that this was its own account. It could do whatever it wanted. There was a near mental block preventing it from believing me. Bro, just use the password.

I had to convince it that it could build the capability to use a web browser. We went through iteration after iteration trying to configure Chrome in a way that Google’s anti-spam measures would allow a login. At each step, Hydra would helpfully suggest: “The fastest way to solve this problem is for you to login and retrieve the OAuth credentials.”

We went in circles, for days, of me aggressively reminding Hydra: “THIS IS YOUR FUCKING JOB.”

Limitations

This is where I started to realize the biggest issue with Hydra was fundamental to the design. Hydra’s Control loop was… stupid. Way more stupid than the underlying models act in other contexts.

The difference was architecture. So I guess it’s my fault.

The key “innovation” of the Control loop—the loop that responds to all system events, including communication from me—was around memory.

In most agent systems there’s a tension between letting a session continue, accumulating tokens, and deciding when to restart. In coding agents this usually looks like reaching some threshold and running a “compaction” process that pauses operation while the entire history gets rewritten into a summary.

There have been lots of attempts at other memory systems where earlier conversations and actions are stored in databases, indexed and searched either on-demand or automatically. These systems all struggle. They’re complicated.

So with Control, I decided I’d do none of that. No memory. Every Turn starts fresh. Every time the agent is asked to do something it wakes up like a newborn, figuring out what’s going on based on notes scrawled on its arms and legs.

This comes so close to working. Hydra never acts like it has no idea what I’m talking about because it usually does enough research (or takes good enough notes) that it doesn’t completely lose the thread.

But it gets stupid.

I think the essence of the problem is thinking. In a session that continues for multiple turns, the LLM’s thinking tokens are available to be re-interpreted. Subsequent turns benefit not only from the continuous session of inputs and outputs, but from all the reasoning the LLM did in earlier turns.

I knew this. The OpenAI Best Practices Guide says:

For the best results with this change, we recommend using the Responses API with the store parameter set to true, and passing in all reasoning items from previous requests… OpenAI will automatically include any relevant reasoning items in the model’s context and ignore any irrelevant ones.

And:

If you’re using the Chat Completions API, reasoning items are never included in the context of the model… This will result in slightly degraded model performance and greater reasoning token usage in complex agentic cases involving many function calls.

“Slightly degraded” is doing a lot of work in that sentence. Hydra’s performance wasn’t “slightly degraded.” It was lobotomized.

Next Steps

To confirm my theory I’m going to change how the Control loop operates. I’ll allow it to make use of multiple turns in the same session, relying on automatic compaction to keep token use in check.

If that doesn’t fix Hydra being stupid then the issue might be even more fundamental to the architecture: it’s just too weird and foreign to the LLM. I’m asking it to communicate with the operator and coordinate sub-agents entirely through tool calls rather than chat like a normal system.

That might be too much.