product

Read, listen, create: why your AI agent needs all three

Your agent already reasons. Give it three hands to do work: read files, listen to recordings, create images. Here's how image generation completes the Frenchie toolkit — and what that means for the shape of your next agent workflow.

Published April 23, 20265 min readBy LAB94

Your AI agent already has a brain. It can reason, plan, write code, argue with you about architecture decisions, and generate a plausible roadmap in under a minute. What it's missing is hands.

Hand-drawn line illustration: a friendly humanoid AI agent silhouette in the center extending three slender hands — one holding a folded document (read), one holding a sound waveform (listen), one holding a small framed picture with a sparkle (create)

A brain without hands can describe what a bookshelf should look like, but it can't build one. For an agent to actually do work — the real work, the kind that ends in a shippable artifact — it needs capabilities that extend past pure reasoning. It needs to be able to read files it didn't write, listen to recordings it wasn't in, and create assets it can use downstream.

Those three verbs — read, listen, create — are what Frenchie gives your agent.

The shape of agent work

Watch an AI agent try to complete a real task end-to-end and you start to notice a pattern. The reasoning work — figuring out what to do, drafting text, writing code, proposing structure — is usually the easy part. Modern models are absurdly good at it.

The hard part is everything that isn't reasoning. The part where the agent has to:

Read a scanned contract the user dropped into the chat.
Listen to a thirty-minute client call that was recorded yesterday.
Create a cover image for the blog post it just drafted.

Each of those is a capability gap. The agent understands the task. It just can't perform the underlying action — reading an image, decoding an audio stream, producing pixel data — because those aren't things an LLM does natively.

You can patch each gap individually. Wire in an OCR API. Host Whisper. Call out to an image-generation service. Handle the upload, the polling, the retries, the file management, the credit accounting. Each patch takes a few hours to build and a few more to maintain. Do that three times and you've built a tools layer. Do it right and you've built Frenchie.

Read (OCR)

Your agent can read any PDF or image. Scanned contracts, handwritten notes, invoices from 2003, research papers with tables and figures. One MCP call, clean Markdown back — with figures pulled out as separate PNG files so your agent can reference them directly instead of guessing at what they contain.

This isn't new to Frenchie — OCR was the first capability we shipped. But it set the shape for everything after: one narrow tool, one clean output, zero infrastructure for you to run.

Listen (transcription)

Your agent can listen to audio and video. Meeting recordings, sales calls, podcast episodes, voicemails. Long files get chunked automatically and reassembled into one continuous transcript. Async by default — your agent kicks off the job, keeps working, and collects the Markdown when it's ready. No blocking, no five-minute pauses in the conversation while a model chews through a long recording.

Most agents can't process audio at all without a tool like this. The model sees a file extension and politely declines. Frenchie turns that into a single transcribe_to_markdown call.

Create (image generation)

And now: your agent can create images. A product mockup to attach to a spec. A cover image for the blog post it just drafted. A concept sketch to include in a design review. Describe what you need in a prompt and your agent generates the image, saves it next to the work it's doing, and moves on.

This is the piece that was missing. An agent that can read a scanned contract and transcribe the related client call but still has to ask you to generate the accompanying visual is an agent that's stuck halfway through the work. Image generation closes that loop.

In Frenchie, it's the generate_image MCP tool. Same install flow, same flat pricing — 20 credits per image, $0.20. The generated file lands in your agent's workspace automatically. No new dashboard. No separate playground to manage. No new billing relationship.

Why three capabilities, not ten

A question we get a lot: why stop at three? Why not add structured data extraction, video generation, music generation, RAG, translation, summarization, the whole menu?

Because an agent's LLM already does most of that. Summarization is a prompt. Translation is a prompt. Structured extraction is a prompt over clean text — and Frenchie's job is to get the clean text to the agent in the first place. The reasoning layer is already solved.

What wasn't solved, before Frenchie, was the inputs and outputs that the reasoning layer needs. Reading arbitrary files. Listening to arbitrary recordings. Producing arbitrary images. Those are the three capability gaps that show up in almost every real agent workflow, and those are the three things Frenchie does.

Three verbs. Three MCP tools. One flat pricing scheme. That's the whole product.

What this looks like in your day

The practical version: you're in Claude Code, Cursor, Codex — whatever your agent lives in — and you're building something real. A blog post. A technical spec. A weekly update. A pitch deck.

You drop in a scanned PDF. Your agent reads it.

You attach a recording from yesterday's customer call. Your agent transcribes it.

You ask for a cover image to match the piece. Your agent generates it.

None of those steps require you to switch tools, open a second dashboard, or paste a URL from a different product. Your agent already has the capability. It just picks the right tool and runs.

That's the shape of agent work we've been trying to make feel normal. Read, listen, create — all through one MCP interface, all behind one API key, all priced in the same credit pool. One hundred free credits on signup, once per email, to try all three.

Ready to see it? Install Frenchie in your agent with npx @lab94/frenchie install --api-key fr_your_key, and your agent gets OCR, transcription, and image generation as native MCP tools the moment you restart. Read the docs for the full capability reference, or skip straight to the free tier.