Lessons from wiring Claude Sonnet into my Gmail, Contacts and Calendar

May 28, 2025

by Eran Sandler

A few weeks ago I experimented with hooking up Anthropic’s Claude Sonnet 3.7 (the last 200k-token Sonnet release before Claude 4 arrived) directly to my personal Google Workspace tools: Gmail, Google Contacts, and Google Calendar. The idea was simple: can a single large-context LLM effectively manage my daily productivity tasks?

I quickly prototyped three agents using Google's APIs:

InboxBot: identifies urgent emails, summarizes unread messages, drafts quick replies, and archives low-priority mail.
ContactBot: merges duplicate contacts and fills in missing contact details (names, companies) extracted from email history.
CalendarBot: checks my availability, negotiates meeting times with other participants (and their own agents), and sends out calendar invitations.

Immediate problems I hit

I rapidly encountered key limitations, even though the 200k-token context window initially seemed huge.

The biggest issue was prompt size: combining the system prompt, detailed function specs, and raw Gmail or Calendar data quickly overflowed the available context. Throwing in more data made results worse, causing the model to lose track of critical details. Payload sizes also varied dramatically, further complicating prompt management.

Another issue was cross-task contamination: Claude often retained context from unrelated tasks, mixing details from one job (like drafting a work-related email) into another entirely different one (such as booking dinner).

To manage these challenges, I added a two-step prompting process. First, a preliminary call selected only the necessary tools for the task. Then a second call included just those tools and carefully trimmed data. I aggressively shortened email threads, extracted minimal summaries, offloaded large payloads externally, and used a task-scoped memory store with a 30-minute expiration to avoid cross-task leaks.

These adjustments reduced the prompt size from roughly 215k tokens down to around 75k, lowered monthly token usage, but introduced additional latency due to the extra API calls.

Results and lessons learned

InboxBot now reliably drafts about 70% of routine email replies with minimal edits required. ContactsBot successfully merged 312 duplicate contacts in one run, and CalendarBot autonomously booked all 25 test meetings I threw at it.

But here's the real takeaway: if these fundamental issues surfaced so quickly in my small-scale experiment, other developers building multi-turn chats or more complicated agent interactions will inevitably run into the same problems—likely sooner and more painfully.