How to Automate Email Triage with AI (Route, Flag, Draft)

How to Automate Email Triage with AI (Route, Flag, Draft)

How to Automate Email Triage with AI (Route, Flag, Draft)

If your operations team is reading 200 emails a day to figure out which ones matter, you don't have an email problem. You have a triage problem. And the fix isn't a smarter inbox app.

Email triage is the process of reading, sorting, and routing incoming messages by what they actually are and what needs to happen next. Email triage automation uses an LLM to do that automatically: it classifies incoming messages by intent, routes them to the correct person or system, and optionally drafts a reply for human review. That replaces the need to manually read and sort every message. This article walks through how to actually build it.

We built this for Empower, a removals company getting 200+ inbound emails a day across enquiries, job updates, complaints, and supplier comms. What follows is based on that real build, not a hypothetical.



What Is Email Triage Automation?

Email triage automation is a workflow that reads each incoming email, decides what type it is, sends it to the right place, and either handles it automatically or flags it for a human. It's not a plugin or an inbox app. It's a system built on top of your existing email setup.



Why Don't AI Email Assistants Solve This Problem?

Search for "AI email management" and you'll get SaneBox, Superhuman, Clean Email, and a dozen others. They're fine tools. But they solve a personal productivity problem, not an operational one.

Those tools help you, as an individual, get through your inbox faster. They snooze newsletters, surface priority senders, and remind you about unanswered threads. Useful if you're swamped with your own emails.

That's not what ops teams need. When 200 emails a day are hitting a shared inbox, some are job enquiries, some are complaints, some are driver updates, some are supplier invoices. You need those emails to go somewhere specific and trigger something specific. A cleaner UI doesn't solve that.

What you need is a workflow that reads each email, decides what it is, sends it to the right place, and either handles it automatically or flags it for a human. That's a different kind of build entirely. It's not a plugin. It's a system.



What Email Triage Automation Actually Does

Properly built, an email triage system does three things in sequence.



Classify: Sorting by Intent, Not Just Sender

The first step is classification. The system reads the email and decides what type it is. Not based on who sent it, but based on what they're asking for or telling you.

This is where an LLM earns its place. Rule-based filters can catch obvious patterns. A subject line containing "invoice" is probably a billing email. But real-world emails are messier than that. A customer might email to say their job went well, then ask about rebooking, then mention a damaged item, all in the same message. A keyword filter doesn't know what to do with that. An LLM does.

Classification outputs a structured label, something like enquiry, complaint, job-update, supplier, internal, that the rest of the workflow uses to decide what happens next.



Route: Getting the Right Email to the Right Person or System

Once classified, the email goes somewhere. That might mean:

  • A new enquiry gets logged as a lead in the CRM and a sales rep gets notified in Slack

  • A complaint gets flagged as high priority and sent directly to the ops manager

  • A supplier invoice gets forwarded to the finance folder and tagged for processing

  • A job update gets matched to the existing job record and the status updated automatically



The routing layer is basically a decision tree built on top of the classification. The LLM does the reading and labelling. The routing logic does something with that label.



Draft: Generating a Reply That Only Needs a Human to Review

For certain categories, the system can also draft a response. Not send it automatically, at least not at first, but generate a draft that a human can review, edit, and send with one click.

For a standard enquiry asking about availability and pricing, there's usually a template answer. The LLM can pull the relevant details, personalise it to the specific email, and produce a draft that's 90% ready. The human review step stays in for quality control, but you've cut the time from minutes to seconds.

For complex or sensitive emails, you skip the draft and just route to the right person with context. Not everything should be automated. Knowing where to draw that line is part of the build.



Step-by-Step: Building an Email Triage Workflow



Step 1: Define Your Email Categories (This Takes Longer Than You Think)

Before you write a single line of code or set up a single integration, you need to sit down and define your categories. This is the unglamorous part that most builds skip, and then they wonder why the system keeps misclassifying things.

Start by pulling 100 real emails from the inbox you want to automate. Read them. What buckets do they naturally fall into? Don't start with the categories you think you have. Start with the emails themselves and let the categories emerge.

For Empower, we expected four or five clean categories. We ended up with nine, because the real inbox had nuances we hadn't anticipated. There were emails that were both an enquiry and a complaint. There were emails from existing customers that looked like new enquiries. There were internal emails from drivers that needed to go to dispatch, not customer service.

Once you have your categories, write a one-paragraph description of each one. What does an email in this category typically contain? What's the intent behind it? What makes it different from the adjacent categories? These descriptions become part of your classification prompt later.

Also decide: for each category, what should happen? Where does it route? Does it get an auto-draft? Does it trigger a notification? Does it update a record somewhere? Map it out before you build it.



Step 2: Choose Your Entry Point

You need a way to get the email into your workflow the moment it arrives. Three main options, compared below:

Entry point

Best for

Setup effort

Limitations

Gmail API (Pub/Sub push notification)

Gmail-based businesses

Medium

Google ecosystem only

Outlook / Microsoft Graph API

Microsoft 365 businesses

Medium-high

More setup, same reliability once live

Email forwarding rules

Quick prototypes

Low

Loses metadata, complicates reply threading

Gmail API with a push notification: Google's Gmail API supports Pub/Sub notifications, basically a webhook that fires whenever a new email lands in an inbox. This is the cleanest approach for Gmail-based businesses. It's reliable, real-time, and gives you access to the full email payload including headers, body, and attachments.

Outlook / Microsoft Graph API: Same concept, different ecosystem. If your business runs on Microsoft 365, the Graph API supports email subscriptions that push notifications on new mail. Slightly more setup involved, but solid once it's running.

Email forwarding rules: The lowest-friction entry point. Set up a forwarding rule in your email client that sends every inbound email to a dedicated inbox or webhook endpoint. Useful for quick prototypes, but it has limitations. You lose some metadata and it can cause complications with reply threading.

For production builds, use the native API. The forwarding approach is fine for testing, but you'll hit limitations when you need to do things like reply from the original email thread or suppress notification emails from triggering the workflow.



Step 3: Classification with an LLM: What Prompt Structure Works

The classification prompt is doing the heavy lifting, so it needs to be clear and specific. Vague prompts produce inconsistent classifications.

The structure that works well looks like this:

First, give the model the category list with descriptions. Not just labels, descriptions. The more context it has about what distinguishes each category, the more accurately it classifies edge cases.

Second, give it the email to classify. Subject line, sender, and body. You don't always need to pass the full thread. Often the most recent email is enough, but include prior context when the classification might depend on it.

Third, tell it to output structured JSON. Something like {"category": "complaint", "confidence": "high", "summary": "Customer reporting damaged item from job #4821", "requires_human": true}. The summary field is useful downstream. It means the person receiving the routed email gets a one-line context note without needing to read the original.

The confidence field matters more than it might seem. When the model is uncertain about a genuinely ambiguous email, you want to know that, so you can route it to a human review queue rather than auto-handling it incorrectly.



Step 4: Routing Logic and Escalation Rules

With a classified email and a confidence signal, the routing layer makes decisions. This is standard conditional logic. If category is X and confidence is high, do Y. If confidence is low, send to review queue.

A few things worth building in from the start:

Escalation rules based on content signals. Even within a category, some emails need different handling. A complaint that mentions a solicitor or legal action should escalate faster than a general complaint about timing. You can add a secondary classification pass specifically for escalation triggers, or include escalation signals as part of the primary classification prompt.

Deduplication. If someone sends the same email twice or you're processing a forwarded thread, you don't want two workflows firing. Build in a check against recent email IDs before processing.

Fallback routing. Every workflow needs a catch-all. If an email doesn't fit any category clearly, it goes to a human review queue with the model's best guess attached. You never want emails disappearing into a void because the classification failed.



Step 5: Draft Generation and Human Review Gate

For categories that warrant a drafted reply, the system runs a second LLM call after classification. This prompt is different. It's not classifying, it's generating.

Pass it the email content, the category, any relevant context pulled from connected systems (job record, customer history, pricing info), and a template or tone guide for that category. The output is a draft reply.

The human review gate is non-negotiable for most businesses at the start. The draft goes into a review interface. Could be as simple as a Slack message with the draft text and an approve/edit button, or a proper review dashboard if volume justifies it. A human reads the draft, makes any changes, and sends.

Over time, as you see what the model gets right consistently, you can choose to auto-send specific categories with high confidence scores. But start with the review gate in place. It builds trust in the system before you hand over the keys.



What This Looks Like in Practice: An Operations Team's Inbox

At Empower, the inbox was a genuine bottleneck. Customer-facing staff were spending the first hour of every morning just sorting through overnight emails to figure out what needed urgent attention. Enquiries were getting missed. Complaints were sitting unread.

The build we implemented used the Gmail API to trigger on every inbound email to the main ops inbox. Each email went through a classification prompt with nine categories defined, including a review-required fallback. The structured output fed into routing logic built in n8n.

Enquiries went to the CRM as new leads and triggered a Slack notification to the sales team. Job updates were matched to existing records by job number extracted from the email body. Complaints above a certain severity threshold (based on language signals in the classification) escalated directly to the ops director. Supplier emails were forwarded to a dedicated folder and tagged.

For enquiries, the system also generated draft replies using pricing and availability pulled from their job management system. Staff reviewed drafts in a simple Slack workflow. See the draft, hit approve or edit, done.

The first hour of the morning no longer exists as a triage exercise. The team arrives to an already-sorted inbox where anything requiring their attention has a priority flag and a summary. What used to take an hour takes about ten minutes of review.



Where Email Triage Automation Breaks (And How to Handle It)

It would be dishonest not to cover this. There are failure modes and they're worth knowing upfront.

Emails that span multiple categories. The model will classify to the best fit, but sometimes an email really is two things. Build your categories with this in mind. Allow for a multi-intent label that routes to a human rather than trying to force a single bucket. Or extract and route multiple intents from a single email if your downstream systems can handle parallel workflows.

Threads with long history. If your trigger fires on a reply within a long email thread, the model needs enough context to classify correctly. Passing only the most recent message can cause misclassification when the intent is buried in the thread history. Experiment with how much thread context to include. Usually the last two or three messages is enough.

Emails where the classification changes the right action significantly. If getting it wrong means a complaint sits in the wrong queue for two days, you need tighter confidence thresholds. Lower the confidence cutoffs for high-stakes categories, push more to human review, and monitor the fallback queue closely in the first few weeks.

Model inconsistency on edge cases. LLMs aren't deterministic. The same email run twice might get different classifications if your prompt has ambiguity. Test your category descriptions against real edge cases before going live and tighten the language where the model wavers.

None of these are deal-breakers. They're engineering problems with engineering solutions. But go in expecting them, not surprised by them.



FAQ



Do I need technical staff to build an email triage system?

You need someone who can work with APIs and build logic workflows, either a developer or someone comfortable with tools like n8n. The LLM classification piece is more accessible than it sounds, but the API integrations and routing logic require technical setup. Most businesses bring in a specialist for the initial build, then manage it themselves once it's running.



What's the difference between this and tools like SaneBox or Superhuman?

Consumer tools like SaneBox and Superhuman are built for personal inbox management. They help individuals process their own email faster. Email triage automation is an operational system that routes emails between people and systems, triggers workflows, and integrates with your CRM, job management software, or Slack. Different problem, different build, completely different outcome.



How accurate is the AI classification?

With well-defined categories and a clear prompt, accuracy is typically 85-95% on emails that fit a known category. The remainder get caught by the fallback queue and reviewed by a human. Accuracy improves as you refine the category descriptions based on real misclassifications in the first few weeks of operation.



Can this work with a shared inbox like info@ or operations@?

Yes, and shared inboxes are often the best candidates for this because multiple people are already trying to triage the same emails manually. The system processes inbound emails the same way regardless of whether the inbox is shared. The routing logic then determines who or what system handles each one.



Should I auto-send the drafted replies or always have a human review?

Start with human review for everything. Once you've seen the system operate for a few weeks and understand where it consistently gets things right, you can consider auto-sending specific low-risk categories, things like confirmation emails or standard information requests. Complaints, anything financial, and anything requiring a judgment call should stay behind a human review gate indefinitely.



How long does it take to build and deploy?

A basic triage system with three to five categories, API integration, and routing logic typically takes two to four weeks to build and test properly. The time isn't in the technical build. It's in defining the categories, testing against real email data, and tuning the classification prompt until the accuracy is solid enough to trust.

If this sounds like your operations team's inbox, we should talk. Book a free audit at amplconsulting.ai and we'll map out exactly what a triage system would look like for your specific email volume and workflow.