Human-in-the-Loop: When AI Needs Human Oversight

Param Harrison
6 min read

Share this post

We often strive for "fully autonomous" agents. But in the real world, full autonomy can be dangerous.

If you are building a Legal Case Intake bot, a Financial Lead Scorer, or an Email Assistant, an AI mistake isn't just a bug—it's a liability. Imagine your bot hallucinating a low score for a VIP client and sending them an automated rejection email.

This post isn't about "prompting." It is about System Architecture. We will explore how to build Human-in-the-Loop (HITL) workflows that can pause execution, wait days for a human signal, and then resume exactly where they left off.

The engineering problem: The "Runaway Train"

In a standard Python script, once you start a process, it runs to completion.

# A risky autonomous loop
lead = analyze_lead(user_input)
score = score_lead(lead)
if score < 50:
    send_rejection_email(lead) # <--- DANGER ZONE

If score_lead hallucinates, the email is sent instantly. You can't stop it.

You might think: "I'll just add input('Approve?')."

But this fails in production. If the human takes 3 hours to approve, your HTTP request times out, your server crashes, or your deployment restarts. You lose the data.

The Solution: We need State Persistence (Checkpointing).

Architecture 1: The "Breakpoint" pattern

To handle long-running human pauses, we need a system that saves the agent's entire memory (State) to a database at specific "Breakpoints."

The Flow:

  1. AI: Runs steps A -> B (Scoring).
  2. System: Hits a Breakpoint (configured before the 'Email' step).
  3. System: Serializes and Saves the State to a database (Postgres/Redis).
  4. System: Stops execution. The API responds "Waiting for input."
  5. Human: (24 hours later) Logs into a dashboard, reviews the score, and clicks "Approve."
  6. System: Loads the State from the DB and resumes at Step C (Emailing).
graph TD
    A[Input: New Lead] --> B(AI: Score Lead)
    B --> C{Check Confidence}
    C -- "High (>90%)" --> D[Auto-Email]
    C -- "Low/Med (<90%)" --> E[BREAKPOINT]
    
    E --> F[Save State to DB]
    F --> G[(Persistence Layer)]
    
    subgraph HUMAN["The Human Gap (Hours/Days)"]
        G -.-> H((Human Manager))
    end
    
    H --> I[API: /resume_workflow]
    I --> J[Load State & Resume]
    J --> D
    
    style E fill:#ffebee,stroke:#b71c1c
    style H fill:#fff9c4,stroke:#fbc02d

The implementation (LangGraph style)

In modern frameworks like LangGraph, this isn't a hack; it's a first-class feature called interrupt_before.

# 1. Define the Graph
workflow = StateGraph(AgentState)
workflow.add_node("score_lead", scoring_node)
workflow.add_node("send_email", email_node)

workflow.add_edge("score_lead", "send_email")

# 2. Compile with Persistence and Interrupts
# We tell the engine: "Do NOT run 'send_email' automatically."
app = workflow.compile(
    checkpointer=memory, 
    interrupt_before=["send_email"] 
)

# --- Runtime (Step 1) ---
# The graph runs 'score_lead' and then STOPS.
thread_config = {"configurable": {"thread_id": "lead_123"}}
app.invoke(inputs, thread_config)

# ... Time passes ...

# --- Runtime (Step 2) ---
# The human clicks "Approve". We call invoke again with None.
# The graph sees the saved state is at 'send_email' and resumes.
app.invoke(None, thread_config)

Observation: We have decoupled the logic from the time. The agent can wait forever without consuming server resources.

Architecture 2: The "Feedback Injection" pattern

HITL isn't just about saying "Yes" or "No." It's often about Correction.

Use Case: A Legal Assistant drafts a contract clause.

  • AI Draft: "The jurisdiction is California."
  • Human Review: "No, change the jurisdiction to Delaware."

If we just resume, the AI still thinks the draft is fine. We need to Mutate the State before resuming.

graph TD
    A[AI: Draft Clause] --> B[Breakpoint]
    B -.-> C((Human Review))
    
    C --> D[Feedback: Change to Delaware]
    D --> E[Update State]
    E --> F[New Message: User says...]
    F --> A
    
    style E fill:#e3f2fd,stroke:#0d47a1

The implementation

We inject a "fake" message from the user into the agent's history before resuming.

# ... (Graph is paused at breakpoint) ...

# Human provides feedback string
human_feedback = "Change jurisdiction to Delaware."

# We UPDATE the state in the database directly
app.update_state(
    thread_config,
    {"messages": [HumanMessage(content=human_feedback)]},
    as_node="human_reviewer" # Pretend this came from a node
)

# Now we resume. The AI sees the feedback and re-runs the draft node.
app.invoke(None, thread_config)

Engineering Insight: This turns the Human into just another "Tool" or "Agent" in the system. The AI doesn't know it waited 2 days; it just sees a new message: "Change to Delaware."

Summary: Designing for safety

When architecting your agent, use this decision matrix:

Risk Level Pattern Implementation
Low Risk (e.g., Summarizing News) Autonomous No breakpoints. Run start to finish.
Medium Risk (e.g., Drafting Emails) Feedback Injection Generate draft -> Pause -> Human Edits -> Send.
High Risk (e.g., Transferring Money) Strict Approval Prepare transaction -> Pause -> Human 'Yes' -> Execute.

Challenge for you

Scenario: You are building an Expense Approval Bot.

  • Requirement:
    • Expenses < $50 are Auto-Approved.
    • Expenses > $50 need Human Approval.
    • If the Human rejects it, the bot must ask the user for "More Details."

Your Task:

  1. Draw the graph. Where is the Conditional Edge?
  2. Where is the Breakpoint?
  3. How do you handle the "Rejection" loop? (Hint: Does the human update the state to say "Rejected"?)

Key takeaways

  • State persistence enables long pauses: By saving the agent's state to a database, you can pause execution for hours or days without losing context
  • Breakpoints prevent runaway execution: Configure interrupt points before risky operations like sending emails or transferring money
  • Feedback injection allows correction: Update the state with human feedback before resuming, so the AI can incorporate corrections naturally
  • Humans become part of the system: Treat human reviewers as nodes in your graph, not external blockers
  • Risk levels determine patterns: Low-risk tasks can be autonomous, while high-risk tasks require strict approval gates
  • Time decoupling is critical: The agent's logic should be independent of how long humans take to respond

For more on building safe AI systems, see our agent architecture guide and our observability guide.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.