What's the time commitment for this bootcamp?

The bootcamp requires 10 hours per week over 6 weeks. This includes live sessions, hands-on projects, and self-paced learning. Most students find this manageable alongside their full-time jobs.

Do I need prior AI experience to join?

No prior AI experience is required, but you should have few years of software development experience. The bootcamp is designed for software engineers who want to upskill in AI engineering.

What if I can't make a live session?

All live sessions are recorded and available for replay. We also offer multiple office hours throughout the week, so you can catch up on any missed content or get help with assignments.

How much should I budget for APIs and resources?

We estimate €10-50 for the entire bootcamp, covering API costs for OpenAI and other services. We'll show you how to optimize costs and use free tiers when possible.

What happens if I can't attend this cohort?

You can defer to the next cohort at no additional cost. We run cohorts every 2-3 months, so you won't have to wait long to join.

How long will I have access to materials after the bootcamp?

You'll have lifetime access to all course materials, recordings, and the private community. This includes future updates and new content we add to the bootcamp.

What's the refund policy?

Yes, you can get a 100% refund if you've progressed less than 10% of the bootcamp or it's within 7 days of your purchase. We're confident in our curriculum and instructor quality, which is why we offer this guarantee.

Do you offer team discounts?

Yes! We offer 20%+ discounts for teams of 3 or more. Contact us at param@learnwithparam.com for team pricing and bulk enrollment options.

Browser Automation: Building Agents That See and Click

In our previous posts, we mastered RAG (see our RAG introduction). RAG allows agents to read your internal data. Search APIs (like Tavily) allow agents to read the indexed web.

But what if an agent needs to do something?

Log into a vendor portal to download an invoice.
Fill out a job application form on 50 different sites.
Navigate a complex, dynamic dashboard to take a screenshot.

This is Browser Automation, and it is the new frontier of Agentic AI. It moves us from "Read-Only" AI to "Read-Write" AI.

This post is for engineers who know how to scrape <html> but want to build agents that can navigate the modern, dynamic web like a human does.

The problem: HTML is "Noise"

If you've written a scraper, you know the pain. You assume the "Submit" button has the ID #submit-btn. But modern React/Tailwind sites look like this:

<div class="flex items-center justify-center p-4 bg-blue-500 hover:bg-blue-700 text-white font-bold rounded">
  <span class="css-1x23f">Next Step</span>
</div>

If you feed this raw HTML to an LLM, two things happen:

Token Explosion: A single webpage can consume 100k tokens of messy <div> soup.
Hallucination: The model gets lost in the CSS classes and can't find the "button."

We need a way for the agent to "see" the page without reading the Matrix code.

The solution: The "Vision + Accessibility" stack

We don't use raw HTML. We use a hybrid of Computer Vision and the Accessibility Tree (AXTree).

1. The Accessibility Tree (The "Semantic" view)

Browsers already generate a simplified version of the page for screen readers. It strips away the <div> noise and leaves the meaning.

HTML: <div class="css-123" role="button">Submit</div>
AXTree: Button: "Submit" [ID: 42]

2. The Vision Model (The "Human" view)

Sometimes, semantic tags are missing. A button is just an image. This is where Multimodal LLMs (like GPT-4o) shine. They can look at a screenshot and say, "There is a blue button in the top right."

The architecture

We build an agent loop that combines these two inputs.

graph TD
    A[Start: Book a flight to NY] --> B(Browser Engine)
    
    B --> C[Screenshot Pixels]
    B --> D[AXTree Semantic DOM]
    
    subgraph VISION["The Agent's Vision"]
        C --> E(Multimodal LLM)
        D --> E
        E --> F[Visual Map with unique IDs]
    end
    
    F --> G[Decision: Click ID 42]
    G --> H(Execute Playwright Action)
    H --> B
    
    style E fill:#e3f2fd,stroke:#0d47a1
    style F fill:#e8f5e9,stroke:#388e3c

Observation: By overlaying unique IDs (1, 2, 3...) on the screenshot/AXTree, we turn a complex coordinate problem ("Click at x:200, y:400") into a simple classification problem ("Click #42").

Engineering pattern: Dynamic form filling

The hardest part of browser automation is filling forms that the agent has never seen before.

Use Case: Universal Job Applicant

You have a user's CV profile. You want the agent to apply to jobs on LinkedIn, Indeed, and Workday. Every form is different.

We cannot hardcode selectors (input[name='first_name']). We need Semantic Mapping.

The Logic:

Agent: Analyzes the AXTree of the current page.
Observation: Finds an input labeled "Given Name".
Reasoning: "My user profile has first_name: 'Jane'. 'Given Name' is semantically equivalent to 'first_name'."
Action: type(id=12, text="Jane")

# Pseudo-code for a browser agent loop using a library like `browser-use`

from browser_use import Agent, Controller
from langchain_openai import ChatOpenAI

# 1. Define the User's Data (The Context)
user_profile = {
    "name": "Jane Doe",
    "email": "jane@example.com",
    "experience": "5 years in Python..."
}

# 2. Initialize the Vision Model
llm = ChatOpenAI(model="gpt-4o")

# 3. Define the Task
task = "Go to 'jobs.example.com', find the 'Apply' button, and fill out the form using my profile."

# 4. Run the Agent
agent = Agent(
    task=task,
    llm=llm,
    # Inject user data into the system prompt
    system_prompt=f"You are an applicant. Use this profile: {user_profile}"
)

await agent.run()

Engineering Insight: The agent doesn't just "fill fields." It handles State Changes.

It clicks "Next."
It waits for the page to load (Vision check).
It sees a generic error ("Invalid Input").
It self-corrects ("Ah, the phone number needs dashes") and retries.

When the agent detects a CAPTCHA (via vision or text), it should pause execution (input_required state) and send a screenshot to the user: "I am stuck. Please solve this CAPTCHA and press Enter."

Summary: Building robust web agents

Feature	Traditional Scraper	AI Browser Agent
Targeting	Hardcoded CSS Selectors	Visual/Semantic IDs
Resilience	Brittle (Breaks on UI update)	High (Adapts to UI changes)
Logic	Linear Script	Dynamic Loop (Reason -> Act)
Cost	Cheap	Expensive (Vision Tokens)

Challenge for you

Scenario: You are building an "Amazon Price Monitor" that checks a specific product page.

The Problem: Sometimes Amazon shows a "One-time purchase" price, and sometimes it defaults to "Subscribe & Save" (which is lower). You want the real price.
Your Task:
1. How would you instruct the agent to ensure it captures the "One-time purchase" price?
2. If an "Accept Cookies" banner covers the price, how does your loop handle it?
3. Draw the decision graph. (Start -> Check for Popups -> Close -> Check Price Type -> Click 'One-time' -> Read Price).

Key takeaways

Vision + Accessibility Tree beats raw HTML: Combining screenshots with semantic accessibility trees gives agents a human-like understanding of web pages
Unique IDs simplify interaction: Overlaying numbered IDs on visual elements turns coordinate problems into classification problems
Semantic mapping enables dynamic forms: Agents can map user data to form fields semantically, without hardcoded selectors
State changes require vision: Agents must detect page loads, errors, and UI changes to handle dynamic web interactions
Step limits prevent infinite loops: Setting maximum action counts protects against agents getting stuck in scroll traps
Pop-up handling is critical: Vision models can detect and close modals that block interactions
CAPTCHAs require human handoff: When automation hits hard limits, gracefully pause and request human intervention
Cost vs resilience trade-off: Vision-based agents are expensive but adapt to UI changes, while traditional scrapers are cheap but brittle

For more on building agentic systems, see our tool calling guide, our multi-agent coordination guide, and our workflow orchestration guide.

For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Browser Automation: Building Agents That See and Click

Share this post

The problem: HTML is "Noise"

The solution: The "Vision + Accessibility" stack

1. The Accessibility Tree (The "Semantic" view)

2. The Vision Model (The "Human" view)

The architecture

Engineering pattern: Dynamic form filling

The risks: Infinite loops and traps

1. The "Infinite Scroll" trap

2. The "Pop-up" blocker

3. CAPTCHAs

Summary: Building robust web agents

Challenge for you

Key takeaways

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Workflow Orchestration: Building State Machines with LangGraph

Browser Automation: Building Agents That See and Click

Share this post

Share this post

Continue Reading

Domain-Specific Voice Flows: Building the Guardrails

Multi-Agent Voice Systems: The Warm Transfer

Voice Conversation Memory: Why Your Bot Forgets Who You Are

Voice AI Fundamentals: The 500ms Threshold

Workflow Orchestration: Building State Machines with LangGraph

Weekly Bytes of AI