AI Engineering in Practice: Building an AI Bedtime Story Generator

Param Harrison
16 min read

Share this post

Welcome to our project-based learning series: The AI Application Stack!

Our mission: To build a complete AI Bedtime Story Generator. We'll start with the simplest "magic trick" (getting an AI to tell a story) and slowly, step-by-step, add all the professional, production-ready layers—like user configuration, streaming, and robust error handling.

This is Part 1, where we'll build the entire application from its first line of code, focusing on how a professional AI application handles user input and delivers an engaging experience.

1. The problem: why do most AI apps feel slow?

We've all used AI tools that do this:

  1. You type your prompt and hit "Enter."
  2. You wait... and wait... for 10, 20, or even 30 seconds.
  3. A huge wall of text appears all at once.

This is called a blocking request. It's technically functional, but the user experience is terrible. It feels slow, broken, and leaves the user wondering if it even worked.

sequenceDiagram
    participant User
    participant Your_API
    participant LLM
    
    User->>Your_API: Generate a story
    activate Your_API
    Your_API->>LLM: Write a 300-word story
    activate LLM
    
    note right of LLM: ...thinking for 20 seconds...
    
    LLM-->>Your_API: [Full 300-word story]
    deactivate LLM
    Your_API-->>User: [Full 300-word story]
    deactivate Your_API
    
    note left of User: (Stared at a loading spinner the whole time)

Our goal is to build a system that feels fast and intelligent. But first, we must build the "slow" version to understand why the "fast" version is so much better.

2. Task 1: The magic trick (a hardcoded story)

Let's build the absolute simplest version. We'll use FastAPI as our web server. This version will have no user configuration. It will just be a single "endpoint" (a URL) that tells the AI to generate one hardcoded story.

This proves we can connect to the AI.

The Code:

# main.py
import os # To read environment variables like API keys
from fastapi import FastAPI
from openai import OpenAI
import uvicorn

# --- Setup ---
app = FastAPI(title="Story Generator API")

# Initialize the OpenAI client. It will automatically look for OPENAI_API_KEY
client = OpenAI() 

@app.get("/generate-story-hardcoded") # Define a GET endpoint
async def generate_story_hardcoded():
    """
    This is our "magic trick." It's simple and hardcoded.
    It proves our connection to the LLM works.
    """
    
    # The prompt is hardcoded directly into the function
    prompt = "Write a short, 3-paragraph bedtime story about a friendly dragon."
    
    # This is a "blocking" call. Our server waits here until the LLM finishes.
    response = client.chat.completions.create(
        model="gpt-4o-mini", # Using a cost-effective model for this demo
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8 # Controls creativity (0.0 for deterministic, 1.0 for highly creative)
    )
    
    story_content = response.choices[0].message.content
    return {"story": story_content}

# This is a simple health check endpoint. Good practice for any API.
@app.get("/health")
def health_check():
    return {"status": "healthy", "message": "API is running"}

# To run this app:
# 1. Save this file as main.py
# 2. Make sure you have `openai` and `fastapi` and `uvicorn` installed:
#    pip install openai fastapi uvicorn python-dotenv
# 3. Set your OpenAI API key in your environment:
#    export OPENAI_API_KEY='your_api_key_here'
# 4. Run the server: uvicorn main:app --reload
# 5. Visit http://localhost:8000/generate-story-hardcoded in your browser.

Result: You have a working app! It connects to the LLM and gets a story back.

graph TD
    A["User Opens Browser <br/> to /generate-story-hardcoded"] --> B[FastAPI Endpoint]
    B --> C["LLM: 'Tell me about a dragon'"]
    C --> D[Full Story Content]
    D --> B
    B --> A["User Sees Full Story in Browser"]

Observation Question: What happens if you refresh the page multiple times? Do you always get the exact same story? Why or why not? (Hint: look at the temperature parameter)

3. Task 2: Adding one config (a dynamic character)

A hardcoded story is boring. Let's make it slightly more interesting by allowing the user to provide a character name via the URL.

The Code:

# main.py (modify the /generate-story endpoint)
@app.get("/generate-story-with-name") # New endpoint path
async def generate_story_with_name(character_name: str): # <-- Our new config as a function parameter!
    """
    This version takes a 'character_name' from the URL.
    Example: http://localhost:8000/generate-story-with-name?character_name=Leo
    """
    
    # The prompt is now dynamic, using the 'character_name' variable
    prompt = f"""
    Write a short, 3-paragraph bedtime story about a character
    named {character_name}.
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    
    story_content = response.choices[0].message.content
    return {"story": story_content}

Result: The user now has some control over the story.

graph TD
    A["User Calls <br/> /generate-story-with-name?character_name=Leo"] --> B[FastAPI Endpoint]
    B --> C["LLM: 'Tell me about Leo...'"]
    C --> D[Full Story Content]
    D --> B
    B --> A["User Sees Full Story"]

Observation Question: What happens if you try to call http://localhost:8000/generate-story-with-name without adding ?character_name=...? How does FastAPI respond, and why?

4. Task 3: Refining the prompt (controlling length)

Our stories are still a bit too long and rambling. Before we add more features, let's refine the output of the current ones. We'll add a simple instruction to our prompt to explicitly control the length.

This is a key aspect of Prompt Engineering: guiding the LLM to give you consistent output.

The Code:

# main.py (just updating the prompt string within the generate_story_with_name function)
@app.get("/generate-story-with-name")
async def generate_story_with_name(character_name: str):
    
    prompt = f"""
    Write a short, 3-paragraph bedtime story about a character
    named {character_name}.
    
    IMPORTANT: Keep the entire story under 150 words.
    """ # <-- Our new, explicit instruction for length!
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    story_content = response.choices[0].message.content
    return {"story": story_content}

Result: Our output is now more consistent in length. We've constrained the LLM to give us what we want.

Observation Question: Even with the max_tokens parameter and a prompt instruction for "150 words," sometimes the LLM might still go slightly over. Why do you think this happens? What are the limitations of strict length control with LLMs?

5. Task 4: Making length configurable

Hardcoding "150 words" isn't ideal. We want the user to choose the story length (e.g., "short," "medium," "long"). Let's add a story_length option to our URL.

The Code:

# main.py (updating the function signature and prompt construction)
@app.get("/generate-story-with-config") # New endpoint path
async def generate_story_with_config(character_name: str, story_length: str = "short"): # <-- 'story_length' with a default!
    """
    Now takes 'story_length' ("short", "medium", "long")
    Example: http://localhost:8000/generate-story-with-config?character_name=Leo&story_length=long
    """
    
    # We create a "prompt factory" (a dictionary) to map user input to LLM instructions
    length_requirements = {
        "short": "3 paragraphs, about 100 words",
        "medium": "5 paragraphs, about 200 words", 
        "long": "8 paragraphs, about 300 words"
    }
    
    # Use .get() to safely retrieve the length, providing a default if 'story_length' is invalid
    length_text = length_requirements.get(story_length, length_requirements["short"])
    prompt = f"""
    Write a bedtime story for a character named {character_name}.
    Make it {length_text}.
    """
    
    # (The rest of the function is the same...)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    story_content = response.choices[0].message.content
    return {"story": story_content}

Result: The app is more flexible, but the URL is getting messy with multiple query parameters. We're also not handling other inputs like story_theme or character_age efficiently.

Observation Question: What are the drawbacks of using URL query parameters for many configurations? Think about data types (how would you send a list?), complexity, and potential errors.

6. Task 5: The pro config (Pydantic & POST)

Using URL query parameters is messy and unsafe for complex applications. A professional app uses a "data contract" for its API. In FastAPI, this is done with Pydantic.

We will now change our endpoint from a GET to a POST request. This lets the user send a clean JSON "request body" instead of a messy URL.

The Code:

# main.py (a major upgrade to our input handling!)
from pydantic import BaseModel
import os # Ensure os is imported for API key

# --- Our "Data Contract" (The Guard) ---
# This class defines the expected structure and types of our incoming JSON data.
class StoryRequest(BaseModel):
    character_name: str
    character_age: int  # FastAPI + Pydantic will ensure this is an integer!
    story_theme: str
    story_length: str   # "short", "medium", or "long"

# --- Our "Prompt Factory" (The Instructions) ---
# We refactor our prompt logic into its own clean function for better organization.
def create_full_prompt(request: StoryRequest) -> str:
    length_requirements = {
        "short": "3-5 paragraphs, about 100-150 words",
        "medium": "5-7 paragraphs, about 200-250 words", 
        "long": "8-12 paragraphs, about 300-400 words"
    }
    # .get() handles cases where 'story_length' might be unexpected
    length_text = length_requirements.get(request.story_length, length_requirements["medium"])
    return f"""
    Write a personalized bedtime story for a {request.character_age}-year-old 
    named {request.character_name} about {request.story_theme}.
    Make it {length_text}.
    End with a gentle moral lesson that emphasizes kindness, bravery, or friendship.
    """

# --- The "Brain" (The POST Endpoint) ---
# We change from @app.get to @app.post. The request body is automatically validated by Pydantic.
@app.post("/generate-story")
async def generate_story_with_body(request: StoryRequest): # <-- It now takes our Pydantic model
    """
    This is our "blocking" endpoint. It's now robust and
    accepts a clean JSON body.
    """
    # 1. Create the prompt using our dedicated function
    prompt = create_full_prompt(request)
    
    # 2. Call the LLM (This is still a blocking call)
    print(f"Generating story for {request.character_name}...")
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.8
    )
    
    # 3. Extract and return the full story
    story_content = response.choices[0].message.content
    return {"story": story_content}

Result: We now have a robust, safe, and professional API. It validates input (e.g., ensuring character_age is an integer) and is easy to use.

graph TD
    A["User Sends POST Request <br/> with JSON Body"] --> B["FastAPI Endpoint"]
    B --> C{"Pydantic Validation"}
    C -- "Valid" --> D["create_full_prompt"]
    D --> E["LLM Call (Blocking)"]
    E --> F["Full Story Content"]
    F --> G["Return to Endpoint"]
    G --> H["User Sees Full Story"]
    C -- "Invalid" --> I["FastAPI Auto-responds <br/> with 422 Error"]

Observation Question: If you send a request with "character_age": "five" (a string) instead of "character_age": 5 (an integer), what HTTP status code does FastAPI return? Why is this automated validation so powerful for developers?

The Problem Persists: Even with this professional setup, there's a fundamental flaw: it's still SLOW. The user hits "Generate" and stares at a loading icon for 10-20 seconds while the LLM composes the entire story.

7. Task 6: Solving the slow problem (streaming)

Now that we have a solid, robust foundation, we can tackle the user experience. Instead of making the user wait for the whole story, we will stream it to them word-by-word, just like ChatGPT.

This is the most important technique for a modern AI application.

graph TD
    subgraph BLOCKING["Blocking API (Slow)"]
        A[User Clicks Generate] --> B(Server Waits 20s)
        B --> C[Full Story Appears]
    end
    
    subgraph STREAMING["Streaming API (Fast)"]
        D[User Clicks Generate] --> E(Word 1 Appears)
        E --> F(Word 2 Appears)
        F --> G(Word 3 Appears...)
    end

To do this, we need to add a new endpoint and a new "generator" function.

The streaming endpoint (the on-ramp)

This new endpoint's job is to tell the browser, "Get ready, I'm going to stream data to you." It uses a StreamingResponse and a special media_type called text/event-stream. This media_type is the standard for Server-Sent Events (SSE).

The Code:

# main.py (add these new imports and the new endpoint)
from fastapi.responses import StreamingResponse
from typing import AsyncGenerator, Optional
import json
import asyncio

@app.post("/stream-story") # Define a new POST endpoint for streaming
async def stream_story(request: StoryRequest):
    """
    This endpoint *initiates* the stream. It hands off
    the real work to the generator function (see next section).
    """
    return StreamingResponse(
        # 1. We pass in our "generator" function, which yields story chunks
        story_generator(request), 
        
        # 2. This is the magic media type for Server-Sent Events (SSE)
        media_type="text/event-stream",  
        headers={
            "Cache-Control": "no-cache", # Important: tells browsers NOT to cache this stream
            "Connection": "keep-alive",  # Important: keeps the connection open for continuous data
        }
    )

Result: We now have an endpoint that, when called, will open a persistent connection and prepare to send data chunks.

The async generator (the engine)

This is the new "brain" of our streaming app. It's an async function that uses the yield keyword instead of return.

  • A function with return sends data once and closes the connection.
  • An async def function with yield (an "async generator") sends a chunk of data and then keeps the connection open, ready to yield the next chunk.

The Code:

# main.py (add this generator function)
async def story_generator(request: StoryRequest) -> AsyncGenerator[str, None]:
    """
    This function "yields" story chunks as they arrive from the LLM.
    This is where the real-time magic happens.
    """
    
    # 1. Create the prompt (using our existing 'create_full_prompt' function)
    prompt = create_full_prompt(request)
    # 2. Call the LLM in STREAMING mode
    # The 'stream=True' parameter is crucial here!
    try:
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.8,
            stream=True  # <--- This is the key to getting chunks back!
        )
        # 3. Loop over every chunk that the LLM sends back
        for chunk in stream:
            content = chunk.choices[0].delta.content # Extract the actual text content
            
            if content: # Only process chunks that have text
                # 4. Format the chunk as a Server-Sent Event (SSE) message
                #    The browser expects "data: {json_string}\n\n"
                data_to_send = {"content": content}
                yield f"data: {json.dumps(data_to_send)}\n\n"
                
                # A tiny sleep to make the streaming effect more noticeable in local demos
                await asyncio.sleep(0.01)
        # 5. When the LLM stream finishes, send a final "done" signal
        yield f"data: {json.dumps({'done': True})}\n\n"
        
    except Exception as e:
        # 6. Always send errors in the same SSE format so the frontend can handle them
        print(f"An error occurred during streaming: {e}")
        yield f"data: {json.dumps({'error': str(e)})}\n\n"

Result: We now have a professional, high-performance API! It's still robust (thanks to Pydantic), but now it feels instant to the user.

sequenceDiagram
    participant User
    participant FastAPI_Streaming
    participant LLM_Stream
    
    User->>FastAPI_Streaming: POST /stream-story (JSON Body)
    activate FastAPI_Streaming
    FastAPI_Streaming->>LLM_Stream: Call LLM with stream=True
    activate LLM_Stream
    
    loop Stream Chunks
        LLM_Stream-->>FastAPI_Streaming: Text Chunk (e.g., "Once")
        FastAPI_Streaming-->>User: data: {"content": "Once"}\n\n
        LLM_Stream-->>FastAPI_Streaming: Text Chunk (e.g., " upon")
        FastAPI_Streaming-->>User: data: {"content": " upon"}\n\n
    end
    
    LLM_Stream-->>FastAPI_Streaming: [END OF STREAM]
    deactivate LLM_Stream
    FastAPI_Streaming-->>User: data: {"done": true}\n\n
    deactivate FastAPI_Streaming

Observation Question: If an error occurs in the story_generator function, why is it important to yield an error message in the SSE format (data: {"error": "..."}\n\n) instead of just letting the function crash?

8. Task 7: Structured output (markdown for streaming)

Our streaming API is great, but it just streams raw text. What if our frontend needs to show the Title in a big font and the Story in smaller paragraphs? We need structured output.

We have two main approaches:

  1. Ask for JSON: We could change the prompt to: "...Respond with a JSON object: {"title": "...", "story_paragraphs": [...]}".

    • The Problem: This breaks streaming! You can't parse a JSON object until you have the entire thing (you need the final } bracket). This forces the LLM to generate the full JSON first, putting us right back in the "blocking" world.
  2. Ask for Markdown (The Better Way): We can ask the LLM to format its streaming text in a simple, parsable way (like Markdown). The frontend can then easily detect and render these formats as they arrive.

Let's modify our prompt function one last time.

The Code:

# main.py (updating the create_full_prompt function)
def create_full_prompt(request: StoryRequest) -> str:
    # ... (same length logic as before) ...
    
    return f"""
    Write a personalized bedtime story for a {request.character_age}-year-old 
    named {request.character_name} about {request.story_theme}.
    Make it {length_text}.
    End with a gentle moral lesson that emphasizes kindness, bravery, or friendship.
    FORMATTING INSTRUCTIONS:
    Start the story with a title formatted as a Markdown H1 header:
    # [Your Story Title]
    
    Follow this with the story paragraphs. Ensure clear paragraph breaks.
    """

Result: Now, our frontend can receive the stream. When it sees a chunk of text starting with #, it knows to put that text in the <h1> title tag. Everything else goes into <p> tags. We get a structured output while still streaming.

sequenceDiagram
    participant LLM_Stream
    participant FastAPI_Streaming
    participant Frontend
    
    LLM_Stream-->>FastAPI_Streaming: "# The Magical Forest\n\n"
    FastAPI_Streaming-->>Frontend: data: {"content": "# The Magical Forest\n\n"}\n\n
    Frontend->>Frontend: Display as <h1>The Magical Forest</h1>
    
    LLM_Stream-->>FastAPI_Streaming: "Once upon a time..."
    FastAPI_Streaming-->>Frontend: data: {"content": "Once upon a time..."}\n\n
    Frontend->>Frontend: Append to <p>Once upon a time...</p>
    
    LLM_Stream-->>FastAPI_Streaming: "\n\nAs the sun set..."
    FastAPI_Streaming-->>Frontend: data: {"content": "\n\nAs the sun set..."}\n\n
    Frontend->>Frontend: Append to new <p>As the sun set...</p>

Observation Question: What other Markdown elements (like **bold** or *italic*) could you instruct the LLM to use for more rich text formatting, and how would a frontend parse those?

Challenges: where to go from here?

We've built a complete, production-ready app. But the journey isn't over. Here are some "challenge tasks" you can try to build on top of this foundation.

Challenge 1: The outline

How would you give the user even more control over the story's direction?

  1. Add a config: Add a new, optional field to the StoryRequest Pydantic model: story_outline: Optional[str] = None.

  2. Tweak the prompt: Modify the create_full_prompt function. If request.story_outline exists, add this to the prompt: "...Follow this one-sentence outline: {request.story_outline}...".

Challenge 2: The series

How would you create a "Chapter 2" of a story, building directly on a previous one?

  1. Add a config: Add an optional field to StoryRequest called previous_story: Optional[str] = None.

  2. Tweak the prompt: Modify the create_full_prompt function. If request.previous_story exists, change the entire prompt to something like:

    "You are a story continuation assistant. Here is the previous story: {request.previous_story}. Write a short, new chapter that continues the adventure for {request.character_name}. Ensure the new chapter has its own Markdown H1 title."

Challenge 3: Safety and content moderation

Our current app generates stories for children. How would you ensure the content is always appropriate?

  1. Pre-processing: Could you add a step before calling the LLM to check the story_theme or character_name for potentially inappropriate keywords?

  2. Post-processing: After the story is generated, could you use another LLM call (or a dedicated content moderation API) to review the story for any harmful content before sending it to the user? (Hint: OpenAI offers a /moderations endpoint.)

Key takeaways

  • Start simple, then iterate: Begin with a hardcoded endpoint to prove the connection, then gradually add complexity
  • Pydantic provides safety: Using Pydantic models for request validation prevents errors and makes APIs more robust
  • Streaming transforms UX: Moving from blocking to streaming responses makes AI applications feel instant and responsive
  • Markdown enables structured streaming: Unlike JSON, Markdown can be parsed incrementally, making it perfect for streaming structured content
  • Prompt engineering matters: Small changes to prompts can dramatically improve output consistency and quality

By building this project step-by-step, you've learned the entire stack: from basic API design to Pydantic validation, prompt engineering, and finally, the high-performance streaming that defines a modern AI application.


For more on building production AI systems, check out our AI Bootcamp for Software Engineers.

Share this post

Continue Reading

Weekly Bytes of AI — Newsletter by Param

Technical deep-dives for engineers building production AI systems.

Architecture patterns, system design, cost optimization, and real-world case studies. No fluff, just engineering insights.

Unsubscribe anytime. We respect your inbox.