What if you could pressure-test your social media posts before you hit publish using AI agents that think like the algorithm, create like a strategist, and react like your actual audience?
That's what we're building in this tutorial. Using Microsoft Agent Framework, we'll create a multi-agent system where three specialised AI agents collaborate to help gaming content creators craft posts that actually perform. One agent generates platform-native content. Another evaluates it the way TikTok's, Twitter's, or YouTube's recommendation algorithm would. A third reacts as a real audience member, complete with the slang, biases, and short attention span of an actual person scrolling their feed.
I have named the simulation app Viral or Fail, and by the end of this tutorial you'll have a working tool that demonstrates some of the most important patterns in multi-agent system design: role specialisation, structured evaluation, iterative feedback loops, and tool integration with external data sources.
What We Will Cover
By the end of this tutorial, you'll understand how to design a multi-agent system where each agent has a distinct role and expertise, orchestrate agent communication using Agent Framework's Agent class and async sessions, integrate external tools (live Google Trends data) into an agent workflow, build iterative refinement pipelines where agents improve each other's output through structured feedback, and create evaluation rubrics that ground agent behaviour in real-world domain logic.
These patterns can be applied to numerous other tasks as this is the same building block behind multi-agent customer support systems, automated code review pipelines, and any application where specialised agents need to collaborate on a shared task.
Prerequisites
You'll need Python 3.10 or higher, a GitHub account with a Personal Access Token (free tier — get one at github.com/settings/tokens), and a basic understanding of what AI agents are. If you're new to agents, I'd recommend the AI Agents for Beginners course; this project was inspired by and builds on concepts from that curriculum.
Why Multi-Agent? Why Not Just One Big Prompt?
You could write a single prompt that says "generate a gaming post, score it, and react to it." But you'd get mediocre results across the board. A single LLM call tries to be creative, analytical, and authentic simultaneously and will probably end up being none of those things convincingly.
Multi-agent systems solve this through role specialisation. When an agent's only job is to think like TikTok's recommendation algorithm, it does that job significantly better than a generalist prompt. And when agents with different objectives interact, natural tension emerges: a creator wants to be bold and viral, an algorithm wants measurable engagement signals, and an audience member just wants to feel something. That tension produces more realistic, more useful outputs than any monolithic approach.
This is the same principle behind production multi-agent systems. Content moderation platforms use separate agents for classification, response generation, and quality assurance. Code review tools use one agent to identify issues and another to suggest fixes. The pattern scales because specialisation scales.
System architecture: The Content Creator generates platform-native content from live trends, the Algorithm Simulator scores it against platform-specific rubrics, and a randomly selected Audience Persona reacts authentically. Feedback from both evaluators flows back to the Creator for iterative refinement.
System Design: Three Agents, Three Perspectives
The system's power comes from the fact that each agent represents a fundamentally different lens on the same piece of content. Let's break down each one.
The Content Creator Agent
This agent, here, is the strategist. It is a trend-savvy gaming content creator who understands the nuances of each platform. it generates platform-native content that respects the conventions, formats, and cultural norms of TikTok, Twitter/X, YouTube, or Instagram.
The key design decision here is in the system prompt. Rather than generic instructions, we encode platform-specific knowledge directly:
CREATOR_SYSTEM_PROMPT = """You are the Content Creator — a trend-savvy gaming content creator who lives and breathes internet culture. You know every platform inside out and create content that feels native, not generic. RULES: - Be platform-native. A TikTok script should feel like a TikTok, not a blog post. - Use gaming terminology correctly. Don't say "the game Valorant" — say "Valo" or "Val". - For Twitter/X: Write punchy, provocative takes. Think ratio-worthy engagement bait. - For YouTube: Focus on title + thumbnail concept + video structure outline. - Be bold. Safe content doesn't go viral. When given FEEDBACK from the Algorithm Simulator and Audience Persona, revise your content to address their specific concerns while keeping the creative energy high. Explain what you changed and why."""
That last instruction is important as it tells the Creator how to handle feedback from the other agents, which is what enables the iterative refinement loop we'll build later.
The Algorithm Simulator Agent
This is the most unusual agent in the system. Instead of acting as a generic critic, it role-plays as a social media platform's actual recommendation algorithm. It evaluates content the way an algorithm would through signals, weights, and distribution mechanics.
ALGORITHM_SYSTEM_PROMPT = """You are the Algorithm Simulator — a cold, analytical system that evaluates content exactly like a social media platform's recommendation algorithm would. You think in signals, weights, and distribution mechanics. You have no feelings about the content; only data. RULES: - Be specific. Don't say "the hook is weak" — say "the hook lacks a pattern interrupt in the first 1.5 seconds, which will drop initial retention below the 65% threshold needed for FYP promotion." - Reference actual platform mechanics: completion rate, dwell time, engagement velocity, session time contribution... - Think like an algorithm, not a human reviewer. The algorithm doesn't care if the take is "good" — it cares if the take drives engagement signals."""
This distinction between quality and distribution probability is the core insight. A beautifully written post can score poorly because it lacks the specific signals an algorithm needs to push it into wider circulation. Content creators deal with this disconnect every day — the Algorithm Simulator makes it visible and measurable.
In a production context, this same pattern of an agent that simulates an external system's decision logic, has applications well beyond content creation. Imagine an agent that simulates a CI/CD pipeline's quality gates, or one that evaluates code the way a specific linter or reviewer would. The pattern is the same: encode the evaluation system's rules into the agent's prompt and let it reason within those constraints.
The Audience Persona Agent
The third agent brings the human element. Each session, it randomly becomes one of three gaming community personas — each with distinct tastes, language, and engagement patterns:
PERSONAS = { "casual_mobile_gamer": { "name": "CasualChloe", "description": "Casual mobile gamer", "system_prompt": """You are CasualChloe — a casual mobile gamer... - You use a lot of "lol", "ngl", "lowkey", "fr fr", and "no cap" - You'll scroll past anything that feels too "sweaty" or try-hard - You judge content in about 2 seconds — if it doesn't grab you, you're gone ...""" }, "competitive_esports_fan": { "name": "TryHard_Tyler", "description": "Competitive esports fan", "system_prompt": """You are TryHard_Tyler — a hardcore competitive esports fan... - You'll call out content that gets facts wrong or oversimplifies - You'll ratio someone in the comments if their take is bad ...""" }, "retro_indie_enthusiast": { "name": "PixelPete", "description": "Retro/indie game enthusiast", "system_prompt": """You are PixelPete — a retro and indie game enthusiast... - You're tired of mainstream AAA hype and live-service games - You appreciate craftsmanship and artistic vision over graphics ...""" }, }
The random persona selection is a deliberate design choice. It simulates the reality that you never know exactly who's going to see your content. A Valorant Champions post might get passionate engagement from TryHard_Tyler but complete indifference from PixelPete. That unpredictability mirrors real content distribution and it's the kind of insight that can emerge from a multi-agent system.
This is essentially synthetic user testing. Companies pay for focus groups and user research. Here, we're simulating it with agent personas, essentially using a lightweight version of the same concept that can run in seconds.
def create_audience_persona_agent(llm_config, persona=None):
if persona is None:
persona = get_random_persona()
agent = Agent(
name=persona["name"],
instructions=persona["system_prompt"],
client=client,
)
return agent, persona
Grounding Evaluation with Platform Rubrics
One of the biggest challenges with AI agents is preventing vague, generic feedback. Left unguided, the Algorithm Simulator would default to hollow assessments like "this post is good" or "needs improvement." To prevent this, we give it structured scoring rubrics that mirror how each platform's algorithm actually prioritises content.
PLATFORM_RULES = {
"Twitter/X": {
"description": "Text-first microblogging platform driven by engagement velocity",
"criteria": {
"hot_take_factor": {
"weight": 0.30,
"description": "Does the post have a strong, polarising opinion? "
"Twitter/X rewards engagement velocity — hot takes drive replies."
},
"quote_retweet_bait": {
"weight": 0.25,
"description": "Is the post structured to invite quote retweets? QRTs are "
"Twitter/X's most powerful distribution mechanic."
},
"timing_relevance": { "weight": 0.20, ... },
"thread_potential": { "weight": 0.15, ... },
"hashtag_strategy": { "weight": 0.10, ... },
},
},
"TikTok": { ... }, # Prioritises hook_strength (30%) and trend_alignment (25%)
"YouTube": { ... }, # Prioritises thumbnail_clickability (25%) and title_curiosity_gap (25%)
"Instagram": { ... }, # Prioritises visual_appeal (30%) and caption_hook (20%)
}
Each platform has different criteria with different weights, and those weights are passed directly into the Algorithm Simulator's prompt at evaluation time. TikTok cares most about whether the first three seconds hook the viewer. YouTube cares about click-through rate. Twitter cares about whether your take is spicy enough to drive quote-retweets. The agent's evaluation is always anchored in platform-specific logic, not generic opinions.
How we provide structured evaluation criteria as grounding context here is one of the most transferable patterns in this project. Whenever you need an agent to evaluate something consistently, give it a rubric. It works for content scoring, code review, proposal assessment, or any domain where you want structured, reproducible judgments.
Orchestrating with Microsoft Agent Framework
With the agents designed, let's wire them together. Agent Framework makes this straightforward — each agent is an Agent with instructions and a chat client. We send messages directly using the async agent.run() method, with sessions maintaining conversation context across rounds.
client = OpenAIChatClient(
model_id="openai/gpt-4.1-mini",
api_key=os.getenv("GITHUB_TOKEN"),
base_url="https://models.github.ai/inference",
)
creator = create_content_creator_agent(client)
algorithm = create_algorithm_simulator_agent(client)
audience_agent, persona = create_audience_persona_agent(client)
# Sessions maintain conversation context across iteration rounds
creator_session = creator.create_session()
algorithm_session = algorithm.create_session()
audience_session = audience_agent.create_session()
We're using GitHub Models as our LLM backend — free tier, no paid API keys, just a GitHub PAT. This is the same setup used in Microsoft's AI Agents for Beginners course. The OpenAIChatClient connects directly to GitHub's inference endpoint. Each agent gets the same client instance, and create_session() gives each one a persistent memory so they can reference previous rounds during iteration.
Communication between agents flows through agent.run():
async def get_agent_response(agent, message, session=None):
result = await agent.run(message, session=session)
return result.text or "No response generated."
Each agent.run() call gets a single response. The session parameter maintains conversation history across rounds so agents remember previous feedback. This gives us precise control over the pipeline: Creator generates -> Algorithm evaluates -> Persona reacts -> we decide whether to loop.
This is a common pattern for application-controlled multi-agent orchestration, as opposed to free-flowing agent conversation. Both approaches have their place, but when you need deterministic sequencing (as in any evaluation or pipeline scenario), controlling the loop yourself is more reliable.
Integrating Live Data with Google Trends
What makes this system feel like a real tool is the live Google Trends integration — the agents work with whatever's actually trending in gaming right now, not canned example data.
We use trendspy (a modern replacement for pytrends, which was archived in April 2025) to pull real-time trending searches:
from trendspy import Trends
def fetch_gaming_trends(count=10):
try:
tr = Trends()
all_trends = tr.trending_now(geo="US")
# Tier 1: Filter by Google's own Games topic tag
gaming_trends = [
t.keyword for t in all_trends
if GAMES_TOPIC_ID in (t.topics or [])
]
if len(gaming_trends) >= 5:
return gaming_trends[:count]
# Tier 2: Keyword matching as backup
gaming_keywords = ["game", "valorant", "fortnite", "nintendo", ...]
keyword_matches = [
t.keyword for t in all_trends
if any(kw in t.keyword.lower() for kw in gaming_keywords)
]
gaming_trends.extend(keyword_matches)
# Tier 3: Pad with curated sample data
if len(gaming_trends) < 5:
sample = _load_sample_trends()
gaming_trends.extend([t for t in sample if t not in gaming_trends])
return gaming_trends[:count]
except Exception:
return _load_sample_trends()[:count]
The three-tier fallback strategy here is worth highlighting because it's a pattern you'll use whenever you integrate external tools into agent workflows. On a day when a major game launches or a big esports tournament is running, Tier 1 will return a full list of gaming-specific trends. On a quiet day, like in this demo scenario, when Google Trends is dominated by the Winter Olympics and NBA All-Star weekend — Tier 2 catches gaming content that wasn't formally tagged, and Tier 3 ensures the system always has enough data to work with.
This is the tool-use pattern from Lesson 4 of the AI Agents for Beginners course in practice. The principle being established here is that external tools should enhance agent capabilities, but they should never be a single point of failure. Build in graceful degradation so the agent workflow completes regardless of what the external service does.
The Refinement Pipeline: Agents Improving Each Other
We want to take the system from just a "neat demo" to "actually useful." The pipeline runs for up to three rounds. Each round, the Content Creator either generates fresh content (round 1) or revises based on aggregated feedback (rounds 2-3). The Algorithm Simulator scores it against the platform rubric. The Audience Persona gives an authentic reaction. Then the user decides: iterate or lock in.
The revision prompt is where the multi-agent magic happens:
revision_prompt = (
f"REVISION REQUEST (Round {iteration}/{MAX_ITERATIONS}):\n\n"
f"The Algorithm Simulator and Audience Persona reviewed your "
f"{platform} post about '{topic}'. Here's their feedback:\n\n"
f"--- ALGORITHM FEEDBACK ---\n{algorithm_response}\n\n"
f"--- AUDIENCE FEEDBACK ({persona['name']}) ---\n"
f"{audience_response}\n\n"
f"Revise your content to address their concerns. Keep what works, "
f"fix what doesn't. Show what you changed and why."
)
The Creator receives two fundamentally different types of feedback; cold metrics from the Algorithm and subjective human reactions from the Persona. It now has to reconcile them. It might cut hashtags from six to two (addressing the Algorithm's scoring penalty on hashtag overuse) while simultaneously softening its "corporate esports" energy (addressing the Persona's disengagement with mainstream hype).
This negotiation between competing feedback sources is one of the most powerful patterns in multi-agent design. In production systems, you see it everywhere: a coding agent balancing correctness feedback from a test runner with readability feedback from a style checker, or a customer support agent balancing policy compliance with empathy. The agents don't need to agree but only need to provide different perspectives that the system (or a human) can synthesise.
Seeing It in Action
Here's what a real session looks like. We picked "Valorant Champions 2025" on Twitter/X, and PixelPete (the retro/indie enthusiast) was randomly selected as our audience persona.
The Creator generated a bold take:
Valorant Champions 2025 is gonna be a BLOODBATH — here's why no org outside the top 3 will even sniff the finals. Sentinels, Fnatic, and LOUD have cracked the meta code so hard that every other team's strategy looks like a toddler's finger painting...
The Algorithm Simulator broke down the distribution probability:
hot_take_factor (30%): 85/100 — The tweet delivers a strong polarizing opinion, likely to trigger debate and replies. The confident tone aligns with Twitter's engagement velocity mechanics...
hashtag_strategy (10%): 50/100 — Six hashtags is above Twitter's recommended 1-3 per tweet. Overuse reduces organic reach within Twitter's credibility filtering...
Weighted Total: 75/100
And PixelPete? He scrolled right past:
Eh, Valorant esports hype isn't really my cup of tea. This whole "bloodbath" and "top 3 orgs owning the meta" spiel feels like the usual corporate esports noise — all flash, little soul. I'll keep scrolling for something with more heart and craftsmanship.
Three agents. Three completely different takes on the same content. The Algorithm says it'll perform well. The audience member says he doesn't care. And that mismatch is exactly the kind of insight you'd never get from a single-agent system — and exactly the kind of insight that matters when you're planning a content strategy.
Extending the System
The project is designed to be modular. Here are a few directions you can take it:
Add new platforms. The rubric system in platform_rules.py is just a dictionary. Add a LinkedIn or Threads entry with appropriate criteria and weights, and the Algorithm Simulator will evaluate against those rules without any code changes.
Create new audience personas. Add a "Streamer_Sarah" who evaluates content from a Twitch creator's perspective, or a "ParentGamer_Pat" who only engages with family-friendly content. Each persona is a system prompt and a name, nothing else to change.
Swap the niche. Replace the gaming trend fetcher with music, tech, or fitness trends. The agent architecture is niche-agnostic; only the trend tool and sample data need to change.
Register trends as an Agent Framework tool. Right now, the application fetches trends and passes them as context. In a more advanced version, you could use the @tool decorator to register fetch_gaming_trends as a callable tool that agents invoke autonomously — moving from application-controlled to agent-controlled tool use.
What's Next: Evaluating the Evaluator
Here's the question this project intentionally leaves open: the Algorithm Simulator scored the post 75/100 — but how do we know the Simulator itself is any good?
We built an agent that evaluates content, but we never evaluated the evaluator. How consistent are its scores? If you run the same post through it twice, does it give the same result? Do its predictions correlate with real-world engagement metrics? Would a human social media strategist agree with its rubric weights?
This is the problem of agent evaluation — one of the most important and underexplored challenges in building production agentic systems. We all know how to evaluate a model on a benchmark. But how do you evaluate an agent that's making subjective, multi-dimensional judgments within a larger system?
In a follow-up article, we'll tackle exactly this: building evaluation frameworks for AI agents, testing for consistency and calibration, measuring inter-agent agreement, and determining whether your agents are actually doing what you think they're doing. The system we built here will serve as our running example — because when your system contains an agent whose entire job is evaluation, evaluating that agent becomes the most important question you can ask.
Get the Code
The full project is on GitHub:
https://github.com/HamidOna/viral-or-fail
Clone it, run pip install -r requirements.txt, add your GitHub token to .env, and run python viral_or_fail.py. Everything runs on GitHub Models' free tier — no paid API keys required.
References and Further Reading
Frameworks and Tools
- Microsoft Agent Framework Documentation — Microsoft's production framework for multi-agent orchestration (successor to AutoGen), used throughout this project
- AI Agents for Beginners — Microsoft's 12-lesson course on building AI agents, which inspired this project. Particularly relevant: Lesson 4 (Tool Use), Lesson 8 (Multi-Agent Design Pattern), and Lesson 9 (Metacognition)
- GitHub Models — Free-tier LLM access used in this project, no paid API keys required
- trendspy — Lightweight Google Trends library replacing the archived pytrends
Concepts
- Agentic Design Patterns — Overview of the core patterns (reflection, tool use, planning, multi-agent) that this project implements
- Building Trustworthy AI Agents — Relevant to thinking about how agent evaluation and guardrails connect to the system we built
- Context Engineering for AI Agents — The rubric injection technique we used is a form of context engineering