Blog Post

Educator Developer Blog
12 MIN READ

ATLAS: Your AI Tutoring Agent For Personalised Learning

danyherscovitch's avatar
danyherscovitch
Copper Contributor
Sep 09, 2025

Introduction

Personalised learning has long been proven to improve students' academic performance, with private tutoring being an example. However, human teaching assistants are full-time researchers with limited availability. Recently, the rise of conversational apps like ChatGPT and Claude offered a promising solution, providing accessible, automated coaching and feedback that many students now rely on to support their studies. Yet, these general-purpose LLMs face critical limitations: they lack awareness of individual student backgrounds and specific course material, cannot track learning progress, and tend to provide direct answers instead of fostering critical thinking. To overcome these challenges, I developed ATLAS, an agentic Intelligent Tutoring System (ITS) designed to bring together the adaptability of human educators and the availability of LLMs.

Background

Hi! My name is Dany, and I’m currently finishing my Master’s degree in Computational Science and AI at Imperial College London. For my research project, I collaborated with Microsoft to develop an innovative education-related tool. In this blog post, I’m excited to introduce ATLAS, a Python system built using Azure AI Foundry, Semantic Kernel, and Azure Cloud Services. I’ll walk you through the ideas behind it, the methodology, and the results I achieved. A special thanks to Lee Stott, who supervised this project.

Deployment in Learning Environments

Frontend App of ATLAS's PoC

Before jumping into the methodology, I think it helps to imagine what ATLAS could actually look like inside a university. Picture a chat-based app, similar to a messaging platform, where each student has a dedicated tutoring agent for every course they’re enrolled in. The agents' role is to guide the student through a course through Socratic questioning, remembering past conversations to keep continuity across sessions.

On the educator’s side, deployment is pretty straightforward: upload the course materials, register students, and connect each of them to the right courses. From there, the tutors are ready to go!

Here is a screenshot of what the platform could look like (it has been built using Chainlit, which I highly recommend for chat apps).

Methodology

ATLAS is engineered to provide a structured, individualised, and tool-augmented learning experience, combining the adaptability of human educators with the availability of LLMs. The system's core is built upon three fundamental AI agent paradigms:

  • Planning: The ability to break down tasks and sequence actions to achieve long-term objectives.
  • Memory: The ability to store and recall information (interactions, preferences, etc) for consistent context.
  • Tool Use: The ability to perform function calling to interact with the environment.

How I implemented those capabilities is based on key pedagogical principles, including Mastery-Based Learning, Differentiated Instruction, and Zone of Proximal Development (ZPD)-driven scaffolding. Grounding ATLAS in the decades-long research of learning science is essential, as it turns every design choice I made into established mechanisms, long-proven to help student achieve better academic performance.

Planning: Knowledge Tracing

A university course contains far too much material for an LLM to handle at once. To manage this, ATLAS uses Knowledge Tracing (KT), a way of breaking a course into smaller, trackable pieces so the tutor can guide students step by step, adapt learning paths, and avoid overwhelming memory causing context rot.

Course Decomposition. The smallest unit of learning is a Knowledge Component (KC): a specific concept with a mastery criterion, learnable content from the syllabus, and optional exercises. ATLAS generates KCs by decomposing the course into chunks of syllabus text (since LLMs struggle with very long inputs) and prompting GPT-4.1, deployed on Azure AI Foundry, to produce structured outputs in JSON mode. Together, these KCs form what I called a course roadmap, though in practice, they are treated as unordered building blocks for tutoring.

 

Process of changing focus from one KC to another.

Learner Progress. Each KC is stored in Cosmos DB and, for each learner, tracked with a mastery status: not started, in progress, mastered, or confused. As students progress, ATLAS can shift focus from one KC to another through function calling, ensuring that only one KC is "active" at a time. This requires marking the active KC as mastered or confused based on interactions with the learner, and the chosen next KC as in progress. The active KC information, including learnable content and exercises, is injected into the system prompt. This keeps the tutor's guidance grounded in the course material and reducing hallucinations.

Memory: Context Engineering

Context engineering is the art of managing what information the LLM sees at each chat round to shape behaviour. It can be used to prevent performance degradation from long inputs, reduce hallucinations, preserve continuity, and lower both cost and latency. In ATLAS, I've used context engineering mainly through chat summarisation, session checkpoints, and live system prompt.

Chat Summarisation. This is a common practice in GenAI engineering. To keep context length bounded, ATLAS maintains a window of N recent messages and compresses the earlier ones into a short summary. The latter is first created once the conversation reaches N turns, then refreshed each round. We use GPT-4.1 Nano deployed on AI Foundry, prioritising speed over peak accuracy, since summarisation is a lightweight task where latency matters. The LLM is wrapped into Semantic Kernel's ChatHistorySummarizationReducer, a variant of ChatHistory, which abstracts the whole summary generation step. Another recommended tool for this task is Azure AI's ConversationAnalysisClient, which includes a collection of NLP task on chat conversations (summarisation, sentiment analysis, etc), although I used SK's solution as it is specially built for my use case.

Interaction Checkpoints. Each chat round produces a checkpoint: a concise summary of one student-tutor exchange. Checkpoints are also generated using GPT-4.1 Nano for latency optimisation. These are stored in the learner profile's Cosmos DB object, and the five most recent checkpoints are loaded into the system prompt at every chat round. This enables ATLAS to resume teaching exactly where the learner left off within the active KC, avoiding unnecessary repetition while preserving continuity across sessions.

Live System Prompt. ATLAS maintains awareness of the learner’s educational background, with various data types included in its system prompt. Alongside the active KC and interaction checkpoints mentioned above, the prompt also includes a formatted summary of learning progress (KC titles and mastery status), basic course and learner information, and the learner’s preferences (introduced below). Since all of this data, stored in Cosmos DB, can be altered by ATLAS, the prompt is refreshed at every chat round to reflect live updates. The process involves removing the old system prompt, reconstructing it by pulling the latest data from Cosmos DB, and reinserting it at the beginning of the chat history. To support this kind of manipulation, I found Semantic Kernel’s ChatHistory (+ summary reduction) with AzureChatCompletion more flexible than using a ChatCompletionAgent with its AgentThread object. In fact, the latter is not as easy to manipulate compared to ChatHistory, with which you can do by simply modifying its list of ChatMessageContent. However, ChatCompletionAgent seems newer and easier to use for simple chat sessions, so it's worth keeping an eye on further development.

Tool Use: Model Context Protocol

Functions ATLAS can call are all available via its MCP Server, built with FastMCP and wrapped into Semantic Kernel's MCPStdioPlugin. The latter object enables connecting to the server as well as passing its collection of tools to the agent's Kernel.

In practice, these tools allow ATLAS to go beyond simple conversation and actively interact with the learning environment. ATLAS currently uses three: KC Switching, Course Content Retrieval, and Learner Preferences Management. Since we’ve already covered KC Switching, let’s dive into the other two.

RAG pipeline flowchart.

Course Content Retrieval. Students often have questions outside the active KC. To keep answers grounded in the syllabus, ATLAS uses a lightweight Retrieval-Augmented Generation (RAG) pipeline. The syllabus is split into sections, embedded, and stored in an Azure AI Search index. When a student asks a question, ATLAS retrieves the most relevant sections using hybrid search (vector + keyword), then reranks them to select the top three. These passages are injected into the conversation so the tutor can respond with syllabus-based answers, minimising hallucinations.

More technically, I generate embeddings with OpenAI's text-embedding-3-large model, deployed on Azure AI Foundry. Each course section is then stored in an Azure AI SearchIndex, where both the raw content and its embedding are defined as SearchField objects. The index also includes a VectorSearch configuration for similarity search and SemanticSearch configuration for reranking results. 

Learning Preferences Management. ATLAS further personalise its teaching by maintaining a profile of student preferences, for example, favouring explanations over exercises, or concise answers over detailed ones. This profile is stored in Cosmos DB, loaded into system prompt, and is managed through function calling. While ATLAS reliably handles explicit requests ("I prefer learning through examples."), it is less consistent at inferring preferences from natural interactions. A interesting avenue for future work is to integrate a third-party LLM that analyses each chat round and proposes adjustments to the learner’s preference profile.

Evaluation Framework: SAILED

Designing ATLAS was only half the challenge. Equally important was figuring out how to measure whether it actually helps students learn. Evaluating AI agents is tricky: they are complex, multi-step systems with long-term goals. Human evaluation is ideal but expensive and slow. To address this, I built SAILED (Student-Agent Interactive Learning Evaluation through Dialogue), a scalable framework that simulates tutoring sessions between ATLAS and virtual students.

Process of predicting student answer to an exam question, using the tutoring conversation stored in Cosmos DB.

The idea is simple:

  1. An LLM-powered student takes an exam before any tutoring.
  2. The student has a tutoring session with ATLAS (or a baseline LLM).
  3. The student retakes the exam using the knowledge gained.

By comparing scores before and after tutoring, we can estimate learning gains in a fast and repeatable way.  But there was one big problem: LLMs playing students are often too smart! they tend to get exam questions right even when told to simulate weaker understanding. To fix this, I introduced a third-party evaluator called the Psychologist. Instead of answering questions directly, it analyses the tutoring dialogue and predicts what the student would choose.

For testing ATLAS with SAILED, I use the first lecture of an “Introduction to Python” course at Imperial College London. The material is broken down into 15 KCs, each paired with a multiple-choice question. I then create three types of virtual students (beginner, intermediate, and advanced), defined by different mastery levels in each KC prior to enrolling in the course. For example, beginners has only 10% of concepts mastered and 50% confused, while advanced students started with 50% mastered and 10% confused. Each simulated student is powered by GPT-4.1 and instructed to behave according to their active KC status (e.g., less confidence shown for confused KCs).

To run the tutoring sessions, I defined a simple conversation loop: the student and tutor take turns until all KCs are marked as mastered. While this looks straightforward, full tutoring dialogues can be very long and computationally expensive. To handle this at scale, I containerised the simulations with Docker and deployed them on Azure Container Apps (ACA), storing every message in Cosmos DB as it was generated.

Lastly, I created a Vanilla LLM teacher benchmark for comparison. Here, the model (also GPT-4.1) only sees the current KC title and delivers 12 turns of dialogue per KC. The prompt was deliberately minimal, just enough to keep the model in a teaching role and avoid degenerate loops. This baseline reflects how students typically use chatbots like ChatGPT or Claude.

Results

Across 9 students, ATLAS achieved the highest performance (mean exam score = 91.1%), compared to 77.8% with vanilla tutoring and only 26.7% with no tutoring. Gains were strongest for advanced students (+17.8% over vanilla), showing ATLAS provides valuable course-specific knowledge even to learners with prior experience. ATLAS also adapted its teaching length: students that were previously confused in a KC required the most turns (mean = 20.8), followed by not-started (11.3) and mastered (6.1). This confirms that the system effectively performs Mastery-Based Learning, and tailors effort to student backgrounds, satisfying the need for Differentiated Instruction.

Overall, these results show that ATLAS provides robust improvements in learning experience and adaptability over both no tutoring and vanilla tutoring.

As a final note, SAILED doesn’t just validate ATLAS: it also provides a blueprint for evaluating any conversational tutoring system. And while it’s no substitute for real students, it’s a fast, low-cost way to benchmark and iterate on conversational AI tutors.

Limitations, Ethics, and Future Directions

While ATLAS advance the vision of making education both adaptable and scalable, several limitations and ethical challenges must be acknowledged.

Limitations

Student over-reliance. Like many LLM-based tutors, ATLAS risks encouraging dependence, where students may default to the system instead of productively struggling with the course material. Although ATLAS encourages reasoning through Socratic dialogue rather than direct answers, the risk of reduced autonomy remains.

Personalisation. Unlike human tutors, ATLAS cannot perceive subtle emotional cues such as frustration or disengagement, which limits its ability to fully adapt instruction. While it can follow preferences and track knowledge progression, it lacks awareness of learner personality, motivation, and prior experiences.

Evaluation representativeness. While simulated learners make large-scale evaluation feasible, they cannot fully replicate human behaviour: they lack emotional responses, diverse personalities, and the unpredictability of real students. Similarly, SAILED focuses on overall teaching outcomes rather than individual system components, which may complicate debugging in future development.

Other practical limitations. ATLAS assumes syllabi are structured, text-based documents with lectures, sections, and exercises. This restricts applicability, since many courses rely on slides, figures, or exercise sets distributed outside the syllabus. Additionally, ATLAS is limited to text-only input and output, whereas leading platforms already offer multimodal interactions. Finally, while grounding in course material reduces hallucinations, it cannot eliminate them entirely. This is an important concern given that trust is critical in education.

Ethical Considerations

These limitations connect to ethical considerations:

  • Storing learner profiles introduces risks around data security, requiring transparent communication of what data is collected and how it is used.
  • Explainability are also crucial: if students cannot understand ATLAS’s reasoning, they may either over-trust or under-trust its guidance.
  • Safeguards against misuse are essential as well. For example, prompt injection may lead the system to complete coursework directly.
  • Finally, fairness must be ensured: biases inherited from training data could disadvantage or alienate certain learners if left unchecked.

Future Work

Future work follows from the limitations discussed. First, course decomposition should be made more flexible, accepting diverse syllabus formats rather than assuming a strict text-based structure. Adding multimodal abilities, through incorporating visual and audio inputs and outputs, is also important to compete with leading foundation models and expand ATLAS usability.

For SAILED, more representative simulated learners are needed. Current profiles rely mainly on distributions of KC statuses. Future versions could include modelling of personality, motivation, and learning preferences to better represent student diversity.

Another promising direction is exploring Multi-Agent Debate (MAD). ATLAS is currently single-agentic, with each agent bound to its own module. Introducing supervisory agents to force ATLAS tutors to further reason could enhance the quality of guidance.

Finally, we must also experiment with alternative LLMs beyond GPT-4.1 and could leverage knowledge graphs to organise the set of KCs. This idea could improve flexibility in knowledge navigation, favouring more optimal learning paths.

Conclusion

This work introduced ATLAS, an agentic ITS designed to provide personalised learning experiences by guiding students through university modules. ATLAS decomposes courses into independent KCs, employs Socratic dialogue to encourage reasoning, and implements key pedagogical concepts such as ZPD and Differentiated Instruction. It maintains awareness of learner progress through Knowledge Tracing, adapts to individual learning preferences, and integrates a RAG pipeline to ground its answers in course material. We also proposed SAILED, a scalable evaluation framework that generates virtual students with varying prior knowledge and simulates tutoring sessions. By comparing exam outcomes across three conditions (no tutoring, vanilla LLM tutoring, and ATLAS tutoring), SAILED demonstrates that ATLAS consistently delivers higher learning gains and personalisation. This shows the potential of ATLAS to make education both available and adaptable, and consequently, more equitable.

Final Note

On a personal note, I've really enjoyed building AI agents with Microsoft tools. As they are large and constantly expanding, staying updated on the latest features could have been challenging, but the significant amount of resources made it so easy. The most valuable, in my experience, is the Azure AI Foundry Discord server, where Cloud Advocates, Engineers, and even external developers come together to provide support and host events on topics across Microsoft’s GenAI ecosystem. The server includes help channels for a wide range of tools and programming languages, from Azure AI and Semantic Kernel to Foundry Local and MCP. Another platform where it has been easy to receive support and raise issues is Azure AI Foundry's Github Repository Discussions. I also want to highlight the API documentation for Azure AI and Semantic Kernel, which I found exceptionally clear, easy to navigate, and developer-friendly.

Lastly, please do not hesitate to contact me if you have any further questions, thoughts, or ideas for future development. You can reach out to me on LinkedIn!

Updated Sep 08, 2025
Version 1.0
No CommentsBe the first to comment