Building Trustworthy AI Agents

Copper Contributor

Mar 20, 2025

AI Agents are software programs designed to interact with their environments, collect data and perform tasks autonomously to achieve specific goals set by humans. The agents can make rational decisions based on their perceptions and data, optimizing their actions to produce the best possible outcomes.

Building AI agents should be safe applications. Here, safety means that AI agent performs as designed. As builders of agentic applications, we have methods and tools to maximize safety:

Building a Meta Prompting System

If you have ever built an AI application using Large Language Models (LLMs), you know the importance of designing a robust system prompt or system message. These prompts establish meta rules, instructions and guidelines for how the LLM will interact with the user and data. The system prompt in AI agents will need highly specific instructions to complete the tasks designed for them.

Microsoft has developed a FREE course on the AI Agents for Beginners in this blog I am going to focus on the Meta Prompting System.

To create scalable system prompts, we can use a meta prompting system for building one or more agents in our application:

step 1: Create a Meta or Template Prompt

We design it as a template so that we can efficiently create multiple agents if needed hence the meta prompt will be used by an LLM to generate the system prompts for the agents we create.

Here’ an example of a meta prompt we would give to the LLM:

step 2: Create a basic prompt

The next step is to create a basic prompt to describe the AI agent. Things to be included are the role of the agent, the task the agent will complete and any responsibilities of the agent.
Here’s an example:

Step 3 Provide basic prompt to LLM

Now we can optimize this prompt by providing the meta prompt as the system prompt and our basic prompt. This will produce a prompt that is better designed for guiding our AI agents:

Step 4: Iterate and Improve

Being able to make small tweaks and improvements by changing the basic prompt and running it through the system will allow you to compare and evaluate results.

Understanding Threats

To build trustworthy AI agents, it is important to understand and mitigate the risks and threats to AI agents.

Here are some of the different threats to AI agents and how you can better plan and prepare for them.

Task and Instruction

Description: Attackers attempt to change the instructions or goals of the AI agent through prompting or manipulating inputs.

Mitigation: Execute validation checks and input filters to detect potentially dangerous prompts before they are processed by AI agent.

Access to critical systems

Description: If an AI agent has access to systems and services that store sensitive data, attackers can compromise the communication between the agent and these services. These can be direct attacks or indirect attempts to gain information about these systems through the agent.

Mitigation: AI agents should have access to systems only to prevent these types of attacks. Also, communication between the agent and the system should be secure.

Resource and Service Overloading

Description: AI agents can access different tools and services to complete tasks. Attackers can use this ability to attack these services by sending a high volume of requests through the AI agent, which may result in failures and inflated cost.

Mitigation: Implement policies to limit the number of requests can make to a service.

Knowledge Based Poisoning

Description: This type of attack does not target the AI agent directly but targets the knowledge base and other services that the AI agent will use.

Mitigation: Perform regular verification of the data that the AI agent will be using in its workflows by ensuring that access to this data is secure and only changed by trusted individuals to avoid this type of attack.

Cascading Errors

Description: Errors caused by attackers can lead to failure of other systems That the AI agent is connected to, causing the attack to become more widespread and harder to troubleshoot.

Mitigation: One method to avoid this is to have the AI agent operate in a limited environment, such as performing tasks in docker to prevent system attacks. Also, creating fallback mechanisms and retrying logic when certain systems respond with an error to prevent larger system failures.

Human-in-the-loop

Another effective way to build trustworthy AI systems is using a human-in-the-loop. This creates a flow where users provide feedback to the agents during the run.

Here is the code snippet using AutoGen to show how this concept is implemented:

Conclusion

In conclusion we have learnt that building trustworthy AI agents requires careful design, robust security measures, and continuous iteration. By implementing structured meta prompting systems, understanding potential threats, and applying mitigation strategies, developers can create AI agents that are safe and effective. Additionally, incorporating a human-in-the-loop approach ensures that AI agents remain aligned with user needs while minimizing risks.