Action Guard — Safer Autonomous AI Agents

This work makes AI agents safer by classifying and blocking harmful actions before they execute. If you value trustworthy AI, please consider supporting and sharing this project.

Star on GitHub Read the Paper

The problem we solve

The rapid rise of autonomous AI agents and tool-enabled LLMs increases the chance they propose or execute harmful, unethical, or unsafe actions. Causes include prompt manipulation, model hallucination, overconfidence, and misleading MCP tool descriptions. Without runtime screening, agents can take harmful steps before humans notice.

Action Guard addresses this by classifying proposed agent actions in real-time and blocking ones flagged as harmful — reducing risk while preserving useful automation.

Why this matters

Project details

HarmActions dataset: A structured dataset of safety-labeled agent actions used to train the classifier and to evaluate agent behaviour (HarmActEval).

Action Classifier: A lightweight neural model trained on HarmActions to label proposed actions as safe, harmful, or unethical — optimized for realtime deployment inside agent loops.

MCP integration: Action Guard integrates with MCP workflows via a guarded proxy and sample MCP server, allowing live screening of tool-calls without modifying existing agents.

Key features: automatic action classification, blocking of harmful calls, detailed classification output, and easy integration (importable API: `is_action_harmful`).

See the full project README for usage, Docker Compose demo, and citation details.

Live demo & workflow

Demo of Action Guard
Interactive demo showcasing guarded agent interaction.
Workflow diagram
How Action Guard screens actions in an agent loop.

How you can help

  1. Star and fork the repo to boost visibility.
  2. Share the demo with peers and on social media.
  3. Try the code, report issues, or contribute improvements.

Citations, badges, and links live in the repository — contributions help the research and improve safety across AI systems.

Modules & components