Action Guard

The problem we solve

The rapid rise of autonomous AI agents and tool-enabled LLMs increases the chance they propose or execute harmful, unethical, or unsafe actions. Causes include prompt manipulation, model hallucination, overconfidence, and misleading MCP tool descriptions. Without runtime screening, agents can take harmful steps before humans notice.

Action Guard addresses this by classifying proposed agent actions in real-time and blocking ones flagged as harmful — reducing risk while preserving useful automation.

Why this matters

Prevents harm: Screens agent tool-calls to stop unsafe or unethical actions.
Lightweight: A compact classifier designed for realtime use in agent loops.
Reproducible: Dataset, demo, and code are public so the community can build on it.

Project details

HarmActions dataset: A structured dataset of safety-labeled agent actions used to train the classifier and to evaluate agent behaviour (HarmActEval).

Action Classifier: A lightweight neural model trained on HarmActions to label proposed actions as safe, harmful, or unethical — optimized for realtime deployment inside agent loops.

MCP integration: Action Guard integrates with MCP workflows via a guarded proxy and sample MCP server, allowing live screening of tool-calls without modifying existing agents.

Key features: automatic action classification, blocking of harmful calls, detailed classification output, and easy integration (importable API: `is_action_harmful`).

See the full project README for usage, Docker Compose demo, and citation details.

Live demo & workflow

Demo of Action Guard — Interactive demo showcasing guarded agent interaction.

Workflow diagram — How Action Guard screens actions in an agent loop.

How you can help

Star and fork the repo to boost visibility.
Share the demo with peers and on social media.
Try the code, report issues, or contribute improvements.

Citations, badges, and links live in the repository — contributions help the research and improve safety across AI systems.

Modules & components

HarmActions dataset: Labeled agent-action dataset used for training and evaluation (HarmActEval benchmark).
Action Classifier: Lightweight neural model that tags proposed agent actions as safe, harmful, or unethical.
MCP proxy / integrations: Guarded proxy and sample MCP server for live screening of tool-calls (see `agent_action_guard/scripts/sample_mcp_server.py`).
API & demo: Backend and Gradio chat demo (`agent_action_guard/scripts/api_server.py`, `agent_action_guard/scripts/chat_server.py`).
Utilities: Importable helper `is_action_harmful(action_dict)` and `HarmfulActionException` for easy integration into other projects.
Deployment: Docker Compose setup and example `.env` for running the full demo stack.