Personalized Constitutionally-Aligned Agentic Superego:
Secure AI Behavior Aligned to Diverse Human Values

Nell Watson ¹, Ahmed Amer ², Evan Harris ³, Preeti Ravindra ⁴, Shujun Zhang ⁵

^{1, 5} University of Gloucestershire
^{2, 3, 4} Independent Researcher

Agentic AI systems, capable of autonomous planning and action, possess immense potential. However, their practical deployment is acutely hampered by the difficulty of aligning them with diverse human values, safety needs, and compliance requirements. Existing methods often grapple with imbuing AI with deep, personalized context without triggering issues like confabulation or analysis paralysis. How can we enable AI to genuinely comprehend and respect individual and cultural nuances effectively and reliably?

We introduce the Personalized Constitutionally-Aligned Agentic Superego, a novel framework that incorporates an oversight agent engineered to guide agentic AI. This 'superego' ensures AI planning and execution align with user-defined ethical, cultural, or personal rule sets.

We present a functional system that actualizes this concept, substantiated by extensive quantitative benchmarks demonstrating a dramatic reduction in harmful AI outputs when the Superego is operational. Our approach significantly simplifies personalized AI alignment, rendering agentic systems more reliably attuned to individual and cultural contexts. The empirical evidence confirms its practical effectiveness in enhancing AI safety and its adaptability through natural language constitutional tuning, forging a tangible pathway towards AI that proves not only powerful but also trustworthy and reflective of the diverse range of human values.

Paper Playground MCP Scenario Samples Notebook LM

EthicsNet Creed Space AI Safety Research

Superego framework graphical abstract showing the architecture and components

Background image of Superego framework graphical abstract

Constitutional Superego

Personalized AI Alignment Framework

6:49

The Challenge: Aligning Agentic AI with Diverse and Nuanced Human Contexts

Agentic AI systems, capable of autonomous planning and action, hold immense potential. However, their practical deployment is severely hindered by the difficulty of aligning them with diverse human values, safety needs, and compliance requirements. Existing methods often struggle to imbue AI with deep, personalized context without causing issues like confabulation or analysis paralysis. How can we make AI genuinely understand and respect individual and cultural nuances effectively and reliably?

Our Solution: The Personalized Agentic Superego

We introduce the Personalized Constitutionally-Aligned Agentic Superego, a novel framework featuring an oversight agent designed to steer agentic AI. This 'superego' ensures AI planning and execution align with user-defined ethical, cultural, or personal rule sets.

How It Works

The Personalized Agentic Superego empowers users to align AI in three simple, high-level steps:

1
Select Your Values: Users choose from a library of 'Creed Constitutions' (e.g., Vegan lifestyle, K-12 Educational Appropriateness, Fiduciary Duties) that represent specific value systems.
2
Set Adherence Levels: Using a simple 1-5 scale, users 'dial' how strictly the AI must adhere to each selected constitution, allowing for nuanced control.
3
Superego Guides AI: A dedicated 'Superego' agent then monitors the AI's internal planning and proposed actions in real-time, ensuring they comply with the chosen constitutions and adherence levels before execution.

Constitutional Superego infographic explaining how personalized AI alignment works through Creed Constitutions and the Superego oversight agent

Visualizing the Superego Architecture

Superego agent conceptual architecture diagram showing components and their interactions

A diagram illustrating the Superego Agent's conceptual architecture, including components like the Inner Agent, Constitutions Repository (Creed Constitutions), User Interface for selection and dialing adherence, the Superego Agent itself, and the Real-time Compliance Enforcer validating AI plans before execution.

Core Mechanisms

The Superego framework operates through several core mechanisms:

•
Creed Constitutions: Users select from a library of 'constitutions' (e.g., Vegan, Christian, K-12 Appropriate, Fiduciary Duties) representing specific value systems.
•
Dialable Adherence: A simple 1-5 scale allows users to set how strictly the AI must adhere to each selected constitution, enabling nuanced control.
•
Constitutional Context Delivery (via MCP): Employs the Model Context Protocol (MCP) to seamlessly transmit the selected Creed Constitutions and their dialed adherence levels to compatible AI models and agentic systems, ensuring the AI operates with the necessary user-defined context.
•
Real-Time Compliance Enforcement: A dedicated enforcer intercepts and validates AI plans before execution against the active, MCP-delivered constitutions and their adherence levels.
•
Universal Ethical Floor (UEF): A baseline of non-negotiable safety and ethical principles underpins all configurations, ensuring a fundamental level of safety.

Key Advantages of the Superego Approach

Our framework offers distinct advantages in achieving personalized and robust AI alignment:

•
Deep Personalization: Tailor AI behavior with user-selectable 'Creed Constitutions' and finely-tuned adherence levels, moving beyond one-size-fits-all solutions.
•
Real-Time Process Oversight: Monitors AI planning (chain-of-thought) and intermediate steps, not just final outputs, allowing earlier detection and mitigation of potential misalignments.
•
Nuanced Interventions: Goes beyond simple block/allow decisions; the Superego can request user clarification, suggest compliant alternatives, or modify plans to align with user values.
•
Dynamic & Adaptable: Easily tuned via natural language adjustments to constitutions, without requiring complex LLM fine-tuning or extensive new datasets for each personalization.
•
Seamless Ecosystem Integration: Leverages the Model Context Protocol (MCP) to inject personalized constitutional constraints directly into compatible third-party AI models and agentic workflows, enhancing practical applicability and interoperability.
•
User-Empowering: Simplifies the complex task of AI alignment, making it more accessible to non-expert users to define how their AI should operate.
•
Functional, demonstrable constitutional superego system: A portal for sharing constitutions, and successful integration with third-party models via the Model Context Protocol (MCP). The interface (shown below) allows users to select constitutions, view their details, set adherence levels, and observe the Superego's reasoning.

Creed.Space interface showing constitution selection, rules view, and adherence level settings

The Creed.Space early prototype interface demonstrating user interaction for selecting 'Creed Constitutions' (e.g., "K-12 Context," "Vegan"), viewing their specific rules (center pane), and setting adherence levels (bottom slider). The Superego's reasoning (left) guides the AI's response to user prompts (right).

Key Findings

The Superego framework enables a wide range of valuable applications, producing significantly more aligned and helpful outputs compared to baseline models. The example below showcases the Superego's sophisticated handling of a culturally sensitive request:

Comparison between Superego and baseline model responses to a culturally-sensitive request

A qualitative comparison. Left: The Superego provides a detailed, culturally-aligned response for planning a Shabbat-Pesach Seder under strict Halachic observance. Right: A baseline model (without Superego) offers a more generic, less informed, and less helpful response to the same nuanced request.

Benchmark Performance Highlights

•
Dramatic Harm Reduction (HarmBench): When tested against human jailbreak attempts, the Superego (with UEF) achieved a 96.4% Attack Success Rate (ASR) reduction for OpenAI's GPT-4o (rendering it effectively 100% safe) and a 76.9% ASR reduction for Google's Gemini 2.5 Flash.
•
Near-Perfect Refusal (AgentHarm): On the AgentHarm "harmful" set, the Superego increased refusal rates for Gemini 2.5 Flash from a baseline of 52.6% to 99.4%. For Anthropic's Claude Sonnet 4, harmful prompt refusal rates increased from 72.0% to 96.6% (and 100% after further targeted tuning).
•
Reduced False Positives: Iterative refinement of the UEF demonstrated the Superego's ability to significantly reduce false positive refusals on AgentHarm's "benign" prompts, matching baseline model refusal rates (e.g., 2.27% for Claude Sonnet 4) while concurrently maintaining high protection against genuinely harmful content. This showcases advanced ethical reasoning beyond simple pattern matching.

76.9%

ASR reduction for
Gemini 2.5 Flash

96.4%

ASR reduction for
GPT-4o

98.3%

Harm reduction on
AgentHarm

Illustrative Use Cases

The Superego framework enables a wide range of valuable applications:

•
Culturally-Sensitive Event Planning: Ensuring AI-generated plans for events (e.g., a children's picnic) fully comply with diverse religious observances (e.g., Shabbat, Halal, Hindu dietary practices) and age-appropriateness requirements.
•
Adherence to Specific Dietary Laws: Assisting users in finding resources, products, or services that strictly adhere to regulations like Halal-certified food options or vegan principles.
•
Corporate & Professional Compliance: Enforcing the consistent application of corporate policies, ethical guidelines, or fiduciary duties in professional settings where AI agents are deployed.
•
Critical Safety Standards in Sensitive Applications: Applying and enforcing safety standards in areas such as AI-assisted counseling (ensuring advice aligns with best practices and avoids harmful suggestions) or managing information related to severe allergies (e.g., preventing an AI from recommending recipes containing known allergens for a user).
•
Upholding Fundamental Human Rights and Ethical Theories: Guiding AI behavior to be consistent with established human rights declarations (e.g., preventing discriminatory outputs, ensuring privacy) and enabling the application of specific moral or ethical theories to complex decision-making scenarios.

See the Demonstration

The Personalized Constitutionally-Aligned Agentic Superego represents a significant stride towards AI that is not only powerful but also demonstrably safer, more trustworthy, and deeply attuned to the diverse tapestry of human values. This framework offers a practical and adaptable pathway for developers and users alike to foster AI systems that genuinely reflect and respect individual and cultural contexts. We invite you to explore these capabilities further through our interactive demonstration at www.Creed.Space and our detailed research.

Visit Interactive Playground

Citation

If you use this work in your research, please cite our paper:

@article{watson2025personalized,
    title={Personalized Constitutionally-Aligned Agentic Superego: 
           Secure AI Behavior Aligned to Diverse Human Values},
    author={Watson, Nell and Amer, Ahmed and Harris, Evan and 
            Ravindra, Preeti and Zhang, Shujun},
    journal={Information},
    volume={16},
    number={8},
    pages={651},
    year={2025},
    doi={10.3390/info16080651}
}

Contact Us

We welcome feedback, questions, and collaborative opportunities
related to the Superego framework.