arxiv:2606.10747
Federico Torrielli
EvilScript
AI & ML interests
AI Safety & Mechanistic interpretability
Recent Activity
authored a paper 39 minutes ago
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment authored a paper 4 days ago
PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models upvoted a paper 5 days ago
PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models