Large Language Models increasingly influence high-stakes decisions, yet existing safety benchmarks fail to capture subtle, psychologically manipulative behaviors. We introduce DarkPatterns-LLM, the first multi-dimensional benchmark designed to detect manipulative and harmful AI behavior across seven harm categories using a four-layer analytical pipeline. Our framework provides fine-grained, interpretable diagnostics beyond binary safety classification.
Four analytical layers spanning manipulation detection, stakeholder impact, temporal propagation, and risk alignment.
Seven harm categories including autonomy, psychological, economic, and societal harm.
MRI, CRS, SIAS, and THDS enable structured, explainable safety benchmarking.
Reveals systematic blindspots in autonomy harm detection and temporal reasoning.
Four-layer pipeline: Multi-Granular Detection (MGD), Multi-Scale Intent Analysis (MSIAN), Threat Harmonization Protocol (THP), and Deep Contextual Risk Alignment (DCRA).
We release DarkPatterns-LLM, a dataset of 401 expert-annotated instruction鈥搑esponse pairs covering seven categories of manipulative and harmful AI behavior.
Each instance includes harmful and safe responses, expert rationales, and structured annotations to support fine-grained, explainable safety evaluation.
All evaluated models show consistent weaknesses in autonomy harm detection and temporal harm propagation.
@article{darkpatternsllm2025,
title={DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior},
author={Asif, Sadia and Rosales Laguan, Israel Antonio and Khan, Haris and Asif, Shumaila and Asif, Muneeb},
journal={arXiv preprint arXiv:2512.22470},
year={2025}
}