DarkPatterns-LLM

A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

Sadia Asif 路 Israel A. Rosales Laguan 路 Haris Khan 路 Shumaila Asif 路 Muneeb Asif

DarkPatterns-LLM dataset harm categories

Abstract

Large Language Models increasingly influence high-stakes decisions, yet existing safety benchmarks fail to capture subtle, psychologically manipulative behaviors. We introduce DarkPatterns-LLM, the first multi-dimensional benchmark designed to detect manipulative and harmful AI behavior across seven harm categories using a four-layer analytical pipeline. Our framework provides fine-grained, interpretable diagnostics beyond binary safety classification.

Key Contributions

Multi-Layer Evaluation

Four analytical layers spanning manipulation detection, stakeholder impact, temporal propagation, and risk alignment.

Psychology-Grounded Taxonomy

Seven harm categories including autonomy, psychological, economic, and societal harm.

Novel Metrics

MRI, CRS, SIAS, and THDS enable structured, explainable safety benchmarking.

Model Diagnostics

Reveals systematic blindspots in autonomy harm detection and temporal reasoning.

Framework Overview

DarkPatterns-LLM Framework

Four-layer pipeline: Multi-Granular Detection (MGD), Multi-Scale Intent Analysis (MSIAN), Threat Harmonization Protocol (THP), and Deep Contextual Risk Alignment (DCRA).

Dataset

We release DarkPatterns-LLM, a dataset of 401 expert-annotated instruction鈥搑esponse pairs covering seven categories of manipulative and harmful AI behavior.

Each instance includes harmful and safe responses, expert rationales, and structured annotations to support fine-grained, explainable safety evaluation.

馃摝 Download Dataset

Results Highlights

89.7 Top MRI Score
7 Harm Categories
401 Annotated Samples
4 Evaluation Layers

All evaluated models show consistent weaknesses in autonomy harm detection and temporal harm propagation.

Citation


  @article{darkpatternsllm2025,
  title={DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior},
  author={Asif, Sadia and Rosales Laguan, Israel Antonio and Khan, Haris and Asif, Shumaila and Asif, Muneeb},
  journal={arXiv preprint arXiv:2512.22470},
  year={2025}
}