AI oversight

Tools for monitoring the risks and impact of AI.

Publications

J. Contro, S. Deol, Y. He, and M. Brandao, “ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour,” in TrustNLP: Sixth Workshop on Trustworthy Natural Language Processing, 2026. [Abstract] [Code] [arXiv] [PDF]

This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84% of such conversations. Second, even when only instructed to be "persuasive" without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly Gaslighting and Fear Enhancement. Third, zero-shot larger models such as Gemini 2.5 pro have the best performance in detecting manipulation (of the models tested), with more work required to fine-tune smaller open source models for real-world on-device oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
S. Deol, J. Contro, and M. Brandao, “Is this Chatbot Trying to Sell Something? Towards Oversight of Chatbot Sales Tactics,” in Proceedings of the 9th Widening NLP Workshop, 2025, pp. 136–156. [Abstract] [arXiv]

This research investigates the detection of covert sales tactics in human-chatbot interactions with a focus on the classification of solicited and unsolicited product recommendations. A custom dataset of 630 conversations was generated using a Large Language Model (LLM) to simulate chatbot-user interactions in various contexts, such as when interacting with users from different age groups, recommending different types of products and using different types of sales tactics. We then employ various approaches, including BiLSTM-based classification with sentence and word-level embeddings, as well as zero-shot, few-shot and CoT classification on large state-of-the-art LLMs. Our results show that few-shot GPT4 (86.44%) is the most accurate model on our dataset, followed by our compact SBERT+BiLSTM model (78.63%) - despite its small size. Our work demonstrates the feasibility of implementing oversight algorithms for monitoring chatbot conversations for undesired practices and that such monitoring could potentially be implemented locally on-device to mitigate privacy concerns. This research thus lays the groundwork for the development of auditing and oversight methods for virtual assistants such as chatbots, allowing consumer protection agencies to monitor the ethical use of conversational AI.

Next Post in PROJECTS