Detecting Training Data For Large Language Models: A Survey
Summary
This survey article reviews methods for detecting training data used to build large language models (LLMs, which are AI systems trained on massive amounts of text to generate human-like responses). The paper examines various techniques that researchers have developed to identify and extract information about what data was used to train these models, which is important for understanding model behavior and potential privacy concerns.
Classification
Related Issues
CVE-2025-45150: Insecure permissions in LangChain-ChatGLM-Webui commit ef829 allows attackers to arbitrarily view and download sensitive
CVE-2025-54868: LibreChat is a ChatGPT clone with additional features. In versions 0.0.6 through 0.7.7-rc1, an exposed testing endpoint
Original source: https://dl.acm.org/doi/abs/10.1145/3779430?af=R
First tracked: March 16, 2026 at 05:11 PM
Classified by LLM (prompt v3) · confidence: 92%