Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Zhang, Yihao; Wei, Zeming; Sun, Jun; Sun, Meng

Computer Science > Machine Learning

arXiv:2404.13752 (cs)

[Submitted on 21 Apr 2024 (v1), last revised 1 Nov 2024 (this version, v3)]

Title:Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Authors:Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun

View PDF HTML (experimental)

Abstract:Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation sensor as an editing oracle. We first identify the importance of a robust and reliable sensor during editing, then propose an Adversarial Representation Engineering (ARE) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at this https URL.

Comments:	Accepted by NeurIPS 2024
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Optimization and Control (math.OC)
Cite as:	arXiv:2404.13752 [cs.LG]
	(or arXiv:2404.13752v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2404.13752

Submission history

From: Yihao Zhang [view email]
[v1] Sun, 21 Apr 2024 19:24:15 UTC (1,034 KB)
[v2] Thu, 23 May 2024 13:06:59 UTC (1,034 KB)
[v3] Fri, 1 Nov 2024 07:51:36 UTC (1,037 KB)

Computer Science > Machine Learning

Title:Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators