Personalized LLM Decoding via Contrasting Personal Preference

Abstract

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

Motivation

LLM personalization is the process of adapting a large language model’s behavior, tone, and outputs to reflect each user’s unique preferences, goals, and style.
Personalizing LLMs is essential for creating effective personal assistants.
Many prior approaches attempt to do this, but each has significant limitations.

Existing Methods

Prompt-based¹

Examples: RAG², PAG³

Weakness:
superficial, prompt-dependent, context-bound.

Training-based⁴

Example: full fine-tuning

Weakness:
high compute cost,
catastrophic forgetting.

PEFT

Adapts a small subset of parameters (e.g., LoRA⁵).

Weakness:
incomplete; decoding-time personalization underused.

[1] Santurkar et al. (2023), Whose Opinions Do Language Models Reflect?, ICML.

[2] Lewis et al. (2021), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401.

[3] Richardson et al. (2023), Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models, arXiv:2310.20081.

[4] Tan et al. (2024), Democratizing Large Language Models via Personalized Parameter-Efficient Fine-Tuning, EMNLP.

[5] Hu et al. (2022), LoRA: Low-Rank Adaptation of Large Language Models, ICLR.

Key Contributions

First decoding-based LLM personalization without external reward models.
Unified pipeline that integrates PEFT, synthetic negatives, and DPO to maximize implicit reward.
Implicit reward maximization via contrastive decoding.
Model-agnostic, compatible with various LLMs (LLaMA, Gemma, Qwen).
Average gain of +10.57% ROUGE-L across 5 personalized generation tasks from LaMP and LongLaMP benchmarks.

Method

TAM: Task-adapted base model
- Adapt the base LLM $\pi_{\text{base}}$ to the target task via PEFT (LoRA).
- Output: task-aware but non-personalized model.
OPPU: User-specific personalization
- Apply LoRA fine-tuning on $\pi_{\text{base}}$ with user history $H_{\text{user}}$.
- The resulting model is $\pi_{\text{user}} = \pi_{\text{base}} + \Delta_{\text{user}}$, capturing the user’s unique style and preferences.
Synthetic negatives
- For each input, sample K candidates from $ \pi_{\text{base}} $ and select the lowest-reward output as a negative.
- The implicit reward is defined as:
  \[ r_{\mathrm{user}}(y_t) = \log \frac{ \pi_{\mathrm{user}}(y_t \mid y_{\lt t}) }{ \pi_{\mathrm{base}}(y_t \mid y_{\lt t})^{\alpha} } \]
  💡 Click to see the rationale behind CoPe’s implicit rewards!
  1. RLHF¹ typically uses an explicit, trained reward model.
  2. DPO² showed that an explicit reward model isn’t required; instead, the “reward” can be approximated by a log-likelihood ratio of two models: $ r(y) \approx \beta \cdot \log \frac{\pi_r(y)}{\pi_{\text{ref}}(y)} $
    
    Intuition: How much more does the aligned model prefer this output compared to the reference model?
    
    KL regularization allows the model to not drift too far from the reference model.
  3. This same formulation appears in our user-implicit reward: $r_{\text{user}}(y_t) \;=\; \log \frac{\pi_{\text{user}}(y_t \mid y_{< t})} {\pi_{\text{base}}(y_t \mid y_{< t})}$
    
    Why does this stay stable? With PEFT (e.g., LoRA), only a small set of parameters change, so $\pi_{\text{user}}$ stays close to $\pi_{\text{base}}$ (akin to KL regularization), making the model stable.
    
    This formulation also appears in contrastive decoding³.
  [1] Christiano et al., Deep Reinforcement Learning from Human Preferences, NeurIPS 2017.
  
  [2] Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS 2023.
  
  [3] Li et al., Contrastive Decoding: Open-ended Text Generation as Optimization, ACL 2023.
DPO: Implicit preference learning
- We can maximize the potential of implicit rewards by aligning our implicit reward logic in both the training and decoding phases.
- In that sense, DPO can be applied during the training phase.
- Train to prefer user-aligned positive samples $y^{\text{pos}}$ over synthetic negatives $y^{\text{neg}}$ using DPO: $$\mathcal{L}_{\text{DPO}} = - \log \sigma \Big( \beta [r_{\text{user}}(y^{\text{pos}}) - r_{\text{user}}(y^{\text{neg}})] \Big)$$
- Training relies purely on implicit user reward signals. Thus, no external reward model needed!
CoPe: Reward-guided decoding
- At inference, choose the next token maximizing the implicit reward: $$y_t^* = \arg\max_{y_t \in \mathcal{V}_{\text{head}}} r_{\text{user}}(y_t)$$
- This ensures generation aligns with the user’s implicit reward without external models.

Experiments

Main Results table showing CoPe performance — CoPe has the best performance, outperforming all baselines!😊

Compatibility of CoPe across different LLMs — CoPe is effective across diverse model families and parameter scales.

Ablation study results for CoPe — CoPe performs best when **DPO** and **Contrastive Decoding** work together — showing a synergistic effect.

BibTeX


      @inproceedings{bu-etal-2025-personalized,
          title = "Personalized {LLM} Decoding via Contrasting Personal Preference",
          author = "Bu, Hyungjune  and
            Jung, ChanJoo  and
            Kang, Minjae  and
            Kim, Jaehyung",
          editor = "Christodoulopoulos, Christos  and
            Chakraborty, Tanmoy  and
            Rose, Carolyn  and
            Peng, Violet",
          booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
          month = nov,
          year = "2025",
          address = "Suzhou, China",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2025.emnlp-main.1723/",
          pages = "33946--33966",
          ISBN = "979-8-89176-332-6"
        }

Personalized LLM Decoding via
Contrasting Personal Preference

CoPe (Contrasting Personal Preference)
personalizes LLMs using implicit reward maximization via contrastive preference.

Abstract

Motivation

Existing Methods

Prompt-based¹

Training-based⁴

PEFT

Key Contributions

Method

TAM: Task-adapted base model

OPPU: User-specific personalization

Synthetic negatives

DPO: Implicit preference learning

CoPe: Reward-guided decoding

Experiments

Video Presentation

BibTeX

Personalized LLM Decoding viaContrasting Personal Preference

CoPe (Contrasting Personal Preference) personalizes LLMs using implicit reward maximization via contrastive preference.

Abstract

Motivation

Existing Methods

Key Contributions

Method

TAM: Task-adapted base model

OPPU: User-specific personalization

Synthetic negatives

DPO: Implicit preference learning

CoPe: Reward-guided decoding

Experiments

Video Presentation

BibTeX

Personalized LLM Decoding via
Contrasting Personal Preference

CoPe (Contrasting Personal Preference)
personalizes LLMs using implicit reward maximization via contrastive preference.