Personalized LLM Decoding via
Contrasting Personal Preference

Hyungjune Bu1*, Chanjoo Jung1*, Minjae Kang2, Jaehyung Kim1
1Yonsei University, 2Opt AI
*equal contribution
Teaser image

CoPe (Contrasting Personal Preference)
personalizes LLMs using implicit reward maximization via contrastive preference.

Abstract

As large language models (LLMs) are progressively deployed in various real-world applications, personalization of LLMs has become increasingly important. While various approaches to LLM personalization such as prompt-based and training-based methods have been actively explored, the development of effective decoding-time algorithms remains largely overlooked, despite their demonstrated potential. In this paper, we propose CoPe (Contrasting Personal Preference), a novel decoding-time approach applied after performing parameter-efficient fine-tuning (PEFT) on user-specific data. Our core idea is to leverage reward-guided decoding specifically for personalization by maximizing each user's implicit reward signal. We evaluate CoPe across five open-ended personalized text generation tasks. Our empirical results demonstrate that CoPe achieves strong performance, improving personalization by an average of 10.57% in ROUGE-L, without relying on external reward models or additional training procedures.

Motivation

Existing Methods

Prompt-based1

  • Examples: RAG2, PAG3

  • Weakness:
    superficial, prompt-dependent, context-bound.

Training-based4

  • Example: full fine-tuning

  • Weakness:
    high compute cost,
    catastrophic forgetting.

PEFT

  • Adapts a small subset of parameters (e.g., LoRA5).

  • Weakness:
    incomplete; decoding-time personalization underused.

[1] Santurkar et al. (2023), Whose Opinions Do Language Models Reflect?, ICML.

[2] Lewis et al. (2021), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401.

[3] Richardson et al. (2023), Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models, arXiv:2310.20081.

[4] Tan et al. (2024), Democratizing Large Language Models via Personalized Parameter-Efficient Fine-Tuning, EMNLP.

[5] Hu et al. (2022), LoRA: Low-Rank Adaptation of Large Language Models, ICLR.

Key Contributions

  • First decoding-based LLM personalization without external reward models.
  • Unified pipeline that integrates PEFT, synthetic negatives, and DPO to maximize implicit reward.
  • Implicit reward maximization via contrastive decoding.
  • Model-agnostic, compatible with various LLMs (LLaMA, Gemma, Qwen).
  • Average gain of +10.57% ROUGE-L across 5 personalized generation tasks from LaMP and LongLaMP benchmarks.

Method

CoPe pipeline overview
  1. TAM: Task-adapted base model

    • Adapt the base LLM $\pi_{\text{base}}$ to the target task via PEFT (LoRA).
    • Output: task-aware but non-personalized model.
  2. OPPU: User-specific personalization

    • Apply LoRA fine-tuning on $\pi_{\text{base}}$ with user history $H_{\text{user}}$.
    • The resulting model is $\pi_{\text{user}} = \pi_{\text{base}} + \Delta_{\text{user}}$, capturing the user’s unique style and preferences.
  3. Synthetic negatives

    • For each input, sample K candidates from \( \pi_{\text{base}} \) and select the lowest-reward output as a negative.
    • The implicit reward is defined as:
      \[ r_{\mathrm{user}}(y_t) = \log \frac{ \pi_{\mathrm{user}}(y_t \mid y_{\lt t}) }{ \pi_{\mathrm{base}}(y_t \mid y_{\lt t})^{\alpha} } \]
      💡 Click to see the rationale behind CoPe’s implicit rewards!
      1. RLHF1 typically uses an explicit, trained reward model.
      2. DPO2 showed that an explicit reward model isn’t required; instead, the “reward” can be approximated by a log-likelihood ratio of two models: \( r(y) \approx \beta \cdot \log \frac{\pi_r(y)}{\pi_{\text{ref}}(y)} \)
        • Intuition: How much more does the aligned model prefer this output compared to the reference model?
        • KL regularization allows the model to not drift too far from the reference model.
      3. This same formulation appears in our user-implicit reward:
        \[ r_{\text{user}}(y_t) \;=\; \log \frac{\pi_{\text{user}}(y_t \mid y_{< t})} {\pi_{\text{base}}(y_t \mid y_{< t})} \]
        • Why does this stay stable? With PEFT (e.g., LoRA), only a small set of parameters change, so \(\pi_{\text{user}}\) stays close to \(\pi_{\text{base}}\) (akin to KL regularization), making the model stable.
        • This formulation also appears in contrastive decoding3.

      [1] Christiano et al., Deep Reinforcement Learning from Human Preferences, NeurIPS 2017.
      [2] Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model, NeurIPS 2023.
      [3] Li et al., Contrastive Decoding: Open-ended Text Generation as Optimization, ACL 2023.
  4. DPO: Implicit preference learning

    • We can maximize the potential of implicit rewards by aligning our implicit reward logic in both the training and decoding phases.
    • In that sense, DPO can be applied during the training phase.
    • Train to prefer user-aligned positive samples $y^{\text{pos}}$ over synthetic negatives $y^{\text{neg}}$ using DPO: $$\mathcal{L}_{\text{DPO}} = - \log \sigma \Big( \beta [r_{\text{user}}(y^{\text{pos}}) - r_{\text{user}}(y^{\text{neg}})] \Big)$$
    • Training relies purely on implicit user reward signals. Thus, no external reward model needed!
  5. CoPe: Reward-guided decoding

    • At inference, choose the next token maximizing the implicit reward: $$y_t^* = \arg\max_{y_t \in \mathcal{V}_{\text{head}}} r_{\text{user}}(y_t)$$
    • This ensures generation aligns with the user’s implicit reward without external models.

Experiments

Main Results table showing CoPe performance
CoPe has the best performance, outperforming all baselines!😊
Compatibility of CoPe across different LLMs
CoPe is effective across diverse model families and parameter scales.
Ablation study results for CoPe
CoPe performs best when DPO and Contrastive Decoding work together — showing a synergistic effect.
Qualitative examples of CoPe outputs

Video Presentation

Watch the authors give a first-hand briefing on their paper!

BibTeX


      @inproceedings{bu-etal-2025-personalized,
          title = "Personalized {LLM} Decoding via Contrasting Personal Preference",
          author = "Bu, Hyungjune  and
            Jung, ChanJoo  and
            Kang, Minjae  and
            Kim, Jaehyung",
          editor = "Christodoulopoulos, Christos  and
            Chakraborty, Tanmoy  and
            Rose, Carolyn  and
            Peng, Violet",
          booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
          month = nov,
          year = "2025",
          address = "Suzhou, China",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2025.emnlp-main.1723/",
          pages = "33946--33966",
          ISBN = "979-8-89176-332-6"
        }