TiTok: Transfer Token-level Knowledge via Contrastive Excess To Transplant LoRA

Abstract

Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.

Motivation

What is LoRA?

LoRA¹ (Low-Rank Adaptation) is a highly popular technique used to train LLMs efficiently. It freezes the base model and trains only a tiny, separate set of new parameters.

The Problem

Not Transferrable: LoRA weights are tied to the parameters of the base model. This means that LoRA adapters become mathematically incompatible when the base model changes.

Existing Methods

Knowledge Distillation²

Transfers the output distribution learned by a source model to a target model.

Weakness: Heavily depends on access to task-specific training data.

TransLoRA³

Transfers LoRA knowledge via synthetic data filtered by a discriminator model.

Weakness: High computational cost due to the discriminator & reliance on full sequences.

[1] Hu, Edward J., et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR, 2022.

[2] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. arXiv:1503.02531, 2015.

[3] Wang, Runqian, et al. Trans-LoRA: Towards Data-Free Transferable Parameter-Efficient Finetuning. NeurIPS, vol. 37, 2024.

TiTok in a Nutshell

TiTok stands for LoRA Transplantation through Token-level knowledge transfer.
Core Idea: Transfers knowledge by training selectively on the most informative tokens within synthetic data.
Identifies tokens via the prediction gap between the source model with and without its LoRA adapter.
Operates without discriminator models or additional training overhead.
Enables robust knowledge transfer across different model architectures, parameter sizes, and versions.

Method

1

Synthetic Data Generation

Input $k$ seed prompts to the source model to generate queries ($\mathbf{q}$) & response tokens ($y_i$).
Diversity is enforced via ROUGE-L < 0.7.

2

Contrastive Excess Score

We divide the source model into two distinct roles to identify what LoRA actually "learned":

1. "Amateur"
(Base Model Only: $\mathcal{M}_s$)

2. "Expert"
(Base Model + LoRA: $\mathcal{M}_s + A_s$)
We identify informative tokens via Contrastive Excess, defined as:
$$S(y_i) = L_e(y_i) - L_a(y_i)$$

Where:

$$L_e(y_i) = \log P_{\mathcal{M}_s + A_s}(y_i \mid \mathbf{q}, \mathbf{y}_{\lt i}), \quad L_a(y_i) = \log P_{\mathcal{M}_s}(y_i \mid \mathbf{q}, \mathbf{y}_{\lt i})$$

$$L_e(y_i) = \log P_{\mathcal{M}_s + A_s}(y_i \mid \mathbf{q}, \mathbf{y}_{\lt i})$$ $$L_a(y_i) = \log P_{\mathcal{M}_s}(y_i \mid \mathbf{q}, \mathbf{y}_{\lt i})$$

💡 Click to see the intuition behind TiTok's contrastive excess scoring!

Intuition: Contrastive excess is a token-level Log-Likelihood Ratio (LLR). If the base model is uncertain but the LoRA-equipped model is highly confident, the token receives a high score. This highlights exactly where LoRA injects decisive task knowledge.

3

Target LoRA Training & Filtering

Instead of training on everything, we focus only on the high-impact data points:
1. Filter Samples: Keep top synthetic data samples based on average contrastive excess scores of the tokens.
2. Filter Tokens: Select the top $k\%$ of tokens with the highest individual contrastive excess scores.
3. Selective Training: Train the target model's LoRA adapters using only these informative tokens for loss computation

How about source and target models having different tokenizers?

The Challenge: When source and target models use different tokenizers, a single word might be split differently.

Example: "righteousness"
• Source: ["right", "eou", "sness"] (LoRA masks "eou")
• Target: ["righteous", "ness"]

The Solution: Dual-Pointer Alignment Algorithm

We use a Dual-Pointer Approach to align character spans and map masking scores ($m_i \in [0,1]$, where 1 = keep and 0 = mask)
If the source model’s tokens match without being split: Copy the masking score.
If they don’t match (e.g., source tokens are split): Compute the average score over the corresponding source token span and assign it to all aligned target tokens.

Empirical Results

Main Results: TiTok outperforms all baselines across diverse transfer settings! 🏆

Ablation Study: TiTok works best when doing both sample and token filtering.

External Data: TiTok is not limited to synthetic data.
It well generalizes to external data too!

Video Presentation

Watch me giving a first-hand briefing on our paper! 🍿

Note: If you are watching this on mobile,
please rotate your phone to landscape to view the screen

Cite Us

@inproceedings{
  jung2026titok,
  title={TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA},
  author={ChanJoo Jung and Jaehyung Kim},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=0B5K9pIdSK}
}