2024-07-13

Fine-Tuning is Not as Straightforward as I Thought

I spent a month of my summer diving into neural search, and this is mostly a (brief) record of my harrowing ordeal of finetuning a (dense) text embedding model for Korean language.

By the way, I don’t know how to speak Korean.

TL;DR

Pre-trained models can have quirks. Get to know them.
The original training objective matters a lot when fine-tuning.
Sometimes, you need to think outside the box, but most times you need to read the literature.
Don’t underestimate the power of augmenting your dataset.

most importantly, don’t just dive right in without a thorough understanding of SOTA techniques
but yeah, you can also rediscover stuff, like i did

The Game Plan

So, the typical approach is,

Get a dataset from a GLUE-like dataset, i.e., KLUE in case of Korean
Tokenize it with XLM-RoBERTa’s tokenizer
Set up a DataModule for train and val splits
Load the pre-trained ‘multilingual-e5-large’ model
Create a LightningModule for training and validation
Train with PyTorch Lightning’s (or Huggingface if you’re into it) Trainer
Evaluate and save the model

Sounds simple enough, right? Well…

When Things Got Weird

First hiccup: the fine-tuned model performed worse in MTEB than the original. Not exactly the improvement I was hoping for.
Zeroth hiccup: There is no Korean MTEB, so, I made my own.

I also discovered a mismatch between model architectures. In the Transformers Library, RobertaModel has a pooling layer, but RobertaForSequenceClassification doesn’t. ¹ I thought loading the weights would fix it, but no dice.

At this point, I lost faith in open-source.

open-source-rant

The Problem with Cosine Embedding Loss

To align the training objective more closely with similarity scores, one would typically use simply torch.nn.CosineEmbeddingLoss. In order to do that, I had

Implemented mean pooling instead of just using the [CLS] token
Computed embeddings for sentence pairs separately
(Siamese, not [CLS] <sentence 1> [PAD] <sentence 2>)

The result? Better than before, but still not beating the original model. Progress, I guess?

Modeling Hiccups

Just when I thought I had it figured out, I learned some crucial things about the model:

It needs input prefixes: “query: " or “passage: “. Who knew?
Its cosine similarity scores usually hang out between 0.7 and 1.0. Very picky.

Turns out, the model was trained with a low temperature (0.01) for InfoNCE contrastive loss. And here’s the kicker: STS pairs aren’t really designed for contrastive loss unless you’re into augmenting negative examples.

SimCSE Is All You Need

To put simply, to train SOTA dense embedding models, this is literally I needed (only if I had the perfect dataset for it)

def cos_sim(self, embeddings1, embeddings2):
    embeddings1 = nn.functional.normalize(embeddings1, p=2, dim=1)
    embeddings2 = nn.functional.normalize(embeddings2, p=2, dim=1)
    return torch.matmul(embeddings1, embeddings2.T)


def loss(self, embeddings1, embeddings2, embeddings3=None):
    if embeddings3 is not None:
        embeddings2 = torch.concat([embeddings2, embeddings3], dim=0)
    scores = self.cos_sim(embeddings1, embeddings2) / self.temperature
    labels = torch.arange(scores.shape[0], device=scores.device)
    loss = nn.CrossEntropyLoss()(scores, labels)
    return loss

And so…

yay

What’s Next?

Well, I’m thinking of hiring (making my company hire) a Korean data annotator to improve the dataset quality. Because why not add another layer of complexity, right?
In all seriousness, this project was a rollercoaster, but it was worth it. If you’re diving into fine-tuning language models, remember: expect the unexpected, and don’t be afraid to experiment.

Unfortunately, this is by design, but it didn’t matter since I was no longer training a classification head, simply using the pooler output. ↩︎