Yo! I am Jaideep, aka jellybean ❄️ on x dot com.
Avg neural networks enjoyer.
stuff that prolly doesn’t end up on x. synced to my obsidian.
Note to self (2): MAKE SURE SAVING CHECKPOINTS WORKS UNLESS YOU WANT TO WASTE 3HRS OF A TRAINING RUN
Note to self:
NEVER INSTALL PYTORCH THROUGH CONDA UNLESS YOU WANT TO WASTE 2.5HRS OF YOUR LIFE
pip is the goat
UPDATE: PyTorch realized this too rofl
<3
GRPO has moved beyond language reasoning; if you have a verifiable task and a base VLM, you can show consistent gains on grounding, counting multimodal etc. While initial experiments/efforts are certainly interesting, there is a common theme about unfaithful traces, limited generalisation, and an obsession with “aha” moments. And a lot of blame is put on SFT…
I personally feel that throwing traditional tasks isn’t necessarily the right direction to draw very strong conclusions, but again these are very early investigations and haven’t been tested on scale. It’s kind of obvious of that you don’t want to overtrain small models, so it’s still an open-ended question about what is the best recipe to upgrade small reasoning models. My hypothesis is that it depends on what your task is.
For instance, let’s look at something VLMs are genuinely terrible at: Spatial Reasoning. Is this a fundamental limitation? Well, Google [@chenSpatialVLMEndowingVisionLanguage2024] says it’s a data problem. You’ll see significant capability improvements and emergent spatial reasoning if you pre-train on the right kind of data.
How do you synthesize such data:
Now, Google didn’t open-source anything related to this project, but I found a community-source implementation
Full disclosure, the author seems to be motivated about this exact problem and about 10 days ago, they released a model…but it seems they just SFT’d over a small thinking vlm. I would definitely compare the performance of their model with mine. Another key difference is that their model is trained to do quantitative estimations too, I didn’t consider those for two reasons 1) simplicity and 2) Intuitively, anything quantitive might not generalize for a small model without any spatial supervision during pretraining. So I adapted the VQASynth pipeline to drop all numerical templates, skip metric scaling, and change the templates. Additionally, I also added more predicates like support/under, containment, relative sizes etc.
Now in order to endow a VLM with spatial reasoning (without pre-training) you have to resort to good old SFT. So, I generated about 10k samples with this approach and finetuned qwen. After that it’s just RL, the reward design is extremely simple for these tasks as it is a sparse reward settings.
For evaluation I just used the 3D tasks for CV-bench (to save compute, I skipped 2D split for now and also I haven’t yet done RL for object-counting). The current checkpoint has significantly improved on spatial reasoning (12pp). The base model is around 50% which might seem impressive but so is a coin-flip when you have only two options.
So that’s progress so far, I do have to train my model for more longer (more steps), and conduct more extensive evaluation (note that i have also not evaluated the SFT only performance so idk how much boost RL gave)
does knowledge distillation really work for reasoners the same way?
would chess players appreciate that we fine-tune language models using an elo system?
think i just turned it the last homework of my life (unless i do a phd, seems unlikely now)
two blog posts i am considering to write about:
thunderkittens in a nutshell:
all you need to reach agi is to lock up a bunch of autistic people in conference room and let them cook
–is what a friend said to me when i told him about a ML reading group
GRM paper notes:
REDACTED
i guess i have figured it out now, how i am going document stuff
[@bonettaVisionLanguageModels2024] not sure if this is relevant to what am i doing but this is pretty interesting, will look into later; pretty much requires me to understand PPO a little although i think i have the intuition for it…
on a quest to be a shape rotator
i regret not taking cs747 seriously, somebody tell 2020 me RL will be a big thing in 5 years
i low-key want to see how SigLIP2 ViT-So-400m + ModernBert does on ImageNet
gave a nice presentation on SigLIP for my class
ok who i am kidding it was a great presentation (Saining was impressed)
Stitched together a minimal CLIP-style DistilBert + ResNet18, trained it on CIFAR10 and guess what
Sigmoid loss still mogs softmax
gemini api and pound cake both are disgusting
test leakage is all you need to go viral
i am not deriving FFT again
what’s the sweet size for model sizes, from the point of it being usable…
we have plenty of (open) models around a few billion, tens of billions and then, suddenly around half a trillion (405, 680)…
not too many in the couple of hundred billion mark (except dscoder2.5, maybe some qwen models)
ah i wish gpt3 was open-source
what if:
Sam comes to his senses and open-sources o3-mini, and hosts it for dirt cheap too (since they have an amazing infrastructure).
OpenAI just wins, then?
a wise shrek once told me, “i don’t believe in libraries”
never deleting this app
new year, two months of sickness (physical and mental) we are so back
these are getting more and more infrequent.
only key takeaway from last 10 days is to use managed/unified memory with cuda kernels until you understand how memory works…
avoid cudaMemcpyAsync
and cudaMemsetAsync
for the time being…
time surely flies when you are down with flu…
https://github.com/assafelovic/gpt-researcher
can use some automatic prompt optimization…
not writing cursive with a fountain pen is chaotic evil.
man, DSPy is awesome, at least now i feel as if i am not getting overpaid for just writing prompts.
deconstructing a search engine, let’s stick to a domain first (medical for e.g.)
i got work do to but it’s saturday night so
training curve btw, should have used early stopping (figured it’s just one epoch).
ok i’ll bite, it’s rag for medical domain (the over-engineered gpt wrapper over a couple of sources) and i must research/experiment to make it better i suppose other priorities today:
switching gears to doing some query optimization/prompt engineering stuff; should line up well with the blog i’ve been trying to write for over a month
maybe if do end up learning somewhat about diffusion and vlms, might become capable enough to work at moondream.
dse-qwen2-korean should be ready in about 19 hours…burning a 4xA100 for 30hrs lemao
log-scale normalization is better than clipping?
now we’re doing feature-wise weighting before computing cosine similarity, which means features with higher variance in the training set will contribute more to the similarity calculation. it maintains the normalization benefits of cosine similarity while incorporating the feature importance weights
have to experiment with different normalization strategies for the feature weights: basic, softmax-based, min-max, log-scale.
also have to see how to see how (sighs) FID scores change.
so what am i working on right now, well, a bunch of things:
now, onto more interesting things:
this in itself is too much work on my plate but i also have teaching linear algebra and classes (obviously)
good luck me
computers understand bits, llms understand vectors
i might actually cook with this blog…
all you need to learn diffusion is
estimated reading time for them combined is 78 mins
you can be an expert at diffusion in under 2 hours
that’s less than 2 weeks btw
diffusion models are a journey that’s equal parts math, magic, and machine learning.
i always don’t use version control but when i do i spam a bunch of Update README.md
2 hours of sleep, on a sugar rush, and forcing myself through a lecture on approximate inference. i think i should just post that lmao.
next 72 hours are going to be brutal so i’ve been using vim
and if time permits work on two research projects
am i too hopeful or thoroughly cooked