enjoy your stay

raw stream of consciousness about ml and random stuff. notes related to stuff that i do, read etc.

Note to self (2): MAKE SURE SAVING CHECKPOINTS WORKS UNLESS YOU WANT TO WASTE 3HRS OF A TRAINING RUN

Note to self: NEVER INSTALL PYTORCH THROUGH CONDA UNLESS YOU WANT TO WASTE 2.5HRS OF YOUR LIFE
pip is the goat
UPDATE: PyTorch realized this too rofl
UPDATE 2: i am a uv fan now.

simply lovely

0128

essay ideas:

"to transcend retardation, you must first embrace it"
"why Before Sunrise is one of my favorite movies"

0101

hny; it turns out you can just write a simple ssg for your obsidian files in a <50 LoC build.py without using hugo or jekyll whatever haha

2025

1228

erm..this thing on...

there's a cognitive overhead to every tool that might be doing something with your input. keeping that boundary clear (where the AI lives, where it doesn't) is probably healthier for actually thinking through problems yourself versus reflexively outsourcing everything to a completion.

0625

trigger "yay you might get a job" workflow

0620

QoL improvement 1: bit the uv bullet

0619

life update: i am a discord mod now (half-joking)

0613

will's search agent has a pretty similar flavor to my wikigame env huh.

the "knowledge" vs "intelligence" debate for llms is the dichotomy between "can the model recall this fact?" to "can the model figure it out with tools, search, and reasoning?" which can only be decoupled in an ideal world.
we have recall benchmarks as well as agent benchmarks, but there is an evaluation gap in the middle. my initial research suggests there is no bench to study when a model should rely on it's internal knowledge vs when it should tools (and more importantly, use them effectively).

a reasonable proxy for this would to make llms play wikigame, i feel.

0507

0505

GRPO has moved beyond language reasoning; if you have a verifiable task and a base VLM, you can show consistent gains on grounding, counting multimodal etc. While initial experiments/efforts are certainly interesting, there is a common theme about unfaithful traces, limited generalisation, and an obsession with "aha" moments. And a lot of blame is put on SFT... I personally feel that throwing traditional tasks isn't necessarily the right direction to draw very strong conclusions, but again these are very early investigations and haven't been tested on scale. It's kind of obvious of that you don't want to overtrain small models, so it's still an open-ended question about what is the best recipe to upgrade small reasoning models. My hypothesis is that it depends on what your task is.
For instance, let's look at something VLMs are genuinely terrible at: Spatial Reasoning. Is this a fundamental limitation? Well, Google [@chenSpatialVLMEndowingVisionLanguage2024] says it's a data problem. You'll see significant capability improvements and emergent spatial reasoning if you pre-train on the right kind of data.

How do you synthesize such data:

segment and label objects
predict dense depth, and
Lift 2d scene into a 3d point-cloud
Based on the analysis on this lifted scene, you can compute all distances that you want
And now you can use appropriate prompt templates, use numeric answers to generate conversation with a LLM
Repurpose the conversation as a VQA example with artificial CoT

Now, Google didn't open-source anything related to this project, but I found a community-source implementation

Full disclosure, the author seems to be motivated about this exact problem and about 10 days ago, they released a model...but it seems they just SFT'd over a small thinking vlm. I would definitely compare the performance of their model with mine. Another key difference is that their model is trained to do quantitative estimations too, I didn't consider those for two reasons 1) simplicity and 2) Intuitively, anything quantitive might not generalize for a small model without any spatial supervision during pretraining. So I adapted the VQASynth pipeline to drop all numerical templates, skip metric scaling, and change the templates. Additionally, I also added more predicates like support/under, containment, relative sizes etc.

Now in order to endow a VLM with spatial reasoning (without pre-training) you have to resort to good old SFT. So, I generated about 10k samples with this approach and finetuned qwen. After that it's just RL, the reward design is extremely simple for these tasks as it is a sparse reward settings.

For evaluation I just used the 3D tasks for CV-bench (to save compute, I skipped 2D split for now and also I haven't yet done RL for object-counting). The current checkpoint has significantly improved on spatial reasoning (12pp). The base model is around 50% which might seem impressive but so is a coin-flip when you have only two options.

So that's progress so far, I do have to train my model for more longer (more steps), and conduct more extensive evaluation (note that i have also not evaluated the SFT only performance so idk how much boost RL gave)

0501

does knowledge distillation really work for reasoners the same way?

0430

would chess players appreciate that we fine-tune language models using an elo system?

think i just turned it the last homework of my life (unless i do a phd, seems unlikely now)

0429

two blog posts i am considering to write about:

on knowledge vs intelligence
SPCT for agentic search

0428

thunderkittens in a nutshell:

0427

all you need to reach agi is to lock up a bunch of autistic people in conference room and let them cook
--is what a friend said to me when i told him about a ML reading group

GRM paper notes: ~~REDACTED~~

0422

i guess i have figured it out now, how i am going document stuff

ʕ•ᴥ•ʔ for longer posts (preferably more than 280 characters)
keep updating small things about work, etc. here (as i used to)
and x dot com, welp...

0306

[@bonettaVisionLanguageModels2024] not sure if this is relevant to what am i doing but this is pretty interesting, will look into later; pretty much requires me to understand PPO a little although i think i have the intuition for it...

0217

on a quest to be a shape rotator

i regret not taking cs747 seriously, somebody tell 2020 me RL will be a big thing in 5 years

0210

i low-key want to see how SigLIP2 ViT-So-400m + ModernBert does on ImageNet

gave a nice presentation on SigLIP for my class
ok who i am kidding it was a great presentation (Saining was impressed)

Stitched together a minimal CLIP-style DistilBert + ResNet18, trained it on CIFAR10 and guess what
Sigmoid loss still mogs softmax

0203

gemini api and pound cake both are disgusting

test leakage is all you need to go viral

i am not deriving FFT again

0130

what's the sweet size for model sizes, from the point of it being usable...
we have plenty of (open) models around a few billion, tens of billions and then, suddenly around half a trillion (405, 680)...
not too many in the couple of hundred billion mark (except dscoder2.5, maybe some qwen models)
ah i wish gpt3 was open-source

what if:
Sam comes to his senses and open-sources o3-mini, and hosts it for dirt cheap too (since they have an amazing infrastructure).
OpenAI just wins, then?

0129

a wise shrek once told me, "i don't believe in libraries"

never deleting this app

0122

new year, two months of sickness (physical and mental) we are so back

2024

1128

these are getting more and more infrequent.
only key takeaway from last 10 days is to use managed/unified memory with cuda kernels until you understand how memory works...
avoid cudaMemcpyAsync and cudaMemsetAsync for the time being...

1117

time surely flies when you are down with flu...
https://github.com/assafelovic/gpt-researcher
can use some automatic prompt optimization...

1102

not writing cursive with a fountain pen is chaotic evil.

1029

man, DSPy is awesome, at least now i feel as if i am not getting overpaid for just writing prompts.

1028

deconstructing a search engine, let's stick to a domain first (medical for e.g.)

there are multiple medical databases (PubMed, SemanticScholar, etc...)
each have a traditional search and their own api
now you have "agent" that can crawl these databases, figure out how to query it
and you need a model/workflow to transform a natural query to an api query.
do you just tailor prompts for each database, kinda cring.
what about automated prompt engineering, DSPy??
search engines that leverage DSPy

1026

i got work do to but it's saturday night so

1024

training curve btw, should have used early stopping (figured it's just one epoch).

1023

ok i'll bite, it's rag for medical domain (the over-engineered gpt wrapper over a couple of sources) and i must research/experiment to make it better i suppose other priorities today:

applying to openai residency, preceded by embellishing my cv
i'll do the visa tomorrow
working on cola and starting cuda stuff

switching gears to doing some query optimization/prompt engineering stuff; should line up well with the blog i've been trying to write for over a month

maybe if do end up learning somewhat about diffusion and vlms, might become capable enough to work at moondream.

1022

dse-qwen2-korean should be ready in about 19 hours...burning a 4xA100 for 30hrs lemao

log-scale normalization is better than clipping?

now we're doing feature-wise weighting before computing cosine similarity, which means features with higher variance in the training set will contribute more to the similarity calculation. it maintains the normalization benefits of cosine similarity while incorporating the feature importance weights
have to experiment with different normalization strategies for the feature weights: basic, softmax-based, min-max, log-scale.
also have to see how to see how (sighs) FID scores change.

1021

i am trying

so what am i working on right now, well, a bunch of things:

at work i was extensively pushing to use multimodal rag based on colpali, and so we compromised for dse; finetuning should be straightforward, hoping it works out well
- all you need then is this encoder + claude.

now, onto more interesting things:

learning CUDA by writing kernels for compositional linear algebra https://github.com/wilson-labs/cola
improving the guided diffusion technique proposed in https://github.com/Agentic-Learning-AI-Lab/procreate-diffusion-public

this in itself is too much work on my plate but i also have teaching linear algebra and classes (obviously)

good luck me

1019

computers understand bits, llms understand vectors

i might actually cook with this blog...

all you need to learn diffusion is

estimated reading time for them combined is 78 mins
you can be an expert at diffusion in under 2 hours
that's less than 2 weeks btw

diffusion models are a journey that's equal parts math, magic, and machine learning.

1017

i always don't use version control but when i do i spam a bunch of Update README.md

1015

2 hours of sleep, on a sugar rush, and forcing myself through a lecture on approximate inference. i think i should just post that lmao.

next 72 hours are going to be brutal so i've been using vim

i have to learn cuda
i have to relearn vector calc so that i can attempt this horrifying assignment on bayesian linear regression
have to train a 2b vlm
proctor a couple of exams

and if time permits work on two research projects

am i too hopeful or thoroughly cooked