A read-out that reads images โ trained only on words.
A 12.3M side-channel head on a frozen google/gemma-4-31B-it, taught meaning from text alone โ to separate discourse communities in the residual stream. It never saw a single picture in training. Hand it a picture and it names what the image means, with zero image training.
Cross-modal transfer. Give it a photo and it lands next to the right words: bicycle โ bicycle, rose โ flower, dog โ pet. It groups images by what they mean, not how they look: imageโreferent kNN 0.64 against chance 0.10.
Why this is a semiotic read-out, not an image classifier. A classifier is trained on labelled images and learns a fixed map from pixels to a closed label set; it only knows the labels it was shown. This read-out is different in kind. It never saw an image in training โ it was trained only on text, to separate discourse communities in the residual stream of a frozen gemma-4-31B. It can interpret a picture because gemma-4 already fuses image and word into one representational stream, and the read-out taps the shared interpretant: the meaning a sign carries, whatever form it arrived in. So it does not classify the image. It tells you what the image means to a system that learned meaning from words โ a transfer across modality, from a head that was never cross-trained. That is the result.
Each picture gets two readings: the words it means โ the load-bearing evidence โ and the discourse it evokes, the nearest of 35 communities it learned from text. Read the second as a flavour, not a category: cars into the automotive community, deer and mushrooms into gardening, cats and dogs into the cozy-domestic communities. Never a class label.
Scored offline through a frozen google/gemma-4-31B-it (62.5 GB, too large to run live)
Try it: RiverRider/srt-sunstone
Model: RiverRider/Gemma-4-31B-it-SRT-Sunstone
Code: https://github.com/space-bacon/SRT