Discussion about this post

User's avatar
ToxSec's avatar

“Instead of learning from short captions like “A brown dog catching a frisbee,” Molmo sees long descriptions that mention lighting conditions, camera angle, background blur, object texture, emotional cues, and implied motion. This leads to deeper and more detailed visual representations.”

Thanks for putting this in an easy to understand way. This made a lot of sense.

Expand full comment
Kiran Thakor's avatar

Vn

Expand full comment

No posts

Ready for more?