Discussion about this post

User's avatar
ToxSec's avatar

“Instead of learning from short captions like “A brown dog catching a frisbee,” Molmo sees long descriptions that mention lighting conditions, camera angle, background blur, object texture, emotional cues, and implied motion. This leads to deeper and more detailed visual representations.”

Thanks for putting this in an easy to understand way. This made a lot of sense.

No posts

Ready for more?