The embedding space insight is crucial. What helped me understand this better was realizing that tokenization for images is fundamentally different than text, yet they converge to the same semantic space. Makes you wonder how this will scale once we add more modalities.
The embedding space insight is crucial. What helped me understand this better was realizing that tokenization for images is fundamentally different than text, yet they converge to the same semantic space. Makes you wonder how this will scale once we add more modalities.
Helpful