ineiti<p>Multi-Modal Foundation Models</p><p>With Amir Zamir</p><p>multiple sensory systems, eg, vision and touch, can teach themselves if they are synchronous in time. <br>If you have a set of sensors, then a multi modal foundation model can translate arbitrarily between them.</p><p>With masked modeling your trying to recover missing information.</p><p>In a MultiMAE model you train a model with different types of inputs and outputs. When trying out different inputs, it is interesting to see how the model adapts to the inputs:</p><p><a href="Https://MultiMAE.epfl.ch" rel="nofollow noopener noreferrer" target="_blank"><span class="invisible"></span><span class="">Https://MultiMAE.epfl.ch</span><span class="invisible"></span></a></p><p>An interesting application is "grounded generation", where you can influence an existing picture with words on what you want to change. You can also adapt the other inputs, like bounding boxes and depth.</p><p><a href="https://ioc.exchange/tags/AMLDGenAI23" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>AMLDGenAI23</span></a> <a href="https://ioc.exchange/tags/EPFL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>EPFL</span></a> <a href="https://ioc.exchange/tags/C4DT_EPFL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>C4DT_EPFL</span></a></p>