资讯

Abstract: Image-text matching aims to bridge the semantic gap between visual and textual modalities, which is a fundamental task for multimodal learning applications. However, most Transformer-based ...