Tianlang Chen
Published: 2021
Total Pages: 223
Get eBook
"With the development of computer vision, natural language processing, and machine learning technologies, a great number of joint visual-textual applications, such as image captioning, visual question answering, visual grounding, image-text cross-modal retrieval, and text-based image generation, emerged in recent years. They leverage machine learning models as the core module to tackle problems related to the intersection of vision and language. For all these joint visual-textual applications, vision and text modalities interact in three fundamental modes. The first is the "joint learning" mode, which considers both modalities as parallel inputs to jointly predict a target. The second is the "retrieval" mode, which explores the correspondence relation between the two modalities and aims to find the corresponding items that belong to different modalities. The third is the "generation" mode, which focuses on creating and modifying the items of one modality using the input of another modality as guidance. For all the joint visual-textual applications of the three modes, how to effectively "capture" and "attend" to the significant information of the visual and textual inputs is crucial. This thesis develops new "capturing" and "attending" methods to effectively model joint visual-textual applications in the three modes. For the first mode, we focus on a significant social media classification application. A novel bilateral attention model is proposed to classify whether a WeChat Moment is related to business or not based on the Moment's image and text information. For the second mode, we comprehensively investigate the application of image-text cross-modal retrieval on both general and domain-specific tasks. We first explore the general image-text matching task and propose approaches that capture high-performance cross-modal information. We then focus on two domain-specific tasks related to font retrieval and person search. We design methods to further utilize the special font and person attributes to attend to better features. For the third mode, we pay attention to a brand new research topic, "emotionalization", which tries to emotionalize an image or a text. We combine this topic with two well-defined joint visual-textual applications of the third mode ? image captioning and text-based image transfer. We design models to incorporate the "emotionalization" operation into the text generation and image transfer process by effectively capturing and attending to critical information"--Pages xv-xvi.