NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED668956
Record Type: Non-Journal
Publication Date: 2021
Pages: 113
Abstractor: As Provided
ISBN: 979-8-5381-2233-2
ISSN: N/A
EISSN: N/A
Available Date: 0000-00-00
Pre-Training Methods for Vision and Language
Hao Tan
ProQuest LLC, Ph.D. Dissertation, The University of North Carolina at Chapel Hill
Vision and language are the primary modalities of our human perception and learning. Recent years have witnessed fast development of methods that connect vision and language. Current deep learning methods are data-hungry, thus pre-training on large-scale data helps warm up the model and shows better fine-tuning results on downstream tasks. However, pre-training frameworks that exploit the power of multi-modality are still underexplored. Specifically, we have the following questions remaining: Could we build large pre-trained models that understand the interactions and alignments between modalities? Could language and vision help the understanding of each other? Could we combine the current diverse methods for vision pre-training and language pre-training? This dissertation aims to answer these questions. I first build a vision-and-language pre-training framework: LXMERT. This pre-training framework learns vision-and-language joint representations from massive data (e.g., MS COCO) and achieves state-of-the-art results on several benchmark tasks such as image question answering and visual reasoning. We also illustrate the importance of single-modality pre-training in vision-and-language tasks. Next, I improve language understanding via dense visual supervision and show its generalization to pure-text tasks. I develop the vokenization method to construct this visual supervision, which learns to retrieve related images for each contextualized token in the sentence. Lastly, current language pre-training and vision pre-training are led by different pretext tasks: language modeling and contrastive learning. I combine these two methods into a unified pre-training framework on videos, such that the pre-trained model could capture both static spatial contents and dynamic temporal interactions. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Authoring Institution: N/A
Grant or Contract Numbers: N/A
Author Affiliations: N/A