Pre-Training Methods for Vision and Language.

Hao Tan

Notes FAQ Contact Us

Back to results

Direct link

ERIC Number: ED668956

Record Type: Non-Journal

Publication Date: 2021

Pages: 113

Abstractor: As Provided

ISBN: 979-8-5381-2233-2

ISSN: N/A

EISSN: N/A

Available Date: 0000-00-00

Pre-Training Methods for Vision and Language

Hao Tan

ProQuest LLC, Ph.D. Dissertation, The University of North Carolina at Chapel Hill

Vision and language are the primary modalities of our human perception and learning. Recent years have witnessed fast development of methods that connect vision and language. Current deep learning methods are data-hungry, thus pre-training on large-scale data helps warm up the model and shows better fine-tuning results on downstream tasks. However, pre-training frameworks that exploit the power of multi-modality are still underexplored. Specifically, we have the following questions remaining: Could we build large pre-trained models that understand the interactions and alignments between modalities? Could language and vision help the understanding of each other? Could we combine the current diverse methods for vision pre-training and language pre-training? This dissertation aims to answer these questions. I first build a vision-and-language pre-training framework: LXMERT. This pre-training framework learns vision-and-language joint representations from massive data (e.g., MS COCO) and achieves state-of-the-art results on several benchmark tasks such as image question answering and visual reasoning. We also illustrate the importance of single-modality pre-training in vision-and-language tasks. Next, I improve language understanding via dense visual supervision and show its generalization to pure-text tasks. I develop the vokenization method to construct this visual supervision, which learns to retrieve related images for each contextualized token in the sentence. Lastly, current language pre-training and vision pre-training are led by different pretext tasks: language modeling and contrastive learning. I combine these two methods into a unified pre-training framework on videos, such that the pre-trained model could capture both static spatial contents and dynamic temporal interactions. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]

Descriptors: Visual Aids, Language, Prior Learning, Learning Processes, Learning Modalities, Alignment (Education)

ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://www.proquest.com/en-US/products/dissertations/individuals.shtml

Publication Type: Dissertations/Theses - Doctoral Dissertations

Education Level: N/A

Audience: N/A

Language: English

Sponsor: National Science Foundation (NSF); Defense Advanced Research Projects Agency (DARPA) (DOD); US Army Research Office (ARO); Office of Naval Research (ONR) (DOD)

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Author Affiliations: N/A