Simple and Effective Multimodal Learning Based on Pre-Trained Transformer Models
Simple and Effective Multimodal Learning Based on Pre-Trained Transformer Models
Blog Article
Transformer-based models have garnered attention acure face lotion because of their success in natural language processing, and in several other fields, such as image and automatic speech recognition.In addition to them being trained on unimodal information, many transformer-based models have been proposed for multimodal information.In multimodal learning, a common problem encountered is the insufficiency of multimodal training data.In this study, to address this problem, a simple and effective method is proposed by using 1) unimodal pre-trained transformer models as encoders for each modal input and 2) a set of transformer layers to fuse their output representations.Further, the proposed method is evaluated by conducting several experiments on aluminum lotion two common benchmarks: CMU multimodal opinion sentiment intensity dataset and multimodal internet movie database.
The proposed model exhibits state-of-the-art performances on both benchmarks and is robust against the reduction in the amount of training data.