开源软件名称(OpenSource Name): pliang279/awesome-multimodal-ml开源软件地址(OpenSource Url): https://github.com/pliang279/awesome-multimodal-ml开源编程语言(OpenSource Language): 开源软件介绍(OpenSource Introduction): Reading List for Topics in Multimodal Machine Learning
By Paul Liang ([email protected] ), Machine Learning Department and Language Technologies Institute , CMU , with help from members of the MultiComp Lab at LTI, CMU. If there are any areas, papers, and datasets I missed, please let me know!
Course content + workshops
Tutorials on Multimodal Machine Learning at CVPR 2022 and NAACL 2022
New course 11-877 Advanced Topics in Multimodal Machine Learning Spring 2022 @ CMU. It will primarily be reading and discussion-based. We plan to post discussion probes, relevant papers, and summarized discussion highlights every week on the website.
Public course content and lecture videos from 11-777 Multimodal Machine Learning , Fall 2020 @ CMU.
Table of Contents
Research Papers
Survey Papers
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods , JAIR 2021
Experience Grounds Language , EMNLP 2020
A Survey of Reinforcement Learning Informed by Natural Language , IJCAI 2019
Multimodal Machine Learning: A Survey and Taxonomy , TPAMI 2019
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications , arXiv 2019
Deep Multimodal Representation Learning: A Survey , arXiv 2019
Guest Editorial: Image and Language Understanding , IJCV 2017
Representation Learning: A Review and New Perspectives , TPAMI 2013
A Survey of Socially Interactive Robots , 2003
Core Areas
Multimodal Representations
Balanced Multimodal Learning via On-the-fly Gradient Modulation , CVPR 2022
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast , IJCAI 2021 [code]
Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text , arXiv 2021
FLAVA: A Foundational Language And Vision Alignment Model , arXiv 2021
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer , arXiv 2021
MultiBench: Multiscale Benchmarks for Multimodal Representation Learning , NeurIPS 2021 [code]
Perceiver: General Perception with Iterative Attention , ICML 2021 [code]
Learning Transferable Visual Models From Natural Language Supervision , arXiv 2021 [blog] [code]
VinVL: Revisiting Visual Representations in Vision-Language Models , arXiv 2021 [blog] [code]
Learning Transferable Visual Models From Natural Language Supervision , arXiv 2020 [blog] [code]
12-in-1: Multi-Task Vision and Language Representation Learning , CVPR 2020 [code]
Watching the World Go By: Representation Learning from Unlabeled Videos , arXiv 2020
Learning Video Representations using Contrastive Bidirectional Transformer , arXiv 2019
Visual Concept-Metaconcept Learning , NeurIPS 2019 [code]
OmniNet: A Unified Architecture for Multi-modal Multi-task Learning , arXiv 2019 [code]
Learning Representations by Maximizing Mutual Information Across Views , arXiv 2019 [code]
ViCo: Word Embeddings from Visual Co-occurrences , ICCV 2019 [code]
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations , CVPR 2019
Multi-Task Learning of Hierarchical Vision-Language Representation , CVPR 2019
Learning Factorized Multimodal Representations , ICLR 2019 [code]
A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks , ICML 2018
Do Neural Network Cross-Modal Mappings Really Bridge Modalities? , ACL 2018
Learning Robust Visual-Semantic Embeddings , ICCV 2017
Deep Multimodal Representation Learning from Temporal Data , CVPR 2017
Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations , COLING 2016
Combining Language and Vision with a Multimodal Skip-gram Model , NAACL 2015
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , NIPS 2014
Multimodal Learning with Deep Boltzmann Machines , JMLR 2014
Learning Grounded Meaning Representations with Autoencoders , ACL 2014
DeViSE: A Deep Visual-Semantic Embedding Model , NeurIPS 2013
Multimodal Deep Learning , ICML 2011
Multimodal Fusion
Robust Contrastive Learning against Noisy Views , arXiv 2022
Cooperative Learning for Multi-view Analysis , arXiv 2022
What Makes Multi-modal Learning Better than Single (Provably) , NeurIPS 2021
Efficient Multi-Modal Fusion with Diversity Analysis , ACMMM 2021
Attention Bottlenecks for Multimodal Fusion , NeurIPS 2021
Trusted Multi-View Classification , ICLR 2021 [code]
Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis , ICDM 2020
Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies , NeurIPS 2020 [code]
Deep Multimodal Fusion by Channel Exchanging , NeurIPS 2020 [code]
What Makes Training Multi-Modal Classification Networks Hard? , CVPR 2020
Dynamic Fusion for Multimodal Data , arXiv 2019
DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis , IJCAI 2019 [code]
Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling , NeurIPS 2019
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification , IEEE TNNLS 2019 [code]
MFAS: Multimodal Fusion Architecture Search , CVPR 2019
The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision , ICLR 2019 [code]
Unifying and merging well-trained deep neural networks for inference stage , IJCAI 2018 [code]
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors , ACL 2018 [code]
Memory Fusion Network for Multi-view Sequential Learning , AAAI 2018 [code]
Tensor Fusion Network for Multimodal Sentiment Analysis , EMNLP 2017 [code]
Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , AAAI 2015
A co-regularized approach to semi-supervised learning with multiple views , ICML 2005
Multimodal Alignment
Reconsidering Representation Alignment for Multi-view Clustering , CVPR 2021 [code]
CoMIR: Contrastive Multimodal Image Representation for Registration , NeurIPS 2020 [code]
Multimodal Transformer for Unaligned Multimodal Language Sequences , ACL 2019 [code]
Temporal Cycle-Consistency Learning , CVPR 2019 [code]
See, Hear, and Read: Deep Aligned Representations , arXiv 2017
On Deep Multi-View Representation Learning , ICML 2015
Unsupervised Alignment of Natural Language Instructions with Video Segments , AAAI 2014
Multimodal Alignment of Videos , MM 2014
Deep Canonical Correlation Analysis , ICML 2013 [code]
Multimodal Pretraining
Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling , CVPR 2021 [code]
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer , arXiv 2021
Large-Scale Adversarial Training for Vision-and-Language Representation Learning , NeurIPS 2020 [code]
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision , EMNLP 2020 [code]
Integrating Multimodal Information in Large Pretrained Transformers , ACL 2020
VL-BERT: Pre-training of Generic Visual-Linguistic Representations , arXiv 2019 [code]
VisualBERT: A Simple and Performant Baseline for Vision and Language , arXiv 2019 [code]
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , NeurIPS 2019 [code]
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , arXiv 2019
LXMERT: Learning Cross-Modality Encoder Representations from Transformers , EMNLP 2019 [code]
VideoBERT: A Joint Model for Video and Language Representation Learning , ICCV 2019
Multimodal Translation
Zero-Shot Text-to-Image Generation , ICML 2021 [code]
Translate-to-Recognize Networks for RGB-D Scene Recognition , CVPR 2019 [code]
Language2Pose: Natural Language Grounded Pose Forecasting , 3DV 2019 [code]
Reconstructing Faces from Voices , NeurIPS 2019 [code]
Speech2Face: Learning the Face Behind a Voice , CVPR 2019 [code]
Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities , AAAI 2019 [code]
Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions , ICASSP 2018 [code]
Crossmodal Retrieval
Learning with Noisy Correspondence for Cross-modal Matching , NeurIPS 2021 [code]
MURAL: Multimodal, Multitask Retrieval Across Languages , arXiv 2021
Self-Supervised Learning from Web Data for Multimodal Retrieval , arXiv 2019
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , CVPR 2018
Multimodal Co-learning
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , ICML 2021
Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions , arXiv 2021
Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision , EMNLP 2020
Foundations of Multimodal Co-learning , Information Fusion 2020
Missing or Imperfect Modalities
A Variational Information Bottleneck Approach to Multi-Omics Data Integration , AISTATS 2021 [code]
SMIL: Multimodal Learning with Severely Missing Modality , AAAI 2021
Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series , arXiv 2019
Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization , ACL 2019
Multimodal Deep Learning for Robust RGB-D Object Recognition , IROS 2015
Analysis of Multimodal Models
M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis , IEEE TVCG 2022
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers , TACL 2021
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! , EMNLP 2020
Blindfold Baselines for Embodied QA , NIPS 2018 Visually-Grounded Interaction and Language Workshop
Analyzing the Behavior of Visual Question Answering Models , EMNLP 2016
Knowledge Graphs and Knowledge Bases
MMKG: Multi-Modal Knowledge Graphs , ESWC 2019
Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs , AKBC 2019
Embedding Multimodal Relational Data for Knowledge Base Completion , EMNLP 2018
A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning , SEM 2018 [code]
Order-Embeddings of Images and Language , ICLR 2016 [code]
Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries , arXiv 2015
Intepretable Learning
Multimodal Explanations by Predicting Counterfactuality in Videos , CVPR 2019
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , CVPR 2018 [code]
Do Explanations make VQA Models more Predictable to a Human? , EMNLP 2018
Towards Transparent AI Systems: Interpreting Visual Question Answering Models , ICML Workshop on Visualization for Deep Learning 2016
Generative Learning
Generalized Multimodal ELBO , ICLR 2021 [code]
Variational Mixture-of-Experts Autoencodersfor Multi-Modal Deep Generative Models , NeurIPS 2019 [code]
Few-shot Video-to-Video Synthesis , NeurIPS 2019 [code]
Multimodal Generative Models for Scalable Weakly-Supervised Learning , NeurIPS 2018 [code1] [code2]
The Multi-Entity Variational Autoencoder , NeurIPS 2017
Semi-supervised Learning
Semi-supervised Vision-language Mapping via Variational Learning , ICRA 2017
Semi-supervised Multimodal Hashing , arXiv 2017
Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition , IJCAI 2016
Multimodal Semi-supervised Learning for Image Classification , CVPR 2010
Self-supervised Learning
DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning , NeurIPS 2021 Datasets & Benchmarks Track [code]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering , NeurIPS 2020 [code]
Self-Supervised MultiModal Versatile Networks , NeurIPS 2020 [code]
Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision , NeurIPS 2020 [code]
Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , CVPR 2017
Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems , 2016
Language Models
Neural Language Modeling with Visual Features , arXiv 2019
Learning Multi-Modal Word Representation Grounded in Visual Context , AAAI 2018
Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes , CVPR 2016
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , ICML 2014 [code]
Adversarial Attacks
Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models , NeurIPS Workshop on Visually Grounded Interaction and Language 2018
Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning , ACL 2018 [code]
Fooling Vision and Language Models Despite Localization and Attention Mechanism , CVPR 2018
Few-Shot Learning
Language to Network: Conditional Parameter Adaptation with Natural Language Descriptions , ACL 2020
Shaping Visual Representations with Language for Few-shot Classification , ACL 2020
Zero-Shot Learning - The Good, the Bad and the Ugly , CVPR 2017
Zero-Shot Learning Through Cross-Modal Transfer , NIPS 2013
Bias and Fairness
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models , arXiv 2021
Towards Debiasing Sentence Representations , ACL 2020 [code]
FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment , ICMI 2020 [code]
Model Cards for Model Reporting , FAccT 2019
Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings , NAACL 2019 [code]
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , FAccT 2018
Datasheets for Datasets , arXiv 2018
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , NeurIPS 2016
Human in the Loop Learning
Human in the Loop Dialogue Systems , NeurIPS 2020 workshop
Human And Machine in-the-Loop Evaluation and Learning Strategies , NeurIPS 2020 workshop
Human-centric dialog training via offline reinforcement learning , EMNLP 2020 [code]
Human-In-The-Loop Machine Learning with Intelligent Multimodal Interfaces , ICML 2017 workshop
Architectures
Multimodal Transformers
Pretrained Transformers As Universal Computation Engines , AAAI 2022
Perceiver: General Perception with Iterative Attention , ICML 2021
FLAVA: A Foundational Language And Vision Alignment Model , arXiv 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio , arXiv 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , NeurIPS 2021 [code]
Parameter Efficient Multimodal Transformers for Video Representation Learning , ICLR 2021 [code]
Multimodal Memory
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation , arXiv 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation , NeurIPS 2021 [code]
Episodic Memory in Lifelong Language Learning , NeurIPS 2019
ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection , EMNLP 2018
Multimodal Memory Modelling for Video Captioning , CVPR 2018
Dynamic Memory Networks for Visual and Textual Question Answering , ICML 2016
Applications and Datasets
Language and Visual QA
Learning to Answer Questions in Dynamic Audio-Visual Scenarios , CVPR 2022
SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events , CVPR 2021 [code]
MultiModalQA: complex question answering over text, tables and images , ICLR 2021
ManyModalQA: Modality Disambiguation and QA over Diverse Inputs , AAAI 2020 [code]
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA , CVPR 2020
Interactive Language Learning by Question Answering , EMNLP 2019 [code]
Fusion of Detected Objects in Text for Visual Question Answering , arXiv 2019
RUBi: Reducing Unimodal Biases in Visual Question Answering , NeurIPS 2019 [code]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , CVPR 2019 [code]
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , CVPR 2019 [code]
MUREL: Multimodal Relational Reasoning for Visual Question Answering , CVPR 2019 [code]
Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , CVPR 2019 [code]
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering , ICML 2019 [code]
Learning to Count Objects in Natural Images for Visual Question Answering , ICLR 2018, [code]
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization , NeurIPS 2018
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , NeurIPS 2018 [code]
RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes , EMNLP 2018 [code]
TVQA: Localized, Compositional Video Question Answering , EMNLP 2018 [code]
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , CVPR 2018 [code]
Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , CVPR 2018 [code]
Stacked Latent Attention for Multimodal Reasoning , CVPR 2018
Learning to Reason: End-to-End Module Networks for Visual Question Answering , ICCV 2017 [code]
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , CVPR 2017 [code] [dataset generation]
Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , CVPR 2017 [code]
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , EMNLP 2016 [code]
MovieQA: Understanding Stories in Movies through Question-Answering , CVPR 2016 [code]
VQA: Visual Question Answering , ICCV 2015 [code]
Language Grounding in Vision
Core Challenges in Embodied Vision-Language Planning , arXiv 2021
MaRVL: Multicultural Reasoning over Vision and Language , EMNLP 2021 [code]
Grounding 'Grounding' in NLP , ACL 2021
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes , NeurIPS 2020 [code]
What Does BERT with Vision Look At? , ACL 2020
Visual Grounding in Video for Unsupervised Word Translation , CVPR 2020 [code]
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference , CVPR 2020 [code]
Grounded Video Description ,
请发表评论