Robust Multimodal Learning from Language, Visual and Acoustic Modalities