Grounding Language by Seeing, Hearing, and Interacting