A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through vision, audition, or other sensors), reason and infer from perception in an interpretable and actionable manner, communicate this understanding to humans and other agents (e.g., hold a natural language dialog grounded in the environment), and act on this understanding in physical worlds (e.g., aid humans by executing commands in an embodied environment). To be able to make progress towards this grand goal, we must explore new multimodal AI tasks, move from datasets to physical environments, and build new kinds of models. In this dissertation, we combine insights from different areas of AI -- computer vision, language understanding, reinforcement learning -- and present steps to connect the underlying domains of vision and language to actions towards such general-purpose agents. In Part 1, we develop agents that can see and talk -- capable of holding free-form conversations about images -- and reinforcement learning-based algorithms to train these visual dialog agents via self-play. In Part 2, we extend our focus to agents that can see, talk, and act -- embodied agents that can actively perceive and navigate in partially-observable simulated environments, to accomplish tasks such as question-answering. In Part 3, we devise techniques for training populations of agents that can comunicate with each other, to coordinate, strategize, and utilize their combined sensory experiences and act in the physical world. These agents learn both what messages to send and who to communicate with, solely from downstream reward without any communication supervision. Finally, in Part 4, we use question-answering as a task-agnostic probe to ask a self-supervised embodied agent what it knows about its physical world, and use it to quantify differences in visual representations agents develop when trained with different auxiliary objectives.