Replicating a human-level understanding of the physical world in computers is a monumental task. Achieving this requires building representations of concepts that manifest themselves visually, linguistically or through other senses. Furthermore concepts do not exist in isolation but are related to each other. In this work, we show how to build representations of concepts from visual and textual data, link visual manifestations of concepts to references in text descriptions (a problem known as word or phrase grounding) without strong supervision, and model the interaction between concepts. Specifically, we address the following three challenges faced by existing vision-language models: The first challenge is that of building generalizable and accurate representations of images and words. For generalization across tasks, we build aligned image-word representations that can be shared across multiple tasks like visual recognition and visual question answering and enhance inductive transfer between them. We also augment text-only word embeddings with word embeddings learned from visual co-occurrences to provide more accurate representations of visual concepts.The second challenge is linking references to visual concepts in textual descriptions to the corresponding regions in the image without requiring strong supervision in the form of word-region grounding. We show that maximizing a lower bound on mutual information between image regions and captions leads to state-of-the-art phrase grounding performance.The third challenge is extending vision-language systems to model interactions between visual entities. We build systems that demonstrate this ability in both generation and detection settings. We show how to generate a plausible layout and appearance of entities given a text description of entity actions and interactions. We also develop a state-of-the-art factored model and training techniques for detecting human-object interactions using pretrained object and pose detectors.