Understanding the meaning of linguistic expressions is a fundamental task of natural language processing. While distributed representations have become a powerful technique for modeling lexical semantics, but they have traditionally relied on ungrounded text corpora to identify semantically similar words. In contrast, this thesis explicitly models the denotation of linguistic expressions by building representations from grounded image captions. This allows us to use descriptions of the world to learn connections that would be difficult to identify in text-based corpora. In particular, we explore novel approaches to entailment that capture everyday world knowledge missing from other NLP tasks, on both existing datasets and our own new dataset. We also present a novel embedding model that produces phrase representations that are informed by our grounded representation. We conclude with an analysis of how grounded embeddings differ from standard distributional embeddings and suggestions for future refinement of this approach.