Recent trends in systems architecture include the growing importance of warehouse-sized computers and new solutions to address the scalability and power efficiency challenges in such large scale data centers. The key drivers behind this rapid growth are a new class of large-scale applications that constantly push the capacity and capability of existing infrastructures to the limit. The essence of these applications is distributed processing of large datasets to satisfy multi-dimensional service-level requirements. A key need for further research from the broader community on architectural issues for such large-scale data centers is the availability of a representative set of the emerging distributed workloads that drive these markets. This paper discusses this challenge. Specifically, we recognize the data-centricity of these workloads and discuss changing requirements in the context of these workloads. We discuss a data-centric workload taxonomy that seeks to separate the most important dimensions across which these applications differ. By examining existing and emerging workloads, we argue for a systematic approach to derive a coverage set of workloads based on this taxonomy. Inspired by the "seven dwarfs" of numerical computation [1][2], we believe that our community needs to collectively identify a set of "data dwarfs" or key data processing kernels -- that provide current and future coverage to this space and can be modeled by open benchmarks with realistic datasets -- for reasoning about new architectural designs and tradeoffs. This discussion was initiated at the 2010 ACLD workshop and we hope such goals would be achieved together by the computer architecture community.