Distributed, in-memory key-value stores have emerged as one of today;;s mostimportant data center workloads. Being critical for the scalability of modernweb services, vast resources are dedicated to key-value stores in orderto ensure that quality of service guarantees are met. These resources include:many server racks to store terabytes of key-value data, the power necessary torun all of the machines, networking equipment and bandwidth, and the data centerwarehouses used to house the racks.There is, however, a mismatch between the key-value store software and thecommodity servers on which it is run, leading to inefficient use of resources.The primary cause of inefficiency is the overhead incurred from processingindividual network packets, which typically carry small payloads, and requireminimal compute resources. Thus, one of the key challenges as we enter theexascale era is how to best adjust to the paradigm shift from compute-centricto storage-centric data centers.This dissertation presents a hardware/software solution that addresses theinefficiency issues present in the modern data centers on which key-valuestores are currently deployed. First, it proposes two physical serverdesigns, both of which use 3D-stacking technology and low-power CPUs to improvedensity and efficiency. The first 3D architecture---Mercury---consists of stacksof low-power CPUs with 3D-stacked DRAM. The secondarchitecture---Iridium---replaces DRAM with 3D NAND Flash to improve density.The second portion of this dissertation proposes and enhanced version of theMercury server design---called KeyVault---that incorporates integrated,zero-copy network interfaces along with an integrated switching fabric. In orderto utilize the integrated networking hardware, as well as reduce theresponse time of requests, a custom networking protocol is proposed. Unlikeprior works on accelerating key-value stores---e.g., by completely bypassing theCPU and OS when processing requests---this work only bypasses the CPU and OSwhen placing network payloads into a process;; memory. The insight behind this isthat because most of the overhead comes from processing packets in the OSkernel---and not the request processing itself---direct placement of packet;;spayload is sufficient to provide higher throughput and lower latency than priorapproaches.