Performance-hungry data center applications demand increasingly higher performance from their storage in addition to larger capacity memory at lower cost. While the existing storage technologies (e.g., HDD and flash-based SSD) are limited in their performance, the most prevalent memory technology (DRAM) is unable to address the capacity and cost requirements of these applications. Emerging byte-addressable, non-volatile memory technologies (such as PCM and RRAM) offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. Such load/store accessible non-volatile or persistent memory (referred to as NVM or PM) introduces an interesting new tier that bridges the performance gap between DRAM and PM, and serves the role of fast storage or slower memory. However, PM has several implications on system design, both hardware and software: (i) the hardware caching mechanisms, while necessary for acceptable performance, complicate the ordering and durability of stores to PM, (ii) the high performance of PM (compared to NAND) and the fact that it is byte-addressable necessitate rethinking of the system software to manage PM and the interfaces to expose PM to the applications, and (iii) the future memory-based applications that will likely employ systems coupling PM with DRAM (for cost and capacity reasons) must be extremely conscious of the performance characteristics of PM and the challenges of using fast vs. slow memory in ways that best meet their performance demands.The key contribution of our research is a set of technologies that addresses these challenges in a bottom-up fashion. Since the real hardware is not yet available, wefirst implement a hardware emulator that can faithfully emulate the relative performance characteristics of DRAM and PM in a system with separate DRAM and emulated PM regions. We use this emulator to perform all of our evaluations. Next we explore system software support to enable low-overhead PM access by new and legacy applications. Towards this end, we implement PMFS, an optimized light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. We demonstrate that PMFS achieves significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device.Finally, we address the problem of designing memory-based applications for systems with both DRAM and PM by extending our system software to manage both the tiers. We demonstrate for several representative large in-memory applications that it is possible to use a small amount of fast DRAM and large amounts of slower PM without a proportional impact to an application's performance, provided the placement of data structures is done in a careful fashion. To simplify the application programming, we implement a set of libraries and automatic tools (called X-Mem) that enables programmers to achieve optimal data placement with minimal effort on their part. Finally, we demonstrate the potentially large benefits of application-driven memory tiering with X-Mem across a range of applications.