Scaling processor performance with future technology nodes is essential to enable future applications for devices ranging from smart-phones to servers. But the traditional methods of achieving that performance through frequency scaling and single-core architectural enhancements are no longer viable due to fundamental scaling limits. To continue scaling performance, parallel computers in the form of Chip Multi-processors (CMPs) are now prevalent, moving the challenge of parallel programming from a niche to the general domain.One challenging area is scalable synchronization to shared data structures using traditional methods. It can take many years for expert programmers using traditional methods to craft a scalable and correct scheme to synchronize access to data-structures in a complex program. Researchers have been searching for methods to make synchronization more tractable. One proposal is to use ;;Transactional Programming;; to abstract synchronization to shared data structures as transactions in a similar fashion as database operations. Transactional programming can be efficiently supported by using a ;;Transactional Memory;; (TM) system.One main problem with TM systems is scalability bottlenecks. When transactional applications are written to emulate future average programmer practices, performance can be worse than a single processor on large CMPs. This should not happen on a system meant to make programming easier.This happens because transactions as represented in the TM system may be dependent on each other--accessing the same data and therefore must serialize--without the programmer being knowledgable about these dependencies due to the abstraction hiding system details.This thesis develops a hardware/software approach to alleviate scalability bottlenecks in TM systems, while maintaining the level of abstraction presented in transactional programming. I first introduce ;;Proactive Transaction Scheduling;; (PTS), a technique that profiles parallel code at runtime to determine orders transactions should execute in to maintain acceptable forward progress. I then propose using PTS to automatically determine transactions causing large amounts of serialization. These transactions are then accelerated using an asymmetric CMP to get better performance. I also show PTS can be used to partition resources in a Multi-threaded processor core for better overall performance over a fair partitioning of resources.
【 预 览 】
附件列表
Files
Size
Format
View
A Hardware/Software Approach for Alleviating Scalability Bottlenecks in Transactional Memory Applications.