The future of performance scaling lies in massively parallel workloads, butless-parallel applications will remain important.Unfortunately, future process technologies and core microarchitecturesno longer promise major per-thread performance improvements, so microarchitectsmust find new ways to address a growing per-thread performance deficit.Moreover, they must do so without sacrificing parallel throughput.To meet these apparently conflicting demands, this dissertation proposes aTiming Speculation (TS) system for CMPs that boosts core clock frequenciespast their normal limits when an application demands per-thread performanceand operates efficiently at nominal frequency when it demands throughput.This work's contributions are organized into three interlocking proposals.This work begins by introducing Paceline, the first TS microarchitecturedesigned specifically for CMPs.Paceline enables two cores to work togetherto execute a single thread at high speed under TS or independently to executetwo threads at the rated frequency.In single-thread mode, one core in thepair --- the ``Leader'' --- executes at higher-than-normal frequency, while a``Checker'' runs at the rated, safe frequency.The Leader runs the program faster but may experience timing errors.To detect and correct these errors, the Checker periodically compares ahash of its architectural state with that of the Leader.The Leader helpsthe Checker keep up by passing it branch results and prefetches.Next, this dissertation enhances Paceline with BlueShift, a circuitdesign method for TS architectures that improves a circuit's common-casedelay rather than focusing on worst-case delay like traditional design flows.BlueShift profiles a gate-level design as it runs real benchmarkapplications to identify the frequently-exercised circuit paths andthen applies speed optimizations to those paths only.These optimizations canbe implemented in a way that can be enabled and disabled at run-time sothat they do not exact a power cost when they are not needed (ie.when the processor is executing a throughput workload).Finally, this work presents LeadOut, a CMP design that combines Pacelinewith an additional per-thread performance enhancement: the ability toincrease core supply voltage above nominal.LeadOut evaluates the performancegains that are possible with Paceline alone, voltage boosting alone, andboth together.It shows major gains from applying the two techniquestogether when feasible and also shows that, in many cases, future CMPs havepower and temperature headroom to exploit still more per-thread enhancementsas long as they can be enabled and disabled dynamically according toapplication demand.
【 预 览 】
附件列表
Files
Size
Format
View
Improving Per-Thread Performance on CMPs through Timing Speculation