Speech metadata extraction can both improve speech recognition and enable novel Interactive Voice Response applications. Unlike the previous research, which concentrates on the frame-level signal processing and pattern classification, this paper systematically studies the behavior of decision combination at the utterance level. We analyze the asymptotic characteristics, and the factors affecting frame-level classification. In addition, we introduce new methods to more accurately and efficiently combine frame-level decisions, including phoneme/power-based weighting and smart sampling. Experimental results in gender classification are presented. Notes: Copyright IEEE. To be published in and presented at the 37th Asilomar Conference on Signals, Systems and Computers, 9-12 November 2003, Pacific Grove, CA 5 Pages