Nowadays, many embedded devices, such as smartphones and Amazon Alexa, useautomatic speech recognition (ASR) technology for the hands-free interface. Especiallyneural network-based algorithms are widely employed in ASR because of highaccuracy and resiliency in noisy environments.Neural network-based algorithms require a large amount of computation for realtimeoperation. As a result, most of today’s ASR systems adopt server-based processing.However, privacy concerns and low latency bring increased demand for on-deviceASR. For on-device ASR, the power consumption should be minimized to increase theoperating time.Many neural network models have been developed for high-performance ASR.Among them, the recurrent neural network (RNN) based algorithms are most commonlyused for speech recognition. Especially long short-term memory (LSTM) RNNis very well known. However, executing the LSTM algorithm on an embedded deviceconsumes much power because the cache size is too small to accommodate all thenetwork parameters. Frequent DRAM accesses due to cache misses not only slow theexecution but also incur a lot of power consumption. One possible solution to mitigatethis problem is to compute multiple output samples at a time, which is called themulti-time step parallelization, to reduce the number of parameter fetches. However,the complex feedback structure of LSTM RNN does not allow multi-time step parallelprocessing.This thesis presents a Residual Simple Gated Convolutional Network (ResidualSimple Gated ConvNet) model with only about 1M parameters. Nowadays, manyCPUs can accommodate neural networks with a parameter size of 1M in cache memory. Thus, this model can run ASR very fast and efficiently without consuming muchpower. The developed model is also based on a convolutional neural network, thus themulti-time step processing can easily be applied. To achieve high accuracy with a smallnumber of parameters, the model employs one-dimensional depthwise convolution,which helps to find temporal patterns of the speech signal. We also considered inceptionresidual connections to reduce the needed number of layers, but this approachneeds to be more improved. The developed Residual Simple Gated ConvNet showedvery fairly high accuracy even with 1M parameters when trained on WSJ speech corpus.This model demands less than 10% of CPU time when running on ARM-basedCPUs for embedded devices.