As technological systems become more and more advanced, the need for including the human during the interaction process has become more apparent. One simple way is to have the computer system understand and respond to the human's emotions. Previous works in emotion recognition have focused on improving performance by incorporating domain knowledge into the underlying system either through pre-specified rules or hand-crafted features. However, in the last few years, learned feature representations have experienced a resurgence mainly due to the success of deep neural networks.In this dissertation, we highlight how deep neural networks, when applied to emotion recognition, can learn representations that not only achieve superior accuracy to hand-crafted techniques, but also align with previous domain knowledge. Moreover, we show how these learned representations can generalize to different definitions of emotions and to different input modalities.The first part of this dissertation considers the task of categorical emotion recognition on images. We show how a convolutional neural network (CNN) that achieves state-of-the-art performance can also learn features that strongly correspond to Facial Action Units (FAUs). In the second part, we focus our attention on emotion recognition in video. We take the image-based CNN model and combine it with a recurrent neural network (RNN) in order to do dimensional emotion recognition. We also visualize the portions of the faces that most strongly affect the output prediction by using the gradient as a saliency map. Lastly, we explore the merit of doing multimodal emotion recognition by combining our model with other models trained on audio and physiological data.