Statistical (machine learning) methods have been applied successfully to speech signals in order to predict the emotional state from the nonverbal speech content. Common approach is based on extracting features that were shown to represent the required information (emotions), and then applying a classifier to the feature data.
We suggest a novel approach: applying Deep Learning -based classifier directly to the time-frequency representation of the raw speech. We describe the considerations of the network topology design, inspired by the biological speech perception mechanism, using common building blocks such as Convnets and LSTM.
Next, we present and analyze the classification results, and demonstrate improvement to the previously reported state of the art. We conclude that speech emotion recognition is another area where applying Deep Learning methods directly to the raw information improves the traditional use of hand-crafted feature extraction as a preceding step to the classification itself.
Dr. Aharon Satt received the B.Sc., M.SC. and D.Sc. degrees in Electrical Engineering from the Technion, Israel Institute of Technology. His expertise areas include signal and speech processing, information theory, and applications of machine learning. He fulfilled multiple research, industrial and technology-leadership positions in several
companies across the hi-tech industry in Israel. He was involved in European and bi- national research projects. He is currently active in research and development of emotion recognition technology.