A text-to-speech synthesis (TTS) system can speak in few voices, each is derived from audio recordings of a real person. TTS voice transformations that change a perceived speaker identity in a controllable way is an attractive alternative to expensive, lengthy and human labor consuming recording and processing of new speech datasets. Foreseen entertainment applications in particular will require multitudes of distinct TTS voices to be created on demand which makes the voice transformation the merely viable option.
I'll present a state of the art in the research area of TTS voice transformation and our work on endowing a product level TTS system with instant, externally configurable voice transformation capabilities.
Alexander Sorin is a senior researcher in the Speech Technologies group at IBM Haifa Research Lab. He is an author of numerous articles and holds 7 patents. He received his M.Sc. degree in Applied Mathematics from the Automation and Computers Department of Moscow Oil and Gas Institute, USSR in 1979. Since 1988 he works at IBM HRL on numerous research projects in speech and image processing including concatenative and statistical text-to-speech synthesis, voice-based emotion detection, automatic speech transcription and distributed speech recognition. He led the IBM team in several European research projects. He is currently leading a research project in the area of speech synthesis and modeling.