Laszlo Toth is a senior researcher at the Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and the University of Szeged. He received his M.Sc. and Ph.D. degrees from the University of Szeged. He joined the Research Group on Artificial Intelligence in 1995, where his main research interests are speech recognition, machine learning and signal processing. He has been involved in the development of neural network-based speech recognition techniques such as the segment-based modeling approach, the hybrid hidden Markov – neural network (HMM/ANN) model, and ANN-based multi-lingual speech recognition methods. He co-authored papers with the researchers of the Technical University of Budapest, University of Edinburgh and the Katholieke Universiteit Leuven. Currently, he is focusing on the application of deep neural networks (DNNs) to speech recognition, including the most recent DNN technologies such as convolutional and recurrent neural networks.
Abstract: Application of deep convolutional neural networks to speech recognition
The concept of convolutional neural networks (CNNs) emerged in the field of image processing several decades ago. However, CNNs have become extremely popular only recently, when we learned how to train large deep structures, and we also have the necessary computing power and huge training datasets. Apart from a few early attempts, CNNs were not tried out at all in speech recognition, perhaps because the usual cepstral input representation is quite different from the 2D images that CNNs are trained on in image recognition. The current deep learning revolution in speech recognition, however, revealed that these new deep neural networks do not require the cepstral computation, so we can return to a simpler, spectrogram-like input, which is basically an image. This opened up the chance for the application of CNNs to speech recognition, with the most cited early studies of Abdel-Hamid et al. in 2012 and Sainath et al. in 2013. My talk provides an overview of my experiences and results got with training CNNs for speech recognition. I will show how convolution along the frequency axis decreases the inter-speaker variance of the recognition error, and hence the sensitivity of the system to the actual speaker. Then I will show how convolution along the time axis can be exploited to hierarchically process a long observation context. These two types of convolutional processing schemes can be combined within the same network, resulting in a hierarchical convolutional network structure that performs convolution along both the time axis and the frequency axis. I will also point out the similiarity between convolutional pooling and maxout neurons, and explain how to build convolutional deep maxout networks. I will close my talk with the latest directions of CNN-based speech recognition research, such as the inclusion of the parameters of the feature extraction filters in the CNN optimization process, which results in CNNs that are trained directly on the raw speech signal.
Rudolf Rabenstein studied Electrical Engineering at the University Erlangen-Nuremberg, Germany, and at the University of Colorado at Boulder, USA. He received the degrees “Doktor-Ingenieur” in electrical engineering and “Habilitation” in signal processing from the University of Erlangen-Nuremberg, Germany. He worked with the Physics Department of the University of Siegen, Germany, and now as a Professor at the Friedrich-Alexander-University Erlangen-Nuremberg. His research interests include multidimensional systems theory and multimedia signal processing. Currently he is an associate editor of the Springer Journal Multidimensional Systems and Signal Processing and a member of the Special Area Team on Acoustic, Sound and Music Signal Processing of the European Association for Signal Processing (EURASIP).
Abstract: Physics-Based Sound Field Reproduction
Techniques for spatial sound reproduction have originally been developed for thereproduction of music and for movie sound tracks. In speech communication they have been applied to video conference systems and recently also to the reproduction of ambient noise for device testing. Here the main focus is on the correct recreation of the physical properties of a sound field.
Physics-based sound field reproduction is also the topic of this presentation. At first, it surveys the physical foundations of sound fields and their reproduction with loudspeaker arrays. Then two established reproduction techniques are presented: Wavefield Synthesis and Ambisonics. Both techniques are discussed in a unifying way starting from a common synthesis equation. Wavefield Synthesis follows by considering the integral form of the acoustic wave equation. The solution of the wave equation can be approximated by a sufficient number of loudspeakers distributed on a surface around the listening area. Ambisonics takes another route by expanding the solution of the wave equation into suitable basis functions, the so-called spherical harmonics. This series expansion can be truncated to only a few terms if the listening area is restricted in size. The strengths and the limitations of both methods are discussed.
The presentation is based on a restricted coverage of the physical and mathematical foundations. In addition it uses graphical representations and visualizations of sound fields to highlight the performance of and the differences between the presented spatial sound reproduction techniques
Björn W. Schuller
Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor all in EE/IT from TUM in Munich/Germany. At present, he is Full Professor and Chair of Complex and Intelligent Systems at the University of Passau/Germany, Reader (Associate Professor) in Machine Learning at Imperial College London/UK, the co-founding CEO of audEERING, and permanent Visiting Professor at HIT/P.R. China. Previous major stations include Joanneum Research in Graz/Austria, and the CNRS-LIMSI in Orsay/France. Dr. Schuller is elected member of the IEEE Speech and Language Processing Technical Committee, Senior Member of the IEEE, and was President of the Association for the Advancement of Affective Computing. He (co-)authored >600 publications (h-index = 56), is the Editor in Chief of the IEEE Transactions on Affective Computing, General Chair of ACII 2019 and ACM ICMI 2014, a Program Chair of Interspeech 2019, ACII 2015 and 2011, ACM ICMI 2013, and IEEE SocialCom 2012. He won a range of awards including being honoured as one of 40 extraordinary scientists under the age of 40 by the World Economic Forum in 2015 and 2016, was Coordinator or PI in more than 10 European Projects, and is consultant of companies such as Huawei and Samsung.
Abstract: Automatic Speaker Analysis 2.0: Hearing the Bigger Picture
Automatic Speaker Analysis has largely focused on single aspects of a speaker such as her ID, gender, emotion, personality or health state. This broadly ignores the interdependency of all the different states and traits impacting on the one single voice production mechanism available to a human speaker. In other words, sometimes we may sound depressed, but we simply have a flu, and hardly find the energy to put more vocal effort into our articulation and sound production. Recently, this lack gave rise to an increasingly holistic speaker analysis – assessing the “larger picture” in one pass such as by multi-target learning. However, for a robust assessment, this requires large amount of speech and language resources labelled in rich ways to train such interdependency, and architectures able to cope with multi-target learning of massive amounts of speech data. In this light, this talk will discuss efficient mechanisms such as large social-media pre-scanning with dynamic cooperative crowd-sourcing for rapid data collection, cross-task-labelling of these data in a wider range of attributes to reach “big & rich” speech data, and efficient multi-target end-to-end and end-to-evolution deep learning paradigms to learn an accordingly rich representation of diverse target tasks in efficient ways. The ultimate goal behind is to enable machines to “hear” the larger picture – the person and her condition and whereabouts behind the voice and words – rather than aiming at a single aspect ignorant and blind to the overall individual and its state, thus leading to the next level of Automatic Speaker Analysis.
Adrian Iftene studied Computer Science at the University “Alexandru Ioan Cuza”, Iasi, Romania. He received his M.Sc. and Ph.D. degrees from the University “Alexandru Ioan Cuza”. He is now a Associate Professor at the University “Alexandru Ioan Cuza”, Faculty of Computer Science. His research interests include natural language processing (Named Entity Recognition – identification and classification, Opinion mining and Sentiment Analysis, Textual Entailment, Question Answering, Information Retrieval) both on Romanian and English. Also, he has experience in analyzing Social Networks (Twitter, Facebook, MySpace, Flickr), in user profiling and in users and resources credibility. Until now he was involved in 24 research projects (12 international and 12 national). In two projects (MUCKE and Compet IT&C), he was project manager and in STAGES project he was scientific adviser.
Abstract: Using Text Processing in a Multimedia Environment
MUCKE (Multimedia and User Credibility Knowledge Extraction) project of type ERA-NET CHIST-ERA departed from current knowledge extraction models, which are mainly quantitative, by giving a high importance to the quality of the processed data, in order to protect the user from an avalanche of equally topically relevant data. MUCKE came with two central innovations: automatic user credibility estimation for multimedia streams and adaptive multimedia concept similarity. Adaptive multimedia concept similarity departed from existing models by creating a semantic representation of the underlying corpora and assigning a probabilistic framework to them. UAIC (“Alexandru Ioan Cuza” University) team was involved in the main tasks of the project: building the data collection, text processing, diversification in image retrieval and data credibility.