What is Modality Corpus?

The MODALITY corpus consists of over 30 hours of multimodal recordings. The database contains high-resolution, high-framerate stereoscopic video streams and audio signals obtained from a microphone array and a laptop microphone. The corpus can be employed to develop an AVSR system, as every utterance was labelled. Recordings in noisy conditions can be used to test the robustness of speech recognition systems.

Read more


The MODALITY audio-visual corpus for multimodal automatic speech recognition. Copyright © Multimedia Systems Department, Gdańsk University of Technology.

Distribution and usage of this corpus is allowed under following conditions:

  1. The corpus is provided as it is. The authors do not warrant that the corpus will be free from errors or will be suitable for any particular purpose.
  2. The authors of the corpus are not responsible for any direct or indirect problems that may be caused to the user of this corpus.
  3. The use of the corpus is limited to research and educational purposes only.
  4. Any work (eg. journal articles, technical reports, conference papers etc.) resulting from the use of the MODALITY corpus must cite the following papers:

    Czyzewski, A., Kostek, B., Bratoszewski, P. et al. J Intell Inf Syst (2017) 49: 167. https://doi.org/10.1007/s10844-016-0438-z

    Jachimski D., Czyżewski A., A comparative study of English viseme recognition methods and algorithms; Multimedia Tools and Applications, Multimed Tools Appl (2018) 77: 16495. https://doi.org/10.1007/s11042-017-5217-5

    Kawaler, M. & Czyżewski, A. J Intell Inf Syst (2019) 53: 381. Speech database including facial expressions recorded with the Face Motion Capture system, J Intell Inf Syst (2019) 53: 381. https://doi.org/10.1007/s10844-019-00547-y

Corpus Features


35 speakers

including 17 natives and 18 non-natives


Full HD / 100 FPS

video capture


Different recording conditions

includes recordings in clean and noisy conditions


Labeled material

Corpus contains hand-made label files as ground truth for AVSR algorithms


2.1 TB

of high quality audio-visual material


8 PCM audio streams

gathered from a microphone array



Includes separated commands and continuous sentences


Time-of-Flight camera recordings

enabling the depth image for further analysis

What do I need to get started?


Fast connection


A lot of disk space


VLC Media Player


Many ideas