including 17 natives and 18 non-natives
high-resolution, high-framerate stereoscopic video streams and audio signals
including 17 natives and 18 non-natives
of high quality audio-visual material
gathered from a microphone array
includes recordings in clean and noisy conditions
Includes separated commands and continuous sentences
Corpus contains hand-made label files as ground truth for AVSR algorithms
The audio files use the Waveform Audio File Format (.wav), and contain a single PCM audio stream sampled at 44.1 kSa/s with 16-bit depth. The video files utilize the Matroska Multimedia Container Format (.mkv) in which a video stream in 1080p resolution, captured at 100 fps was placed after being compressed with h.264 codec (using High 4:4:4 profile). The ‘.lab’ files are text files containing the information on word positions in audio files, and follow the HTK label format. Each line of a ‘.lab’ file contains the actual label preceded by start and end times (in 100 ns units) e.g. : 1239620000 1244790000 FIVE which denotes the word “five”, occurring between the 123.962 s and 124.479 s of audio.
The MODALITY corpus consists of over 30 hours of multimodal recordings. The database contains high-resolution, high-framerate stereoscopic video streams and audio signals obtained from a microphone array and a laptop microphone. The corpus can be employed to develop an AVSR system, as every utterance was labelled. Recordings in noisy conditions can be used to test the robustness of speech recognition systems.
The language material was based on a remote control scenario and it includes 231 words -numbers, names of months and days, a set of verbs and nouns related to a computer device control. They were read by speakers as separated words and sequences resulting in a set of 12 recording sessions per speaker. Half of the sessions were recorded in quiet conditions, the other half contained three kinds of intrusive signals (traffic, babble and factory noise).
The corpus includes recordings of 42 speakers (33 male, 9 female). The participants include 20 students and staff of Multimedia Systems Department of the Gdańsk University of Technology, 5 students of the Institute of English and American Studies of the University of Gdańsk, and 17 native English speakers.
Due to the size of the corpus (approx. 2.5 TB of data), every speaker’s recording was placed in a separate zip file of the size approx. 4-7 GB each. Multiple, chosen recordings can be grouped into a single archive to ease the process of downloading the corpus.
The recordings were organized according to the speakers’ language skills. The group A (17 speakers) consists of native-speakers. Non-native speakers recordings (Polish nationals) were placed in the Group B (25 speakers).
The MODALITY is described in a more detailed way in this paper:
Piotr Bratoszewski, Andrzej Czyżewski, Józef Kotus, Paweł Spaleniak, Marcin
Title: An audio-visual corpus for multimodal automatic speech recognition (currently under review for Journal of Intelligent Information Systems )
Fusce id purus. Ut varius tincidunt libero. Phasellus dolor. Maecenas vestibulum mollis diam. Pellentesque ut neque.
Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. In dui magna, posuere eget, vestibulum et, tempor auctor, malesuada pretium. Pellentesque auctor neque nec urna. Proin sapien ipsum, porta a, auctor quis, euismod ut, mi. Aenean viverra rhoncus pede. Pellentesque habitant morbi ac turpis egestas. Ut non enim eleifend felis pretium feugiat. Vivamus quis mi.
Pellentesque auctor neque nec urna. Proin sapien ipsum, porta a, auctor quis, euismod ut, mi. Magna, posuere eget, vestibulum et, tempor auctor, justo. In ac felis quis tortor malesuada pretium.
Fusce id purus. Ut varius tincidunt libero. Phasellus dolor. Maecenas vestibulum mollis diam. Pellentesque ut neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames.
We work for you 24/7 - Curabitur at lacus ac velit ornare lobortis. Curabitur a felis in nunc fringilla tristique. Morbi mattis ullamcorper velit.