Overview

A leading technology company approached Klatch to collect and transcribe 3,000 hours of speech data in eight Indian languages. The objective was to produce a high-quality dataset for training speech recognition models. Klatch was assigned to collect diverse speech data from many dialects and kinds, emphasizing gender and age variety. Klatch had to ensure excellent audio quality. Data might be collected from anywhere and evaluated using standard acceptance criteria before delivery.

Client
Confidential
Industry
Internet/ Technology
Use case
Research & Development

Challenge

The primary challenge faced by Klatch was to collect a diverse and representative dataset that covered multiple dialects, domains, and demographic groups. Klatch had to check that the audio was of excellent quality and that the data was devoid of post-processing and background noise. Furthermore, Klatch collected conversational data, which was more difficult than monolingual speech data. Klatch ensured that the high-quality dataset met the standard acceptance requirements for the speech recognition model.

Solution

To address the challenge, Klatch adopted a comprehensive approach that includes numerous methods to ensure the dataset’s quality and diversity. Klatch collected speech data in eight Indian languages, emphasizing gender and age diversity. Klatch collected data from various dialects and domains, including weather, news, entertainment, health, agriculture, education, jobs, and finances. Klatch collected monolingual and conversational speech data with a focus on conversational speech. Klatch validated the data using prevalent acceptance criteria such as WER and TER.

The speech data were collected and transcribed by a team of experienced linguists and annotators for Klatch. The team ensured that the audio quality was high and that the data had no post-processing or background noise. The team also validated that the conversational speech data came from various sources, including narrowband and wideband recordings. The team collected data from mobile phones and landlines for multilingual conversations and used a combination of English and local languages.

Klatch also utilized a quality checklist to confirm that the dataset was high quality and matched the standard requirements. Measures to ensure correct audio segmentation, accurate transcription, low background noise, no audio clipping or distortion, and transcription from specific domains were included on the quality checklist.

Results

Klatch successfully collected and transcribed 3,000 hours of speech data in eight Indian languages. The dataset was diverse and representative, covering multiple dialects, domains, and demographic groups. The dataset comprised both monolingual and conversational speech data.

The dataset was used to train high-accuracy speech recognition models. It was also used in research, such as analyzing dialectal variances and language processing. The project assisted the client in developing speech recognition models for use in various applications such as customer service, voice assistants, and language learning. Klatch has also established itself as a major provider of data services in the speech recognition space due to the project.