Training and testing datasets
In a custom speech project, you can upload datasets for training, qualitative inspection, and quantitative measurement. This article covers the types of training and testing data that you can use for custom speech.
Text and audio that you use to test and train a custom model should include samples from a diverse set of speakers and scenarios that you want your model to recognize. Consider these factors when you're gathering data for custom model testing and training:
- Include text and audio data to cover the kinds of verbal statements that your users make when they're interacting with your model. For example, a model that raises and lowers the temperature needs training on statements that people might make to request such changes.
- Include all speech variances that you want your model to recognize. Many factors can vary speech, including accents, dialects, language-mixing, age, gender, voice pitch, stress level, and time of day.
- Include samples from different environments, for example, indoor, outdoor, and road noise, where your model is used.
- Record audio with hardware devices that the production system uses. If your model must identify speech recorded on devices of varying quality, the audio data that you provide to train your model must also represent these diverse scenarios.
- Keep the dataset diverse and representative of your project requirements. You can add more data to your model later.
- Only include data that your model needs to transcribe. Including data that isn't within your custom model's recognition requirements can harm recognition quality overall.
Data types
The following table lists accepted data types, when each data type should be used, and the recommended quantity. Not every data type is required to create a model. Data requirements vary depending on whether you're creating a test or training a model.
Data type | Used for testing | Recommended for testing | Used for training | Recommended for training |
---|---|---|---|---|
Audio only | Yes (visual inspection) | 5+ audio files | Yes (Preview for en-US ) |
1-100 hours of audio |
Audio + human-labeled transcripts | Yes (evaluation of accuracy) | 0.5-5 hours of audio | Yes | 1-100 hours of audio |
Plain text | No | Not applicable | Yes | 1-200 MB of related text |
Pronunciation | No | Not applicable | Yes | 1 KB to 1 MB of pronunciation text |
Display format | No | Not applicable | Yes | Up to 200 lines for ITN, 1,000 lines for rewrite, 1,000 lines for profanity filter |
Training with plain text or structured text usually finishes within a few minutes.
Tip
Start with plain-text data or structured-text data. This data will improve the recognition of special terms and phrases. Training with text is much faster than training with audio (minutes versus days).
Start with small sets of sample data that match the language, acoustics, and hardware where your model will be used. Small datasets of representative data can expose problems before you invest in gathering larger datasets for training. For sample custom speech data, see this GitHub repository.
If you train a custom model with audio data, choose a Speech resource region with dedicated hardware for training audio data. For more information, see footnotes in the regions table. In regions with dedicated hardware for custom speech training, the Speech service uses up to 100 hours of your audio training data, and can process about 10 hours of data per day. After the model is trained, you can copy the model to another region as needed with the Models_CopyTo REST API.
Consider datasets by scenario
A model trained on a subset of scenarios can perform well in only those scenarios. Carefully choose data that represents the full scope of scenarios that you need your custom model to recognize. The following table shows datasets to consider for some speech recognition scenarios:
Scenario | Plain text data and structured text data | Audio + human-labeled transcripts | New words with pronunciation |
---|---|---|---|
Call center | Marketing documents, website, product reviews related to call center activity | Call center calls transcribed by humans | Terms that have ambiguous pronunciations (see the Xbox example in the preceding section) |
Voice assistant | Lists of sentences that use various combinations of commands and entities | Recorded voices speaking commands into device, transcribed into text | Names (movies, songs, products) that have unique pronunciations |
Dictation | Written input, such as instant messages or emails | Similar to preceding examples | Similar to preceding examples |
Video closed captioning | TV show scripts, movies, marketing content, video summaries | Exact transcripts of videos | Similar to preceding examples |
To help determine which dataset to use to address your problems, refer to the following table:
Use case | Data type |
---|---|
Improve recognition accuracy on industry-specific vocabulary and grammar, such as medical terminology or IT jargon. | Plain text or structured text data |
Define the phonetic and displayed form of a word or term that has nonstandard pronunciation, such as product names or acronyms. | Pronunciation data or phonetic pronunciation in structured text |
Improve recognition accuracy on speaking styles, accents, or specific background noises. | Audio + human-labeled transcripts |
Audio + human-labeled transcript data for training or testing
You can use audio + human-labeled transcript data for both training and testing purposes. You must provide human-labeled transcriptions (word by word) for comparison:
- To improve the acoustic aspects like slight accents, speaking styles, and background noises.
- To measure the accuracy of Azure's speech to text accuracy when, it's processing your audio files.
For a list of base models that support training with audio data, see Language support. Even if a base model does support training with audio data, the service might use only part of the audio. And it still uses all the transcripts.
Important
If a base model doesn't support customization with audio data, only the transcription text will be used for training. If you switch to a base model that supports customization with audio data, the training time may increase from several hours to several days. The change in training time would be most noticeable when you switch to a base model in a region without dedicated hardware for training. If the audio data is not required, you should remove it to decrease the training time.
Audio with human-labeled transcripts offers the greatest accuracy improvements if the audio comes from the target use case. Samples must cover the full scope of speech. For example, a call center for a retail store would get the most calls about swimwear and sunglasses during summer months. Ensure that your sample includes the full scope of speech that you want to detect.
Consider these details:
- Training with audio brings the most benefits if the audio is also hard to understand for humans. In most cases, you should start training by using only related text.
- If you use one of the most heavily used languages, such as US English, it's unlikely that you would need to train with audio data. For such languages, the base models already offer good recognition results in most scenarios, so it's probably enough to train with related text.
- Custom speech can capture word context only to reduce substitution errors, not insertion or deletion errors.
- Avoid samples that include transcription errors, but do include a diversity of audio quality.
- Avoid sentences that are unrelated to your problem domain. Unrelated sentences can harm your model.
- When the transcript quality varies, you can duplicate exceptionally good sentences, such as excellent transcriptions that include key phrases, to increase their weight.
- The Speech service automatically uses the transcripts to improve the recognition of domain-specific words and phrases, as though they were added as related text.
- It can take several days for a training operation to finish. To improve the speed of training, be sure to create your Speech service subscription in a region with dedicated hardware for training.
A large training dataset is required to improve recognition. Generally, we recommend that you provide word-by-word transcriptions for 1 to 100 hours of audio (up to 20 hours for older models that do not charge for training). However, even as little as 30 minutes can help improve recognition results. Although creating human-labeled transcription can take time, improvements in recognition are only as good as the data that you provide. You should upload only high-quality transcripts.
Audio files can have silence at the beginning and end of the recording. If possible, include at least a half-second of silence before and after speech in each sample file. Although audio with low recording volume or disruptive background noise isn't helpful, it shouldn't limit or degrade your custom model. Always consider upgrading your microphones and signal processing hardware before gathering audio samples.
Important
For more information about the best practices of preparing human-labeled transcripts, see Human-labeled transcripts with audio.
Custom speech projects require audio files with these properties:
Important
These are requirements for Audio + human-labeled transcript training and testing. They differ from the ones for Audio only training and testing. If you want to use Audio only training and testing, see this section.
Property | Value |
---|---|
File format | RIFF (WAV) |
Sample rate | 8,000 Hz or 16,000 Hz |
Channels | 1 (mono) |
Maximum length per audio | Two hours (testing) / 40 s (training) Training with audio has a maximum audio length of 40 seconds per file (up to 30 seconds for Whisper customization). For audio files longer than 40 seconds, only the corresponding text from the transcription files is used for training. If all audio files are longer than 40 seconds, the training fails. |
Sample format | PCM, 16-bit |
Archive format | .zip |
Maximum zip size | 2 GB or 10,000 files |
Plain-text data for training
You can add plain text sentences of related text to improve the recognition of domain-specific words and phrases. Related text sentences can reduce substitution errors related to misrecognition of common words and domain-specific words by showing them in context. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized.
Provide domain-related sentences in a single text file. Use text data that's close to the expected spoken utterances. Utterances don't need to be complete or grammatically correct, but they must accurately reflect the spoken input that you expect the model to recognize. When possible, try to have one sentence or keyword controlled on a separate line. To increase the weight of a term such as product names, add several sentences that include the term. But don't copy too much - it could affect the overall recognition rate.
Note
Avoid related text sentences that include noise such as unrecognizable characters or words.
Use this table to ensure that your plain text dataset file is formatted correctly:
Property | Value |
---|---|
Text encoding | UTF-8 BOM |
Number of utterances per line | 1 |
Maximum file size | 200 MB |
You must also adhere to the following restrictions:
- Avoid repeating characters, words, or groups of words more than three times. For example, don't use "aaaa," "yeah yeah yeah yeah," or "that's it that's it that's it that's it." The Speech service might drop lines with too many repetitions.
- Don't use special characters or UTF-8 characters above
U+00A1
. - URIs will be rejected.
- For some languages such as Japanese or Korean, importing large amounts of text data can take a long time or can time out. Consider dividing the dataset into multiple text files with up to 20,000 lines in each.
Pronunciation data for training
Specialized or made up words might have unique pronunciations. These words can be recognized if they can be broken down into smaller words to pronounce them. For example, to recognize "Xbox", pronounce it as "X box". This approach won't increase overall accuracy, but can improve recognition of this and other keywords.
You can provide a custom pronunciation file to improve recognition. Don't use custom pronunciation files to alter the pronunciation of common words. For a list of languages that support custom pronunciation, see language support.
The spoken form is the phonetic sequence spelled out. It can be composed of letters, words, syllables, or a combination of all three. This table includes some examples:
Recognized displayed form | Spoken form |
---|---|
3CPO | three c p o |
CNTK | c n t k |
IEEE | i triple e |
You provide pronunciations in a single text file. Include the spoken utterance and a custom pronunciation for each. Each row in the file should begin with the recognized form, then a tab character, and then the space-delimited phonetic sequence.
3CPO three c p o
CNTK c n t k
IEEE i triple e
Refer to the following table to ensure that your pronunciation dataset files are valid and correctly formatted.
Property | Value |
---|---|
Text encoding | UTF-8 BOM (ANSI is also supported for English) |
Number of pronunciations per line | 1 |
Maximum file size | 1 MB (1 KB for free tier) |
Audio data for training or testing
Audio data is optimal for testing the accuracy of Microsoft's baseline speech to text model or a custom model. Keep in mind that audio data is used to inspect the accuracy of speech regarding a specific model's performance. If you want to quantify the accuracy of a model, use audio + human-labeled transcripts.
Note
Audio only data for training is available in preview for the en-US
locale. For other locales, to train with audio data you must also provide human-labeled transcripts.
Custom speech projects require audio files with these properties:
Important
These are requirements for Audio only training and testing. They differ from the ones for Audio + human-labeled transcript training and testing. If you want to use Audio + human-labeled transcript training and testing, see this section.
Property | Value |
---|---|
File format | RIFF (WAV) |
Sample rate | 8,000 Hz or 16,000 Hz |
Channels | 1 (mono) |
Maximum length per audio | Two hours |
Sample format | PCM, 16-bit |
Archive format | .zip |
Maximum archive size | 2 GB or 10,000 files |
Note
When you're uploading training and testing data, the .zip file size can't exceed 2 GB. If you require more data for training, divide it into several .zip files and upload them separately. Later, you can choose to train from multiple datasets. However, you can test from only a single dataset.
Use SoX to verify audio properties or convert existing audio to the appropriate formats. Here are some example SoX commands:
Activity | SoX command |
---|---|
Check the audio file format. | sox --i <filename> |
Convert the audio file to single channel, 16-bit, 16 KHz. | sox <input> -b 16 -e signed-integer -c 1 -r 16k -t wav <output>.wav |
Custom display text formatting data for training
Learn more about preparing display text formatting data and display text formatting with speech to text.
Automatic Speech Recognition output display format is critical to downstream tasks and one-size doesn’t fit all. Adding Custom Display Format rules allows users to define their own lexical-to-display format rules to improve the speech recognition service quality on top of Azure custom speech Service.
It allows you to fully customize the display outputs such as add rewrite rules to capitalize and reformulate certain words, add profanity words and mask from output, define advanced ITN rules for certain patterns such as numbers, dates, email addresses; or preserve some phrases and kept them from any Display processes.
For example:
Custom formatting | Display text |
---|---|
None | My financial number from contoso is 8BEV3 |
Capitalize "Contoso" (via #rewrite rule)Format financial number (via #itn rule) |
My financial number from Contoso is 8B-EV-3 |
For a list of supported base models and locales for training with structured text, see Language support. The Display Format file should have an .md extension. The maximum file size is 10 MB, and the text encoding must be UTF-8 BOM. For more information about customizing Display Format rules, see Display Formatting Rules Best Practice.
Property | Description | Limits |
---|---|---|
#ITN | A list of invert-text-normalization rules to define certain display patterns such as numbers, addresses, and dates. | Maximum of 200 lines |
#rewrite | A list of rewrite pairs to replace certain words for reasons such as capitalization and spelling correction. | Maximum of 1,000 lines |
#profanity | A list of unwanted words that will be masked as ****** from Display and Masked output, on top of Microsoft built-in profanity lists. |
Maximum of 1,000 lines |
#test | A list of unit test cases to validate if the display rules work as expected, including the lexical format input and the expected display format output. | Maximum file size of 10 MB |
Here's an example display format file:
// this is a comment line
// each section must start with a '#' character
#itn
// list of ITN pattern rules, one rule for each line
\d-\d-\d
\d-\l-\l-\d
#rewrite
// list of rewrite rules, each rule has two phrases, separated by a tab character
old phrase new phrase
# profanity
// list of profanity phrases to be tagged/removed/masked, one line one phrase
fakeprofanity
#test
// list of test cases, each test case has two sentences, input lexical and expected display output
// the two sentences are separated by a tab character
// the expected sentence is the display output of DPP+CDPP models
Mask the fakeprofanity word Mask the ************* word