Use pronunciation assessment

In this article, you learn how to evaluate pronunciation with speech to text through the Speech SDK. Pronunciation assessment evaluates speech pronunciation and gives speakers feedback on the accuracy and fluency of spoken audio.

Note

Pronunciation assessment uses a specific version of the speech-to-text model, different from the standard speech to text model, to ensure consistent and accurate pronunciation assessment.

Use pronunciation assessment in streaming mode

Pronunciation assessment supports uninterrupted streaming mode. The recording time can be unlimited through the Speech SDK. As long as you don't stop recording, the evaluation process doesn't finish and you can pause and resume evaluation conveniently.

For information about availability of pronunciation assessment, see supported languages and available regions.

As a baseline, usage of pronunciation assessment costs the same as speech to text for pay-as-you-go pricing. For more information, see Pricing.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

For how to use Pronunciation Assessment in streaming mode in your own application, see sample code.

Continuous recognition

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub under the function PronunciationAssessmentContinuousWithFile.

If your audio file exceeds 30 seconds, use continuous mode for processing.

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub under the function pronunciationAssessmentContinuousWithFile.

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub under the function pronunciation_assessment_continuous_from_file.

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub.

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub under the function pronunciationAssessFromFile.

If your audio file exceeds 30 seconds, use continuous mode for processing. The sample code for continuous mode can be found on GitHub under the function continuousPronunciationAssessment.

Set configuration parameters

Note

Pronunciation assessment is not available with the Speech SDK for Go. You can read about the concepts in this guide. Select another programming language for your solution.

In the SpeechRecognizer, you can specify the language to learn or practice improving pronunciation. The default locale is en-US. To learn how to specify the learning language for pronunciation assessment in your own application, you can use the following sample code.

var recognizer = new SpeechRecognizer(speechConfig, "en-US", audioConfig);
auto recognizer = SpeechRecognizer::FromConfig(speechConfig, "en-US", audioConfig);
SpeechRecognizer recognizer = new SpeechRecognizer(speechConfig, "en-US", audioConfig);
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, language="en-US", audio_config=audio_config)
speechConfig.speechRecognitionLanguage = "en-US";
SPXSpeechRecognizer* recognizer = [[SPXSpeechRecognizer alloc] initWithSpeechConfiguration:speechConfig language:@"en-US" audioConfiguration:audioConfig];
let recognizer = try! SPXSpeechRecognizer(speechConfiguration: speechConfig, language: "en-US", audioConfiguration: audioConfig)

Tip

If you aren't sure which locale to set for a language that has multiple locales, try each locale separately. For instance, for Spanish, try es-ES and es-MX. Determine which locale scores higher for your scenario.

You must create a PronunciationAssessmentConfig object. You can set EnableProsodyAssessment and EnableContentAssessmentWithTopic to enable prosody and content assessment. For more information, see configuration methods.

var pronunciationAssessmentConfig = new PronunciationAssessmentConfig( 
    referenceText: "", 
    gradingSystem: GradingSystem.HundredMark,  
    granularity: Granularity.Phoneme,  
    enableMiscue: false); 
pronunciationAssessmentConfig.EnableProsodyAssessment(); 
pronunciationAssessmentConfig.EnableContentAssessmentWithTopic("greeting"); 
auto pronunciationConfig = PronunciationAssessmentConfig::Create("", PronunciationAssessmentGradingSystem::HundredMark, PronunciationAssessmentGranularity::Phoneme, false); 
pronunciationConfig->EnableProsodyAssessment(); 
pronunciationConfig->EnableContentAssessmentWithTopic("greeting"); 
PronunciationAssessmentConfig pronunciationConfig = new PronunciationAssessmentConfig("", 
    PronunciationAssessmentGradingSystem.HundredMark, PronunciationAssessmentGranularity.Phoneme, false); 
pronunciationConfig.enableProsodyAssessment(); 
pronunciationConfig.enableContentAssessmentWithTopic("greeting");
pronunciation_config = speechsdk.PronunciationAssessmentConfig( 
    reference_text="", 
    grading_system=speechsdk.PronunciationAssessmentGradingSystem.HundredMark, 
    granularity=speechsdk.PronunciationAssessmentGranularity.Phoneme, 
    enable_miscue=False) 
pronunciation_config.enable_prosody_assessment() 
pronunciation_config.enable_content_assessment_with_topic("greeting")
var pronunciationAssessmentConfig = new sdk.PronunciationAssessmentConfig( 
    referenceText: "", 
    gradingSystem: sdk.PronunciationAssessmentGradingSystem.HundredMark,  
    granularity: sdk.PronunciationAssessmentGranularity.Phoneme,  
    enableMiscue: false); 
pronunciationAssessmentConfig.enableProsodyAssessment(); 
pronunciationAssessmentConfig.enableContentAssessmentWithTopic("greeting");  
SPXPronunciationAssessmentConfiguration *pronunicationConfig = 
[[SPXPronunciationAssessmentConfiguration alloc] init:@"" gradingSystem:SPXPronunciationAssessmentGradingSystem_HundredMark granularity:SPXPronunciationAssessmentGranularity_Phoneme enableMiscue:false]; 
[pronunicationConfig enableProsodyAssessment]; 
[pronunicationConfig enableContentAssessmentWithTopic:@"greeting"]; 
let pronAssessmentConfig = try! SPXPronunciationAssessmentConfiguration("", 
    gradingSystem: .hundredMark, 
    granularity: .phoneme, 
    enableMiscue: false) 
pronAssessmentConfig.enableProsodyAssessment() 
pronAssessmentConfig.enableContentAssessment(withTopic: "greeting")

This table lists some of the key configuration parameters for pronunciation assessment.

Parameter Description
ReferenceText The text that the pronunciation is evaluated against.

The ReferenceText parameter is optional. Set the reference text if you want to run a scripted assessment for the reading language learning scenario. Don't set the reference text if you want to run an unscripted assessment.

For pricing differences between scripted and unscripted assessment, see Pricing.
GradingSystem The point system for score calibration. FivePoint gives a 0-5 floating point score. HundredMark gives a 0-100 floating point score. Default: FivePoint.
Granularity Determines the lowest level of evaluation granularity. Returns scores for levels greater than or equal to the minimal value. Accepted values are Phoneme, which shows the score on the full text, word, syllable, and phoneme level, Word, which shows the score on the full text and word level, or FullText, which shows the score on the full text level only. The provided full reference text can be a word, sentence, or paragraph. It depends on your input reference text. Default: Phoneme.
EnableMiscue Enables miscue calculation when the pronounced words are compared to the reference text. Enabling miscue is optional. If this value is True, the ErrorType result value can be set to Omission or Insertion based on the comparison. Values are False and True. Default: False. To enable miscue calculation, set the EnableMiscue to True. You can refer to the code snippet above the table.
ScenarioId A GUID for a customized point system.

Configuration methods

This table lists some of the optional methods you can set for the PronunciationAssessmentConfig object.

Note

Content and prosody assessments are only available in the en-US locale.

To explore the content and prosody assessments, upgrade to the SDK version 1.35.0 or later.

There is no length limit for the topic parameter.

Method Description
EnableProsodyAssessment Enables prosody assessment for your pronunciation evaluation. This feature assesses aspects like stress, intonation, speaking speed, and rhythm. This feature provides insights into the naturalness and expressiveness of your speech.

Enabling prosody assessment is optional. If this method is called, the ProsodyScore result value is returned.
EnableContentAssessmentWithTopic Enables content assessment. A content assessment is part of the unscripted assessment for the speaking language learning scenario. By providing a description, you can enhance the assessment's understanding of the specific topic being spoken about. For example, in C# call pronunciationAssessmentConfig.EnableContentAssessmentWithTopic("greeting");. You can replace 'greeting' with your desired text to describe a topic. The description has no length limit and currently only supports the en-US locale.

Get pronunciation assessment results

When speech is recognized, you can request the pronunciation assessment results as SDK objects or a JSON string.

using (var speechRecognizer = new SpeechRecognizer(
    speechConfig,
    audioConfig))
{
    // (Optional) get the session ID
    speechRecognizer.SessionStarted += (s, e) => {
        Console.WriteLine($"SESSION ID: {e.SessionId}");
    };
    pronunciationAssessmentConfig.ApplyTo(speechRecognizer);
    var speechRecognitionResult = await speechRecognizer.RecognizeOnceAsync();

    // The pronunciation assessment result as a Speech SDK object
    var pronunciationAssessmentResult =
        PronunciationAssessmentResult.FromResult(speechRecognitionResult);

    // The pronunciation assessment result as a JSON string
    var pronunciationAssessmentResultJson = speechRecognitionResult.Properties.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);
}

Word, syllable, and phoneme results aren't available by using SDK objects with the Speech SDK for C++. Word, syllable, and phoneme results are only available in the JSON string.

auto speechRecognizer = SpeechRecognizer::FromConfig(
    speechConfig,
    audioConfig);
// (Optional) get the session ID
speechRecognizer->SessionStarted.Connect([](const SessionEventArgs& e) {
    std::cout << "SESSION ID: " << e.SessionId << std::endl;
});
pronunciationAssessmentConfig->ApplyTo(speechRecognizer);
speechRecognitionResult = speechRecognizer->RecognizeOnceAsync().get();

// The pronunciation assessment result as a Speech SDK object
auto pronunciationAssessmentResult =
    PronunciationAssessmentResult::FromResult(speechRecognitionResult);

// The pronunciation assessment result as a JSON string
auto pronunciationAssessmentResultJson = speechRecognitionResult->Properties.GetProperty(PropertyId::SpeechServiceResponse_JsonResult);

To learn how to specify the learning language for pronunciation assessment in your own application, see sample code.

For Android application development, the word, syllable, and phoneme results are available by using SDK objects with the Speech SDK for Java. The results are also available in the JSON string. For Java Runtime (JRE) application development, the word, syllable, and phoneme results are only available in the JSON string.

SpeechRecognizer speechRecognizer = new SpeechRecognizer(
    speechConfig,
    audioConfig);
// (Optional) get the session ID
speechRecognizer.sessionStarted.addEventListener((s, e) -> {
    System.out.println("SESSION ID: " + e.getSessionId());
});
pronunciationAssessmentConfig.applyTo(speechRecognizer);
Future<SpeechRecognitionResult> future = speechRecognizer.recognizeOnceAsync();
SpeechRecognitionResult speechRecognitionResult = future.get(30, TimeUnit.SECONDS);

// The pronunciation assessment result as a Speech SDK object
PronunciationAssessmentResult pronunciationAssessmentResult =
    PronunciationAssessmentResult.fromResult(speechRecognitionResult);

// The pronunciation assessment result as a JSON string
String pronunciationAssessmentResultJson = speechRecognitionResult.getProperties().getProperty(PropertyId.SpeechServiceResponse_JsonResult);

recognizer.close();
speechConfig.close();
audioConfig.close();
pronunciationAssessmentConfig.close();
speechRecognitionResult.close();
var speechRecognizer = SpeechSDK.SpeechRecognizer.FromConfig(speechConfig, audioConfig);
// (Optional) get the session ID
speechRecognizer.sessionStarted = (s, e) => {
    console.log(`SESSION ID: ${e.sessionId}`);
};
pronunciationAssessmentConfig.applyTo(speechRecognizer);

speechRecognizer.recognizeOnceAsync((speechRecognitionResult: SpeechSDK.SpeechRecognitionResult) => {
    // The pronunciation assessment result as a Speech SDK object
    var pronunciationAssessmentResult = SpeechSDK.PronunciationAssessmentResult.fromResult(speechRecognitionResult);

    // The pronunciation assessment result as a JSON string
    var pronunciationAssessmentResultJson = speechRecognitionResult.properties.getProperty(SpeechSDK.PropertyId.SpeechServiceResponse_JsonResult);
},
{});

To learn how to specify the learning language for pronunciation assessment in your own application, see sample code.

speech_recognizer = speechsdk.SpeechRecognizer(
        speech_config=speech_config, \
        audio_config=audio_config)
# (Optional) get the session ID
speech_recognizer.session_started.connect(lambda evt: print(f"SESSION ID: {evt.session_id}"))
pronunciation_assessment_config.apply_to(speech_recognizer)
speech_recognition_result = speech_recognizer.recognize_once()
# The pronunciation assessment result as a Speech SDK object
pronunciation_assessment_result = speechsdk.PronunciationAssessmentResult(speech_recognition_result)

# The pronunciation assessment result as a JSON string
pronunciation_assessment_result_json = speech_recognition_result.properties.get(speechsdk.PropertyId.SpeechServiceResponse_JsonResult)

To learn how to specify the learning language for pronunciation assessment in your own application, see sample code.

SPXSpeechRecognizer* speechRecognizer = \
        [[SPXSpeechRecognizer alloc] initWithSpeechConfiguration:speechConfig
                                              audioConfiguration:audioConfig];
// (Optional) get the session ID
[speechRecognizer addSessionStartedEventHandler: ^ (SPXRecognizer *sender, SPXSessionEventArgs *eventArgs) {
    NSLog(@"SESSION ID: %@", eventArgs.sessionId);
}];
[pronunciationAssessmentConfig applyToRecognizer:speechRecognizer];

SPXSpeechRecognitionResult *speechRecognitionResult = [speechRecognizer recognizeOnce];

// The pronunciation assessment result as a Speech SDK object
SPXPronunciationAssessmentResult* pronunciationAssessmentResult = [[SPXPronunciationAssessmentResult alloc] init:speechRecognitionResult];

// The pronunciation assessment result as a JSON string
NSString* pronunciationAssessmentResultJson = [speechRecognitionResult.properties getPropertyByName:SPXSpeechServiceResponseJsonResult];

To learn how to specify the learning language for pronunciation assessment in your own application, see sample code.

let speechRecognizer = try! SPXSpeechRecognizer(speechConfiguration: speechConfig, audioConfiguration: audioConfig)
// (Optional) get the session ID
speechRecognizer.addSessionStartedEventHandler { (sender, evt) in
	print("SESSION ID: \(evt.sessionId)")
try! pronConfig.apply(to: speechRecognizer)

let speechRecognitionResult = try? speechRecognizer.recognizeOnce()

// The pronunciation assessment result as a Speech SDK object
let pronunciationAssessmentResult = SPXPronunciationAssessmentResult(speechRecognitionResult!)

// The pronunciation assessment result as a JSON string
let pronunciationAssessmentResultJson = speechRecognitionResult!.properties?.getPropertyBy(SPXPropertyId.speechServiceResponseJsonResult)

Result parameters

Depending on whether you're using scripted or unscripted assessment, you can get different pronunciation assessment results. Scripted assessment is for the reading language learning scenario. Unscripted assessment is for the speaking language learning scenario.

Note

For pricing differences between scripted and unscripted assessment, see Pricing.

Scripted assessment results

This table lists some of the key pronunciation assessment results for the scripted assessment, or reading scenario.

Parameter Description Granularity
AccuracyScore Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. Syllable, word, and full text accuracy scores are aggregated from the phoneme-level accuracy score, and refined with assessment objectives. Phoneme level,
Syllable level (en-US only),
Word level,
Full Text level
FluencyScore Fluency of the given speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. Full Text level
CompletenessScore Completeness of the speech, calculated by the ratio of pronounced words to the input reference text. Full Text level
ProsodyScore Prosody of the given speech. Prosody indicates how natural the given speech is, including stress, intonation, speaking speed, and rhythm. Full Text level
PronScore Overall score of the pronunciation quality of the given speech. PronScore is calculated from AccuracyScore, FluencyScore, CompletenessScore, and ProsodyScore with weight, provided that ProsodyScore and CompletenessScore are available. If either of them isn't available, PronScore won't consider that score. Full Text level
ErrorType This value indicates the error type compared to the reference text. Options include whether a word is omitted, inserted, or improperly inserted with a break. It also indicates a missing break at punctuation. It also indicates whether a word is badly pronounced, or monotonically rising, falling, or flat on the utterance. Possible values are None for no error on this word, Omission, Insertion, Mispronunciation, UnexpectedBreak, MissingBreak, and Monotone. The error type can be Mispronunciation when the pronunciation AccuracyScore for a word is below 60. Word level

Unscripted assessment results

This table lists some of the key pronunciation assessment results for the unscripted assessment, or speaking scenario.

VocabularyScore, GrammarScore, and TopicScore parameters roll up to the combined content assessment.

Note

Content and prosody assessments are only available in the en-US locale.

Response parameter Description Granularity
AccuracyScore Pronunciation accuracy of the speech. Accuracy indicates how closely the phonemes match a native speaker's pronunciation. Syllable, word, and full text accuracy scores are aggregated from phoneme-level accuracy score, and refined with assessment objectives. Phoneme level,
Syllable level (en-US only),
Word level,
Full Text level
FluencyScore Fluency of the given speech. Fluency indicates how closely the speech matches a native speaker's use of silent breaks between words. Full Text level
ProsodyScore Prosody of the given speech. Prosody indicates how natural the given speech is, including stress, intonation, speaking speed, and rhythm. Full Text level
VocabularyScore Proficiency in lexical usage. It evaluates the speaker's effective usage of words and their appropriateness within the given context to express ideas accurately, and the level of lexical complexity. Full Text level
GrammarScore Correctness in using grammar and variety of sentence patterns. Lexical accuracy, grammatical accuracy, and diversity of sentence structures jointly elevate grammatical errors. Full Text level
TopicScore Level of understanding and engagement with the topic, which provides insights into the speaker’s ability to express their thoughts and ideas effectively and the ability to engage with the topic. Full Text level
PronScore Overall score of the pronunciation quality of the given speech. PronScore is calculated from AccuracyScore, FluencyScore, and ProsodyScore with weight, provided that ProsodyScore is available. If ProsodyScore isn't available, PronScore won't consider that score. Full Text level
ErrorType A word is badly pronounced, improperly inserted with a break, or missing a break at punctuation. It also indicates whether a pronunciation is monotonically rising, falling, or flat on the utterance. Possible values are None for no error on this word, Mispronunciation, UnexpectedBreak, MissingBreak, and Monotone. Word level

The following table describes the prosody assessment results in more detail:

Field Description
ProsodyScore Prosody score of the entire utterance.
Feedback Feedback on the word level, including Break and Intonation.
Break
ErrorTypes Error types related to breaks, including UnexpectedBreak and MissingBreak. The current version doesn't provide the break error type. You need to set thresholds on the fields UnexpectedBreak - Confidence and MissingBreak - confidence to decide whether there's an unexpected break or missing break before the word.
UnexpectedBreak Indicates an unexpected break before the word.
MissingBreak Indicates a missing break before the word.
Thresholds Suggested thresholds on both confidence scores are 0.75. That means, if the value of UnexpectedBreak - Confidence is larger than 0.75, it has an unexpected break. If the value of MissingBreak - confidence is larger than 0.75, it has a missing break. While 0.75 is a value we recommend, it's better to adjust the thresholds based on your own scenario. If you want to have variable detection sensitivity on these two breaks, you can assign different thresholds to the UnexpectedBreak - Confidence and MissingBreak - Confidence fields.
Intonation Indicates intonation in speech.
ErrorTypes Error types related to intonation, currently supporting only Monotone. If the Monotone exists in the field ErrorTypes, the utterance is detected to be monotonic. Monotone is detected on the whole utterance, but the tag is assigned to all the words. All the words in the same utterance share the same monotone detection information.
Monotone Indicates monotonic speech.
Thresholds (Monotone Confidence) The fields Monotone - SyllablePitchDeltaConfidence are reserved for user-customized monotone detection. If you're unsatisfied with the provided monotone decision, adjust the thresholds on these fields to customize the detection according to your preferences.

JSON result example

The scripted pronunciation assessment results for the spoken word "hello" are shown as a JSON string in the following example.

  • The phoneme alphabet is IPA.
  • The syllables are returned alongside phonemes for the same word.
  • You can use the Offset and Duration values to align syllables with their corresponding phonemes. For example, the starting offset (11700000) of the second syllable loʊ aligns with the third phoneme, l. The offset represents the time at which the recognized speech begins in the audio stream. The value is measured in 100-nanosecond units. To learn more about Offset and Duration, see response properties.
  • There are five NBestPhonemes that correspond to the number of spoken phonemes requested.
  • Within Phonemes, the most likely spoken phonemes was ə instead of the expected phoneme ɛ. The expected phoneme ɛ only received a confidence score of 47. Other potential matches received confidence scores of 52, 17, and 2.
{
    "Id": "bbb42ea51bdb46d19a1d685e635fe173",
    "RecognitionStatus": 0,
    "Offset": 7500000,
    "Duration": 13800000,
    "DisplayText": "Hello.",
    "NBest": [
        {
            "Confidence": 0.975003,
            "Lexical": "hello",
            "ITN": "hello",
            "MaskedITN": "hello",
            "Display": "Hello.",
            "PronunciationAssessment": {
                "AccuracyScore": 100,
                "FluencyScore": 100,
                "CompletenessScore": 100,
                "PronScore": 100
            },
            "Words": [
                {
                    "Word": "hello",
                    "Offset": 7500000,
                    "Duration": 13800000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 99.0,
                        "ErrorType": "None"
                    },
                    "Syllables": [
                        {
                            "Syllable": "hɛ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 91.0
                            },
                            "Offset": 7500000,
                            "Duration": 4100000
                        },
                        {
                            "Syllable": "loʊ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0
                            },
                            "Offset": 11700000,
                            "Duration": 9600000
                        }
                    ],
                    "Phonemes": [
                        {
                            "Phoneme": "h",
                            "PronunciationAssessment": {
                                "AccuracyScore": 98.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "h",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 52.0
                                    },
                                    {
                                        "Phoneme": "ə",
                                        "Score": 35.0
                                    },
                                    {
                                        "Phoneme": "k",
                                        "Score": 23.0
                                    },
                                    {
                                        "Phoneme": "æ",
                                        "Score": 20.0
                                    }
                                ]
                            },
                            "Offset": 7500000,
                            "Duration": 3500000
                        },
                        {
                            "Phoneme": "ɛ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 47.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "ə",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "l",
                                        "Score": 52.0
                                    },
                                    {
                                        "Phoneme": "ɛ",
                                        "Score": 47.0
                                    },
                                    {
                                        "Phoneme": "h",
                                        "Score": 17.0
                                    },
                                    {
                                        "Phoneme": "æ",
                                        "Score": 2.0
                                    }
                                ]
                            },
                            "Offset": 11100000,
                            "Duration": 500000
                        },
                        {
                            "Phoneme": "l",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "l",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 46.0
                                    },
                                    {
                                        "Phoneme": "ə",
                                        "Score": 5.0
                                    },
                                    {
                                        "Phoneme": "ɛ",
                                        "Score": 3.0
                                    },
                                    {
                                        "Phoneme": "u",
                                        "Score": 1.0
                                    }
                                ]
                            },
                            "Offset": 11700000,
                            "Duration": 1100000
                        },
                        {
                            "Phoneme": "oʊ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "d",
                                        "Score": 29.0
                                    },
                                    {
                                        "Phoneme": "t",
                                        "Score": 24.0
                                    },
                                    {
                                        "Phoneme": "n",
                                        "Score": 22.0
                                    },
                                    {
                                        "Phoneme": "l",
                                        "Score": 18.0
                                    }
                                ]
                            },
                            "Offset": 12900000,
                            "Duration": 8400000
                        }
                    ]
                }
            ]
        }
    ]
}

You can get pronunciation assessment scores for:

  • Full text
  • Words
  • Syllable groups
  • Phonemes in SAPI or IPA format

Supported features per locale

The following table summarizes which features that locales support. For more specifies, see the following sections. If the locales you require aren't listed in the following table for the supported feature, fill out this intake form for further assistance.

Phoneme alphabet IPA SAPI
Phoneme name en-US en-US, zh-cn
Syllable group en-US en-US
Spoken phoneme en-US en-US

Syllable groups

Pronunciation assessment can provide syllable-level assessment results. A word is typically pronounced syllable by syllable rather than phoneme by phoneme. Grouping in syllables is more legible and aligned with speaking habits.

Pronunciation assessment supports syllable groups only in en-US with IPA and with SAPI.

The following table compares example phonemes with the corresponding syllables.

Sample word Phonemes Syllables
technological teknələdʒɪkl tek·nə·lɑ·dʒɪkl
hello hɛloʊ hɛ·loʊ
luck lʌk lʌk
photosynthesis foʊtəsɪnθəsɪs foʊ·tə·sɪn·θə·sɪs

To request syllable-level results along with phonemes, set the granularity configuration parameter to Phoneme.

Phoneme alphabet format

Pronunciation assessment supports phoneme name in en-US with IPA and in en-US and zh-cn with SAPI.

For locales that support phoneme name, the phoneme name is provided together with the score. Phoneme names help identify which phonemes were pronounced accurately or inaccurately. For other locales, you can only get the phoneme score.

The following table compares example SAPI phonemes with the corresponding IPA phonemes.

Sample word SAPI Phonemes IPA phonemes
hello h eh l ow h ɛ l oʊ
luck l ah k l ʌ k
photosynthesis f ow t ax s ih n th ax s ih s f oʊ t ə s ɪ n θ ə s ɪ s

To request IPA phonemes, set the phoneme alphabet to IPA. If you don't specify the alphabet, the phonemes are in SAPI format by default.

pronunciationAssessmentConfig.PhonemeAlphabet = "IPA";
auto pronunciationAssessmentConfig = PronunciationAssessmentConfig::CreateFromJson("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\"}");
PronunciationAssessmentConfig pronunciationAssessmentConfig = PronunciationAssessmentConfig.fromJson("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\"}");
pronunciation_assessment_config = speechsdk.PronunciationAssessmentConfig(json_string="{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\"}")
var pronunciationAssessmentConfig = SpeechSDK.PronunciationAssessmentConfig.fromJSON("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\"}");
pronunciationAssessmentConfig.phonemeAlphabet = @"IPA";
pronunciationAssessmentConfig?.phonemeAlphabet = "IPA"

Assess spoken phonemes

With spoken phonemes, you can get confidence scores that indicate how likely the spoken phonemes matched the expected phonemes.

Pronunciation assessment supports spoken phonemes in en-US with IPA and with SAPI.

For example, to obtain the complete spoken sound for the word Hello, you can concatenate the first spoken phoneme for each expected phoneme with the highest confidence score. In the following assessment result, when you speak the word hello, the expected IPA phonemes are h ɛ l oʊ. However, the actual spoken phonemes are h ə l oʊ. You have five possible candidates for each expected phoneme in this example. The assessment result shows that the most likely spoken phoneme was ə instead of the expected phoneme ɛ. The expected phoneme ɛ only received a confidence score of 47. Other potential matches received confidence scores of 52, 17, and 2.

{
    "Id": "bbb42ea51bdb46d19a1d685e635fe173",
    "RecognitionStatus": 0,
    "Offset": 7500000,
    "Duration": 13800000,
    "DisplayText": "Hello.",
    "NBest": [
        {
            "Confidence": 0.975003,
            "Lexical": "hello",
            "ITN": "hello",
            "MaskedITN": "hello",
            "Display": "Hello.",
            "PronunciationAssessment": {
                "AccuracyScore": 100,
                "FluencyScore": 100,
                "CompletenessScore": 100,
                "PronScore": 100
            },
            "Words": [
                {
                    "Word": "hello",
                    "Offset": 7500000,
                    "Duration": 13800000,
                    "PronunciationAssessment": {
                        "AccuracyScore": 99.0,
                        "ErrorType": "None"
                    },
                    "Syllables": [
                        {
                            "Syllable": "hɛ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 91.0
                            },
                            "Offset": 7500000,
                            "Duration": 4100000
                        },
                        {
                            "Syllable": "loʊ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0
                            },
                            "Offset": 11700000,
                            "Duration": 9600000
                        }
                    ],
                    "Phonemes": [
                        {
                            "Phoneme": "h",
                            "PronunciationAssessment": {
                                "AccuracyScore": 98.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "h",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 52.0
                                    },
                                    {
                                        "Phoneme": "ə",
                                        "Score": 35.0
                                    },
                                    {
                                        "Phoneme": "k",
                                        "Score": 23.0
                                    },
                                    {
                                        "Phoneme": "æ",
                                        "Score": 20.0
                                    }
                                ]
                            },
                            "Offset": 7500000,
                            "Duration": 3500000
                        },
                        {
                            "Phoneme": "ɛ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 47.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "ə",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "l",
                                        "Score": 52.0
                                    },
                                    {
                                        "Phoneme": "ɛ",
                                        "Score": 47.0
                                    },
                                    {
                                        "Phoneme": "h",
                                        "Score": 17.0
                                    },
                                    {
                                        "Phoneme": "æ",
                                        "Score": 2.0
                                    }
                                ]
                            },
                            "Offset": 11100000,
                            "Duration": 500000
                        },
                        {
                            "Phoneme": "l",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "l",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 46.0
                                    },
                                    {
                                        "Phoneme": "ə",
                                        "Score": 5.0
                                    },
                                    {
                                        "Phoneme": "ɛ",
                                        "Score": 3.0
                                    },
                                    {
                                        "Phoneme": "u",
                                        "Score": 1.0
                                    }
                                ]
                            },
                            "Offset": 11700000,
                            "Duration": 1100000
                        },
                        {
                            "Phoneme": "oʊ",
                            "PronunciationAssessment": {
                                "AccuracyScore": 100.0,
                                "NBestPhonemes": [
                                    {
                                        "Phoneme": "oʊ",
                                        "Score": 100.0
                                    },
                                    {
                                        "Phoneme": "d",
                                        "Score": 29.0
                                    },
                                    {
                                        "Phoneme": "t",
                                        "Score": 24.0
                                    },
                                    {
                                        "Phoneme": "n",
                                        "Score": 22.0
                                    },
                                    {
                                        "Phoneme": "l",
                                        "Score": 18.0
                                    }
                                ]
                            },
                            "Offset": 12900000,
                            "Duration": 8400000
                        }
                    ]
                }
            ]
        }
    ]
}

To indicate whether, and how many potential spoken phonemes to get confidence scores for, set the NBestPhonemeCount parameter to an integer value such as 5.

pronunciationAssessmentConfig.NBestPhonemeCount = 5;
auto pronunciationAssessmentConfig = PronunciationAssessmentConfig::CreateFromJson("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\",\"nBestPhonemeCount\":5}");
PronunciationAssessmentConfig pronunciationAssessmentConfig = PronunciationAssessmentConfig.fromJson("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\",\"nBestPhonemeCount\":5}");
pronunciation_assessment_config = speechsdk.PronunciationAssessmentConfig(json_string="{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\",\"nBestPhonemeCount\":5}")
var pronunciationAssessmentConfig = SpeechSDK.PronunciationAssessmentConfig.fromJSON("{\"referenceText\":\"good morning\",\"gradingSystem\":\"HundredMark\",\"granularity\":\"Phoneme\",\"phonemeAlphabet\":\"IPA\",\"nBestPhonemeCount\":5}");
pronunciationAssessmentConfig.nbestPhonemeCount = 5;
pronunciationAssessmentConfig?.nbestPhonemeCount = 5

Pronunciation score calculation

Pronunciation scores are calculated by weighting accuracy, prosody, fluency, and completeness scores based on specific formulas for reading and speaking scenarios.

When sorting the scores of accuracy, prosody, fluency, and completeness from low to high (if each score is available) and representing the lowest score to the highest score as s0 to s3, the pronunciation score is calculated as follows:

For reading scenario:

  • With prosody score: PronScore = 0.4 * s0 + 0.2 * s1 + 0.2 * s2 + 0.2 * s3
  • Without prosody score: PronScore = 0.6 * s0 + 0.2 * s1 + 0.2 * s2

For the speaking scenario (the completeness score isn't applicable):

  • With prosody score: PronScore = 0.6 * s0 + 0.2 * s1 + 0.2 * s2
  • Without prosody score: PronScore = 0.6 * s0 + 0.4 * s1

This formula provides a weighted calculation based on the importance of each score, ensuring a comprehensive evaluation of pronunciation.