Azure Cognitive Services : Speech to Text revolutionizing traditional media delivery

Azure Cognitive Services has leverage machine learning in a large way by providing multiple challenging solutions especially in media vertical where such things were not so popular.

Azure Cognitive provides capturing speech and converting that to text based on speaker accent and language using a large vocabulary of similar spoken words based on training model provided to the Azure Cognitive services. While currently around 8 languages are supported, slowly this is going to increase to 110+ languages as supported by Bing search

The practical problem of such translation is the way the user has spoken and the background noise.

REST API’s normally follows below pattern –

https://speech.platform.bing.com/speech/recognition/<RECOGNITION_MODE>/cognitiveservices/v1?language=<LANGUAGE_TAG>&format=<OUTPUT_FORMAT>

## Sample request header is like below-

POST https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=en-US&format=detailed HTTP/1.1
Accept: application/json;text/xml
Content-Type: audio/wav; codec=audio/pcm; samplerate=16000
Ocp-Apim-Subscription-Key: YOUR_SUBSCRIPTION_KEY
Host: speech.platform.bing.com
Transfer-Encoding: chunked
Expect: 100-continue

## Below PowerShell can be used to submit data for cognitive analysis

$SpeechServiceURI =
‘https://speech.platform.bing.com/speech/recognition/interactive/cognitiveservices/v1?language=en-us&format=detailed’

# $OAuthToken is the authorization token returned by the token service.
$RecoRequestHeader = @{
‘Ocp-Apim-Subscription-Key’ = ‘YOUR_SUBSCRIPTION_KEY’;
‘Transfer-Encoding’ = ‘chunked’
‘Content-type’ = ‘audio/wav; codec=audio/pcm; samplerate=16000’
}

# Read audio into byte array
$audioBytes = [System.IO.File]::ReadAllBytes(“YOUR_AUDIO_FILE”)

$RecoResponse = Invoke-RestMethod -Method POST -Uri $SpeechServiceURI -Headers $RecoRequestHeader -Body $audioBytes

# Show the result
$RecoResponse

Following response can be expected –

{
“RecognitionStatus”: “Success”,
“Offset”: 22500000,
“Duration”: 21000000,
“NBest”: [{
“Confidence”: 0.941552162,
“Lexical”: “find a funny movie to watch”,
“ITN”: “find a funny movie to watch”,
“MaskedITN”: “find a funny movie to watch”,
“Display”: “Find a funny movie to watch.”
}]
}