Google releases API to convert audio into text: characteristics for developers

3 min reading
Google releases API to convert audio into text: characteristics for developers
Google releases API to convert audio into text: characteristics for developers


One of Google’s latest releases is a perfect example of how large technology companies are working toward the API economy; and how they are working to use application programming interfaces to attract the developer community, facilitate product and service creation and extend their influence beyond the four walls of their head offices. Cloud Speech API by Mountain View converts audio into text in over 80 languages.

You can either transcribe incoming audio from the phone’s microphone or an application, or control the device using your voice. This is possible because the tool applies wide-range neural network models targeted at processing natural language. The first obvious question is: What is a neural network and what is it for?

There are a lot of definitions of “neural network” and some of them are extremely complex. One of the easiest to understand may be by Dr. Simon Haykin in his book Neural Networks: A comprehensive foundation: “A neural network is a massively parallel distributed processor made up of simple processing units that has a natural propensity for storing experiential knowledge and making available for use.”

How does an Artificial Neural Network (ANN) acquire and store knowledge? Through a learning process and neural interconnections that store the information and generate an output stimulus. To some extent, its learning and processing procedure is similar to the procedure of the human brain.

For a neural network to acquire knowledge a learning algorithm is required. The algorithm’s process is based on randomly and sequentially applying a series of training data from which the network gains information and then learns from. It’s a matter of patterns.

There are three types of learning:

● Supervised learning: input values are entered and generate output values. These results are compared to the correct values and any network deviations are corrected to adjust the process.

● Reinforcement learning: input values are entered into the network and their output values are then checked for correctness. 

● Unsupervised learning: the neural network creates classification patterns from which it sorts out the supplied information. 

The main characteristics of any neural network are:

● Auto-organization and adaptability: adaptive learning algorithm.

● Non-lineal processing: it increases the capacity of the artificial neural network in terms of extracting and classifying patterns from among noise.

● Parallel processing: large number of nodes for greater interconnectivity. 

Cloud Speech API: characteristics

Google’s new API contains some of the most interesting functionalities when you need an application programming interface linked to natural language processing, speech recognition and obtaining results in real time. This is important since a sufficiently high processing speed is needed to be able to respond immediately.

● Automatic Speech Recognition (ASR): an in-depth learning neural network is used to recognized speech, provide speech-based search features and transcribe speech.

● Streaming recognition: as the API processes and recognizes the user’s speech, it returns results in real time with no waiting times. This allows the application to offer all speech processing functionalities.

● Buffered audio support: the API processes sound from the microphone of an application or mobile device and packages it in various compression formats: FLAC, AMR, PCMU and linear-16. This compression is necessary to subsequently process the sound.

● Speech recognition in over 80 languages. This characteristic offers a major competitive advantage over other providers of similar services for external developers. 

● Integrated API.

● Inappropriate content filtering

Nuance, largest market rival

For a long time, when developers needed to incorporate speech recognition and natural language processing functionalities into their applications, their usual provider was Nuance. Its technology is part of many current market leaders in language interpretation such as voice assistant Siri and assistants by Apple, S-Voice and Samsung. Also, car manufacturers for instance usually need this type of resource for their on-board computers, e.g. BMW and Chrysler.

By releasing Cloud Speech API, Google aims to attract large mobile device and car manufacturers away from their current providers. In addition to processing speech and responding in real time through the cloud, it supports more languages: 80 languages for Speech API vs. 40 languages currently supported by Nuance’s mobile SDKs  (for Android and iOS and browsers).  

At the moment, Google Speech API’s access to the cloud is limited but the company has not yet revealed how limited it actually is. Any developer can fill out a simple form and start trying the application programming interface. In the medium term, it is expected that Google Speech API will charge developers for accessing and using it. 

If you are interested in APIs, you can now try BBVA’s Sandbox manager.

It may interest you