Google Chrome: How to Use the Web Speech API

May 2, 2013
#How To
8 min
count lines of code in Bitbucket
Check our apps for Atlassian Products: Confluence, Jira, and Bitbucket.

This February Google released Chrome version 25. One of the newest and most interesting features introduced in this version was Web Speech API support. Web Speech API is the JavaScript library that allows speech recognition and speech-to-text conversion. Conversely, Web Speech API enables you to transform text into speech.

Speech recognition supports several popular languages and is quite effective. Currently, developers have two options of implementing speech recognition on web-pages.

The First Method

The easiest way to use this technology is to use the already implemented functionality for the html tag <input>. You only need to add the attribute x-webkit-speech:

<input x-webkit-speech>

And you get a text box that allows you to dictate a text.

By default, the recognition language will be the same as that set in your browser. But you can change it in two ways:

1) By adding the attribute lang=”en” where the attribute value defines the language to be recognized:

<input lang=”en” x-webkit-speech>

2) By using the <meta> tag on your html page:

<meta http-equiv = 'Content-Language' content = 'en'/>

The x-webkit-speech attribute’s advantages are:

  • It’s easy to implement
  • The browser doesn’t request the user to allow to use the microphone.

However, there’re significant disadvantages:

  • Speech recognition is stopped after a pause
  • When you resume speech recognition in the same box, the old value is substituted by a new one, so you can’t add data.
  • It’s supported only by the <input> tag.
  • The interim results are not displayed which contributes to poor feedback as the user sees the recognition result only after they stop talking and the recognition process is finished.

You can see how it works here.

The Second Method – using Web Speech API on JavaScript

This method is based on the interaction with Web Speech API with the help of JavaScript (demo). To start using API, you need to create a new object that will be employed for recognition:

var recognition = new webkitSpeechRecognition();

Further, you can set the following speech recognition parameters:

1) Set the continuous recognizing that enables the user to make long pauses and dictate large texts. By default this property is set to false (i.e. a speech pause will stop the recognition process).

recognition.continuous = true;

2) Enable interim results fetching. Thus you have access to interim recognition results and can display them in the text box immediately after receiving them. The user will see a constantly refreshing text, otherwise, the recognized text will be available only after a pause. The default value is false.

recognition.interimResults = true;

3) Set the recognition language. By default, it corresponds to the browser language.

recognition.lang = “en”;

To start recognizing you need to call the function:

recognition.start();

The following function is called to stop recognition:

recognition.stop();

Now you need to initialize the recognition results handler:

recognition.onresult = function (event) {};

Add the result handling logic inside this function. The event object has the following fields:

  • event.results[i] – the array containing recognition result objects. Each array element corresponds to a recognized word on the i recognition stage.
  • event.resultIndex – the current recognition result index.
  • event.results[i][j] – the j-th alternative of a recognized word. The first element is a most probable recognized word.
  • event.results[i].isFinal – the Boolean value that shows whether this result is final or interim.
  • event.results[i][ j].transcript – the text representation of a word.
  • event.results[i][ j].confidence – the probability of the given word correct decoding (value from 0 to 1).

Now let’s write an elementary function that adds only final results to a text box (<input> as well as <textarea> can be used):

recognition.onresult = function (event) {
	for (var i = event.resultIndex; i < event.results.length; ++i) {
		if (event.results[i].isFinal) {
			insertAtCaret(textAreaID, event.results[i][0].transcript);
		}
	}
};

This function contains a loop that iterates over all objects of recognized words. If the result is final, it’s displayed in the text box.

Here insertAtCaret() is the function that inserts a text (the 2nd argument ) into <input> or <textarea> with the textAreaID identificator.

Now let’s consider a more complex example that outputs interim results to a text box. The implementation of final results output is the same, but we added a code that outputs interim results.

recognition.onresult = function (event) {
	// Calculating and saving the cursor position where the text will be displayed
	var pos = textArea.getCursorPosition() - interimResult.length;
	// Deleting an interim result from the textArea field
	textArea.val(textArea.val().replace(interimResult, ''));
	interimResult = '';
	// Restoring the cursor position
	textArea.setCursorPosition(pos);
	for (var i = event.resultIndex; i < event.results.length; ++i) {
		if (event.results[i].isFinal) {
			insertAtCaret(textAreaID, event.results[i][0].transcript);
		} else {
			// Outputting the interim result to the text field and adding
			// an interim result marker - 0-length space
			insertAtCaret(textAreaID, event.results[i][0].transcript + '\u200B');
			interimResult += event.results[i][0].transcript + '\u200B';
		}
	}
};

The advantages of JavaScript Web Speech API:

  • the possibility to continuously dictate a text
  • the possibility to implement a multi-session recognition and save the result.
  • the possibility to insert the recognized speech anywhere in a text
  • it can be used for any (and not only) html element (you can implement voice commands)
  • the possibility to display interim recognition results

The disadvantages are:

  • the user should allow the browser to use the microphone before starting the session
  • the session length is limited to 60 seconds.

The most important disadvantage of the whole implementation (including the x-webkit-speech attribute in the <input> tag) of Web Speech API in Chrome is that recognition is performed on the server and not locally in the browser. This is the reason why recognition has time limits. Currently, popular browsers as Firefox, Internet Explorer and Safari don’t support Web Speech API, but it’s expected that these browsers also implement speech recognition shortly.