Freelance Traveller - Traveller By The Byte - Vilani Speech Synthesis with SSML

Vilani Speech Synthesis with SSML

by Jeff Zeitlin

This article originally appeared in the September/October 2019 issue.

Author’s Note: In this article, “Windows PowerShell” refers to the version of PowerShell distributed with/as part of Windows 7 or later, or which is included with the Windows Management Framework updated for those versions. “PowerShell Core” refers to those versions of PowerShell other than Windows PowerShell. “PowerShell”, not otherwise specified, refers to both Windows PowerShell and PowerShell Core.

This article makes significant use of the IPA characters in Unicode. Your browser should use a font for monospaced text that includes these characters; on Windows systems, both Courier New and Andale Mono will work.

All code from this article can be downloaded from https://www.freelancetraveller.com/infocenter/software/ssml.zip

If you’ve got a Windows computer (Windows 7 or later), your computer can talk pretty easily:

Start up a Windows PowerShell session—it doesn’t matter whether you use the ISE or the console version of Windows PowerShell—and type the code in listing 1 at the prompt.


		# Listing 1: Basic Speech Commands (Windows PowerShell)
		Add-Type -AssemblyName System.Speech

		$voice = New-Object -TypeName System.Speech.Synthesis.SpeechSynthesizer

		$voice.Speak("Good day, ladies and gentlemen")

The voice quality is pretty good, although the intonation is somewhat mechanical—the result actually sounds better than the voice from Stephen Hawking’s voder, though the rhythm and intonation is similar.

Other systems (e.g., Macintosh or Linux) have their own speech synthesis (sometimes called TTS—text-to-speech) systems, which may or may not be accessible from PowerShell Core on those systems. You will need to consult the documentation for your operating system and TTS software.

But even in Windows, it’s really only this simple if the text you use in the $voice.Speak(…) statement is in the language that your Windows system uses as the default user interface language—for me, US English. If you try to use text from a language whose orthographic conventions (that is, the way sounds are written) are significantly different from your system default language, you’ll get something that will sound badly wrong, and in fact you may even end up having part or all of your text spelled out. On my system, for example, trying to get the standard voice (for US English) to speak French has pretty horrible results. Trying to use the English TTS engine with a language that doesn’t even use the Latin alphabet (e.g., Russian, Hebrew, or Chinese) throws an error.

You can, of course, install additional voices for different languages, and in some languages, for different dialects or accents (for example, Windows has English voices for US, Canada, England, Ireland, Australia, and India) or both genders. If you’re willing to pay for third-party voices, you can even get children’s voices or elderly voices. I’ve installed other Microsoft (free, built-in to Windows) voices on my system, so if I wanted my computer to say something in French, I could enter the code in Listing 2.


		# Listing 2: Windows PowerShell Speaks French
		$voice.SelectVoice("Microsoft Hortense Desktop")

		$voice.Speak("Bonjour mesdames et messieurs")

Naturally, you can incorporate these statements into a script, and have complex “canned” dialogues, or you can write a script that reads your input and then speaks it.

What happens, though, if you want to use a language that isn’t available (for example, obscure languages like Xhosa, or fictional languages like Klingon), either as a free Microsoft voice or as a third-party voice? Or if you want to insert a single word or short phrase in one language into the middle of a text in another? For both situations, the World Wide Web Consortium (W3C) has defined Speech Synthesis Markup Language (SSML), based on XML and allowing the user to specify exact pronunciation using the International Phonetic Alphabet (IPA).

A full treatment of SSML is beyond the scope of this article; we will only be discussing how to generate an IPA pronunciation and insert it into an SSML framework.

Most TTS systems, not just those for Windows, will support SSML. PowerShell Core is available for Windows, Macintosh, and Linux systems, so the PowerShell code in the rest of this article is applicable to any system, unless otherwise noted.

A minimum SSML string for the Windows text-to-speech (TTS) subsystem is given in Listing 3a; Listing 3b includes the XML preamble and DOCTYPE preambles that TTS systems other than Windows may require.


		# Listing 3a: Minimal SSML for Windows TTS (PowerShell $voice.SpeakSSML(…))
		<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" 
		xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>
		# Listing 3b: SSML with preambles for non-Windows TTS (check your TTS 
		system documentation)
		<?xml version="1.0"?>

		<!DOCTYPE speak PUBLIC "-//W3C/DTD SYNTHESIS 1.0//EN" 
		"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">

		<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" 
		xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>

To tell Windows PowerShell to use SSML for speech generation, use $voice.SpeakSSML(…) instead of $voice.Speak(…) (See listing 4).


		# Listing 4: Using $voice.SpeakSSML(…) in Windows PowerShell
		$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" 
		xml:lang="en-US">Good day, ladies and gentlemen.</speak>'

		$voice.SpeakSSML($ssml)

Doing this doesn’t get you anything beyond what we've already seen with $voice.Speak(…), however; we need to insert another SSML tag to use IPA: the <phoneme> tag.

Suppose we want our default US English voice to say “The French for ‘Hello’ is ‘Bonjour’.”. If we simply pass that string to the TTS engine, it will completely mangle the French word. We use the <phoneme> tag to tell the (English) TTS engine how to pronounce the French word (see listing 5).


		# Listing 5: Using the <phoneme> tag to insert one language into 
		another
		$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" 
		xml:lang="en-US">'

		$ssml += 'The French for "Hello" is '

		$ssml += '<phoneme alphabet="ipa" ph="boɴˈʒɯʁ">"Bonjour"</phoneme>.</speak>'

If we then feed this to the TTS engine, we will get what sounds like an American who knows French, but still has an American accent.

In the <phoneme> tag, we provide the ‘alphabet’ attribute to tell the TTS engine what phonetic transcription system we will be using to represent the pronunciation. All SSML processors that support the <phoneme> tag are required to support IPA; other phonetic representation systems may be supported at the TTS engine author’s discretion. The ‘ph’ attribute provides the pronunciation of the word or phrase, as represented in the phonetic transcription system named in the ‘alphabet’ attribute.

We now have enough information on SSML to be able to have our computer insert individual Vilani words into phrases in our computer’s primary TTS language. What we don’t have is a way of transcribing Vilani into IPA. I went through extant information on the Vilani language, came up with the IPA equivalents for the “standard” Latin-alphabet orthography for Vilani, and wrote it out into a file that will be used by code in this article. That file, VILANI.IPA, is included in SSML.ZIP. See the boxed text below for how to create a language IPA definition file.

Building a Language IPA Definition File

The structure of the language IPA definition file is fairly simple, but must be followed carefully; there is essentially no tolerance for variation. Create an ordinary text file with the name «language».ipa (e.g., vilani.ipa); the contents are as follows:

The first line of the file must always be the string


			ortho=ipa

Each subsequent line is of the form


			«text»=«ipa»

where

«text»: is the way the sound is written in the language. Any Unicode character sequence may appear here.
«ipa»: is the IPA representation of the sound.

For example, in English, the character sequence “sh” normally represent the sound that maps to the IPA symbol /ʃ/. To represent this, you would include


			sh=ʃ

in your language IPA definition file.

Sometimes, the IPA symbol for one sound will match the way a different sound is written in the language. If this happens, you will need to be careful about the order of the lines in the language IPA definition file. For example, suppose that in your language, the character “o” represents the sound notated in IPA by the symbol /a/, and the character “a” represents the sound notated in IPA by the symbol /æ/. If your language IPA definition file contains the two lines


			o=a

			a=æ

in that order, you will end up changing all occurrences of both “o” and “a” to /æ/, because the generator will first change “o” to /a/, and then it will change “a” to /æ/. To achieve the intended substitutions, you need to have these two lines in the opposite order, so that “a” gets changed to /æ/, and then “o” gets changed to /a/.

In rare cases, you might not be able to come up with a workable order. In that case, you may have to use a secondary substitution character. For example, in Vilani, the characters "ii" represent the sound written in IPA as [i], and the character "i" represents the sound written in IPA as [ɪ]. If you do the i-to-ɪ substitution first, then "ii", a completely different sound, gets replaced with "ɪɪ" - not what you want. If, on the other hand, you do the ii-to-i substitution first, you end up with all occurrences of both "i" and "ii" being changed to "ɪ", like the o-a substitution problem example above. The solution here is to use a temporary substitute for "ii", and then, after you've completed the i-to-ɪ substitution, replace the temporary substitute with "i":


			ii=#

			i=ɪ

			#=i

Be careful about the substitution that you use; the PowerShell code that implements the substitution uses regular expressions by default, and you may get unexpected results if you use characters with special meanings in regular expressions. For example, if $ is used instead of # in the above example, the result will have the character “i” at the end of every IPA string; this is because $ is the regular expression symbol for "end of string".

This technique is essentially guaranteed to work; it should be noted that, at least in the specific case of Vilani, one can in fact do the i-to-ɪ substitution first, and then convert any occurrences of ɪɪ back to i.

The PowerShell Advanced Function (also called a ‘script cmdlet’) in Listing 6 will take as parameters a language identifier and a string containing a word ostensibly in that language, and will use the rules defined in a file such as described in the sidebar to emit a string that contains the IPA for the correct pronunciation of the input word. Note that the rules file must be named «language».ipa, where «language» is the language with which you are working (Vilani, in our example).

# Listing 6: Convert Text to IPA according to language rules - This function is part of ssml.ps1 in the zip file
		
function ConvertTo-IPA {
    [CmdletBinding()]

    Param(
        [Parameter(Mandatory=$true)]
        [string]$language,

        [Parameter(Mandatory=$true)]
        [string]$word
    )

    $langfile = $language + ".ipa"
    $phonemetable = (Import-CSV -Path $langfile -Delimiter '=')
    ForEach($phoneme in $phonemetable) {
        $word = $word -replace $phoneme.ortho,$phoneme.ipa
    }
    return $word
}

Now, we need to insert this IPA string into a <phoneme> tag. The PowerShell Advanced Function/script cmdlet in Listing 7 will take as parameters a language identifier and a string containing a word ostensibly in that language, and will use the function from Listing 6 to generate an IPA string, and then emit the <phoneme> tag that will allow our TTS system to pronounce the word.

# Listing 7: Generate a  tag with IPA pronunciation - This function is part of ssml.ps1
		
function New-SSMLPhonemeTag {
    [CmdletBinding()]

    Param(
       [Parameter(Mandatory=$true)]
       [string]$language,

       [Parameter(Mandatory=$true)]
       [string]$word
    )

    $phonemetag = '<phoneme alphabet="ipa" ph="'
    $phonemetag += (ConvertTo-IPA -word $word -language $language)
    $phonemetag += '">' + $word + '</phoneme>'

    return $phonemetag
}

As this returns the string to be inserted into the SSML, you can call it as part of your effort to build the SSML string (see listing 8)


		# Listing 8: Generating SSML with <phoneme> tags
		$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" 
		xml:lang="en-US">'

		$ssml += 'The Vilani word that means "a change in lighting that reveals 
		new detail" is "' + (New-SSMLPhonemeTag -word kurishdam -language vilani) 
		+ '".'

		$ssml += '</speak>'

NOTE: The pronunciation generated by these functions does not take into account any rules for stress or tone that may differ from those of the default TTS engine language. You may want to output the generated SSML (or, later on in this article, the PLS lexicon) to a file and hand-edit it to reflect those additional rules.

The <phoneme> tag isn’t really the right solution for entire phrases or paragraphs in an unsupported language, however. The ideal solution would be to create or obtain a TTS engine for the language; however, we are assuming that that’s not an option. You can, however, add vocabulary to an existing TTS engine using a pronunciation lexicon. The W3C has a specification for this, the Pronunciation Lexicon Specification (PLS). This is an XML-based file format that pairs orthography with pronunciation, much like the <phoneme> tag in an SSML document does. However, when a pronunciation lexicon is active, one may pass strings in the lexicon’s language to the TTS engine, either directly or as part of a SSML document (depending on the TTS engine’s limitations), without individual <phoneme> tags, and have it pronounce the words correctly (see listing 9).


		<!-- Listing 9: SSML to load a pronunciation lexicon, then use it -->
		
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <lexicon uri="file:///usr/traveller/vilani/lexicon.pls"/>
    Dishimkhirni lekane baasa ka amaargi in disaninu ka iirbar in sisadikud. Dirgekii ka darkaamku in midu in dinekhinumninu ka khurer khinumash.
</speak>

(The Windows .NET SpeechSynthesizer class also has a method .AddLexicon(…) to load a PLS file. There is a known bug with the “Microsoft Zira Desktop” voice; this voice ignores loaded lexicons.)

According to the W3C specification for PLS, a minimal PLS header would consist of the XML prolog, followed by the <lexicon> element defining the namespace, alphabet, and language (see listing 10).


		<!-- Listing 10: A Minimal PLS Header -->

<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
</lexicon>

Note that some TTS systems require the xml:lang attribute to match the ‘native’ language of the TTS voice (Windows is one such). In those cases, you will need separate copies of the lexicon for each language you wish to apply the lexicon to. As with the <phoneme> tag in SSML, support for IPA is mandated; support for other pronunciation representations is at the TTS engine author’s discretion.

The <lexicon> element encloses multiple <lexeme> elements, each representing a single “word” and its pronunciation. Each <lexeme> element encloses one or more <grapheme> elements, representing the way the word is written, and one or more <phoneme> elements, representing the pronunciation. For the purposes of this article, we will assume that a lexeme encloses exactly one grapheme and one phoneme. (see listing 11)


		<!-- Listing 11: A Lexicon with a <lexeme> element -->
		
<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
    <lexeme>
        <grapheme>bonjour</grapheme>
        <phoneme>boɴˈʒɯʁ</phoneme>
    </lexeme>
</lexicon>

Given the lexicon from listing 11, once loaded into an English voice, we could use the word “bonjour” without having to include pronunciation data “on the fly”.

The PowerShell Advanced Function/script cmdlet in listing 12 takes a text file and a language IPA definition file, and uses the ConvertTo-IPA function from Listing 6 to generate a PLS lexicon for the language including all the words in the text file. It is assumed that the text file will contain one word per line. The only required parameter is the language name; if the vocabulary text file or output file names are omitted, they will default to the language name followed by .txt and .pls respectively (i.e., if the language is vilani, the language data will be read from vilani.ipa, the vocabulary from vilani.txt, and the output lexicon will be vilani.pls)


		# Listing 12: A PLS Lexicon Generator - This function is part of ssml.ps1 in the zip file
		
function New-PLSLexicon {
    [CmdletBinding()]

    param(
        [Parameter(Mandatory=$true)]
        [string]$language,

        [string]$wordfile,

        [string]$outfile
    )

    $lexicon = @()
    if ($wordfile -eq "") { $wordfile = $language + '.txt' }
    if ($outfile  -eq "") { $outfile  = $language + '.pls' }
    $wordlist = Get-Content $wordfile
    $lexicon += '<?xml version="1.0"?>'
    $lexicon += '<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">'
    ForEach ($word in $wordlist) {
        $lexicon += '  <lexeme>'
        $lexicon += '    <grapheme>' + $word + '</grapheme>'
        $lexicon += '    <phoneme>' + (ConvertTo-IPA -word $word -language $language) + '</phoneme>'
        $lexicon += '  </lexeme>'
    }
    $lexicon += '</lexicon>'
    Set-Content -Encoding Unicode -Path $outfile -Value $lexicon
}

Minimizing or Avoiding Lexicons

Using your computer’s default language is not always the best place to start—if your computer’s “native” language doesn't use spelling and pronunciation rules that are similar to those of the language you want to synthesize, you will need to provide pronunciation information for almost every word in your synthesized language. On the other hand, if your synthesized language has rules that are similar to some other installed (or installable) language, starting from that similar language means that you will only have to provide pronunciation data for words which contain phonemes that either do not exist in or are different from the installed language—for example, if you hold that Old High Geonee (OHG) sounds most like Italian, and that Italian spelling rules are valid for OHG, except that OHG “ss” is pronounced like English “sh” instead of “s”, as in Italian, you would only need to provide pronunciation data for OHG words that contain “ss”—all other words could be supplied without pronunciation data, and would be pronounced properly by an Italian voice in a TTS system.

The advantage to this is that any PLS lexicon that you create need only contain words with the differing phonemes, rather than a complete vocabulary for the language, making the lexicon significantly smaller. It should be noted that the Windows TTS system has significant problems with large lexicons.

References

SSML 1.0: https://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
SSML 1.1: https://www.w3.org/TR/speech-synthesis11/
IPA: https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
PLS: https://www.w3.org/TR/pronunciation-lexicon/