Vilani Speech Synthesis with SSML
This article originally appeared in the September/October 2019 issue.
Author’s Note: In this article, “Windows PowerShell” refers to the version of PowerShell distributed with/as part of Windows 7 or later, or which is included with the Windows Management Framework updated for those versions. “PowerShell Core” refers to those versions of PowerShell other than Windows PowerShell. “PowerShell”, not otherwise specified, refers to both Windows PowerShell and PowerShell Core.
This article makes significant use of the IPA characters in Unicode. Your browser should use a font for monospaced text that includes these characters; on Windows systems, both Courier New and Andale Mono will work.
All code from this article can be downloaded from https://www.freelancetraveller.com/infocenter/software/ssml.zip
If you’ve got a Windows computer (Windows 7 or later), your computer can talk pretty easily:
Start up a Windows PowerShell session—it doesn’t matter whether you use the ISE or the console version of Windows PowerShell—and type the code in listing 1 at the prompt.
# Listing 1: Basic Speech Commands (Windows PowerShell)
Add-Type -AssemblyName System.Speech
$voice = New-Object -TypeName System.Speech.Synthesis.SpeechSynthesizer
$voice.Speak("Good day, ladies and gentlemen")
The voice quality is pretty good, although the intonation is somewhat mechanical—the result actually sounds better than the voice from Stephen Hawking’s voder, though the rhythm and intonation is similar.
Other systems (e.g., Macintosh or Linux) have their own speech synthesis (sometimes called TTS—text-to-speech) systems, which may or may not be accessible from PowerShell Core on those systems. You will need to consult the documentation for your operating system and TTS software.
But even in Windows, it’s really only this simple if the text you use
in the $voice.Speak(…)
statement is in the language that
your Windows system uses as the default user interface language—for me,
US English. If you try to use text from a language whose orthographic
conventions (that is, the way sounds are written) are significantly
different from your system default language, you’ll get something that
will sound badly wrong, and in fact you may even end up having part or
all of your text spelled out. On my system, for example, trying to get
the standard voice (for US English) to speak French has pretty horrible
results. Trying to use the English TTS engine with a language that
doesn’t even use the Latin alphabet (e.g., Russian, Hebrew, or Chinese)
throws an error.
You can, of course, install additional voices for different languages, and in some languages, for different dialects or accents (for example, Windows has English voices for US, Canada, England, Ireland, Australia, and India) or both genders. If you’re willing to pay for third-party voices, you can even get children’s voices or elderly voices. I’ve installed other Microsoft (free, built-in to Windows) voices on my system, so if I wanted my computer to say something in French, I could enter the code in Listing 2.
# Listing 2: Windows PowerShell Speaks French
$voice.SelectVoice("Microsoft Hortense Desktop")
$voice.Speak("Bonjour mesdames et messieurs")
Naturally, you can incorporate these statements into a script, and have complex “canned” dialogues, or you can write a script that reads your input and then speaks it.
What happens, though, if you want to use a language that isn’t available (for example, obscure languages like Xhosa, or fictional languages like Klingon), either as a free Microsoft voice or as a third-party voice? Or if you want to insert a single word or short phrase in one language into the middle of a text in another? For both situations, the World Wide Web Consortium (W3C) has defined Speech Synthesis Markup Language (SSML), based on XML and allowing the user to specify exact pronunciation using the International Phonetic Alphabet (IPA).
A full treatment of SSML is beyond the scope of this article; we will only be discussing how to generate an IPA pronunciation and insert it into an SSML framework.
Most TTS systems, not just those for Windows, will support SSML. PowerShell Core is available for Windows, Macintosh, and Linux systems, so the PowerShell code in the rest of this article is applicable to any system, unless otherwise noted.
A minimum SSML string for the Windows text-to-speech (TTS) subsystem is given in Listing 3a; Listing 3b includes the XML preamble and DOCTYPE preambles that TTS systems other than Windows may require.
# Listing 3a: Minimal SSML for Windows TTS (PowerShell $voice.SpeakSSML(…))
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis"
xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>
# Listing 3b: SSML with preambles for non-Windows TTS (check your TTS
system documentation)
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C/DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis"
xml:lang="en-US">Good Day, Ladies and Gentlemen</speak>
To tell Windows PowerShell to use SSML for speech generation, use
$voice.SpeakSSML(…)
instead of $voice.Speak(…)
(See listing 4).
# Listing 4: Using $voice.SpeakSSML(…) in Windows PowerShell
$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis"
xml:lang="en-US">Good day, ladies and gentlemen.</speak>'
$voice.SpeakSSML($ssml)
Doing this doesn’t get you anything beyond what we've already seen with
$voice.Speak(…)
, however; we need to insert another SSML tag to use IPA:
the <phoneme>
tag.
Suppose we want our default US English voice to say “The French for
‘Hello’ is ‘Bonjour’.”. If we simply pass that string to the TTS engine,
it will completely mangle the French word. We use the <phoneme>
tag to
tell the (English) TTS engine how to pronounce the French word (see
listing 5).
# Listing 5: Using the <phoneme> tag to insert one language into
another
$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis"
xml:lang="en-US">'
$ssml += 'The French for "Hello" is '
$ssml += '<phoneme alphabet="ipa" ph="boɴˈʒɯʁ">"Bonjour"</phoneme>.</speak>'
If we then feed this to the TTS engine, we will get what sounds like an American who knows French, but still has an American accent.
In the <phoneme>
tag, we provide the ‘alphabet
’ attribute to tell the
TTS engine what phonetic transcription system we will be using to
represent the pronunciation. All SSML processors that support the
<phoneme>
tag are required to support IPA; other phonetic representation
systems may be supported at the TTS engine author’s discretion. The ‘ph
’
attribute provides the pronunciation of the word or phrase, as
represented in the phonetic transcription system named in the ‘alphabet
’
attribute.
We now have enough information on SSML to be able to have our computer
insert individual Vilani words into phrases in our computer’s primary
TTS language. What we don’t have is a way of transcribing Vilani into
IPA. I went through extant information on the Vilani language, came up
with the IPA equivalents for the “standard” Latin-alphabet orthography
for Vilani, and wrote it out into a file that will be used by code in
this article. That file, VILANI.IPA
, is included in SSML.ZIP
. See the
boxed text below for how to create a language IPA definition file.
The PowerShell Advanced Function (also called a ‘script cmdlet’)
in Listing 6 will take as parameters a language identifier and a string
containing a word ostensibly in that language, and will use the rules
defined in a file such as described in the sidebar to emit a
string that contains the IPA for the correct pronunciation of the input
word. Note that the rules file must be named «language».ipa
, where
«language»
is the language with which you are working (Vilani,
in our example).
# Listing 6: Convert Text to IPA according to language rules - This function is part of ssml.ps1 in the zip file
function ConvertTo-IPA {
[CmdletBinding()]
Param(
[Parameter(Mandatory=$true)]
[string]$language,
[Parameter(Mandatory=$true)]
[string]$word
)
$langfile = $language + ".ipa"
$phonemetable = (Import-CSV -Path $langfile -Delimiter '=')
ForEach($phoneme in $phonemetable) {
$word = $word -replace $phoneme.ortho,$phoneme.ipa
}
return $word
}
Now, we need to insert this IPA string into a <phoneme>
tag. The PowerShell Advanced Function/script cmdlet in Listing 7 will take as
parameters a language identifier and a string containing a word
ostensibly in that language, and will use the function from Listing 6 to
generate an IPA string, and then emit the <phoneme>
tag that will allow
our TTS system to pronounce the word.
# Listing 7: Generate a tag with IPA pronunciation - This function is part of ssml.ps1
function New-SSMLPhonemeTag {
[CmdletBinding()]
Param(
[Parameter(Mandatory=$true)]
[string]$language,
[Parameter(Mandatory=$true)]
[string]$word
)
$phonemetag = '<phoneme alphabet="ipa" ph="'
$phonemetag += (ConvertTo-IPA -word $word -language $language)
$phonemetag += '">' + $word + '</phoneme>'
return $phonemetag
}
As this returns the string to be inserted into the SSML, you can call it as part of your effort to build the SSML string (see listing 8)
# Listing 8: Generating SSML with <phoneme> tags
$ssml = '<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis"
xml:lang="en-US">'
$ssml += 'The Vilani word that means "a change in lighting that reveals
new detail" is "' + (New-SSMLPhonemeTag -word kurishdam -language vilani)
+ '".'
$ssml += '</speak>'
NOTE: The pronunciation generated by these functions does not take into account any rules for stress or tone that may differ from those of the default TTS engine language. You may want to output the generated SSML (or, later on in this article, the PLS lexicon) to a file and hand-edit it to reflect those additional rules.
The <phoneme>
tag isn’t really the right solution for entire phrases or
paragraphs in an unsupported language, however. The ideal solution would
be to create or obtain a TTS engine for the language; however, we are
assuming that that’s not an option. You can, however, add vocabulary to
an existing TTS engine using a pronunciation lexicon. The W3C has a
specification for this, the Pronunciation Lexicon Specification (PLS).
This is an XML-based file format that pairs orthography with
pronunciation, much like the <phoneme>
tag in an SSML document does.
However, when a pronunciation lexicon is active, one may pass strings in
the lexicon’s language to the TTS engine, either directly or as part of
a SSML document (depending on the TTS engine’s limitations), without
individual <phoneme>
tags, and have it pronounce the words correctly
(see listing 9).
<!-- Listing 9: SSML to load a pronunciation lexicon, then use it -->
<speak version="1.0" xmlns="https://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<lexicon uri="file:///usr/traveller/vilani/lexicon.pls"/>
Dishimkhirni lekane baasa ka amaargi in disaninu ka iirbar in sisadikud. Dirgekii ka darkaamku in midu in dinekhinumninu ka khurer khinumash.
</speak>
(The Windows .NET SpeechSynthesizer class also has a method .AddLexicon(…)
to load a PLS file. There is a known bug with the “Microsoft Zira
Desktop” voice; this voice ignores loaded lexicons.)
According to the W3C specification for PLS, a minimal PLS header would
consist of the XML prolog, followed by the <lexicon>
element defining
the namespace, alphabet, and language (see listing 10).
<!-- Listing 10: A Minimal PLS Header -->
<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
</lexicon>
Note that some TTS systems require the xml:lang
attribute to match the
‘native’ language of the TTS voice (Windows is one such). In those
cases, you will need separate copies of the lexicon for each language
you wish to apply the lexicon to. As with the <phoneme>
tag in SSML,
support for IPA is mandated; support for other pronunciation
representations is at the TTS engine author’s discretion.
The <lexicon>
element encloses multiple <lexeme>
elements, each
representing a single “word” and its pronunciation. Each
<lexeme>
element encloses one or more <grapheme>
elements, representing the way
the word is written, and one or more <phoneme>
elements, representing
the pronunciation. For the purposes of this article, we will assume that
a lexeme encloses exactly one grapheme and one phoneme. (see listing 11)
<!-- Listing 11: A Lexicon with a <lexeme> element -->
<?xml version="1.0"?>
<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>bonjour</grapheme>
<phoneme>boɴˈʒɯʁ</phoneme>
</lexeme>
</lexicon>
Given the lexicon from listing 11, once loaded into an English voice, we could use the word “bonjour” without having to include pronunciation data “on the fly”.
The PowerShell Advanced Function/script cmdlet in listing 12 takes a
text file and a language IPA definition file, and uses the ConvertTo-IPA
function from Listing 6 to generate a PLS lexicon for the language
including all the words in the text file. It is assumed that the text
file will contain one word per line. The only required parameter is the
language name; if the vocabulary text file or output file names are
omitted, they will default to the language name followed by .txt
and
.pls
respectively (i.e., if the language is vilani
, the language data will be
read from vilani.ipa
, the vocabulary from vilani.txt
, and the output
lexicon will be vilani.pls
)
# Listing 12: A PLS Lexicon Generator - This function is part of ssml.ps1 in the zip file
function New-PLSLexicon {
[CmdletBinding()]
param(
[Parameter(Mandatory=$true)]
[string]$language,
[string]$wordfile,
[string]$outfile
)
$lexicon = @()
if ($wordfile -eq "") { $wordfile = $language + '.txt' }
if ($outfile -eq "") { $outfile = $language + '.pls' }
$wordlist = Get-Content $wordfile
$lexicon += '<?xml version="1.0"?>'
$lexicon += '<lexicon version="1.0" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon" alphabet="ipa" xml:lang="en-US">'
ForEach ($word in $wordlist) {
$lexicon += ' <lexeme>'
$lexicon += ' <grapheme>' + $word + '</grapheme>'
$lexicon += ' <phoneme>' + (ConvertTo-IPA -word $word -language $language) + '</phoneme>'
$lexicon += ' </lexeme>'
}
$lexicon += '</lexicon>'
Set-Content -Encoding Unicode -Path $outfile -Value $lexicon
}
References
SSML 1.0:
https://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
SSML 1.1:
https://www.w3.org/TR/speech-synthesis11/
IPA:
https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
PLS:
https://www.w3.org/TR/pronunciation-lexicon/