VoiceXML — Good, Bad & the Ugly

I came across this post while checking_ my inactive blog on archive.org. Although the article is dated (2009) and some of the technologies, specs, and links may not be relevant today_, the VoiceXML part is still relevant as much as it was in 2009 and hence thought it is worth sharing — please read keeping in mind that it was written in 2009 and posted here ASIS.

Dec 4, 2021 8 min read

History
The original article was an attempt to identify some of the key deficiencies of VoiceXML as a programming language for voice, and why we created VoicePHP in 2008 as a viable replacement of VoiceXML On its release, VoicePHP was praised by many, a notable quote from none other than GigaOM, Om Malik — “The idea of VoicePHP is Disruptive in its Simplicity”. This article follows why we did. You can view the original blog post and comments here.

VoiceXML — Good, Bad & the Ugly

While XML (and in general any marking language) has been great for representing the data, using it for programming is like using a wrench to hit a nail. All you need is a hammer to nail! Although a wrench can do a manageable job of hitting a nail, it’s not elegant, creates a mess, and cannot address the problem with precision. The same is the case with XML It is best suited for and was designed for data representation, data-transfer and has repeatedly proven its worth for the same. Trying to use that for programming is adapting it for something that it inherently wasn’t meant to do. Let me try and explain this in detail:

Consider a typical “hello world” application in VoiceXML (courtesy, vxml.org).

It took 10 lines of code to write something as simple as that. On top of it, for a simple hello world application, one may be tempted to seek more information about <form> and <block> tags which seem like overkill.

Shocked? You are not alone. As Dominique Boucher states on his blog

from a developer’s perspective, it’s (VoiceXML) like having to program in Cobol! And I only slightly exaggerate

The same application in PHP is reduced to barely 2–3 lines of code which is more readable and intuitive.

See the difference? Don’t take my word for it, here is what veterans and users say about VoiceXML. We will soon dwell on why do they say so.

From the Industry Veterans

In early 2007, Brian OConnor commented on annoyance in VXML standards. As he states, he was unable to use <if> within a <prompt> tag. It’s an arbitrary limitation and requires a nasty workaround.

In the article “ Is VoiceXML the Right Tool for Your Voice Application?”, Brian Brown identifies very precise weaknesses of VoiceXML. For example, when even basic voice controls (pause, resume, etc) are not available in VoiceXML, how it can be even considered the language to program voice? He nailed the problem very well. Look at how VoicePHP addresses it beautifully in a sample application here.

Dannis in his interesting email and unique style shares the pain of VoiceXML,

TCL was the most ugly languge of the 90-ies. VXML has now taken over. The language appears not to have iteration (while, for) and no recursion. But it DOES have the goto primitive, which was banned by Dijkstra 30 years ago. There is no function abstraction and neither object-oriented constructs.

He further adds which I will elaborate on later in this post

“ VXML is an interpreted language using Javascript. Why not using only Javascript with a bundle of speech specific predefined functions? Hacking java-servlet code already entails generating HTML and Javascript. I don’t see why we have to follow the same painful route with VXML”

Even VoiceXML vendors are aware of the limitations and they have tried to create specific & proprietary enhancements to get around VoiceXML limitations, for example, CallXML by Voxeo. In fact, Voxeo CEO commented on VoicePHP coverage by Gigaom that

As a developer, I do not like VoiceXML. Personally, I find it to be too complicated, painful, and a barrier to entry for new developers as others have said. This is why Voxeo offers many other ways to create voice applications, including CallXML — a very powerful yet simple XML based telephony markup; “.

Do I need to say anything more?

The big question — Why do veterans say so?

Unorganized jungle of XML, JavaScript, CDATA, etc.

In my opinion, VoiceXML looks like a creation out of obsession. XML was the new kid on the block and perhaps impressed or obsessed by it, somehow fitting it to the Voice programming needs became the name of the game. Basic TTS & ASR was made to work — wow! So far so good!

Then someone realized that even simple/common programming requirements cannot be done in XML. There wasn’t a simple solution to address this in XML and that’s how Javascript (ECMAScript) became a part of the VoiceXML standard. In my vocabulary, this is nothing more than a “workaround”. If Javascript was being considered then why not do everything in Javascript? When Javascript can completely replace XML (exactly like VoicePHP), is there any logical reason to keep XML around and more importantly continue it as a “standard”? To be honest, this workaround has only made the life of programmers complex. To illustrate, consider the following sample code that reads out a caller-id ( courtesy):

Now compare this code with the VoicePHP equivalent (demo here):

What a mind-blowing difference between both the solutions. Maybe the above example demonstrates the point we all are trying to make. A workaround v/s a natural programming language. As one can see from the above example, VoiceXML has to fall back upon Javascript since XML cannot even have the basic capabilities to manipulate the numbers or strings, how can it be even considered for programming.

As you explore more, you will realize that VoiceXML is handicapped enough even not to be able to offer simple loops on its own. Can you imagine an application without such a basic control statement & despite that, such basic structures were not addressed in VoiceXML. VoiceXML simply falls back to Javascript for doing such basic stuff in a messy way.

In contrast, take a look at just about any application at http://code.voicephp.com to see how easily one can take an existing application and move over to VoicePHP with all the programming constructs usually available in most programming languages.

CDATA — Add it to the mess

Well, it keeps getting better. To bring in Javascript, VoiceXML uses the CDATA directive. What is going on? Isn’t it messy already? Why do I have to care about all the subtleties? For curious minds, the CDATA directive is used so that our script can contain characters that are normally reserved for XML syntax usage.

It’s truly getting messy — XML, Javascript, CDATA, and off-course unreadable code. Keep in mind all that we have done so far is really just “read out a phone number”. It begs to me ask this question: Why is it so damn complicated?

Consider the code for the first tutorial on Voice Recognition from vxml.org.

I am sure you need a coffee break after reading the above code; the code looks verbose, repetitive, and unmanageable. This same application when written in a commonly used ‘real’ programming language will have a lot less code and will read much better. Again refer to any code snippet at http://code.voicephp.com

Server-side programming

One cannot use VoiceXML by itself to write a complete application. For even simple client-side processing you need Javascript. Moving on, if you need to integrate some back-end logic (a.k.a Server-side programming), you need to take the help of one of the commonly used back-end technologies (e.g. PHP, ASP, .NET, etc.).

This is not me saying but vxml.org

Coding an application with just straight VoiceXML is just fine and dandy, thankyouverymuch, but the real potential of VoiceXML is harnessed when we add some ASP or JSP into the mix

Look at the emphasis on “real”. I sincerely appreciate their candid confession and applaud them for succinctly putting the limitation of VoiceXML across so distinctly. So as you can see, in addition to learning VoiceXML tags and attributes, Javascript, and a different programming style, one now has also to learn a server-side language. Think about it — you need a chilled beer to relax but you are being given a cocktail — like it or not!

In Closing

Any way you slice it; VoiceXML doesn’t come close to meeting the requirements of real-world applications. Voice applications would do really well if there was an easy way to bring them to life. Developers do not want to use complicated technology to achieve something simple, intuitive, and obvious — I know I won’t.

We are not against VoiceXML. In fact, VoiceXML spearheaded the way for voice programming and took away the complexity that one had to deal with in the early days (remember hardware card and proprietary drivers nightmare?). When it launched, VoiceXML was the “new” way to program voice and we were completely supportive of it too. We released the world’s first “ Adobe Flash-based VoiceXML Platform “.

But it’s about time that VoiceXML realizes its inadequacies and makes way for better alternatives. Alternatives like VoicePHP (or maybe even VoicePERL or VoicePYTHON) could do a better job. The web is evolving and solutions that can tightly integrate with it will become more and more important. Dedicated solutions to tackle a specific problem are a thing of the past. Some technologies (e.g. PHP for web programming, Flash for UI and widgets, Mobile applications using a data network, etc.) have proven themselves and it’s about time that we re-use them and not bind ourselves to technologies that began with the right attitude to solve a problem but couldn’t really establish themselves due to technical limitations.Originally published at https://web.archive.org.