Do Androids Perceive Numbers as We Do?

Meiyu
7 min readJun 16, 2023

--

A group project by Meiyu Zheng, Jayathilaga Ramajayam, Hariharavarshan Nandakumar and Sara Mirjalili from Class of 2023, MDS — CL program, UBC

Our quirky title was inspired by Philip K. Dick’s science fiction novel “Do Androids Dream of Electric Sheep?” (fun fact: the film Blade Runner was adapted based on this book!). Let’s dive into the wonderful world of AI, numbers and language processing!

Image by mariia.shalabaieva on Unsplash

Have you ever wondered how a voice assistant understands and processes numerical representations? by “numerical representation”, concepts such as date, time, common units and measurements. For example, how does an AI chatbot associate ‘2’ to ‘two’, ‘$’ to ‘dollar’, or ‘四’ is correspond to ‘4’? Do you think an AI machine would prefer ‘$20’ or ‘twenty dollars’?

This cognitive process may come naturally to humans, we can associate ‘1/4’ with ‘one-quarter’ or ‘June fourteenth’ with ‘June, 14th’ effortlessly. Unfortunately not so for machines! We have to “teach” machines meticulously to know and perform the way we want!

Before we dive in, let’s demystify a few related terms first:

  • TTS: Text to Speech (conversion of written text into verbal language)
  • ASR: Automatic Speech Recognition (transcribing spoken text into written text)
  • TN: Text Normalization (e.g., ‘20 years old’ -> ‘twenty years old’)
  • ITN: Inverse Text Normalization (e.g., ‘thirty percent’ -> ‘30%’)
  • LLM: Large Language Models (AI models designed to understand and generate human-like text)
  • WFST: Weighted Finite State Transducers (used in many text and speech processing)

Now, you might be thinking…

So, you are talking about a bunch of numbers in different formats, how does that relate to speech recognition, you may ask?

A robust ASR engine should be able to capture and comprehend spoken numerical entities from user input and transform them to written forms accurately, and vice versa. Think of it as a translator: “it costs me twenty-three dollars” should be translated to “it costs me $23”, and the reverse should hold.

Similarly, the system should be able to handle these conversion seamlessly in any virtual assitant or chatbot products, and it also should be capable of handling variations and edge cases.

Hmm…sounds like an easy task? you may say?

Not so in the reality! As much as it sounds like a trivial component of ASR, it is super important and its optimization significantly impacts the performance of speech recognition! and trust me, it can get incredibly complex, especially when dealing with noisy ASR results!

A closer look at TNITN project

For our capstone project, we worked for Seasalt.ai (a Seattle-based AI company) to develop a Text Normalization and Inverse Text Normalization (TNITN) pipeline for Traditional Chinese text. Our goal? is to improve the robustness of the state-of-the-art of the Text-to-Speech and Automatic Speech Recognition system by leveraging quality-data and combination of rule-based and neural models.

Transforming the text from written form into spoken is called Text normalization (TN) while transforming the text from spoken form into written form is Inverse Text normalization(ITN). TN is an essential pre-processing step of text-to-speech (TTS) systems to ensure that TTS can handle all input texts without skipping unknown symbols such as puntuations and symbols; and ITN is a part of the automatic speech recognition (ASR) post-processing pipeline to enhance the readability of the ASR results and boost performance of downstream tasks.

Stop there…that sounds too much!

To put it simply, our goal was to enhance a voice assistant’s performance in understanding both numerical representations (like $100 and one-hundred dollars) and associate one form to another better.

We focused on punctuations and ITN task in particular for this project. We aimed to improve the system’s ability to punctuate and translate spoken forms into written forms accurately… all in Traditional Chinese from Taiwan!

Sounds really intriguing, isn’t it?

And remind me again, why does that matter for Chinese text?

Because it can significantly improve the readability of the ASR results: believe me, you don’t want to see a bunch of strings “two thousand twenty three nine one one…” or “十億八千五百六十萬 ”!

( A huge shout-out to the inventor of the Arabic numerals! )

Unlike many languages, Chinese is not a spaced-language, which increases the complexity when dealing with long sentences. It’s a challenging task to find the explicit word boundaries if one doesn’t have good knowledge of Chinese characters and semantics; and likewise, this also adds challenges to the machine learning models hard to identify or tokenize Chinese text accurately. Thus for enhanced readability and better comprehension, appropriate punctuation is critical!

Let’s see an example: how punctuations make difference?

你好你最近過得好嗎 -> 你好,你最近過得好嗎?

before: ‘heyhowareyoudoinglately’

after: ‘hey, how are you doing lately?’

ITN: one more example coming along…

Chinese characters to Arabic numerals(aka. spoken form -> written form)

我要預定九月十二號五點四十的票 -> 我要預定9月12號5點40的票

before: ‘I’d like to book a ticket for five fourty September twelfth’

after: ‘I’d like to book a ticket for 5:40, 9/12’

That does look much better!

and more importantly, it also lays the groundwork for all kinds of down-stream NLP tasks (such as named entity recognition, intent classification) after punctuated and ITN transformed properly.

But wait…

Challenges, however, came with the dataset and method. The traditional ASR/TTS is rule-based system, such as WFST, which to define a set of linguistics rules in different classes (DATE, CARDINALS, TIME etc).

Simply saying, you can understand it as a collection of substituents, these grammar rules are straightforward: 1 is set to be equivalent to one, $ is equal to dollar, once pairs are defined, they are transformed to each other in the process of TN-ITN.

However, there are some limitations, extending grammars sound feasible but it can’t seamlessly handle edge cases. Not to mention, rules could be context-dependent and out of hand for some languages, there could be different ways of using certain numbers in different scenarios!

Another setback is that it cannot perform well with ambiguous input when the text is heavily depend on the context, for instance, “St.” can be converted to “Street” or “Saint” (as in “Saint Patrick Day”)!

Our approach

Therefore, we turned to neural model-based systems by employing the NeMo toolkit (puntuaction and ITN models). For ITN training in particular, a key challenge was acquiring quality data. As you might imagine, text data in spoken forms is rarely found on the web out there. LLMs came to the rescue!

LLMs enabled us to transform in-house ASR results to generate high-quality ITN data, so that we could obtain pairs for the model training.

After exploring many different LLMs including LLaMA, Alpaca, Dolly and OpenAssitant, we ultimately chose to ulitze ChatGPT (GPT-3.5-turbo in particular) for data generation. Even though ChatGPT didn’t yield perfect performance for the punctuation and ITN task, it was the best and most powerful among all models and that contributed significantly to our data preparation process!

We set the same data generation pipeline for punctuction and ITN tasks, then we clean, post-processed the data and adopted different preparation formats for each task respectively since they require different structures for training. After this step, we then fed these data for training using NeMo models for puntuations and ITN modules.

Final Remarks

Reflecting on the project, there’s no doubt that we encountered some bumps on the road. We were lucky to have the chance to be exposed to different LLMs — the promising performance of LLMs such as LLaMA and Alpaca on simple tasks with toy data didn’t translate to the full-scale data, so we ended up using ChatGPT. It made us realize LLMs are powerful, and can be great for general tasks, yet implementing LLMs for a specific task in production proved to be challenging. Still, the journey and the learnings made it worthwhile.

When working on the data generation script for ITN task in particular, we came up with many different prompts and kept track of the performance, but when it comes to the full-scaled data, the output even for a powerful model like chatgpt, it hallucinates at some point. This could be due to many factors, the task was not a general task, rather more of specialized niche, and the language was in Traditional Chinese, not in English. The given prompts still could be tweaked in different ways to enhance the performance, such as using techniques like Chain of Thought.

This project served as a reminder that while AI models are exceptional tools, they might not always perform optimally in every scenario. It’s a dynamic journey of learning, adapting, and iterating for us — a journey we found truely rewarding.

Circling back to our initial question — “Do Androids Perceive Numbers as We Do?” does it prefer ‘$20’ or ‘twenty dollars’? Well, the answer is — it depends. At times, it might lean towards ‘twenty dollars’ for TTS, while in other instances, ‘$20’ may be the preferred choice in the ASR pipeline. And that’s the beauty of TNITN!

Thanks for stopping by and reading about our project! : )

--

--

Meiyu

Computational Linguist | NLP Engineer | Technical Writer