Speech-based services are growing in importance. Organisations, whether providers of products, or commercial or social services, realise the benefits of providing an easier, faster, and more inclusive way to let people interact with them through speech. They teach their AI-driven systems to speak using speech datasets to understand what is being said in the first place, and then to deliver an appropriate response. Though how many times have you been asked to rephrase what you have said? The standby solution of speaking slower and louder does not work if you use the same word sequences which are not understood in the first place by an inadequately trained machine at the other end.
In a nutshell, here’s how it works. Appropriate dialogue is recorded, and then annotated. The AI systems are trained using speech datasets to correctly recognise what was said, and then find it in its written database, look for the best written response, and covert that back to audio to respond to the caller. System errors can occur at any stage of not recognising what is being said, of inadequate or incorrect annotations, and a limited choice of available response.
Inadequate training data causes problems down the line
Experts consequently attribute many errors in speech recognition systems to flaws in the speech datasets used to train the models. Data is sometimes scraped from many different places, and it can reflect the same societal-level biases of the people doing the scraping (usually of white and male ethnicity). Language models have been shown to exhibit prejudices along race, ethnic, religious, and gender lines, associating Black people with more negative emotions and struggling with “Black-aligned English.”
According to the Algorithmic Justice League’s Voice Erasure project, speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 19% for white voices compared to 35% for Black voices. Reducing error requires larger datasets and cleaner data for client investments in AI-driven speech systems to deliver the best returns.
A crowdsourced solution
Defined.ai (formerly DefinedCrowd) is a leading provider of speech datasets, models and tools for training voice systems driven by Artificial Intelligence. It is based in Seattle, USA, and through its Neevo platform it has a crowdsourced workforce of over 500,000 global contributors from more than 70 countries who speak over 50 languages. Between them they have logged successful completion of over 200 million tasks to record speech, annotate it, or check the work that has been done by others.
Their two key resources are the provision of prebuilt training data as itemised in a catalogue of existing generic databases, created in many languages; and then further customised speech datasets based on specific client requirements. Clients may specify speech focused on specific industries or business sectors, how many hours of it they require, a choice of scripted or spontaneous dialogue, and supplied by how many individual contributors?
This often requires recruiting new contributors. Depending on the complexity of client requirements, solutions can involve microtasks performed by broad swathes of the general public, or maybe require input from industry sector specialists and qualified translators.
Crowdsourcing contributors to record or annotate speech datasets enables the final result to more accurately represent the particular audience specified by a client. AI built on data that is representative of the populations it is serving delivers more successful results, with a higher degree of consumer engagement and support.
Comprehensive speech datasets must reflect regional accents, and also the accents of immigrants who are not speaking their native language. This is where high error rates can occur. Recruiting specific sub-groups of a population on a temporary basis is a great strength of crowdsourcing datasets over using a workforce of regular employees.
“There is an accent gap in speech technology. Research shows that speech recognition technologies are not nearly as accurate in understanding non-native accents as they are in understanding white, non-immigrant, upper-middle-class Americans,” said Dr. Daniela Braga, founder and CEO of Defined.ai. It is not a surprise, because it is this demographic that had the most access to and trained the technology from the beginning.
Perhaps to highlight the deficiencies of many less comprehensive speech datasets, Defined.ai made free datasets of Spanish-accented English available to AI developers. This would allow them to test how well their speech recognition models understand non-native English speakers, which amounts to 35 million people in the United States.
The new company name marks the start of a transformation to scale the business to new heights, moving beyond crowdsourced data gathering to a comprehensive AI platform and marketplace ready for a new era of AI development.