Historically, access to labelled data has been the bottleneck for deep learning especially as access to large amounts of high-quality labelled data was essential for any ML organisation. This data gluttony of deep learning has drawn much criticism as one of the outstanding problems in AI.
Deep learning has an access problem.
As the value of data continues to increase, getting hold of high-quality labelled data has always proved a challenge. Typically, unlabelled data consists of information such as photos, audio recordings, videos, news articles and tweets. There is no “explanation” for each piece of this data — it just contains the data, and nothing else. Labelled data typically takes a set of unlabelled data and augments it with some sort of “tag,” “label,” or “class” that allows the information to be easily categorised.
Producing labelled data is time consuming and costly and is responsible for the ‘bottleneck’ we now see in accessing it. This data is essential for any business that offers machine learning or artificial intelligence solutions, particularly when it comes to training models for real world use. There is however a solution to this problem: unsupervised learning.
In the last two years, there has been huge progress in the field of unsupervised learning. Unsupervised learning does away with labelled data, improving sample efficiency enormously (sample efficiency is the amount of experience an algorithm requires during training in order to reach a certain level of performance). Whilst most of this progress originates from the Natural Language Processing (NLP) community, at Speechmatics we have developed our own family of unsupervised speech algorithms and applied it to speech recognition technology. This capability, known as the large-scale network (project name “Hydra”), leads the industry as the first commercially available speech recognition engine built on unsupervised acoustic models (AM). Unsupervised AMs will be the standard for commercial automatic speech recognition (ASR).
This new approach to machine learning provides a paradigm shift in ASR accuracy. The models exceeded our expectations and are highly performant in the noisy, real-world environments ASR has traditionally struggled with. Using these models in the Speechmatics engine will enable customers to unlock unprecedented value from their voice data.
Obviously, accuracy is crucial to speech recognition and there are a number of factors to consider, particularly in regard to languages, accents and dialects. This requires enormous and diverse datasets. In order to operate in commercial use cases this must be compatible with day-to-day business practices. By using unsupervised learning, the large-scale network’s algorithms work out what underlying patterns and structure there might be in complex data like speech. The main challenge within unsupervised learning, is dealing with the lack of prespecified objectives for what you would like it to output – the Hydra Project aims to solve this.
To ensure the ML engine meets the needs of customers, Speechmatics has offered selected customers an opportunity to test the new engine containing the new capability from the Hydra Project. International media services company RedBee Media, which works with some of the most well known media brands, was given early-access. Testing it against competitors, RedBee found the new engine significantly outperformed other engines, particularly in regard to RedBee’s key metrics of textual accuracy, overall punctuation accuracy and word error rate (WER). In addition, it was by far the most successful in analysing the test file with poor audio quality – no doubt due its ability to perform well in the noisy, real-world environments automatic speech recognition (ASR) has traditionally struggled with. RedBee made it clear that the Hydra Project work could have a significant impact on the company’s ability to generate viable ASR transcripts, increase productivity and enhance the value of their services to customers.
In their concluding the result of the testing RedBee said: “Just to say thanks again for the opportunity to test the batch version of Hydra, reiterate our interest in testing a real time version when available and to say congratulations on the impressive achievement!”
Unsupervised learning will be the dominant form of machine learning for the foreseeable future, meaning there are a host of commercial applications for this type of technology. The Hydra algorithms are therefore relevant to any area of modern machine learning. This technology is currently being used for next generation diarisation, language detection, and emotion identification betas that will vastly outperform current technology. This accuracy step-change will allow customers to unlock unprecedented value from their voice data, improving customer experience, greater risk management and crucially, impacting a businesses’ bottom line.
From an industry point of view, the large-scale network algorithms shift to unsupervised learning has significant effects. This brings a shift in the balance of power from those who have vast lakes of labelled data to those with access to large amounts of compute power. This means that data aggregation and labelling companies will become less relevant. Although this may not sound good, the result is a dramatic lowering of one of the main barriers to entry for commercial machine learning, namely access to large quantities of labelled data. This opens up the playing field to new players who before would not have been able to enter the space.
Using unlabelled data is a gear change in commercial machine learning applications and the Hydra Project has shown how this approach is helping to capitalise on those benefits: not only does it benefit the companies and consumers, but means that the power dynamic is more evenly spread between the companies that had large data lakes and those who did not.