AI’s Biggest Bottleneck: High-Quality Training Data
Oct 28 · 8 min read
Training a powerful machine learning model has never been easier, but creating the right high-quality training dataset is still incredibly difficult. Surge AI is a team of ML engineers and research scientists building human-AI platforms to solve this.
Interested in working together or joining our team? Reach out at firstname.lastname@example.org or sign up for our beta .
Four years ago, Google shocked language experts with a dramatically improved machine translation model, big tech was acqui-hiring every ML startup they could get their hands on, AlphaGo beat the world’s Go experts, and the New York Times declared that “machine learning is poised to reinvent computing itself” .
In 2016, DeepMind began building an AI to beat StarCraft II — and by the end of 2019, its AlphaStar AI reached GrandMaster level . https://www.youtube.com/watch?v=5iZlrBqDYPM
The ML takeover seemed inevitable, and it felt like we were just a few years away from a world where automated content moderation systems would clean up our social media platforms, Siri-like assistants would take over our homes, and Netflix’s recommendation systems would give us better movie suggestions than our friends.
???? What’s happened since then?
Much of what the ML optimists predicted has come true.
Faster GPUs and more efficient software have dropped the cost of training neural networks and allowed for larger and larger models to be trained. New products like AWS Sagemaker and Google Cloud AI Platform make the infrastructure work of machine learning much easier. There are still additional difficulties around integrating ML into traditional systems, but the cost and difficulty have dropped dramatically.
The compute available to AI researchers has been doubling every 3–4 months for the past several years. https://openai.com/blog/ai-and-compute/
We’ve also developed new neural network architectures that can scale to massive datasets and learn to perform more subjective tasks. Transformers, for example, capture the nuances of language much better than older algorithms and power OpenAI’s 175 billion parameter GPT-3 model , a new language generator capable of writing blog posts that reach the top of the Hacker News front page .
A GPT-3-written blog post on productivity that reached the top of Hacker News. https://liamp.substack.com/p/my-gpt-3-blog-got-26-thousand-visitors
And these models aren’t just available to researchers at Google and OpenAI: easy-to-use implementations are freely available for everyone thanks to open-source contributions from Hugging Face, Google, and Facebook.
But where’s the revolution?
In the past five years, we’ve gotten better models, cheaper computers, and easier-to-use software tools. Why hasn’t there been a new rush of AI applications throughout the business world?
???? Why can I generate a blog post with GPT-3 — but social media companies struggle to keep inflammatory content out of our feeds?
???? Why do we have superhuman Starcraft algorithm — while e-commerce stores still recommend I buy a second toaster?
The short answer: our models have gotten better but our data hasn’t. Tools for training massive neural networks have seen dramatic improvements, but tools for capturing human judgment in training data remain just as difficult. Because of this gap, most of the models used in companies today are trained on datasets that are poorly aligned with what model creators want.
What’s wrong with today’s data? Garbage in, garbage out ????
In some cases, models are trained on proxies like clicks and user engagement.
Many of the original algorithms behind social media feeds, for example, aren’t designed to provide users the best experience; instead, they’re optimized to maximize clicks and engagement, the easiest sources of data available to machine learning practitioners.
But likes are different from quality — shocking conspiracy theories are addictive, but do you really want to see them in your feed? — and this mismatch has led to a host of unintended side effects, including the proliferation of clickbait, the spread of political misinformation, and the pervasiveness of hateful, inflammatory content.
At other times, models are trained on datasets built by workers who aren’t capable of capturing an ML creator’s intentions. These labelers often don’t work in their native language and lack cultural context. These are all sources of misalignment that lead to poor dataset quality.
For example, imagine a worker is asked to determine if this comment (from the Toxic Comment Classification Challenge ) is mean-spirited:
McCain should get with his democrat buddies Schumer and Durbin and write a ‘bi-partisan’ Bill repealing the Jones Act, and have it go through ‘the regular order’ that Senator McBackstab loves so much. Then after… oh five years… of back-and-forth with going through the appropriate Committees, having all the Hearings, the Conferences with the House, then bringing it up for a Vote on the Senate Floor. Of course Senator McTraitor will be long gone by then but hey, nothing’s perfect.
A native English speaker will recognize two things indicative of mean-spiritedness: the quotes around “bi-partisan” and “the regular order” suggest sarcasm, and McBackstab and McTraitor are insults, not real names. In contrast, these cues are often missed by data labelers today.
Or take the following tweet:
A typical labeler will spot “bitches”, “fucking”, and “shit”, and mistakenly label this tweet as toxic, even though the profanity is intended in a positive, uplifting manner. We’ve run into issues like these in our training sets countless times.
Data defines the model. Even when an engineer uses a powerful architecture capable of understanding nuanced language, it can only learn what’s encoded in the training sets. If the data is a poor proxy for what we are trying to learn or is simply mislabeled, no amount of ML expertise will prevent the model from being junk.
???? What advancements do we need?
These dataset quality issues cause a host of problems for ML.
When faced with poorly performing models, engineers spend months tinkering with feature engineering and new algorithms, without realizing that the problem lies with their data. Algorithms intended to bring friends and family together drive red-hot emotions and angry comments instead.
How can we fix these problems? While most machine learning teams use powerful tools to develop their models, it’s rare for them to bring the same level of sophistication to their datasets. Here are a set of building blocks we believe are critical for unlocking the next wave of AI.
Skilled, high-quality labelers who understand the problem you’re trying to solve ????
As AI systems become more complex, we need sophisticated human labeling systems to teach them and measure their performance. Think of models with enough knowledge of the world to classify misleading information, or algorithms that increase Time Well Spent instead of clicks.
This level of sophistication can’t be boosted from majority votes on low-skill workers. In order to teach our machines about hate speech or to identify algorithmic bias, we need high-quality labeling forces who understand those problems themselves.
2. Spaces for ML teams and labelers to interact ????
ML models are constantly changing. What counts as misinformation today — or a high-quality search result — may not tomorrow, and we’ll never capture every edge case in our labeling instructions.
Just as building products is an iterative, feedback-driven process between users and engineers, dataset creation should be as well. When counting faces in an image, do cartoon characters count? When labeling hate speech, where do quotes fall? Labelers uncover natural ambiguities and insights after going through thousands of examples, and in order to surface questions and feedback, we need channels for both sides to communicate.
3. Objective functions aligned with human values ????
Our models often train on datasets that are merely an approximation of their true goal, leading to unintended divergences.
BLEU scores in Machine Translation are only an approximation of translation quality. https://twitter.com/WWRob/status/1315005570378657792
In AI safety debates, for example, people often worry about machine intelligences that run amok, with the power to threaten the world. Others counter that this is a problem for the far future — and yet, when we look at the largest issues facing tech platforms today, isn’t this already happening?
Facebook’s mission, for example, isn’t to garner likes, but to connect us with our friends and family. But by training its models to increase likes and social interactions — rather than posts that make you feel closer to your friends — they learn to spread content that’s highly engaging, but also toxic and misinformed .
What if Facebook could inject human values into its training objectives? This isn’t a fantasy: Google Search already uses human evaluation in its experimentation process , and the human/AI systems we’re building aim to expand this more broadly.
???? Building an AI future
At its core, machine learning is about teaching computers to perform the job we want — and we do that by showing them the right examples.
So in order to build high-quality models, shouldn’t building high-quality datasets, and making sure they match the problem at hand, be the most important skill of an ML engineer?
Ultimately, we care whether AI solves human needs, not whether it beats artificial benchmarks.
If you work on content moderation, are your datasets capturing hate speech — or positive, uplifting profanity too?
If you’re building the next generation of Search and Recommender Systems, are your training sets modeling relevance and quality — or addictive misinformation and clicks?
Is your data aligned with your goals, or are you training paperclip maximizers to take over the world?
Dataset creation isn’t something taught in schools, and it’s easy for engineers who’ve spent years studying algorithms to fixate on the fanciest models coming out of arXiv each month. But if we want to build AI to solve real-world problems, we need to think deeply about the datasets that define our models, and give them a human touch.
Surge AI is a team of Google, Facebook, and Twitter engineers building platforms to enable the next wave of AI — starting with the problem of high-quality supervised data.
This requires humans and technology working together. Our high-skill labeling workforce undergoes a rigorous series of tests; our platform then makes trustworthy data collection a breeze. Together, we’ve helped top companies create massive datasets for training and measuring misinformation, AI fairness, creative generation, and more.
Want to accelerate your machine learning teams? Excited about designing AI for human needs? Interested in helping us build the future? We’d love to hear from you. Send us a message at email@example.com or sign up for early access .