Big Data

Semi-supervised learning explained | InfoWorld


In his 2017 Amazon shareholder letter, Jeff Bezos wrote something interesting about Alexa, Amazon’s voice-driven intelligent assistant:

In the U.S., U.K., and Germany, we’ve improved Alexa’s spoken language understanding by more than 25% over the last 12 months through enhancements in Alexa’s machine learning components and the use of semi-supervised learning techniques. (These semi-supervised learning techniques reduced the amount of labeled data needed to achieve the same accuracy improvement by 40 times!)

Given those results, it might be interesting to try semi-supervised learning on our own classification problems. But what is semi-supervised learning? What are its advantages and disadvantages? How can we use it?

What is semi-supervised learning?

As you might expect from the name, semi-supervised learning is intermediate between supervised learning and unsupervised learning. Supervised learning starts with training data that are tagged with the correct answers (target values). After the learning process, you wind up with a model with a tuned set of weights, which can predict answers for similar data that haven’t already been tagged.

Semi-supervised learning uses both tagged and untagged data to fit a model. In some cases, such as Alexa’s, adding the untagged data actually improves the accuracy of the model. In other cases, the untagged data can make the model worse; different algorithms have vulnerabilities to different data characteristics, as I’ll discuss below.

In general, tagging data costs money and takes time. That isn’t always an issue, since some data sets already have tags. But if you have a lot of data, only some of which is tagged, then semi-supervised learning is a good technique to try.

Semi-supervised learning algorithms

Semi-supervised learning goes back at least 15 years, possibly more; Jerry Zhu of the University of Wisconsin wrote a literature survey in 2005. Semi-supervised learning has had a resurgence in recent years, not only at Amazon, because it reduces the error rate on important benchmarks.



READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.