2012-07-08

Computer literacy: the importance of reading for young artificial neural networks

Photo from New York Times
Recently, a Stanford researcher and a Google X team made headlines with their study on unsupervised learning. Starting with one of the biggest neural networks ever built (1 billion connections and 16,000 CPU cores), they fed in a still frame from each of 10 million YouTube videos and watched it learn to identify human faces and cats.

But why aren't experiments on a similar scale being done with text data? I can think of three reasons why they should be.



First, if they improve the state-of-the-art accuracy in unsupervised and semisupervised text classification as much as they did in image classification, it'll save researchers lots of time deciding which papers to read. We'll be able to load in millions of papers from arXiv.org and a smaller set of labelled examples, and come up with a 99% accurate filter for a subject class too narrow to get its own journal or conference. (I've already been looking into doing this as a grad student, and the Google study has me seriously considering the neural-net approach. But it'll mean either becoming a data-mining expert or teaming up with one, and hitting a cloud provider up for a CPU-cycle grant.)

Second, it can help scientists understand how humans learn to read. Using ANNs as models to study human psychology isn't new. Last year, a Yale study used the DISCERN neural net to test several hypotheses about the cellular-level mechanism of schizophrenia. We need to know a lot more about human literacy if we're to bridge the digital divide when billions of people still can't read, let alone read a language that has half-decent machine translation from English.

Third, once we're using neural nets bigger than the human visual cortex, it's likely we'll get deeper levels of processing. It will need to render its output in a far more flexible format, which means we'll have to either reinvent the wheel or teach it a human language. To perform at its full potential, it'll also need a higher quality and depth of information than you can find on YouTube (again, science papers from arXiv are a good example). Would you let your kids stay home from school to watch YouTube all day? Garbage in, garbage out.

(Lest there be any confusion, "unsupervised machine learning" doesn't usually refer to an electronic babysitter. It means not telling the classifier what class any of the training examples belong to.)
Enhanced by Zemanta