Friday, 8 March 2013

My misgivings on machine learning

Machine Learning offers the promise of rapid automated 'thinking', but is it well grounded in cognitive science.
As part of the London Data Science meetups I was able to see some demo of the Open Source machine learning Python platform scikit-learn.   I understand that machine learning has been a major element of the expansion of search and social networks.  Without machine learning things like sentiment detection, search, and recommendations would not work.

But I have to ask how real these things are.  Take the example of Google search.  We might say that Google search does a good job of finding what is on the Internet that we want, but is this true?  Google search essentially stands only against a few other search engines.  We don't search the Internet by hand ourselves so we really don't know if Google is doing a good job or not, its just Google is the best tool we have.  The same for detecting sentiment in forum posts of making friend recommendations: are theses tasks connected to anything real?

There is a real risk that the precision of machine learning could mask its artificial nature, that the results of most machine learning systems are utterly self referent establishing facts about things that only exist in the made up world of the internet and social media.

I have not been impressed by Facebook's ability to find my friends, or Google's ability to recommend ads I really want to see.  I am a bit concerned that machine learning is just becoming a ghost in the machine, a part of the simulation that the web replaces for reality.  People have masses of data, they want to find a cheap way to try and say something about the data so they run a training set through a random tree model and then get sufficient reliability for a test set, but is that how reality really works?

How about the black swans, the outliers that contain almost all the really important features of life.  In my own study of twitter I find that locations have a high degree of predictability as to how much tweeting will be located with it at a certain time.  But its the times when the model breaks down, when tweeting is higher or lower than I would forecast that is really interesting.

1 comment:

  1. it's actually a knowledge full post. thanks to shear . this post has removed my a number of wrong thing . i thing if you to-do your acctivetice you will achive much popularety.. at last..thanks.
    Information visualization Low