Error Analysis in Machine Learning
Following is an Error Analysis in case of a machine learning problem. Probably you can consider it as a visual problem which humans can solve easily. you can consider the problem of identifying cat in images, or identifying your own image from a dataset. This mainly applies to a neural network class of algorithms because image classification works better with neural networks. Since they are good at handling more amount of data.
Why error analysis in machine learning ?
An initial ML model that is developed out of first iteration in your development phase will have certain accuracy. Now challenge is to improve its accuracy.
Instead of trying multiple random solutions given by other developers, the right place can be to start analysing the errors manually.
There are 2 approaches,
Approach 1, Analyse misclassified images & categorise errors under different category like poor lighting, usage of filters, very small, etc. Then brainstorm solution, make algorithmic improvements for most impacted error like correcting the lighting of images, removing any filters, zooming and cropping the image if resolution supports, etc.
Approach 2, Analyse misclassified images & correct labels of both dev(evaluation) set and test set also. Like if the image is of a dog but marked as cat, then correct. Yes, of course there can be human errors and other errors as well.
Now this correction of algorithm to improve evaluation and development set only can lead to a bias of training data. We have to
For bigger dev sets say 5000 images it will be difficult to follow the above approach, hence split the dev set into Set-1(500 images) and Set-2(4500 images), such that it has 20 percent error rate in it. Now analyse manually smaller Set-1 and make improvements to algorithm. Now check results again on both Set-1 and Set-2, if the results improve only for Set-1 then you had over fit the algorithm , meaning it can predict only for the provided images or data. Any other uncertain image it encounters it cannot fare well in its prediction task. So overfitting is a real problem when we are developing a model.
So what can be done, just put back the data that got overfit and take another 500 set of images and repeat it.
While in beginning I asked to consider the approach only for image recognition, the reason for that was any machine learning model its comparison is initially drawn against a human performance.
If the task that we intent to do is easier for humans to do then the machine learning models development will be faster.
The primary reasons being, easier availability of labelled data. A human will know what is right and what is wrong. Obviously a human knows a cat is a cat better than a computer.
A human intuition can help in more ways from data acquisition, cleaning phase and algorithm development since he/she can solve the problem easily without the need of a computer at first place.
You can define optimal error rate. The least a human error will form the baseline to be achieved.
Incase of a task that is more difficult to solve by a human being and we look upto machine learning to solve, we lose out on above benefits, and as a result the progress becomes slower. Like finding a banking fraud, cybercrime etc. But it has become efficient in certain areas like marketing. There have come numerous applications in social media. For example to improve sales by giving ad recommendations, post recommendations etc.
Thank you!