Enter the email address associated with your account, and we'll email you a link to reset your password.
In that particular paper circular patches represented with SIFT descriptors are used as 'codewords' and clustered to build a vocabulary/dictionary. Images are then represented using that vocabulary. The fact that the patches are circular rather than square makes them rotationally invariant. Furthermore, SIFT descriptors are one of the state-of-the-art methods of describing shape and are scale-invariant. This allows a more descriptive vocabulary to built.
At this stage, you basically have a set of images described by their most prominent shape features in the form of feature vectors. You can then use any algorithms to classify those. A good choice is SVM as it weighs the features based on how much they affect the decision boundary between classes. Thus, if you investigate your top support vectors, you could find not necessarily the common characteristics in a particular class, but rather, the characteristics that differentiate that class from all the other classes. I can see that the DictionaryLearning algorithms also tries to optimize the features used so that could be a good choice too.
This method could improve results in difficult cases where there is large intra-class variance, as you described. The complex descriptors would make the algorithm more robust and accurate in general.