By Binay Gupta, Algorithm & NLP Engineer
Setting the scene: Suppose you run an online shoe store where you get details of shoes from different brands. Based upon your market analysis, you get to know that apart from price, a customer looks at five attributes before ordering shoes:
- Brand
- Color
- Size
- Purpose i.e. whether they’re running shoes or casual
- Material i.e. rubber, leather or cloth
But the problem is, these details are not present in a structured manner in the sheet your supplier has shared. A small snippet (dataset) containing details on 10 such variants of shoes is shown below:
Seq No | Shoe Details |
1 | Adidas Black Running Shoes 9 |
2 | 9 s white running shoe by Puma |
3 | 8 size black casual rubber Shoe by Puma |
4 | Red tape leather formal office shoe |
5 | Blue UCB casual shoes 9 size |
6 | Casual 9 s blue shoe by Adidas ( 2 pairs) |
7 | Formal 9 size black shoe by UCB |
8 | Casual 8 s black shoe by HRX |
9 | Roadstar grey show |
10 | 9 size grey trekking show by Woodland. |
Table 1
Now, if you want to get the value for those attributes (if present) separately, how would you handle the problem? Clearly, you can see that this is a classification problem where each phrase of product details belongs to one of the five attributes. We create 6th category as OTHER which means the phrase doesn’t belong to any of them (say, ‘by’).
Before we look at the nature of the problem, let’s try to understand a few concepts first. (Note: I may use the tag, label, and class interchangeably, but they have the same meaning here.)
Sequential vs Discrete Classification Algorithm: While finding the tag for a phrase, sometimes we consider neighboring information (sequential classification algorithm), sometimes we don’t (discrete classification algorithm). Consider an English statement, “dishq understands your taste preferences”. Now if you want to tag each word of the sentence with its part of speech (POS), it’s always recommended to consider grammar of the English language rather than finding out POS of each word separately (discrete classifier’s approach). Consider another example, where you have a set of images of our dog Zoey. Suppose you know, everyday morning Zoey wakes up, has her meal and after that gets ready to play, now even if an image is blurred, you can find what the dog is doing if you know the sequence of her actions (i.e. sequential classifier).
Generative vs Discriminative Model: Classification algorithms belong to either of Generative models or Discriminative models. In order to calculate the chance of an event X given Y (i.e. P(X|Y)), it calculates the likelihood of event Y given X (i.e. P(Y|X)) and probability of event X and using Bayes theorem (i.e. P(X|Y) = P(Y|X) * P(X) / P(Y)) finds the required probability. That means, in order to categorize a data point, a generative algorithm learns how the data is generated. It is like solving a more general problem as an intermediate step in order to solve the target problem.
A discriminative model is quite straightforward in solving the target problem, all it requires is quality data and makes almost no assumption on the distribution of data. Given enough clean data, a discriminative model always performs better than a generative model.
Let’s try to understand them with a very well known example. Suppose you have to predict the gender of a student from his/her height and weight. Here is your train data.
Height | Weight | Gender |
5.9 | 180 | M |
5.7 | 172 | F |
6 | 192 | M |
5.6 | 177 | M |
5.5 | 172 | F |
5.6 | 170 | F |
Table 2
Now, suppose you have to predict the gender of a student having a weight 172 unit and height 5.6 unit. A generative model directly looks into the data and finds the required probability whereas the discriminative model (say Naive Bayes) calculates it differently (for example, read this) making an independent assumption of features.
A more detailed study on Generative and Discriminative models can be found here.
Let’s go back to the original problem
Looking at the data for the shoe e-commerce store problem (see Table 1), here are a few features:
- Not always will you get all five attributes.
- Word followed by ‘by’ is mostly brand name.
- If i-2 th word is a digit, then there is a high chance that i-th word is color.
- If i-3 th word is a digit, the likelihood of i-th word being “purpose” is high.
- If i-th word is a digit, it’s very likely that it’s shoe size.
Here, except the first point, the rest are all suggesting that we should go for the sequential, also called structural, classification approach. Neighboring information will actually be helpful to find the class for a phrase.
Should we use the Generative model or Discriminative model? We have tested both the Hidden Markov Model, which is Generative and Conditional Random Field, Discriminative, on our data. Our data was sufficiently large and Conditional Random Field has outperformed the Hidden Markov Model (HMM). Let’s try to understand why CRF (Conditional Random Field) has performed better.
CRF: A discriminative, sequence classifier for structural prediction.
Like for any sequential classifier, input for CRF is a sequence of phrases X1, X2, …. Xm. You have to predict the tag Y1, Y2 ….Ym for each word. Like HMM, it also predicts the most probable sequence of tags (label for each phrase) using Viterbi Algorithm and trains the model using gradient descent. The key difference is, in HMM your tag of the current word is dependent only on the tag of the previous word. And the current word is dependent on the current tag only. It means, here you can define a smaller set of local features only, whereas there are no such constraints for CRF, which makes it more flexible and powerful.
pycrf Toolkit: Among all the libraries I have explored for CRF, pycrf (python crfsuite) appears best:
- Fast and highly accurate
- Least resource (memory) intensive
- Easy to define features
- Easy installation and use
Installation and documentation for pycrf is available here. You will also get an example code for named entity recognition problem using pycrf here.
How we use CRF: We are building the largest, richest, most diverse recipe database in the world. For each recipe, we have 26 different attributes, which we collect from a variety of sources. These attributes often come in an unstructured manner. We use the CRF model to collect the values for attributes and later store them in our database.
Endnote: Earlier, I had already mentioned that the performance of discriminative models (rather, any machine learning algorithm) is highly dependent on the quality of the data. Sometimes training data is not cleaned or is not correctly labeled. In such a scenario, you can use the pseudo-labeling method to clean your training data.
VAIBHAV ADLAKHA says
Nice concise article giving a easily understandable and intuitive introduction to various ML concepts. Just want to point out that one of the links mentioned in this post has been moved/expired. – https://chrisalbon.com/machine_learning/naive_bayes/naive_bayes_classifier_from_scrthis