July 02, 2017
Recently I was working on a news article classification system. The API I created allowed users to submit articles, along with labels (e.g., sports/entertainment/golf) associated with the articles. Because articles can have more than one label, users could also submit counterexamples for any given label. I also supplied 100 random counterexamples to new categories to allow the system to start working as soon as positive examples were provided.
For each label, the system created a Naive Bayes model that run on each incoming article so that it would return all applicable, machine-learned labels to the user. I stored all the word counts for each category in a table, and I created a materialized view to store the logarithm of the probability odds for each keyword in each category for fast computation and easy modification.
Additionally, I added 1 to the count of each word in each category for a smoothing effect, and if a word was missing from the positive examples or counterexamples, then I would set the probability to 1./(larger of the positive example or counter example sum of word counts). This way, if a word did not exist in either set, the effects on the overall probability would cancel out.
Keywords were ngrams of lengths 1, 2, and 3, and were weighted by
Lastly, only n-grams that appeared at least 2 times (or in the title) were selected.
When I ran the models, I noticed a few things:
Because of the above issues, I decided to test the following modifications to the model:
For the first category, which I would describe as semi-specific, note that both limiting the effect of the strongest keyword and capping its effect produce desirable results when allowing for low sensitivities when the false positive rate must be very low
For the second category, which I would describe as specific, there is a dramatic performance increase when omitting and/or limiting the most impactful keyword from each article.
For the third category, which I would describe as very broad, there was not much benefit gained from this, as the features were not sufficient enough to discriminate between articles belonging and not belonging to that label.
For the fourth category, which I would describe as extremely specific, and containing a wide variety of person's names, the features were fairly sparse, but enough to create odd-looking valleys in the graphs due to the discrete nature of the allowed probabilities for estimates with a given set of training data.
For the fifth category, which I would describe as broad, there is a benefit to reducing impact, but only for low sensitivities. This is largely a result of n-grams not being able to sufficiently describe a label.
You may have noticed the n-cap columns are the same whether or not the most significant n-gram is omitted. This was due to a bug in the system when I collected the data, and I currently do not have access to the system to acquire new data. The bug has been fixed, though.
Additionally, priors were initially estimated at 0.1, but slowly changed upwards or downwards as positive or negative feedback were added, respectively.