Modifying Naive Bayes for Text Classification

By Max Candocia


July 02, 2017

Recently I was working on a news article classification system. The API I created allowed users to submit articles, along with labels (e.g., sports/entertainment/golf) associated with the articles. Because articles can have more than one label, users could also submit counterexamples for any given label. I also supplied 100 random counterexamples to new categories to allow the system to start working as soon as positive examples were provided.

For each label, the system created a Naive Bayes model that run on each incoming article so that it would return all applicable, machine-learned labels to the user. I stored all the word counts for each category in a table, and I created a materialized view to store the logarithm of the probability odds for each keyword in each category for fast computation and easy modification.

Additionally, I added 1 to the count of each word in each category for a smoothing effect, and if a word was missing from the positive examples or counterexamples, then I would set the probability to 1./(larger of the positive example or counter example sum of word counts). This way, if a word did not exist in either set, the effects on the overall probability would cancel out.

Keywords were ngrams of lengths 1, 2, and 3, and were weighted by

Lastly, only n-grams that appeared at least 2 times (or in the title) were selected.

When I ran the models, I noticed a few things:

  1. The cost of a false positive was much greater than the cost of a false negative. If a few articles don't get a certain tag, it's not a big deal, but if baseball articles are being tagged as football, that can become very annoying
  2. Some labels did not need many examples to perform very well, whereas other labels had much difficulty even after hundreds of examples
  3. In many cases, articles that should be labeled a certain way are not labeled because of one keyword, but because of many keywords

Because of the above issues, I decided to test the following modifications to the model:

  1. The most impactful n-gram of each article is dropped at classification time in order to avoid one single term dominating.
  2. The positive effect (i.e., the logarithm of the odds of the probabilities) of any individual n-gram is limited to a specific value (aka n-cap, where n is the maximum value the log probability odds can have), or is capped at the square root of the original value (root-cap)
  3. Because labels can require different models, each should be tested using 5-fold cross-validation, and different probability cutoffs should be tested. This also makes estimating the label priors much less important


For the first category, which I would describe as semi-specific, note that both limiting the effect of the strongest keyword and capping its effect produce desirable results when allowing for low sensitivities when the false positive rate must be very low

For the second category, which I would describe as specific, there is a dramatic performance increase when omitting and/or limiting the most impactful keyword from each article.

For the third category, which I would describe as very broad, there was not much benefit gained from this, as the features were not sufficient enough to discriminate between articles belonging and not belonging to that label.

For the fourth category, which I would describe as extremely specific, and containing a wide variety of person's names, the features were fairly sparse, but enough to create odd-looking valleys in the graphs due to the discrete nature of the allowed probabilities for estimates with a given set of training data.

For the fifth category, which I would describe as broad, there is a benefit to reducing impact, but only for low sensitivities. This is largely a result of n-grams not being able to sufficiently describe a label.


You may have noticed the n-cap columns are the same whether or not the most significant n-gram is omitted. This was due to a bug in the system when I collected the data, and I currently do not have access to the system to acquire new data. The bug has been fixed, though.

Additionally, priors were initially estimated at 0.1, but slowly changed upwards or downwards as positive or negative feedback were added, respectively.


Recommended Articles

Exploring Data with Generative Clustering

Clustering is a common machine learning technique applied to data for a variety of reasons, including dimensionality reduction, finding similar objects, and discovering important features. Here I demonstrate how generative clustering can be used with the Iris data set.

What Time Should You Post to Reddit? (Part 2)

When is the ideal time to post to Reddit? Using a large sample of Reddit posts from Google BigQuery and elastic-net regression techniques, I take a closer look at post scoring patterns across Reddit.