How Likely Are You to be Banned From Reddit?

How Likely Are You to be Banned From Reddit?

By Max Candocia

|

April 08, 2020

I created a Reddit bot that gives you a score based on how likely you are to be banned within 90 days. A score of 0 is the highest risk (>= 95%), while the probably reduces by a factor of 3 every 100 points, up until 1000.

Introducing, /u/CredditReportingBot

Using the bot is fairly simple. You can either put a request in this thread, with the format

/u/<username> or /u/<username> DETAILED INQUIRY

or you can respond to a post/comment with the name of the bot like this:

/u/CredditReportingBot< or /u/CredditReportingBot DETAILED INQUIRY/p>

Where <username> is the username of the user that you want to investigate. If you add DETAILED INQUIRY to the message, then it will give you more information about a user's score and the reasoning behind it. By default, you can only do one of these detailed inquiries per 24-hour period.

Training Data

The data I collected was a sample of ~260,000 users on Reddit, taken over the course of December 2019-January 2020, then again from March 2020-April 2020 to determine which users had been banned since then. I used my scraper, as shown here, https://maxcandocia.com/blog/2016/Dec/30/scraping-reddit-data/, to collect the data.

If a user's account was banned, suspended, or deleted, that is considered a ban. I included deletions, as suspended users can delete their accounts, and I don't have a good way of distinguishing those. Also, the Reddit API doesn't indicate if an account was deleted vs. shadowbanned.

Performance

As a warning, the methodology I used for training and testing is a bit hand-wavy, as the features, model-building, and testing would ideally require 3 separate sets, but reducing the size of any of these negatively affects the performance.

I tested out the performance of the algorithm using a 5/8 | 3/8 train | test split. The algorithm is a bit tricky to tune, as it takes forever to build many of the features, many of which are "oracle" features that are calculated using tabulations of the training data. The performance is a bit lacking for the lower-risk groups, but it is relatively effective for the higher-risk groups. It does suffer a bit from overfitting, but I will have to adjust the training algorithm in the future to use a separate split set.

Below is the receiver operator characteristic curve (ROC curve). The blue line describest the training error of the currently used model. The magenta dashed line (training error) and black line (testing error) show what happens when I overfit a model using oracle features. The red dotted line is a simple model that ranks younger accounts are more risky.

Definitions:

The model I am currently using can be described as follows:

The features that were built using all of the data should be more precise/complete, but there is a certain lack of robustness using this method. The only way to evaluate this model will be to test how it performs on a completely new set of data, which will have to come at a later time after creating a new sample.

I do not currently have the code up, as it is was built with a lot of backwards-compatibility in mind when I was adding features to this project, and it needs a lot of cleaning.

I hope you enjoy this bot on Reddit!


Tags: 

Recommended Articles

Introducing Quizzes

If you ever wanted to make one of those silly quizzes you see on the Internet, a tool is now available that let's you do that, using machine learning and user feedback to determine the outcome.

Subreddits That Get You the Most Awards

Which Subreddits are most likely to generate awards for their users?