April 08, 2020
I created a Reddit bot that gives you a score based on how likely you are to be banned within 90 days. A score of 0 is the highest risk (>= 95%), while the probably reduces by a factor of 3 every 100 points, up until 1000.
Introducing, /u/CredditReportingBot
Using the bot is fairly simple. You can either put a request in this thread, with the format
/u/<username>
or /u/<username> DETAILED INQUIRY
or you can respond to a post/comment with the name of the bot like this:
/u/CredditReportingBot
< or /u/CredditReportingBot DETAILED INQUIRY
/p>
Where <username>
is the username of the user that you want to investigate. If you add DETAILED INQUIRY
to the message, then it will give you more information about a user's score and the reasoning behind it. By default, you can only do one of these detailed inquiries per 24-hour period.
The data I collected was a sample of ~260,000 users on Reddit, taken over the course of December 2019-January 2020, then again from March 2020-April 2020 to determine which users had been banned since then. I used my scraper, as shown here, https://maxcandocia.com/blog/2016/Dec/30/scraping-reddit-data/, to collect the data.
If a user's account was banned, suspended, or deleted, that is considered a ban. I included deletions, as suspended users can delete their accounts, and I don't have a good way of distinguishing those. Also, the Reddit API doesn't indicate if an account was deleted vs. shadowbanned.
As a warning, the methodology I used for training and testing is a bit hand-wavy, as the features, model-building, and testing would ideally require 3 separate sets, but reducing the size of any of these negatively affects the performance.
I tested out the performance of the algorithm using a 5/8 | 3/8 train | test split. The algorithm is a bit tricky to tune, as it takes forever to build many of the features, many of which are "oracle" features that are calculated using tabulations of the training data. The performance is a bit lacking for the lower-risk groups, but it is relatively effective for the higher-risk groups. It does suffer a bit from overfitting, but I will have to adjust the training algorithm in the future to use a separate split set.
Below is the receiver operator characteristic curve (ROC curve). The blue line describest the training error of the currently used model. The magenta dashed line (training error) and black line (testing error) show what happens when I overfit a model using oracle features. The red dotted line is a simple model that ranks younger accounts are more risky.
Definitions:
The model I am currently using can be described as follows:
The features that were built using all of the data should be more precise/complete, but there is a certain lack of robustness using this method. The only way to evaluate this model will be to test how it performs on a completely new set of data, which will have to come at a later time after creating a new sample.
I do not currently have the code up, as it is was built with a lot of backwards-compatibility in mind when I was adding features to this project, and it needs a lot of cleaning.
I hope you enjoy this bot on Reddit!
Tags: