nsfw_data_scraper

Shell ★ 13k updated 2y ago

Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier

A Docker-based toolkit of shell scripts for collecting a labeled image dataset and training a content moderation classifier that distinguishes explicit from safe-for-work images across five categories.

ShellDockerPythonfastaiJupytersetup: moderatecomplexity 3/5

This repository contains a set of shell scripts for collecting a large image dataset to train an image classifier that distinguishes between explicit and non-explicit content. The classifier is designed around five categories: pornography, hentai-style drawings, sexually explicit but non-pornographic images, neutral everyday images, and safe-for-work drawings.

The scripts are numbered and meant to be run in order. The first script collects URLs of images from various sources, primarily Reddit, using a tool called Ripme that can scrape image galleries from supported websites. A pre-collected set of URLs is already included in the repository, so you can skip that step unless you want to change the sources. Subsequent scripts download the actual images from those URLs, optionally pull in additional safe-for-work image datasets from existing public collections, and then split everything into training and test folders organized by category.

The whole collection process runs inside Docker, which handles the required tools so you do not need to install them manually. The README warns that the download can take several hours and suggests leaving it running overnight.

Once the data is collected, a Jupyter notebook is included for training a convolutional neural network on the images using the fastai library. The author reports reaching 91% accuracy with this approach. The README also notes that the dataset is noisy, meaning some images may be miscategorized, and that certain categories (drawings versus hentai, and pornography versus sexy) are more likely to be confused with each other.

This is a data collection and training toolkit for researchers or developers building content moderation systems.

Where it fits

Download and organize a pre-labeled image dataset split into training and test sets for content moderation research.
Train a convolutional neural network classifier with ~91% accuracy on explicit vs. safe-for-work content using the included notebook.
Build a custom content filter by adding new image sources to the URL collection step and rerunning the pipeline.

Open on GitHub → Full breakdown on explaingit →