sensitive-word

Java ★ 5.9k updated 3mo ago

👮‍♂️The sensitive word tool for java.(敏感词/违禁词/违法词/脏词。基于 DFA 算法实现的高性能 java 敏感词过滤工具框架。内置支持单词标签分类分级。请勿发布涉及政治、广告、营销、翻墙、违反国家法律法规等内容。高性能敏感词检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。)

A Java library that detects and filters inappropriate or prohibited text, profanity, spam, political terms, from user input, using a 60,000-word Chinese dictionary and evasion-bypass logic for fast, accurate content moderation.

JavaMavenDFAsetup: easycomplexity 2/5

sensitive-word is a Java library for detecting and filtering prohibited or inappropriate text in user-submitted content. You give it a string, and it can tell you whether any flagged words are present, return which ones it found, or replace them with asterisks or a custom substitution. It is written in Chinese and targeted at Chinese-language applications.

The library ships with a built-in dictionary of over 60,000 words covering profanity, politically sensitive terms, spam-associated phrases, and other restricted content. Performance is high: the README cites over 140,000 checks per second, achieved through a DFA (Deterministic Finite Automaton) algorithm, which is a pattern-matching technique that processes text efficiently without scanning each word individually from scratch.

Beyond exact matches, the tool handles many ways people try to evade filters. It can normalize traditional and simplified Chinese characters to the same form before checking, handle full-width and half-width variants of letters and numbers, convert Chinese characters to their phonetic pinyin spelling, ignore repeated characters (like "heeello"), and skip over special characters inserted between letters. This makes it harder to sneak a flagged word past the filter by disguising it.

Developers can add their own custom word lists and whitelists (words to never flag), update those lists dynamically at runtime without restarting the application, assign category tags to individual words, and write custom replacement logic so different words get different substitutions. The library also includes detection modes for email addresses, URLs, and IP addresses.

Installation is via a Maven dependency in a Java project (JDK 1.8 or newer required). A companion admin web interface is available as a separate repository for managing the word lists through a UI. The full README is longer than what was shown.

Where it fits

Add content moderation to a Chinese-language app to automatically block profanity and spam in user submissions.
Replace or flag restricted words in real time without restarting the application by updating word lists dynamically.
Maintain a custom whitelist so legitimate business terms are never falsely blocked by the filter.
Detect email addresses, URLs, and IP addresses in user-submitted text alongside custom word categories.

Open on GitHub → Full breakdown on explaingit →