PHP中文关键词匹配实例代码

  
 关键词匹配是比较常见的需求,如留言、弹幕及游戏聊天中的敏感词过滤,都需要对一段文字进行关键词匹配。提取到关键词后,再做进一步处理。
本类借助PHP高效的数组和mbstring扩展,来实现对中文关键词的匹配。主要思想是以关键词为key,构建字典数组,这样便可以对每个关键词可实现常数级别的查找。
具体代码如下:
class WordMatcher {
    public $dict = [];
    public $wordMaxLen = 0;

    function __construct(){
        if(! extension_loaded('mbstring')) {
            exit('extension mbstring is not loaded');
        }
    }

    function addWord($word) {
        $len = mb_strlen($word, 'utf-8');
        $this->wordMaxLen = $len > $this->wordMaxLen ? $len : $this->wordMaxLen;
        $this->dict[$word] = 1;
    }

    function removeWord($word) {
        unset($this->dict[$word]);
    }

    function match($str, &$matched, $matchAll=false) {
        if(mb_strlen($str) < 1) {
            return;
        }

        $matchLen = 0;
        $len = $this->wordMaxLen;
        while($len>0) {
            $substr = mb_substr($str, 0, $len, 'utf-8');
            if(isset($this->dict[$substr])) {
                $matchLen = $len;
                $matched[] = $substr;
                break;
            } else {
                $len--;
            }
        }

        if(!$matchAll && $matchLen) {
            $str = mb_substr($str, $matchLen, null, 'utf-8');
        } else {
            $str = mb_substr($str, 1, null, 'utf-8');
        }

        $this->match($str, $matched, $matchAll);
    }
}

$matcher = new WordMatcher;
$matcher->addWord('PHP');
$matcher->addWord('语言');
$matcher->addWord('H');
$matcher->match('PHP是最好的语言', $matched);
 
相关文章