Nodejs实现MongoDB的中文全文搜索

问题背景

MongoDB本身是支持全文搜索的,但是遗憾的是,它不支持中文.英文的分词较为简单,基本上是按空格拆分即可，这就是MongoDB内置的默认分词器.但是中文都是连着的,没有空格分隔,那该怎么拆分呢?很简单,用中文分词器拆分一下就好了.

核心思路

在创建Document对象时,为要做全文索引的字段添加影子字段,所谓影子字段,作用就是保存原始字段的分词结果.
然后在影子字段上添加全文索引,而非在原始字段上.
在搜索关键词时,也要先对关键词做一下分词,不然搜索效果会很差.
在分词函数中,要区分开中英文,不然jieba分词器默认会将英文逐字符拆分,那搜索英文效果会很差.

下面可以先看一下我做了中文分词以后保存在MongoDB里面的测试数据,可以看到影子字段titleToken和contentToken,在英文时几乎没变(其实还是清理掉了一些停顿字符的),在中文时,就自动在每个词之间加入了空格.比如就将”裸辞”拆分成了”裸辞”.如果我们不对搜索关键词也做分词,直接搜索”裸辞”,是搜不到的.

All articles:
{
  _id: new ObjectId('672036818456f3ef4799b21f'),
  title: 'Getting Started with TypeScript',
  content: 'TypeScript is a superset of JavaScript...',
  titleToken: 'Getting Started with TypeScript',
  contentToken: 'TypeScript is a superset of JavaScript',
  __v: 0
}
{
  _id: new ObjectId('672036818456f3ef4799b220'),
  title: 'Getting Started with MongoDB',
  content: 'MongoDB is a NoSQL database...',
  titleToken: 'Getting Started with MongoDB',
  contentToken: 'MongoDB is a NoSQL database',
  __v: 0
}
{
  _id: new ObjectId('672036818456f3ef4799b221'),
  title: '裸辞后独立开发产品上线五天开始盈利，我是怎么做的',
  content: 'Learn advanced techniques in MongoDB...',
  titleToken: '裸 辞 后 独立 开发 产品 上线 五天 开始 盈利 我 是 怎么 做 的',
  contentToken: 'Learn advanced techniques in MongoDB',
  __v: 0
}
{
  _id: new ObjectId('672036818456f3ef4799b222'),
  title: 'TypeScript',
  content: 'TypeScript TypeScript TypeScript',
  titleToken: 'TypeScript',
  contentToken: 'TypeScript TypeScript TypeScript',
  __v: 0
}
{
  _id: new ObjectId('672036818456f3ef4799b223'),
  title: 'Mongodb建立文本索引时，会对提取所有文本的关键字建立索引，因而会造成一定的性能问题。所以对于结构化的字段，建议用普通的关系查询，如果需要对大段的文本进行搜索，才考虑用全文搜索。',
  content: 'Learn advanced techniques in MongoDB...',
  titleToken: 'Mongodb 建立 文本 索引 时 会 对 提取 所有 文本 的 关键字 建立 索引 因而 会 造成 一定 的 性能 问题 所以 对于 结构化 的 字 段 建议 用 普通 的 关系 查询 如果 需要 对 大段 的 文本 进行 搜索 才 考虑 用 全文 搜索',
  contentToken: 'Learn advanced techniques in MongoDB',
  __v: 0
}

完整代码

import mongoose, { Schema } from "mongoose";
import nodejieba from "nodejieba";
import dotenv from "dotenv";

// 从环境变量中获取MongoDB连接字符串
dotenv.config();
const uri = process.env.MONGO_URI || "";

// 定义 Article Schema
const articleSchema = new Schema({
    title: String,
    content: String,
    titleToken: String,
    contentToken: String,
});

// 定义 title 和 content 的影子字段的全文索引
articleSchema.index({ titleToken: "text", contentToken: "text" });

// 定义 pre('validate') 中间件,在创建Document对象时对其进行分词,并将分词结果保存到相应的影子字段中
articleSchema.pre("validate", function (next) {
    if (!this.titleToken) {
        this.titleToken = bigramTokenize(this.title);
    }
    if (!this.contentToken) {
        this.contentToken = bigramTokenize(this.content);
    }
    next();
});

// 创建 Article Model
const Article = mongoose.model("Article", articleSchema);

// 定义分词函数
function bigramTokenize(str) {
    // 使用正则表达式将字符串中的中文和英文分开
    const regex = /([\\u4e00-\\u9fa5]+|[a-zA-Z]+)/g;
    const tokens = str.match(regex);

    // 对中文部分进行分词，英文部分保持不变
    const result = tokens.map((token) => {
        return /^[\\u4e00-\\u9fa5]+$/.test(token)
            ? nodejieba.cut(token).join(" ")
            : token;
    });
    return result.join(" ");
}

async function main() {
    try {
        // 连接到 MongoDB
        await mongoose.connect(uri, {
            dbName: "test",
            autoIndex: true,
            autoCreate: true,
        });
        console.log("Connected to MongoDB");

        // 先删除所有数据
        await Article.deleteMany({});
        console.log("Deleted all articles");

        //再删除所有索引
        Article.collection.dropIndexes();
        console.log("Dropped all indexes");

        // 重新用schema创建索引
        await Article.ensureIndexes();
        console.log("ReCreated indexes");

        // 再插入一些示例数据
        const articles = [
            new Article({
                title: "Getting Started with TypeScript",
                content: "TypeScript is a superset of JavaScript...",
            }),
            new Article({
                title: "Getting Started with MongoDB",
                content: "MongoDB is a NoSQL database...",
            }),
            new Article({
                title: "裸辞后独立开发产品上线五天开始盈利，我是怎么做的",
                content: "Learn advanced techniques in MongoDB...",
            }),
            new Article({
                title: "TypeScript",
                content: "TypeScript TypeScript TypeScript",
            }),
            new Article({
                title: "Mongodb建立文本索引时，会对提取所有文本的关键字建立索引，因而会造成一定的性能问题。所以对于结构化的字段，建议用普通的关系查询，如果需要对大段的文本进行搜索，才考虑用全文搜索。",
                content: "Learn advanced techniques in MongoDB...",
            }),
        ];

        await Article.insertMany(articles);
        console.log("Inserted example articles");

        // 执行全文搜索查询
        await searchKeyword("TypeScript");
        await searchKeyword("独立开发产品");
        await searchKeyword("因而会造成一定的性能问题");
        await searchKeyword("我怎么做的");

        // 打印当前的所有文章
        const allItems = await Article.find();
        console.log("All articles:");
        allItems.forEach((item) => {
            console.log(item);
        });
    } catch (err) {
        console.error("Error:", err);
    } finally {
        // 关闭连接
        await mongoose.connection.close();
        console.log("Disconnected from MongoDB");
    }
}

async function searchKeyword(searchKeyword: string) {
    const filteredResult = await Article.aggregate([
        // 对搜索关键词也要进行分词
        {
            $match: {
                $text: { $search: bigramTokenize(searchKeyword) },
            },
        },
        // 搜索结果中保留下的字段
        {
            $project: {
                title: 1,
                content: 1,
                score: { $meta: "textScore" },
            },
        },
        // 过滤出匹配分数大于1的文章
        {
            $match: {
                score: { $gt: 1 },
            },
        },
        // 按匹配分数排序
        {
            $sort: {
                score: { $meta: "textScore" },
            },
        },
    ]);

    console.log(`Search results for keyword "${searchKeyword}":`);
    filteredResult.forEach((article: any) => {
        console.log(article);
    });
}

main();

测试的搜索结果

可以看到,基本符合预期

Search results for keyword "TypeScript":
{
  _id: new ObjectId('672036b4f9e3829bff610653'),
  title: 'TypeScript',
  content: 'TypeScript TypeScript TypeScript',
  score: 2.85
}
{
  _id: new ObjectId('672036b4f9e3829bff610650'),
  title: 'Getting Started with TypeScript',
  content: 'TypeScript is a superset of JavaScript...',
  score: 1.3333333333333333
}
Search results for keyword "独立开发产品":
{
  _id: new ObjectId('672036b4f9e3829bff610652'),
  title: '裸辞后独立开发产品上线五天开始盈利，我是怎么做的',
  content: 'Learn advanced techniques in MongoDB...',
  score: 1.6
}
Search results for keyword "因而会造成一定的性能问题":
{
  _id: new ObjectId('672036b4f9e3829bff610654'),
  title: 'Mongodb建立文本索引时，会对提取所有文本的关键字建立索引，因而会造成一定的性能问题。所以对于结构化的字段，建议用普通的关系查询，如果需要对大段的文本进行搜索，才考虑用全文搜索。',
  content: 'Learn advanced techniques in MongoDB...',
  score: 4.411005434782609
}
Search results for keyword "我怎么做的":
{
  _id: new ObjectId('672036b4f9e3829bff610652'),
  title: '裸辞后独立开发产品上线五天开始盈利，我是怎么做的',
  content: 'Learn advanced techniques in MongoDB...',
  score: 2.1333333333333333
}
{
  _id: new ObjectId('672036b4f9e3829bff610654'),
  title: 'Mongodb建立文本索引时，会对提取所有文本的关键字建立索引，因而会造成一定的性能问题。所以对于结构化的字段，建议用普通的关系查询，如果需要对大段的文本进行搜索，才考虑用全文搜索。',
  content: 'Learn advanced techniques in MongoDB...',
  score: 1.0740489130434783
}

后续优化

分词函数可以考虑做一下缓存,不要对同一字符多次分词了
关于匹配分词score得好好再研究一下,因为现在在有些情况下,搜索结果中会多出一些记录.比如搜索”我怎么做的”时,莫名其妙多出了”Mongodb建立文本索引时”这一条记录,但是这条记录并不包含搜索词的token.
目前是基于mongoose库实现的,其实我更喜欢用prisma库,但是prisma库好像不支持操作MongoDB的全文索引,这就有点尴尬了.