
本文介绍一种基于可扩展词表与正则预处理的轻量级文本脱敏方案,专为客服邮件/对话转录文本设计,能精准删除“dear”“hello”等称谓后紧跟的称谓词(如mr.、ms.)和人名,同时保留产品名、地名等关键业务信息,无需ner且不依赖crm数据库。
在客户对话文本匿名化任务中,直接使用命名实体识别(NER)易误删产品名、品牌或地点;而维护全量人名停用词表又不可持续。更稳健的策略是结合语义模式识别与结构化规则:利用常见称谓(salutations)、敬称(honorifics)和上下文分隔符(如逗号、换行)构建可维护的规则链,精准定位并移除“称谓 + 敬称 + 姓名”这一固定结构,保留后续全部对话内容。
以下提供两种演进式实现方案,均支持德语文本(不区分大小写),兼容空格冗余、标点粘连等真实场景噪声:
✅ 方案一:增强型规则解析(推荐用于生产)
该方案支持复合称谓(如 "Good morning")、双名("Lisa Martin")、带连字符姓氏("Duncan-Jones")及首逗号标记,鲁棒性强:
import re
def anonymize_salutation(text: str,
salutations_1=None,
salutations_2=None,
honorifics=None) -> str:
if salutations_1 is None:
salutations_1 = ["dear", "dearest", "hello", "hi", "hiya",
"greetings", "salutations", "ok", "good", "my"]
if salutations_2 is None:
salutations_2 = ["morning", "day", "afternoon", "evening",
"there", "dear", "dearest"]
if honorifics is None:
honorifics = ["mr", "mrs", "dr", "ms", "sir", "master"]
# 预处理:用特殊标记替代首个逗号(避免干扰姓名切分)
marked = re.sub(r'^([^,]*),', r'\1 #', text.strip(), count=1)
# 按空白符分割(自动处理多空格、制表符)
words = re.split(r'\s+', marked)
if not words:
return text
try:
# 步骤1:跳过单/双词称谓(如 "Good morning" → 移除前2词)
i = 0
if words[i].lower() in salutations_1:
i += 1
if i < len(words) and words[i].lower() in salutations_2:
i += 1
# 步骤2:跳过敬称(如 "Mr", "Dr")
if i < len(words) and words[i].lower() in honorifics:
i += 1
# 步骤3:跳过第一名字(必删)
if i < len(words):
i += 1
# 步骤4:处理逗号标记与第二名字(如 "Lisa # Martin" → 跳过 # 和 Martin)
remaining = words[i:]
if remaining and remaining[0] == '#':
remaining = remaining[1:]
if len(remaining) > 1 and remaining[1] == '#':
remaining = remaining[2:]
# 还原逗号并拼接
result = ' '.join(remaining).replace(' #', ',')
return result.strip()
except (IndexError, AttributeError):
return text # 异常时返回原文,便于日志追踪
# 示例调用
texts = [
"Dear mrs chan Blah blah blah",
"Good morning Ms Daisy Martin, Hope you are well.",
"Hi there seema hows things?"
]
for t in texts:
print(f"'{t}' → '{anonymize_salutation(t)}'")输出示例:'Dear mrs chan Blah blah blah' → 'Blah blah blah''Good morning Ms Daisy Martin, Hope you are well.' → 'Hope you are well.''Hi there seema hows things?' → 'hows things?'
✅ 方案二:Pandas 向量化替换(适合批量处理CSV/Excel)
若数据已加载为 pandas.DataFrame,可结合 str.replace() 实现高效批处理:
import pandas as pd
import re
# 构建动态正则模式:匹配称谓+可选敬称+姓名+可选标点
pattern = r'^(?:dear|hello|hi|greetings|good\s+(?:morning|afternoon|evening)|ok)\s+' \
r'(?:mr\.?|mrs\.?|ms\.?|dr\.?|sir|master)\s+([^\s,]+)(?:\s+[^\s,]+)?[,\.]?\s*'
df = pd.DataFrame({"text": texts})
df["anonymized"] = df["text"].str.replace(pattern, "", regex=True, case=False).str.strip()⚠️ 注意事项与最佳实践
- 词表可扩展性:salutations_1/2 和 honorifics 列表应随业务场景迭代更新(如增加德语称谓 "Sehr geehrte"、"Liebe");
- 标点鲁棒性:代码显式处理逗号分隔,但对句号、分号需按需扩展 re.sub(r'[,.!?;]\s*$', '', ...);
- 边界保护:避免误删行中词汇(如 "The product Mr. Clean is great"),本方案仅作用于行首结构,天然规避此风险;
- 验证先行:上线前务必用真实样本测试,建议添加日志记录未匹配行,持续优化规则;
- 合规补充:脱敏后建议叠加 re.sub(r'\b[A-Z][a-z]+\s+[A-Z][a-z]+\b', '[REDACTED]', ...) 作为兜底(仅匹配首字母大写的双词组合)。
该方法平衡了精度、可维护性与执行效率,是 GDPR/DSGVO 场景下处理德语客服文本的理想轻量级方案。










