Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

混淆集相关方法存在两个 bug #470

Closed
treya-lin opened this issue Jan 30, 2024 · 2 comments
Closed

混淆集相关方法存在两个 bug #470

treya-lin opened this issue Jan 30, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@treya-lin
Copy link
Contributor

treya-lin commented Jan 30, 2024

1. kenlm

存在问题:同一个错字重复出现时只修正了第一次出现的case

我发现如果混淆集中同一个词在句子中重复出现,只会修改第一次出现的。

举个例子:
混淆集

莪 我
祢 你

例句

s= "莪想说莪爱祢"
m_custom = Corrector(custom_confusion_path_or_dict = "./my_custom_confusion.txt")
m_custom.correct(s)

结果

{'source': '莪想说莪爱祢', 'target': '我想说莪爱你', 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

第二个”莪“字没有被换掉。

2. confusion pipeline

使用confusion pipeline时,上面同一个例子,但是“莪”字两处都没有被改掉

from pycorrector import ConfusionCorrector
confusion_dict = {"莪": "我", "祢": "你"}
model_confusion = ConfusionCorrector(custom_confusion_path_or_dict=confusion_dict)
model_confusion.correct("莪想说莪爱祢")

结果

{'source': '莪想说莪爱祢',
 'target': '莪想说莪爱你',
 'errors': [('莪', '我', 0), ('祢', '你', 5)]}

检测到第一个'莪',但两处'莪'都没被改掉。

@treya-lin treya-lin added the bug Something isn't working label Jan 30, 2024
@treya-lin treya-lin changed the title kenlm 方法同一个错字重复出现时只修正了第一次出现的case 混淆集相关方法存在两个 bug Jan 30, 2024
@treya-lin
Copy link
Contributor Author

I had these two issues fixed. I will submit a PR later.

treya-lin added a commit to treya-lin/pycorrector that referenced this issue Jan 30, 2024
shibing624 added a commit that referenced this issue Jan 31, 2024
Fix issue #470: Improve how confusion words are located and/or replaced
@shibing624
Copy link
Owner

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants