ConvSeq2seq Attention Model for Chinese Spelling Correction

Features

基于Attention机制的Sequence to Sequence模型
Luong Attention
Conv Seq2Seq model, GPU并行计算，训练加速
训练加速tricks：dataset bucketing, prefetching, token-based batching, gradients accumulation
Beam Search
Chinese Samples: sighan2015 data

Usage

Requirements

pip安装依赖包
```
torch>=1.4.0
transformers>=4.4.2
```

快速加载

pycorrector快速预测

from pycorrector import ConvSeq2SeqCorrector
m = ConvSeq2SeqCorrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))

output:

[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]

Dataset

toy data

sighan 2015中文拼写纠错数据（2k条）：examples/data/sighan_2015/train.tsv

data format:

你说的是对，跟那些失业的人比起来你也算是辛运的。	你说的是对，跟那些失业的人比起来你也算是幸运的。

big train data

nlpcc2018+hsk dataset, download from https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA 密码:m6fg [130W sentence pair，215MB]

Train model

run train:

python train.py --do_train --do_predict

Predict model

python predict.py