# ConvSeq2seq Attention Model for Chinese Spelling Correction


## Features

* 基于Attention机制的Sequence to Sequence模型
* Luong Attention
* Conv Seq2Seq model, GPU并行计算，训练加速
* 训练加速tricks：dataset bucketing, prefetching, token-based batching, gradients accumulation
* Beam Search
* Chinese Samples: sighan2015 data

## Usage

### Requirements
* pip安装依赖包
```
torch>=1.4.0
transformers>=4.4.2
```


### 快速加载
#### pycorrector快速预测

example: [examples/seq2seq/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/seq2seq/demo.py)
```python
from pycorrector import ConvSeq2SeqCorrector
m = ConvSeq2SeqCorrector()
print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
```

output:
```shell
[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]
```

### Dataset

#### toy data
sighan 2015中文拼写纠错数据（2k条）：[examples/data/sighan_2015/train.tsv](https://github.com/shibing624/pycorrector/blob/master/examples/data/sighan_2015/train.tsv)

data format:
```
你说的是对，跟那些失业的人比起来你也算是辛运的。	你说的是对，跟那些失业的人比起来你也算是幸运的。
```


#### big train data

nlpcc2018+hsk dataset, download from https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA  密码:m6fg [130W sentence pair，215MB] 

### Train model
run train:
```
python train.py --do_train --do_predict
```


### Predict model
```
python predict.py
```

output:
```shell
[{'source': '今天新情很好', 'target': '今天心情很好', 'errors': [('新', '心', 2)]},
{'source': '你找到你最喜欢的工作，我也很高心。', 'target': '你找到你最喜欢的工作，我也很高兴。', 'errors': [('心', '兴', 15)]}]
```
![result image](https://github.com/shibing624/pycorrector/blob/master/docs/git_image/convseq2seq_ret.png)


## Release model
基于SIGHAN2015数据集训练的convseq2seq模型，已经release到github:

- convseq2seq model: https://github.com/shibing624/pycorrector/releases/download/0.4.5/convseq2seq_correction.tar.gz