浏览代码

update train data.

shibing624 1 周之前
父节点
当前提交
ee0c6f0762
共有 3 个文件被更改,包括 21 次插入22 次删除
  1. 13 14
      README.md
  2. 1 1
      examples/gpt/README.md
  3. 7 7
      examples/gpt/training_llama_demo.py

+ 13 - 14
README.md

@@ -370,15 +370,13 @@ output:
 ```
 
 ### GPT模型
-基于ChatGLM3、LLaMA、Baichuan、QWen等模型微调训练纠错模型,训练方法见[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)
-
-在ChatGLM3-6B上SFT微调的纠错模型,已经release到HuggingFace Models: https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora
+基于ChatGLM3、Qwen2.5等模型微调训练纠错模型,训练方法见[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)
 
 #### pycorrector快速预测
 
 example: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)
 ```python
-from pycorrector import GptCorrector
+from pycorrector.gpt.gpt_corrector import GptCorrector
 m = GptCorrector()
 print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作,我也很高心。']))
 ```
@@ -457,17 +455,18 @@ output:
 
 ## Dataset
 
-| 数据集                          | 语料 |                                                                                下载链接                                                                                 | 压缩包大小 |
-|:-----------------------------| :--------- |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|
-| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条) |               [百度网盘(密码01b9)](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |
-| **`原始SIGHAN数据集`**            | SIGHAN13 14 15 |                                                      [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |
-| **`原始Wang271K数据集`**          | Wang271K |                   [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |
-| **`人民日报2014版语料`**            | 人民日报2014版 |                                    [飞书(密码cHcu)](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |
-| **`NLPCC 2018 GEC官方数据集`**    | NLPCC2018-GEC |                                        [官方trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |
-| **`NLPCC 2018+HSK熟语料`**      | nlpcc2018+hsk+CGED | [百度网盘(密码m6fg)](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [飞书(密码gl9y)](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |
-| **`NLPCC 2018+HSK原始语料`**     | HSK+Lang8 | [百度网盘(密码n31j)](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [飞书(密码Q9LH)](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |
+| 数据集                          | 语料                           |                                                                                下载链接                                                                                 | 压缩包大小 |
+|:-----------------------------|:-----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|
+| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条)        |               [百度网盘(密码01b9)](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |
+| **`原始SIGHAN数据集`**            | SIGHAN13 14 15               |                                                      [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |
+| **`原始Wang271K数据集`**          | Wang271K                     |                   [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |
+| **`人民日报2014版语料`**            | 人民日报2014版                    |                                    [飞书(密码cHcu)](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |
+| **`NLPCC 2018 GEC官方数据集`**    | NLPCC2018-GEC                |                                        [官方trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |
+| **`NLPCC 2018+HSK熟语料`**      | nlpcc2018+hsk+CGED           | [百度网盘(密码m6fg)](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [飞书(密码gl9y)](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |
+| **`NLPCC 2018+HSK原始语料`**     | HSK+Lang8                    | [百度网盘(密码n31j)](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [飞书(密码Q9LH)](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |
 | **`中文纠错比赛数据汇总`**             | Chinese Text Correction(CTC) |                                                     [中文纠错汇总数据集(天池)](https://tianchi.aliyun.com/dataset/138195)                                                      |   -   |
-| **`NLPCC 2023中文语法纠错数据集`**    | NLPCC 2023 Sharedtask1 |                          [Task 1: Chinese Grammatical Error Correction(Training Set)](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |
+| **`NLPCC 2023中文语法纠错数据集`**    | NLPCC 2023 Sharedtask1       |                          [Task 1: Chinese Grammatical Error Correction(Training Set)](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |
+| **`百度智能文本校对比赛数据集`**          | 中文真实场景纠错数据                   |                          [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)                          |  10M  |
 
 
 

+ 1 - 1
examples/gpt/README.md

@@ -26,7 +26,7 @@ pip install transformers peft -U
 
 example: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)
 ```python
-from pycorrector import GptCorrector
+from pycorrector.gpt.gpt_corrector import GptCorrector
 m = GptCorrector()
 print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作,我也很高心。']))
 ```

+ 7 - 7
examples/gpt/training_llama_demo.py

@@ -17,17 +17,17 @@ def main():
     parser = argparse.ArgumentParser()
     parser.add_argument('--train_file', default='../data/grammar/train_sharegpt.jsonl', type=str, help='Train file')
     parser.add_argument('--test_file', default='../data/grammar/test_sharegpt.jsonl', type=str, help='Test file')
-    parser.add_argument('--model_type', default='llama', type=str, help='Transformers model type')
-    parser.add_argument('--model_name', default='shibing624/chinese-alpaca-plus-7b-hf', type=str,
+    parser.add_argument('--model_type', default='auto', type=str, help='Transformers model type')
+    parser.add_argument('--model_name', default='Qwen/Qwen2.5-1.5B-Instruct', type=str,
                         help='Transformers model or path')
     parser.add_argument('--do_train', action='store_true', help='Whether to run training.')
     parser.add_argument('--do_predict', action='store_true', help='Whether to run predict.')
     parser.add_argument('--bf16', action='store_true', help='Whether to use bf16 mixed precision training.')
-    parser.add_argument('--output_dir', default='./outputs-llama-demo/', type=str, help='Model output directory')
-    parser.add_argument('--prompt_template_name', default='vicuna', type=str, help='Prompt template name')
-    parser.add_argument('--max_seq_length', default=128, type=int, help='Input max sequence length')
-    parser.add_argument('--max_length', default=128, type=int, help='Output max sequence length')
-    parser.add_argument('--num_epochs', default=0.2, type=float, help='Number of training epochs')
+    parser.add_argument('--output_dir', default='./outputs-qwen-1.5b-demo/', type=str, help='Model output directory')
+    parser.add_argument('--prompt_template_name', default='qwen', type=str, help='Prompt template name')
+    parser.add_argument('--max_seq_length', default=512, type=int, help='Input max sequence length')
+    parser.add_argument('--max_length', default=512, type=int, help='Output max sequence length')
+    parser.add_argument('--num_epochs', default=1, type=float, help='Number of training epochs')
     parser.add_argument('--batch_size', default=8, type=int, help='Batch size')
     parser.add_argument('--eval_steps', default=50, type=int, help='Eval every X steps')
     parser.add_argument('--save_steps', default=50, type=int, help='Save checkpoint every X steps')