1 周之前 · ee0c6f0762
--- a/README.md
+++ b/README.md
@@ -370,15 +370,13 @@ output:
 
				 ```
			
 
				 
			
 
				 ### GPT模型
			
 
				-基于ChatGLM3、LLaMA、Baichuan、QWen等模型微调训练纠错模型，训练方法见[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)
			
 
				-
			
 
				-在ChatGLM3-6B上SFT微调的纠错模型，已经release到HuggingFace Models: https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora
			
 
				+基于ChatGLM3、Qwen2.5等模型微调训练纠错模型，训练方法见[examples/gpt/README.md](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/README.md)
			
 
				 
			
 
				 #### pycorrector快速预测
			
 
				 
			
 
				 example: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)
			
 
				 ```python
			
 
				-from pycorrector import GptCorrector
			
 
				+from pycorrector.gpt.gpt_corrector import GptCorrector
			
 
				 m = GptCorrector()
			
 
				 print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
			
 
				 ```
			
@@ -457,17 +455,18 @@ output:
 
				 
			
 
				 ## Dataset
			
 
				 
			
 
				-| 数据集                          | 语料 |                                                                                下载链接                                                                                 | 压缩包大小 |
			
 
				-|:-----------------------------| :--------- |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|
			
 
				-| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条) |               [百度网盘（密码01b9）](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |
			
 
				-| **`原始SIGHAN数据集`**            | SIGHAN13 14 15 |                                                      [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |
			
 
				-| **`原始Wang271K数据集`**          | Wang271K |                   [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |
			
 
				-| **`人民日报2014版语料`**            | 人民日报2014版 |                                    [飞书（密码cHcu）](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |
			
 
				-| **`NLPCC 2018 GEC官方数据集`**    | NLPCC2018-GEC |                                        [官方trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |
			
 
				-| **`NLPCC 2018+HSK熟语料`**      | nlpcc2018+hsk+CGED | [百度网盘（密码m6fg）](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [飞书（密码gl9y）](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |
			
 
				-| **`NLPCC 2018+HSK原始语料`**     | HSK+Lang8 | [百度网盘（密码n31j）](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [飞书（密码Q9LH）](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |
			
 
				+| 数据集                          | 语料                           |                                                                                下载链接                                                                                 | 压缩包大小 |
			
 
				+|:-----------------------------|:-----------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----:|
			
 
				+| **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条)        |               [百度网盘（密码01b9）](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ) <br/> [shibing624/CSC](https://huggingface.co/datasets/shibing624/CSC)                | 106M  |
			
 
				+| **`原始SIGHAN数据集`**            | SIGHAN13 14 15               |                                                      [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)                                                       | 339K  |
			
 
				+| **`原始Wang271K数据集`**          | Wang271K                     |                   [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)                    |  93M  |
			
 
				+| **`人民日报2014版语料`**            | 人民日报2014版                    |                                    [飞书（密码cHcu）](https://l6pmn3b1eo.feishu.cn/file/boxcnKpildqIseq1D4IrLwlir7c?from=from_qr_code)                                    | 383M  |
			
 
				+| **`NLPCC 2018 GEC官方数据集`**    | NLPCC2018-GEC                |                                        [官方trainingdata](http://tcci.ccf.org.cn/conference/2018/dldoc/trainingdata02.tar.gz)                                         | 114M  |
			
 
				+| **`NLPCC 2018+HSK熟语料`**      | nlpcc2018+hsk+CGED           | [百度网盘（密码m6fg）](https://pan.baidu.com/s/1BkDru60nQXaDVLRSr7ktfA) <br/> [飞书（密码gl9y）](https://l6pmn3b1eo.feishu.cn/file/boxcnudJgRs5GEMhZwe77YGTQfc?from=from_qr_code) | 215M  |
			
 
				+| **`NLPCC 2018+HSK原始语料`**     | HSK+Lang8                    | [百度网盘（密码n31j）](https://pan.baidu.com/s/1DaOX89uL1JRaZclfrV9C0g) <br/> [飞书（密码Q9LH）](https://l6pmn3b1eo.feishu.cn/file/boxcntebW3NI6OAaqzDUXlZHoDb?from=from_qr_code) |  81M  |
			
 
				 | **`中文纠错比赛数据汇总`**             | Chinese Text Correction（CTC） |                                                     [中文纠错汇总数据集（天池）](https://tianchi.aliyun.com/dataset/138195)                                                      |   -   |
			
 
				-| **`NLPCC 2023中文语法纠错数据集`**    | NLPCC 2023 Sharedtask1 |                          [Task 1: Chinese Grammatical Error Correction（Training Set）](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |
			
 
				+| **`NLPCC 2023中文语法纠错数据集`**    | NLPCC 2023 Sharedtask1       |                          [Task 1: Chinese Grammatical Error Correction（Training Set）](http://tcci.ccf.org.cn/conference/2023/taskdata.php)                          | 125M  |
			
 
				+| **`百度智能文本校对比赛数据集`**          | 中文真实场景纠错数据                   |                          [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)                          |  10M  |
			
 
				 
			
 
				 
			
 
				 
			
--- a/examples/gpt/README.md
+++ b/examples/gpt/README.md
@@ -26,7 +26,7 @@ pip install transformers peft -U
 
				 
			
 
				 example: [examples/gpt/demo.py](https://github.com/shibing624/pycorrector/blob/master/examples/gpt/demo.py)
			
 
				 ```python
			
 
				-from pycorrector import GptCorrector
			
 
				+from pycorrector.gpt.gpt_corrector import GptCorrector
			
 
				 m = GptCorrector()
			
 
				 print(m.correct_batch(['今天新情很好', '你找到你最喜欢的工作，我也很高心。']))
			
 
				 ```
			
--- a/examples/gpt/training_llama_demo.py
+++ b/examples/gpt/training_llama_demo.py
@@ -17,17 +17,17 @@ def main():
 
				     parser = argparse.ArgumentParser()
			
 
				     parser.add_argument('--train_file', default='../data/grammar/train_sharegpt.jsonl', type=str, help='Train file')
			
 
				     parser.add_argument('--test_file', default='../data/grammar/test_sharegpt.jsonl', type=str, help='Test file')
			
 
				-    parser.add_argument('--model_type', default='llama', type=str, help='Transformers model type')
			
 
				-    parser.add_argument('--model_name', default='shibing624/chinese-alpaca-plus-7b-hf', type=str,
			
 
				+    parser.add_argument('--model_type', default='auto', type=str, help='Transformers model type')
			
 
				+    parser.add_argument('--model_name', default='Qwen/Qwen2.5-1.5B-Instruct', type=str,
			
 
				                         help='Transformers model or path')
			
 
				     parser.add_argument('--do_train', action='store_true', help='Whether to run training.')
			
 
				     parser.add_argument('--do_predict', action='store_true', help='Whether to run predict.')
			
 
				     parser.add_argument('--bf16', action='store_true', help='Whether to use bf16 mixed precision training.')
			
 
				-    parser.add_argument('--output_dir', default='./outputs-llama-demo/', type=str, help='Model output directory')
			
 
				-    parser.add_argument('--prompt_template_name', default='vicuna', type=str, help='Prompt template name')
			
 
				-    parser.add_argument('--max_seq_length', default=128, type=int, help='Input max sequence length')
			
 
				-    parser.add_argument('--max_length', default=128, type=int, help='Output max sequence length')
			
 
				-    parser.add_argument('--num_epochs', default=0.2, type=float, help='Number of training epochs')
			
 
				+    parser.add_argument('--output_dir', default='./outputs-qwen-1.5b-demo/', type=str, help='Model output directory')
			
 
				+    parser.add_argument('--prompt_template_name', default='qwen', type=str, help='Prompt template name')
			
 
				+    parser.add_argument('--max_seq_length', default=512, type=int, help='Input max sequence length')
			
 
				+    parser.add_argument('--max_length', default=512, type=int, help='Output max sequence length')
			
 
				+    parser.add_argument('--num_epochs', default=1, type=float, help='Number of training epochs')
			
 
				     parser.add_argument('--batch_size', default=8, type=int, help='Batch size')
			
 
				     parser.add_argument('--eval_steps', default=50, type=int, help='Eval every X steps')
			
 
				     parser.add_argument('--save_steps', default=50, type=int, help='Save checkpoint every X steps')