windlx commited on
Commit
1e9eb5d
·
verified ·
1 Parent(s): 6824258

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. LICENSE +21 -0
  2. README.md +188 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2026 xiu xiu
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zh
5
+ - en
6
+ datasets:
7
+ - IowaCat/page_type_inference_dataset
8
+ metrics:
9
+ - accuracy: 0.99
10
+ pipeline_tag: text-generation
11
+ tags:
12
+ - url-classification
13
+ - list-page-detection
14
+ - detail-page-detection
15
+ - qwen
16
+ - fine-tuning
17
+ - lora
18
+ - url-parser
19
+ widget:
20
+ - text: "https://example.com/product/12345"
21
+ - text: "https://example.com/category/electronics"
22
+ ---
23
+
24
+ # URL Page Type Classifier
25
+
26
+ <div align="center">
27
+
28
+ ![Model Size](https://img.shields.io/badge/Model%20Size-1.5B-blue)
29
+ ![License](https://img.shields.io/badge/License-MIT-green)
30
+ ![Accuracy](https://img.shields.io/badge/Accuracy-99%25-green)
31
+
32
+ </div>
33
+
34
+ ## 📋 概述
35
+
36
+ 基于 Qwen2.5-1.5B + LoRA 微调的URL类型分类模型,用于判断URL是列表页还是详情页。
37
+
38
+ ## 🏗️ 模型架构
39
+
40
+ | 项目 | 详情 |
41
+ |------|------|
42
+ | **基础模型** | Qwen/Qwen2.5-1.5B |
43
+ | **微调方法** | LoRA (r=16, alpha=32) |
44
+ | **参数量** | 1.5B |
45
+ | **可训练参数** | ~18M (1.18%) |
46
+
47
+ ## 📊 训练数据
48
+
49
+ - **数据集**: IowaCat/page_type_inference_dataset
50
+ - **训练样本**: 10,000条URL (5000列表页 + 5000详情页)
51
+ - **数据来源**: HuggingFace Datasets
52
+
53
+ ### 数据分布
54
+
55
+ | 类型 | 数量 | 比例 |
56
+ |------|------|------|
57
+ | 列表页 (List Page) | 5,000 | 50% |
58
+ | 详情页 (Detail Page) | 5,000 | 50% |
59
+
60
+ ## ⚙️ 训练配置
61
+
62
+ ```python
63
+ {
64
+ "base_model": "Qwen/Qwen2.5-1.5B",
65
+ "lora_rank": 16,
66
+ "lora_alpha": 32,
67
+ "lora_dropout": 0.05,
68
+ "num_train_epochs": 3,
69
+ "per_device_train_batch_size": 2,
70
+ "gradient_accumulation_steps": 8,
71
+ "learning_rate": 2e-4,
72
+ "fp16": true,
73
+ "optimizer": "adamw_torch",
74
+ "lr_scheduler_type": "cosine"
75
+ }
76
+ ```
77
+
78
+ ## 📈 性能评估
79
+
80
+ ### 测试结果
81
+
82
+ | 测试集 | 样本数 | 准确率 |
83
+ |--------|--------|--------|
84
+ | 验证集 | 100 | **99%** |
85
+
86
+ ### 示例预测
87
+
88
+ | URL | 预测结果 |
89
+ |-----|----------|
90
+ | `https://example.com/products/category` | 列表页 (List Page) |
91
+ | `https://example.com/product/12345` | 详情页 (Detail Page) |
92
+ | `https://example.com/search?q=test` | 列表页 (List Page) |
93
+ | `https://example.com/item/abc123` | 详情页 (Detail Page) |
94
+ | `https://example.com/list/all` | 列表页 (List Page) |
95
+
96
+ ## 🚀 快速开始
97
+
98
+ ### 安装依赖
99
+
100
+ ```bash
101
+ pip install transformers peft torch
102
+ ```
103
+
104
+ ### 推理代码
105
+
106
+ ```python
107
+ from transformers import AutoTokenizer, AutoModelForCausalLM
108
+
109
+ model_name = "windlx/url-classifier-model"
110
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
111
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
112
+
113
+ # 要分类的URL
114
+ url = "https://example.com/product/12345"
115
+
116
+ # 构建提示
117
+ prompt = f"""请判断以下URL是列表页还是详情页。
118
+
119
+ URL: {url}
120
+ 类型: """
121
+
122
+ # 推理
123
+ inputs = tokenizer(prompt, return_tensors="pt")
124
+ outputs = model.generate(**inputs, max_new_tokens=10, do_sample=False)
125
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
126
+
127
+ # 提取结果
128
+ if "详情页" in response or "Detail Page" in response:
129
+ result = "详情页 (Detail Page)"
130
+ else:
131
+ result = "列表页 (List Page)"
132
+
133
+ print(f"URL: {url}")
134
+ print(f"类型: {result}")
135
+ ```
136
+
137
+ ### 使用 GPU
138
+
139
+ ```python
140
+ # 自动使用GPU
141
+ model = AutoModelForCausalLM.from_pretrained(
142
+ model_name,
143
+ trust_remote_code=True,
144
+ device_map="auto",
145
+ torch_dtype="auto"
146
+ )
147
+ ```
148
+
149
+ ### 使用 CPU
150
+
151
+ ```python
152
+ # 强制使用CPU
153
+ model = AutoModelForCausalLM.from_pretrained(
154
+ model_name,
155
+ trust_remote_code=True,
156
+ device_map="cpu",
157
+ torch_dtype="float32"
158
+ )
159
+ ```
160
+
161
+ ## ⚠️ 局限性
162
+
163
+ 1. **仅基于URL字符串** - 不访问实际网页内容
164
+ 2. **依赖URL路径规范** - 对于URL路径不规范的网站,准确率可能较低
165
+ 3. **仅支持中英文** - 主要针对中文URL优化
166
+
167
+ ## 📝 使用场景
168
+
169
+ - 🔍 **搜索引擎优化 (SEO)** - 识别网站页面结构
170
+ - 🕷️ **网页爬虫** - 判断链接类型,优化爬取策略
171
+ - 📊 **网站分析** - 统计列表页和详情页比例
172
+ - 🔗 **链接分类** - 大规模URL分类处理
173
+
174
+ ## 📁 相关链接
175
+
176
+ - **GitHub仓库**: https://github.com/xiuxiu/url-classifier
177
+ - **HuggingFace模型**: https://huggingface.co/windlx/url-classifier-model
178
+ - **训练数据集**: https://huggingface.co/datasets/IowaCat/page_type_inference_dataset
179
+
180
+ ## 🙏 致谢
181
+
182
+ - [Qwen](https://github.com/QwenLM/Qwen2) - 提供基础模型
183
+ - [LoRA](https://github.com/microsoft/LoRA) - 高效微调方法
184
+ - [HuggingFace](https://huggingface.co/) - 模型托管平台
185
+
186
+ ## 📄 许可
187
+
188
+ [LICENSE](LICENSE) - MIT License