PipeOwl
Collection
A transformer-free semantic retrieval engine. • 12 items • Updated
A transformer-free semantic retrieval engine for multilingual wiki retrieval and passage expansion.
It combines:
| item | value |
|---|---|
| token size | 734803 |
| embedding dim | 512 |
| storage format | safetensors (FP16) |
| all data size | ~5.44 GB |
| model data size | ~728 MB |
| wiki data size | ~4.73 GB |
| languages | multilingual |
| startup time | ~7300 ms |
| query latency top1~5 | 740~22000 ms |
git clone https://huggingface.co/WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag/
cd PipeOwl-1.10.2-tw-wiki-rag/
pip install numpy safetensors
python quickstart_wiki_merge.py
請見專案內 example.md
範例:
Input:
bang dream
Output:
BanG_Dream!BanG_Dream!See example.md for full examples.
PipeOwl-1.10.2-tw-wiki-rag/
├─ quickstart_wiki_merge.py #入口
├─ wiki_retriever_merge.py #wiki資料搜尋
├─ entity_layer.py #同義層邏輯
├─ tokenizer_priority.py #切詞器邏輯
├─ engine.py #pipeowl核心模組入口
├─ entity_alias.json #同義詞資料集 (未完備)
├─ phrase_lexicon.txt #完整詞保護
├─ tokenizer.json #核心tokenizer
├─ pipeowl.safetensors #核心向量矩陣儲存
├─ zhwiki_clean.jsonl #wikidata資料集
└─ wiki_index_merge/
├─ wiki_merge_meta.json
├─ wiki_titles.txt
├─ wiki_title_tokens.jsonl
├─ wiki_title_tokens_offsets.json
├─ wiki_token_to_title_ids.json
└─ wiki_title_offsets.json
Code and system components in this repository are released under the MIT License.
Included Wikipedia-derived retrieval data is subject to CC BY-SA 4.0. Please ensure attribution and share-alike compliance if redistributing derived wiki data.
Base model
WangKaiLin/PipeOwl-1.10-multilingual