PipeOwl-1.10.2-tw-wiki-rag

PipeOwl Wiki RAG Demo

A transformer-free semantic retrieval engine for multilingual wiki retrieval and passage expansion.

It combines:

  • a static embedding field
  • entity normalization
  • phrase-aware tokenization
  • merge-based title retrieval
  • local passage spread from matched wiki entries

Current Limitations

  • Query latency is still dominated by Python-side retrieval logic.
  • entity_alias.json is incomplete and still growing.
  • phrase_lexicon.txt is heuristic and hand-curated.
  • Passage spread is a lightweight local retrieval step, not a full global chunk index.
  • Wiki freshness depends on the included zhwiki dump.

Architecture

  • Static embedding table (V × D)
  • Aligned vocabulary index
  • Linear scoring
  • Pluggable decoder stage

Model Specs

item value
token size 734803
embedding dim 512
storage format safetensors (FP16)
all data size ~5.44 GB
model data size ~728 MB
wiki data size ~4.73 GB
languages multilingual
startup time ~7300 ms
query latency top1~5 740~22000 ms

Quickstart

Tested on Python 3.10+

Unzip pipeowl1.10.2.zip

git clone https://huggingface.co/WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag/
cd PipeOwl-1.10.2-tw-wiki-rag/

pip install numpy safetensors

python quickstart_wiki_merge.py

Example:

請見專案內 example.md

範例: Input: bang dream

Output:

  • normalized query: BanG_Dream!
  • top title: BanG_Dream!
  • knowledge spread: a matched passage from the wiki article

See example.md for full examples.

Repository Structure

PipeOwl-1.10.2-tw-wiki-rag/
├─ quickstart_wiki_merge.py #入口 
├─ wiki_retriever_merge.py #wiki資料搜尋
├─ entity_layer.py #同義層邏輯
├─ tokenizer_priority.py #切詞器邏輯
├─ engine.py #pipeowl核心模組入口
├─ entity_alias.json #同義詞資料集 (未完備)
├─ phrase_lexicon.txt #完整詞保護
├─ tokenizer.json #核心tokenizer
├─ pipeowl.safetensors #核心向量矩陣儲存
├─ zhwiki_clean.jsonl #wikidata資料集
└─ wiki_index_merge/ 
   ├─ wiki_merge_meta.json 
   ├─ wiki_titles.txt 
   ├─ wiki_title_tokens.jsonl
   ├─ wiki_title_tokens_offsets.json
   ├─ wiki_token_to_title_ids.json
   └─ wiki_title_offsets.json

License

Code and system components in this repository are released under the MIT License.

Included Wikipedia-derived retrieval data is subject to CC BY-SA 4.0. Please ensure attribution and share-alike compliance if redistributing derived wiki data.

Downloads last month
85
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag

Finetuned
(1)
this model

Collections including WangKaiLin/PipeOwl-1.10.2-tw-wiki-rag