PipeOwl / README.md
WangKaiLin's picture
Update README.md
61af7c0 verified
|
raw
history blame
2.1 kB
metadata
language: zh
tags:
  - embeddings
  - retrieval
  - numpy
  - transformer-free
license: mit

PipeOwl-1.0 (Geometric Embedding)

PipeOwl is a transformer-free geometric embedding package built on a static embedding field stored as NumPy arrays.

This repo provides:

  • L1_base_embeddings.npy: float32 (V, 1024) embedding table (unit-normalized)
  • L1_base_vocab.json: list of vocab strings aligned to embedding rows
  • delta_base_scalar.npy: float32 (V,) optional scalar bias field
  • minimal inference engine (engine.py) and usage script (quickstart.py)

Attribution

The base embedding vectors were generated using BGE (Apache-2.0) via inference (model outputs). This repository does not redistribute any original BGE model weights.


Quickstart

pip install numpy
python quickstart.py

Or minimal usage:

from engine import PipeOwlEngine, PipeOwlConfig

engine = PipeOwlEngine(PipeOwlConfig())
q = engine.encode("雪鴞好可愛")

use q for similarity / retrieval

Files data/L1_base_embeddings.npy : embedding table (float32, V×1024) data/L1_base_vocab.json : vocab aligned with rows data/delta_base_scalar.npy : scalar bias (float32, V) engine.py : minimal runtime quickstart.py : example script

Notes No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.


Stress Test Results (Hard Retrieval Setting)

corpus size = 1200 eval size = 200 ood ratio = 0.28

Model in-domain MRR@10 OOD MRR@10
MiniLM 0.019 0.026
BGE 0.026 0.009
PipeOwl 0.013 0.023

Note: This test uses a harder corpus and adversarial-style queries. Absolute scores are low due to difficulty scaling.

See full experimental notes here: https://hackmd.io/@galaxy4552/BkpUEnTwbl


pipeowl/
│
├─ README.md
├─ model_card.md
├─ LICENSE
│
├─ engine.py
├─ quickstart.py
│   
└─ data/
    ├─ L1_base_embeddings.npy
    ├─ delta_base_scalar.npy
    └─ L1_base_vocab.json