PipeOwl / README.md

Update README.md

61af7c0 verified 2 months ago

2.1 kB

language: zh
tags:
  - embeddings
  - retrieval
  - numpy
  - transformer-free
license: mit

PipeOwl-1.0 (Geometric Embedding)

PipeOwl is a transformer-free geometric embedding package built on a static embedding field stored as NumPy arrays.

This repo provides:

L1_base_embeddings.npy: float32 (V, 1024) embedding table (unit-normalized)
L1_base_vocab.json: list of vocab strings aligned to embedding rows
delta_base_scalar.npy: float32 (V,) optional scalar bias field
minimal inference engine (engine.py) and usage script (quickstart.py)

Attribution

The base embedding vectors were generated using BGE (Apache-2.0) via inference (model outputs). This repository does not redistribute any original BGE model weights.

Quickstart

pip install numpy
python quickstart.py

Or minimal usage:

from engine import PipeOwlEngine, PipeOwlConfig

engine = PipeOwlEngine(PipeOwlConfig())
q = engine.encode("雪鴞好可愛")

use q for similarity / retrieval

Files data/L1_base_embeddings.npy : embedding table (float32, V×1024) data/L1_base_vocab.json : vocab aligned with rows data/delta_base_scalar.npy : scalar bias (float32, V) engine.py : minimal runtime quickstart.py : example script

Notes No safetensors / pytorch_model.bin is included because this model is distributed as a static NumPy embedding field.

Stress Test Results (Hard Retrieval Setting)

corpus size = 1200 eval size = 200 ood ratio = 0.28

Model	in-domain MRR@10	OOD MRR@10
MiniLM	0.019	0.026
BGE	0.026	0.009
PipeOwl	0.013	0.023

Note: This test uses a harder corpus and adversarial-style queries. Absolute scores are low due to difficulty scaling.

See full experimental notes here: https://hackmd.io/@galaxy4552/BkpUEnTwbl

pipeowl/
│
├─ README.md
├─ model_card.md
├─ LICENSE
│
├─ engine.py
├─ quickstart.py
│   
└─ data/
    ├─ L1_base_embeddings.npy
    ├─ delta_base_scalar.npy
    └─ L1_base_vocab.json