--- language: - en - de - fr - it - es - nl - da - sv - "no" - pl license: apache-2.0 tags: - token-classification - ner - product-search - query-understanding base_model: bltlab/queryner-bert-base-uncased datasets: - bltlab/queryner - thepian/eco-products-ner-fixtures pipeline_tag: token-classification --- # queryner-eco-ner Named entity recognition for product search queries. Identifies **brand**, **product category**, **product name**, and **origin** spans in free-text queries. Fine-tuned from [bltlab/queryner-bert-base-uncased](https://huggingface.co/bltlab/queryner-bert-base-uncased), which was trained on Amazon ESCI queries. This model extends it with domain-specific vocabulary drawn from a European product database — brand names, multilingual product titles, and origin countries. ## Labels The model predicts the full 17-type label set from the base queryner model. The four types most relevant to product search are: | Label | HF tag | Example span | |---|---|---| | Brand | `B-creator` / `I-creator` | `Ecover`, `Dr. Bronner's` | | Product category | `B-core_product_type` / `I-core_product_type` | `washing up liquid`, `shampoo` | | Product name | `B-product_name` / `I-product_name` | `Skin Food`, `Men 48H Deodorant` | | Origin | `B-origin` / `I-origin` | `Germany`, `Italy` | All other queryner types (`modifier`, `department`, `UoM`, `color`, `material`, etc.) are preserved from the base model. ## Usage ```python from transformers import pipeline ner = pipeline("token-classification", model="thepian/queryner-eco-ner", aggregation_strategy="simple") results = ner("Ecover washing up liquid without palm oil") # [{'entity_group': 'creator', 'word': 'Ecover', ...}, # {'entity_group': 'core_product_type', 'word': 'washing up liquid', ...}] results = ner("organic olive oil from Italy under €15") # [{'entity_group': 'core_product_type', 'word': 'olive oil', ...}, # {'entity_group': 'origin', 'word': 'Italy', ...}] ``` ## Training data 20,203 examples from three sources: | Source | Examples | Notes | |---|---|---| | [bltlab/queryner](https://huggingface.co/datasets/bltlab/queryner) | 9,140 | Amazon ESCI queries; all 17 label types | | Local domain fixtures | ~1,063 | Hand-annotated product search queries (incl. substitute-frame fixtures) | | Synthetic DB fixtures | ~10,000 | Template-generated from brand/category/product vocabulary; includes 1,000 substitute-frame (multilingual) | Synthetic examples are generated by `generate_db_dataset.py` from a European product database. Brand names come from EU-registered brands; product names are extracted from all language variants stored in `product.name` (en, de, fr, it, es, nl, and others). Product names that are exact matches of English category strings are excluded to avoid contradictory training signal. ## Label balance and product name vs category The two most commonly confused labels are `core_product_type` (product category) and `product_name` (specific named product). The model's only reliable cue for distinguishing them is positional: text following a known brand is a candidate for `product_name`, while standalone noun phrases are typically `core_product_type`. This positional signal is structural, not lexical — "Dove shampoo" and "Dove Skin Food" look identical to the model at the template level. ### Why category dominates in training (~2:1 target) Real product search queries are category-heavy by a large margin. Most users type "shampoo", "olive oil", or "washing powder", not "Fuji Green Tea Refreshingly Hydrating Conditioner". Training data should approximate inference-time distribution; over-representing `product_name` creates a mismatch that degrades category precision on the majority of queries. The base model (bltlab/queryner-bert-base-uncased) was trained on Amazon ESCI queries, which are also category-heavy. The marginal value of additional `core_product_type` examples is lower than the marginal value of `product_name` examples, but collapsing to 1:1 risks the model labeling any noun phrase after a brand as `product_name` — including generic category words like "shampoo" or "washing up liquid". **Current ratio: ~2.3:1 (core_product_type : product_name). Target: ~2:1.** ### Why going below 2:1 requires better data, not just more examples Increasing `product_name` examples without addressing lexical quality introduces contradictory signal: - A product named "Shampoo" and a category called "shampoo" become competing labels for the same string. The model cannot resolve this without knowing whether the token is generic or specific — information that is not present in the query. - The category cross-reference filter (dropping product names that are exact English category matches) addresses the worst cases, but morphological variants ("Shampoos", "Crème") and multi-language overlaps remain. To move significantly below 2:1 safely, the `product_name` training data would need to satisfy: | Requirement | Why | |---|---| | Lexically distinct from category vocabulary | Prevents the model learning a single label for identical strings | | High word-count names (3+ tokens) | Single and two-token product names are indistinguishable from short category slugs by surface form alone | | Brand diversity | The positional cue (brand precedes product name) only generalises if many different brands are paired with many different product names — a narrow brand set leads to brand-specific memorisation | | Multilingual coverage proportional to expected query mix | Training on English product names only means the model will underperform on French/German/Italian queries even though multilingual product names exist in the DB | | Minimal repetition | A product name seen 20 times with the same brand drowns signal from rarer names | Until those conditions are met, `product_name_ratio` should stay at 0.25–0.30 and the 2:1 overall ratio maintained by generating more total synthetic examples rather than increasing the ratio. --- ## Training procedure - Base model: `bltlab/queryner-bert-base-uncased` - Tokenizer: BERT WordPiece; subword tokens after the first in each word are masked (`-100`) - Max sequence length: 128 - Label set: collected from training data (all 17 queryner types preserved) - Optimiser: AdamW, weight decay 0.01, warmup ratio 0.1 - Segmented training: brand/product/origin first, then certification O-token signal at lower LR Typical segment configuration: ``` Segment 1: epochs=3, lr=3e-5 (base → domain) Segment 2: epochs=2, lr=1e-5 (add cert O-token signal) Segment 3: epochs=2, lr=5e-6 (product name ratio increase) Segment 4: epochs=2, lr=5e-6 (substitute-frame + multilingual, brand F1 0.698 → 0.897) ``` ## Evaluation Evaluated on 63 held-out domain fixtures (39 general + 24 substitute-frame / multilingual) with exact and partial span matching. **Segment 4** — 2 epochs, lr=5e-6, base=segment 3 checkpoint, 20,203 training examples (incl. substitute-frame): | Label | P (partial) | R (partial) | F1 (partial) | F1 (exact) | |---|---|---|---|---| | brand | 0.929 | 0.867 | **0.897** | **0.897** | | product category | 0.895 | 0.962 | **0.927** | 0.891 | | product name | 0.875 | 0.700 | 0.778 | 0.556 | | origin | 1.000 | 0.917 | **0.957** | **0.957** | | **overall** | **0.915** | **0.900** | **0.908** | 0.874 | Key remaining gaps: - `Dr. Bronner's` apostrophe: tokenizer splits `'` → span predicted as `"dr. bronner ' s"`. Needs pre-tokenization normalization. - Ecover brand FN (4 fixtures): underrepresented in training vocabulary; missed even in substitute-frame context. - German origin `Deutschland` not recognized — training uses English country names only. - Umlaut span mismatch: `Spülmittel` lowercased to `spulmittel` by BERT WordPiece. ## Limitations - Extraction patterns are primarily English; avoidance frames in other languages (`ohne`, `sans`, `senza`) are not NER targets — they are handled by a separate parser - Multilingual product names are included in training but evaluation is English-only - Origin recognition covers ~13 European countries drawn from product records; global coverage is partial - Barcode and price extraction are not NER tasks — handled by a dedicated parser ## Citation If you use this model, please cite the base model: ``` @misc{queryner, author = {Björklund, Love and Ljunglöf, Peter}, title = {QueryNER: Named Entity Recognition for Product Search Queries}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/bltlab/queryner-bert-base-uncased} } ```