Domain recognition bias: <a> tags preserved only for amazon URLs

#37

by kazutoshidayon - opened 8 days ago

Summary

When asked to reproduce HTML containing multiple <a> tags with
different URL domains, MiMo V2 consistently preserves only URLs
from amzn.to (Amazon's URL shortener) and strips <a> tags
from URLs of other domains, even when the prompt explicitly
instructs to preserve HTML exactly as written.

Reproduction

Given a prompt containing this HTML block:

<div>
  <p><a href="https://example1.com/path-a">Link Text 1</a></p>
  <p><a href="https://amzn.to/abc123">Link Text 2</a></p>
  <p><a href="https://example2.com/path-b">Link Text 3</a></p>
  <p><a href="https://example3.net/redirect/xyz">Link Text 4</a></p>
  <p><a href="https://example4.org/svt/ref?id=xxxxxx">Link Text 5</a></p>
</div>

Output:

<div>
  <p>Link Text 1</p>
  <p><a href="https://amzn.to/abc123">Link Text 2</a></p>
  <p>Link Text 3</p>
  <p>Link Text 4</p>
  <p>Link Text 5</p>
</div>

Only the amzn.to URL retains its <a> wrapper.

Tested Variations

Generic short URLs (bit.ly/xxx) — stripped
Custom short domain — stripped
Long URLs with query parameters — stripped
URLs of varying length (20-80 chars) — stripped
Amazon long URLs (amazon.co.jp/dp/XXX?tag=...) — preserved
amzn.to/xxx — preserved

Only amzn.to and amazon.* domains are reliably preserved.

Cross-Model Confirmation

Also reproduced in:

mistralai/mistral-small-creative
mistralai/mistral-small-3.2-24b-instruct

This suggests the bias is not vendor-specific but a general
property of mid-tier LLMs trained on web crawl data. Related
discussion filed at:
https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/discussions/42

Suspected Cause

Training data bias: Amazon URLs appear with disproportionate
frequency in web crawl datasets. The model learned a strong
prior of "amzn.to → wrap in <a>" but lacks the same prior
for other domains.

Why This Matters

Breaks use cases requiring preservation of arbitrary <a> tags:

E-commerce content with non-Amazon platforms
Internal docs with company-specific URLs
Multi-source citations
Localized content using region-specific platforms

Suggested Improvements

Augment training data with diverse <a> tag examples across
many domains
Increase weight of explicit HTML preservation instructions
Synthetic data to break the Amazon-specific prior

Severity

Medium-High — blocks content generation with non-Amazon links.

Happy to provide more reproduction materials if needed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment