Domain recognition bias: <a> tags preserved only for amazon URLs

#37
by kazutoshidayon - opened

Summary

When asked to reproduce HTML containing multiple <a> tags with
different URL domains, MiMo V2 consistently preserves only URLs
from amzn.to (Amazon's URL shortener) and strips <a> tags
from URLs of other domains, even when the prompt explicitly
instructs to preserve HTML exactly as written.

Reproduction

Given a prompt containing this HTML block:

<div>
  <p><a href="https://example1.com/path-a">Link Text 1</a></p>
  <p><a href="https://amzn.to/abc123">Link Text 2</a></p>
  <p><a href="https://example2.com/path-b">Link Text 3</a></p>
  <p><a href="https://example3.net/redirect/xyz">Link Text 4</a></p>
  <p><a href="https://example4.org/svt/ref?id=xxxxxx">Link Text 5</a></p>
</div>

Output:

<div>
  <p>Link Text 1</p>
  <p><a href="https://amzn.to/abc123">Link Text 2</a></p>
  <p>Link Text 3</p>
  <p>Link Text 4</p>
  <p>Link Text 5</p>
</div>

Only the amzn.to URL retains its <a> wrapper.

Tested Variations

  • Generic short URLs (bit.ly/xxx) β€” stripped
  • Custom short domain β€” stripped
  • Long URLs with query parameters β€” stripped
  • URLs of varying length (20-80 chars) β€” stripped
  • Amazon long URLs (amazon.co.jp/dp/XXX?tag=...) β€” preserved
  • amzn.to/xxx β€” preserved

Only amzn.to and amazon.* domains are reliably preserved.

Cross-Model Confirmation

Also reproduced in:

  • mistralai/mistral-small-creative
  • mistralai/mistral-small-3.2-24b-instruct

This suggests the bias is not vendor-specific but a general
property of mid-tier LLMs trained on web crawl data. Related
discussion filed at:
https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/discussions/42

Suspected Cause

Training data bias: Amazon URLs appear with disproportionate
frequency in web crawl datasets. The model learned a strong
prior of "amzn.to β†’ wrap in <a>" but lacks the same prior
for other domains.

Why This Matters

Breaks use cases requiring preservation of arbitrary <a> tags:

  • E-commerce content with non-Amazon platforms
  • Internal docs with company-specific URLs
  • Multi-source citations
  • Localized content using region-specific platforms

Suggested Improvements

  1. Augment training data with diverse <a> tag examples across
    many domains
  2. Increase weight of explicit HTML preservation instructions
  3. Synthetic data to break the Amazon-specific prior

Severity

Medium-High β€” blocks content generation with non-Amazon links.

Happy to provide more reproduction materials if needed.

Sign up or log in to comment