Domain recognition bias: <a> tags preserved only for amazon URLs
Summary
When asked to reproduce HTML containing multiple <a> tags with
different URL domains, MiMo V2 consistently preserves only URLs
from amzn.to (Amazon's URL shortener) and strips <a> tags
from URLs of other domains, even when the prompt explicitly
instructs to preserve HTML exactly as written.
Reproduction
Given a prompt containing this HTML block:
<div>
<p><a href="https://example1.com/path-a">Link Text 1</a></p>
<p><a href="https://amzn.to/abc123">Link Text 2</a></p>
<p><a href="https://example2.com/path-b">Link Text 3</a></p>
<p><a href="https://example3.net/redirect/xyz">Link Text 4</a></p>
<p><a href="https://example4.org/svt/ref?id=xxxxxx">Link Text 5</a></p>
</div>
Output:
<div>
<p>Link Text 1</p>
<p><a href="https://amzn.to/abc123">Link Text 2</a></p>
<p>Link Text 3</p>
<p>Link Text 4</p>
<p>Link Text 5</p>
</div>
Only the amzn.to URL retains its <a> wrapper.
Tested Variations
- Generic short URLs (
bit.ly/xxx) β stripped - Custom short domain β stripped
- Long URLs with query parameters β stripped
- URLs of varying length (20-80 chars) β stripped
- Amazon long URLs (
amazon.co.jp/dp/XXX?tag=...) β preserved amzn.to/xxxβ preserved
Only amzn.to and amazon.* domains are reliably preserved.
Cross-Model Confirmation
Also reproduced in:
mistralai/mistral-small-creativemistralai/mistral-small-3.2-24b-instruct
This suggests the bias is not vendor-specific but a general
property of mid-tier LLMs trained on web crawl data. Related
discussion filed at:
https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506/discussions/42
Suspected Cause
Training data bias: Amazon URLs appear with disproportionate
frequency in web crawl datasets. The model learned a strong
prior of "amzn.to β wrap in <a>" but lacks the same prior
for other domains.
Why This Matters
Breaks use cases requiring preservation of arbitrary <a> tags:
- E-commerce content with non-Amazon platforms
- Internal docs with company-specific URLs
- Multi-source citations
- Localized content using region-specific platforms
Suggested Improvements
- Augment training data with diverse
<a>tag examples across
many domains - Increase weight of explicit HTML preservation instructions
- Synthetic data to break the Amazon-specific prior
Severity
Medium-High β blocks content generation with non-Amazon links.
Happy to provide more reproduction materials if needed.