| |
| <div class="intro-brief" style="--intro-rgb: 255, 71, 64"> |
| <span class="intro-token" style="--a:0.56">Want</span><span class="intro-token" style="--a:0.53"> key</span><span class="intro-token" style="--a:0.29"> points</span><span class="intro-token" style="--a:0.31"> at</span><span class="intro-token" style="--a:0.09"> a</span><span class="intro-token" style="--a:0.00"> glance</span><span class="intro-token" style="--a:0.04">?</span><span class="intro-token" style="--a:0.26"> Or</span><span class="intro-token" style="--a:0.31"> simply</span><span class="intro-token" style="--a:0.29"> curious</span><span class="intro-token" style="--a:0.03"> about</span><span class="intro-token" style="--a:0.08"> the</span><span class="intro-token" style="--a:0.33"> information</span><span class="intro-token" style="--a:0.68">-the</span><span class="intro-token" style="--a:0.02">oret</span><span class="intro-token" style="--a:0.00">ic</span><span class="intro-token" style="--a:0.29"> nature</span><span class="intro-token" style="--a:0.00"> of</span><span class="intro-token" style="--a:0.31"> language</span><span class="intro-token" style="--a:0.19">?</span><br><br><span class="intro-token" style="--a:0.32">Try</span><span class="intro-token" style="--a:0.47"> Info</span><span class="intro-token" style="--a:0.70"> Highlight</span><span class="intro-token" style="--a:0.17">.</span><span class="intro-token" style="--a:0.06"> It</span><span class="intro-token" style="--a:0.25"> uses</span><span class="intro-token" style="--a:0.34"> large</span><span class="intro-token" style="--a:0.02"> language</span><span class="intro-token" style="--a:0.00"> models</span><span class="intro-token" style="--a:0.02"> to</span><span class="intro-token" style="--a:0.23"> analyze</span><span class="intro-token" style="--a:0.14"> text</span><span class="intro-token" style="--a:0.37"> information</span><span class="intro-token" style="--a:0.19"> density</span><span class="intro-token" style="--a:0.05"> and</span><span class="intro-token" style="--a:0.34"> visual</span><span class="intro-token" style="--a:0.01">izes</span><span class="intro-token" style="--a:0.39"> where</span><span class="intro-token" style="--a:0.08"> the</span><span class="intro-token" style="--a:0.26"> important</span><span class="intro-token" style="--a:0.13"> parts</span><span class="intro-token" style="--a:0.05"> are</span><span class="intro-token" style="--a:0.08">.</span><br><br><span class="intro-token" style="--a:0.17">The</span><span class="intro-token" style="--a:0.40"> color</span><span class="intro-token" style="--a:0.17"> intensity</span><span class="intro-token" style="--a:0.07"> of</span><span class="intro-token" style="--a:0.06"> each</span><span class="intro-token" style="--a:0.27"> token</span><span class="intro-token" style="--a:0.10"> indicates</span><span class="intro-token" style="--a:0.07"> how</span><span class="intro-token" style="--a:0.04"> much</span><span class="intro-token" style="--a:0.03"> information</span><span class="intro-token" style="--a:0.03"> it</span><span class="intro-token" style="--a:0.09"> carries</span><span class="intro-token" style="--a:0.04">.</span><span class="intro-token" style="--a:0.39"> Try</span><span class="intro-token" style="--a:0.04"> it</span><span class="intro-token" style="--a:0.12"> yourself</span><span class="intro-token" style="--a:0.21">!</span> |
| </div> |
|
|
| |
| <details class="intro-more"> |
| <summary> |
| <span class="intro-summary-when-closed">Learn more</span> |
| <span class="intro-summary-when-open">Hide</span> |
| </summary> |
|
|
| |
| <div class="intro-block"> |
| <h4>Intuitive Understanding of Information</h4> |
| <p>From a linguistic perspective, information represents the novelty/surprise/importance of a word. Words that |
| are harder to predict from context typically carry more information. A simple example: "This morning I opened the door and saw a 'UFO'." |
| vs "This morning I opened the door and saw a 'cat'." — clearly "UFO" carries more information.</p> |
| </div> |
|
|
| |
| <div class="intro-block intro-technical"> |
| <h4>Information-Theoretic Perspective</h4> |
| <p>In our implementation, the information content of each token comes from how difficult it is for the LLM to |
| predict that token from left to right.</p> |
| <p> |
| From an information-theoretic perspective, this can be expressed as the conditional information of a token |
| given the model and the preceding context: |
| </p> |
| <pre> |
| Information of tokenᵢ in a text = -log₂P(tokenᵢ | model, token₀, …, tokenᵢ₋₁) |
| </pre> |
| <p>The core assumption behind Info Highlight is that this conditional information aligns with human subjective |
| perception, such as novelty, surprise, and potential importance. |
| </p> |
| </div> |
|
|
| |
| <div class="intro-block"> |
| <h4>Ideal vs Reality</h4> |
| <p> |
| For an ideal model, whose knowledge and contextual understanding match that of the reader, the evaluation |
| would perfectly align with human subjective perception. |
| </p> |
| <p>Therefore, the gap between current results and reader perception mainly comes from two aspects:</p> |
| <ul> |
| <li><strong>Model capability vs human reader:</strong> The model's understanding and knowledge may be generally less than, |
| or possibly exceed, the reader's. Imagine comparing a state-of-the-art LLM with a ten-year-old reader.</li> |
| <li><strong>Model context vs human reader:</strong> The model only has the text read so far as context, much less |
| than the reader's. Info Highlight uses base models without instruction tuning or prompts (which actually |
| gives the best results).</li> |
| </ul> |
| <p>The good news is that LLMs are improving so fast: current analysis results already reflect mainstream |
| readers' subjective perception to some extent, and can be used to evaluate article information content and |
| improve reading speed.</p> |
| </div> |
|
|
| |
| <div class="intro-block"> |
| <h4>Tribute</h4> |
| <p>Built on the classic project <a href="http://gltr.io" target="_blank" rel="noopener">GLTR.io</a>, |
| developed by Hendrik Strobelt et al. in 2019. GLTR was a web demo that pioneered using GPT-2 prediction |
| probabilities to detect generated text.</p> |
| <p>However, Info Highlight is not meant to detect AI text, but to evaluate the "information quality" of text.</p> |
| </div> |
|
|
| |
| <div class="intro-block intro-faq"> |
| <h4>FAQ</h4> |
|
|
| <p><strong>Is it an AI text detector?</strong></p> |
| <p>No.</p> |
| <p>When we dislike AI text, we actually dislike low-quality text. We dislike low-quality human-written text, |
| rather than high-quality AI-generated content. So the key is the "information quality" of the text. |
| Info Highlight aims to detect "information quality" rather than "AI signs", though it can be used to detect |
| AI-generated nonsense with no information content.</p> |
|
|
| <p><strong>What LLM is currently used?</strong></p> |
| <p>Currently the open-source <strong>Qwen3-0.6B/1.7B/4B/14B-Base</strong> is used. Among them, the 4B model gives |
| results quite close to most people's subjective perception among the models the author has tested (note that |
| larger model does not necessarily lead to more consistency with the reader's subjective perception). When |
| hardware is limited, 0.6B/1.7B models are used; they perform slightly worse than 4B (information |
| content difference is within ~15%), but the trend is similar.</p> |
|
|
| <p><strong>Why does information content affect text quality?</strong></p> |
| <p>Low information content means the LLM can easily predict it from context. If even a machine can predict it, |
| how important can it be? Conversely, high information content means the LLM has difficulty predicting it |
| from context. (Assuming it's not a mistake) Then it represents key information the author wants to convey |
| that the machine doesn't know.</p> |
| </div> |
| </details> |