Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>I Woke Up To NaN And Now I Am Dead Inside | TinyMemoryLM</title> | |
| <link rel="stylesheet" href="bluesheet.css"> | |
| <link rel="preconnect" href="https://fonts.googleapis.com"> | |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> | |
| <link href="https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono&display=swap" rel="stylesheet"> | |
| <style> | |
| :root { | |
| --blue-900: #000000; | |
| --blue-800: #0a0a0a; | |
| --blue-700: #111111; | |
| --blue-600: #1a1a1a; | |
| --blue-500: #333333; | |
| --blue-400: #555555; | |
| --blue-300: #777777; | |
| --blue-200: #888888; | |
| --blue-100: #aaaaaa; | |
| --white: #ffffff; | |
| --white-soft: #f5f5f5; | |
| --white-muted: #e0e0e0; | |
| --grid-line: rgba(255, 255, 255, 0.03); | |
| --grid-line-major: rgba(255, 255, 255, 0.06); | |
| --accent: #ededed; | |
| --accent-muted: #888888; | |
| --font-sans: 'Geist', -apple-system, BlinkMacSystemFont, sans-serif; | |
| --font-mono: 'Geist Mono', 'SF Mono', 'Fira Code', monospace; | |
| --container-max: 1100px; | |
| } | |
| * { box-sizing: border-box; margin: 0; padding: 0; } | |
| html { font-size: 16px; scroll-behavior: smooth; } | |
| body { font-family: var(--font-sans); background: var(--blue-900); color: var(--white-muted); line-height: 1.7; -webkit-font-smoothing: antialiased; } | |
| a { color: var(--white); text-decoration: none; transition: color 0.15s ease; } | |
| a:hover { color: var(--accent); } | |
| .container { max-width: var(--container-max); margin: 0 auto; padding: 0 24px; } | |
| nav { position: fixed; top: 0; left: 0; right: 0; z-index: 100; background: rgba(0, 0, 0, 0.85); backdrop-filter: blur(12px); border-bottom: 1px solid var(--blue-600); padding: 16px 0; } | |
| nav .container { display: flex; justify-content: space-between; align-items: center; } | |
| .nav-brand { font-size: 18px; font-weight: 600; color: var(--white); display: flex; align-items: center; gap: 8px; } | |
| .nav-brand span { color: var(--accent); } | |
| .nav-links { display: flex; gap: 32px; } | |
| .nav-links a { font-size: 14px; font-weight: 500; color: var(--blue-200); } | |
| .nav-links a:hover { color: var(--white); } | |
| .post { padding: 140px 0 80px; } | |
| .post-back { display: inline-block; color: var(--blue-200); font-size: 14px; margin-bottom: 32px; } | |
| .post-back:hover { color: var(--accent); } | |
| .post-back::before { content: '← '; } | |
| .post-meta { display: flex; gap: 12px; margin-bottom: 20px; } | |
| .post-date { font-size: 13px; color: var(--blue-200); font-family: var(--font-mono); } | |
| .post-tag { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: var(--white); background: rgba(255, 255, 255, 0.08); padding: 4px 10px; border-radius: 4px; } | |
| .post h1 { font-size: 36px; font-weight: 700; color: var(--white); margin-bottom: 32px; line-height: 1.2; letter-spacing: -0.02em; } | |
| .post-body p { font-size: 17px; line-height: 1.8; margin-bottom: 24px; color: var(--blue-200); } | |
| .post-body p:first-of-type { font-size: 20px; color: var(--white-muted); } | |
| .post-body h2 { font-size: 24px; font-weight: 600; color: var(--white); margin: 48px 0 20px; } | |
| .post-body blockquote { border-left: 3px solid var(--accent); padding: 20px 24px; margin: 32px 0; background: var(--blue-800); border-radius: 0 8px 8px 0; } | |
| .post-body blockquote p { font-size: 16px; font-style: italic; color: var(--blue-200); margin: 0; } | |
| .post-body hr { border: none; height: 1px; background: var(--blue-600); margin: 48px 0; } | |
| .stats-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin: 24px 0; } | |
| .stat-card { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; text-align: center; } | |
| .stat-card .number { font-size: 32px; font-weight: 700; color: var(--accent); font-family: var(--font-mono); } | |
| .stat-card .label { font-size: 13px; color: var(--blue-200); margin-top: 8px; } | |
| .code-block { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 13px; overflow-x: auto; } | |
| .code-block .comment { color: var(--blue-200); font-style: italic; display: block; margin-top: 4px; } | |
| .log-block { background: var(--blue-900); border: 1px solid var(--accent); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 13px; color: var(--accent); } | |
| .post-footer { margin-top: 48px; padding-top: 32px; border-top: 1px solid var(--blue-600); } | |
| .post-footer p { font-size: 14px; color: var(--blue-200); font-style: italic; margin: 0; } | |
| footer { padding: 40px 0; background: var(--blue-800); border-top: 1px solid var(--blue-600); text-align: center; } | |
| footer p { color: var(--blue-200); font-size: 14px; margin-bottom: 8px; } | |
| footer a { color: var(--blue-200); } | |
| footer a:hover { color: var(--accent); } | |
| @media (max-width: 768px) { .post h1 { font-size: 28px; } .nav-links { display: none; } .stats-grid { grid-template-columns: 1fr; } } | |
| </style> | |
| </head> | |
| <body> | |
| <svg class="scribbles" viewBox="0 0 1440 900" preserveAspectRatio="xMidYMid slice"> | |
| <path d="M100,50 Q150,30 200,60 T300,40 T400,70" fill="none" stroke="white" stroke-width="1"/> | |
| <path d="M800,200 Q850,180 900,210 T1000,190 T1100,220" fill="none" stroke="white" stroke-width="0.8"/> | |
| <path d="M200,700 Q250,680 300,710 T400,690 T500,720" fill="none" stroke="white" stroke-width="0.6"/> | |
| <path d="M1200,400 Q1250,380 1300,410 T1400,390" fill="none" stroke="white" stroke-width="0.7"/> | |
| <path d="M50,400 Q100,380 150,420 T250,400" fill="none" stroke="white" stroke-width="0.5"/> | |
| <circle cx="350" cy="150" r="30" fill="none" stroke="white" stroke-width="0.6"/> | |
| <circle cx="1100" cy="600" r="25" fill="none" stroke="white" stroke-width="0.5"/> | |
| <path d="M600,100 L620,80 L640,100 L660,80" fill="none" stroke="white" stroke-width="0.7"/> | |
| <path d="M1300,750 Q1320,730 1340,760 T1380,740" fill="none" stroke="white" stroke-width="0.5"/> | |
| <path d="M100,800 Q120,780 140,810 T180,790 T220,820" fill="none" stroke="white" stroke-width="0.6"/> | |
| <path d="M700,500 Q720,480 740,510 T780,490 T820,520" fill="none" stroke="white" stroke-width="0.4"/> | |
| <path d="M400,300 C420,280 440,320 460,300 C480,280 500,320 520,300" fill="none" stroke="white" stroke-width="0.5"/> | |
| <path d="M900,700 C920,680 940,720 960,700 C980,680 1000,720 1020,700" fill="none" stroke="white" stroke-width="0.6"/> | |
| <path d="M150,250 Q170,230 190,260 Q210,240 230,270" fill="none" stroke="white" stroke-width="0.4"/> | |
| <path d="M1050,100 Q1070,80 1090,110 Q1110,90 1130,120" fill="none" stroke="white" stroke-width="0.5"/> | |
| <path d="M500,850 C520,830 540,860 560,840 C580,820 600,860 620,840" fill="none" stroke="white" stroke-width="0.4"/> | |
| <path d="M1350,50 Q1370,30 1390,60 T1430,40" fill="none" stroke="white" stroke-width="0.5"/> | |
| <path d="M30,600 Q50,580 70,610 T110,590" fill="none" stroke="white" stroke-width="0.4"/> | |
| </svg> | |
| <nav> | |
| <div class="container"> | |
| <a href="index.html" class="nav-brand"><span>/</span>TinyMemoryLM</a> | |
| <div class="nav-links"> | |
| <a href="index.html">Home</a> | |
| <a href="blog.html">Blog</a> | |
| <a href="status.html">Status</a> | |
| </div> | |
| </div> | |
| </nav> | |
| <main> | |
| <article class="post"> | |
| <div class="container"> | |
| <a href="blog.html" class="post-back">Back to Blog</a> | |
| <header> | |
| <div class="post-meta"> | |
| <span class="post-date">2026-03-19</span> | |
| <span class="post-tag">Training Disasters</span> | |
| </div> | |
| <h1>I Woke Up To NaN And Now I Am Dead Inside</h1> | |
| </header> | |
| <div class="post-body"> | |
| <p>I went to sleep happy. The loss was going down. The gradients were stable. The GPU was humming at 60C like a contented cat. I dreamed of completion. I dreamed of a finished Sonnet model. I dreamed of sleep that was not interrupted by thoughts of learning rate schedules.</p> | |
| <p>I woke up to NaN. Not a number. Not a loss. Nothing. Just emptiness where my progress used to be. The training script was still running. The GPU was still spinning. The loss was just gone. Replaced by the most devastating three letters in machine learning.</p> | |
| <blockquote> | |
| <p>There is no pain like opening your terminal in the morning and seeing loss: nan. It is a special kind of grief reserved for people who train models on consumer hardware.</p> | |
| </blockquote> | |
| <h2>The Numbers</h2> | |
| <div class="stats-grid"> | |
| <div class="stat-card"> | |
| <div class="number">16%</div> | |
| <div class="label">Progress Lost</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="number">0%</div> | |
| <div class="label">Current Progress</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="number">42</div> | |
| <div class="label">Hours Wasted</div> | |
| </div> | |
| <div class="stat-card"> | |
| <div class="number">1</div> | |
| <div class="label">Checkpoint Saved</div> | |
| </div> | |
| </div> | |
| <p>Sixteen percent. Out of 261 hours. That is 42 hours of compute. That is 42 hours of electricity. That is 42 hours of my life I will never get back. The last checkpoint saved at 15 percent. The NaN happened at 16 percent. One percent. That is all I needed. One more percent and I would have been safe.</p> | |
| <h2>The Log</h2> | |
| <div class="log-block"> | |
| Step 45000 | Loss: 2.341 | Grad Norm: 1.2<br> | |
| Step 45100 | Loss: 2.338 | Grad Norm: 1.1<br> | |
| Step 45200 | Loss: 2.335 | Grad Norm: 1.3<br> | |
| Step 45300 | Loss: nan | Grad Norm: nan<br> | |
| Step 45400 | Loss: nan | Grad Norm: nan<br> | |
| <span class="comment"># It happened so fast. I did not even see it coming.</span> | |
| </div> | |
| <p>Look at those numbers. Look at how normal they were. Look at how everything was fine. Then step 45300. Everything broke. The gradients exploded. The loss became undefined. The model forgot how to speak English. It forgot how to speak anything.</p> | |
| <h2>What Went Wrong</h2> | |
| <p>I have theories. None of them bring me peace. Maybe the learning rate was too high. Maybe I should have used a scheduler. Maybe a bad batch of data slipped through. Maybe a cosmic ray hit my GPU at exactly the wrong moment. Maybe the universe is telling me to stop.</p> | |
| <p>I checked the gradients. They spiked right before the NaN. Classic gradient explosion. I should have used gradient clipping. I did not use gradient clipping. I was confident. Confidence is the enemy of completed training runs.</p> | |
| <div class="code-block"> | |
| <span class="comment"># My training config (the problematic parts)</span><br> | |
| learning_rate = 3e-4 # Probably too high<br> | |
| gradient_clipping = None # Why would I need this<br> | |
| checkpoint_every = 5000 # Too infrequent<br> | |
| hope = "maximum" # Not a valid parameter<br> | |
| <span class="comment"># I am a fool.</span> | |
| </div> | |
| <h2>The Restart</h2> | |
| <p>I restarted the training. From zero. From nothing. From the beginning. The loss is going down again. It is at 0.3 percent now. It will take 261 hours again. I will watch it again. I will fear the NaN again.</p> | |
| <p>I added gradient clipping. I lowered the learning rate. I increased checkpoint frequency. I added every safeguard I could think of. It will still probably crash. That is the nature of this work. You prepare. You plan. You lose everything anyway.</p> | |
| <h2>The Emotional Toll</h2> | |
| <p>I am not okay. I know it is just code. I know it is just weights. I know I can try again. I also know I just lost two days of my life. Two days I could have spent doing anything else. Writing. Sleeping. Touching grass. Instead I am watching a progress bar reset.</p> | |
| <p>My team knows. They sent messages. "It happens." "Welcome to ML." "At least you learned something." I do not want to learn. I want a finished model. I want to release Sonnet. I want to stop writing blogs about failures.</p> | |
| <blockquote> | |
| <p>Every machine learning practitioner has a NaN story. This is mine. It is not special. It is just painful.</p> | |
| </blockquote> | |
| <h2>Why I Continue</h2> | |
| <p>I could use cloud GPUs. They have better reliability. They have checkpointing built in. They have support. I do not have the money. I have a 5090 and pride. The 5090 is working. The pride is wounded.</p> | |
| <p>I could train smaller models. Haiku is done. Haiku works. Haiku gives fish answers but it exists. I could just release Haiku and call it a day. I will not. I want Sonnet. I want Opus. I want the full set. I want to suffer for my art.</p> | |
| <h2>Lessons Learned</h2> | |
| <p>Save checkpoints more often. Clip your gradients. Lower your learning rate. Expect failure. Assume everything will break. Plan for the NaN. When it happens, do not cry. Restart. Try again. Maybe this time it will work.</p> | |
| <p>Also maybe do not train models overnight without checking them first. Maybe watch the loss curve. Maybe be present. Maybe accept that local training is pain and I chose this.</p> | |
| <h2>Final Thoughts</h2> | |
| <p>The training is running again. The loss is decreasing. The GPU is at 60C. Everything is fine. Everything is terrible. I will check it in an hour. Then in another hour. Then I will sleep with one eye open.</p> | |
| <p>If you are training models locally, I am sorry. If you are thinking about it, I am more sorry. If you have never experienced a NaN crash, you will. It is not a matter of if. It is a matter of when. Prepare yourself.</p> | |
| <hr> | |
| </div> | |
| <footer class="post-footer"> | |
| <p>Current status: Restarted at 0.3%. Gradient clipping enabled. Learning rate lowered. Trust issues developed. Send thoughts and prayers.</p> | |
| </footer> | |
| </div> | |
| </article> | |
| </main> | |
| <footer> | |
| <div class="container"> | |
| <p>Built with curiosity over compute</p> | |
| <p>TinyMemoryLM by AILAY | 2026</p> | |
| </div> | |
| </footer> | |
| </body> | |
| </html> |