Device: cuda Loading tokenizer: /tmp/eval/multilingual_32k.model Loading base model: /tmp/eval/best_model.pt Model loaded: 3.04B parameters Loading SFT data from: /tmp/sft_data_v2 Train: 3949348 tokens, Val: 201020 tokens Using 8-bit AdamW (bitsandbytes) Starting SFT training for 4000 steps... Batch size: 1 x 4 accum = 4 effective, Seq len: 2048, LR: 2e-05 Step 10/4000 | Loss: 2.3791 | LR: 0.000001 | TPS: 1196 | 68s Step 20/4000 | Loss: 2.5346 | LR: 0.000002 | TPS: 1418 | 116s Step 30/4000 | Loss: 2.7910 | LR: 0.000003 | TPS: 1511 | 163s Step 40/4000 | Loss: 2.5189 | LR: 0.000004 | TPS: 1562 | 210s Step 50/4000 | Loss: 2.5049 | LR: 0.000005 | TPS: 1594 | 257s Step 60/4000 | Loss: 2.5417 | LR: 0.000006 | TPS: 1616 | 304s Step 70/4000 | Loss: 2.2374 | LR: 0.000007 | TPS: 1633 | 351s Step 80/4000 | Loss: 2.5328 | LR: 0.000008 | TPS: 1645 | 398s Step 90/4000 | Loss: 2.5359 | LR: 0.000009 | TPS: 1655 | 445s Step 100/4000 | Loss: 2.4830 | LR: 0.000010 | TPS: 1663 | 493s Step 110/4000 | Loss: 2.3015 | LR: 0.000011 | TPS: 1669 | 540s Step 120/4000 | Loss: 2.4667 | LR: 0.000012 | TPS: 1675 | 587s Step 130/4000 | Loss: 2.3792 | LR: 0.000013 | TPS: 1680 | 634s Step 140/4000 | Loss: 2.3918 | LR: 0.000014 | TPS: 1684 | 681s Step 150/4000 | Loss: 2.3368 | LR: 0.000015 | TPS: 1687 | 728s Step 160/4000 | Loss: 2.4838 | LR: 0.000016 | TPS: 1690 | 775s Step 170/4000 | Loss: 2.3578 | LR: 0.000017 | TPS: 1693 | 823s Step 180/4000 | Loss: 2.5485 | LR: 0.000018 | TPS: 1695 | 870s Step 190/4000 | Loss: 2.0834 | LR: 0.000019 | TPS: 1698 | 917s Step 200/4000 | Loss: 1.9784 | LR: 0.000020 | TPS: 1699 | 964s Step 210/4000 | Loss: 2.4826 | LR: 0.000020 | TPS: 1701 | 1011s Step 220/4000 | Loss: 2.3540 | LR: 0.000020 | TPS: 1703 | 1058s Step 230/4000 | Loss: 2.2093 | LR: 0.000020 | TPS: 1704 | 1105s Step 240/4000 | Loss: 2.2137 | LR: 0.000020 | TPS: 1706 | 1153s Step 250/4000 | Loss: 2.2151 | LR: 0.000020 | TPS: 1707 | 1200s Step 260/4000 | Loss: 2.2535 | LR: 0.000020 | TPS: 1708 | 1247s Step 270/4000 | Loss: 2.2235 | LR: 0.000020 | TPS: 1709 | 1294s Step 280/4000 | Loss: 2.0449 | LR: 0.000020 | TPS: 1710 | 1341s Step 290/4000 | Loss: 2.1502 | LR: 0.000020 | TPS: 1711 | 1388s Step 300/4000 | Loss: 2.3716 | LR: 0.000020 | TPS: 1712 | 1435s Step 310/4000 | Loss: 2.1591 | LR: 0.000020 | TPS: 1713 | 1483s Step 320/4000 | Loss: 2.2153 | LR: 0.000020 | TPS: 1714 | 1530s Step 330/4000 | Loss: 2.2023 | LR: 0.000020 | TPS: 1714 | 1577s Step 340/4000 | Loss: 2.3968 | LR: 0.000020 | TPS: 1715 | 1624s Step 350/4000 | Loss: 2.1146 | LR: 0.000020 | TPS: 1716 | 1671s Step 360/4000 | Loss: 2.1857 | LR: 0.000020 | TPS: 1716 | 1718s Step 370/4000 | Loss: 2.1965 | LR: 0.000020 | TPS: 1717 | 1765s Step 380/4000 | Loss: 2.1613 | LR: 0.000020 | TPS: 1717 | 1813s Step 390/4000 | Loss: 2.3080 | LR: 0.000020 | TPS: 1718 | 1860s Step 400/4000 | Loss: 2.2964 | LR: 0.000020 | TPS: 1718 | 1907s 📊 Val loss: 2.2256 (NEW BEST!) 💾 Best model saved to /tmp/sft/sft_model_v2.pt Step 410/4000 | Loss: 2.2859 | LR: 0.000020 | TPS: 1703 | 1973s Step 420/4000 | Loss: 2.1711 | LR: 0.000020 | TPS: 1703 | 2020s Step 430/4000 | Loss: 2.1434 | LR: 0.000020 | TPS: 1704 | 2067s Step 440/4000 | Loss: 2.2115 | LR: 0.000020 | TPS: 1705 | 2114s Step 450/4000 | Loss: 2.2985 | LR: 0.000020 | TPS: 1706 | 2161s Step 460/4000 | Loss: 1.9845 | LR: 0.000020 | TPS: 1707 | 2208s Step 470/4000 | Loss: 2.3135 | LR: 0.000020 | TPS: 1707 | 2255s Step 480/4000 | Loss: 2.3004 | LR: 0.000020 | TPS: 1708 | 2302s Step 490/4000 | Loss: 2.1841 | LR: 0.000020 | TPS: 1709 | 2349s Step 500/4000 | Loss: 2.3647 | LR: 0.000020 | TPS: 1709 | 2396s Step 510/4000 | Loss: 2.1587 | LR: 0.000020 | TPS: 1710 | 2443s Step 520/4000 | Loss: 2.0790 | LR: 0.000020 | TPS: 1711 | 2490s Step 530/4000 | Loss: 2.0842 | LR: 0.000020 | TPS: 1711 | 2537s Step 540/4000 | Loss: 2.4031 | LR: 0.000020 | TPS: 1712 | 2584s Step 550/4000 | Loss: 2.3037 | LR: 0.000020 | TPS: 1712 | 2632s Step 560/4000 | Loss: 2.2433 | LR: 0.000020 | TPS: 1713 | 2679s Step 570/4000 | Loss: 2.1670 | LR: 0.000020 | TPS: 1713 | 2726s Step 580/4000 | Loss: 2.1579 | LR: 0.000020 | TPS: 1714 | 2773s Step 590/4000 | Loss: 1.9392 | LR: 0.000020 | TPS: 1714 | 2820s Step 600/4000 | Loss: 2.1226 | LR: 0.000020 | TPS: 1715 | 2867s Step 610/4000 | Loss: 2.2641 | LR: 0.000019 | TPS: 1715 | 2914s Step 620/4000 | Loss: 2.0771 | LR: 0.000019 | TPS: 1715 | 2961s Step 630/4000 | Loss: 2.4527 | LR: 0.000019 | TPS: 1716 | 3008s Step 640/4000 | Loss: 2.2605 | LR: 0.000019 | TPS: 1716 | 3055s Step 650/4000 | Loss: 1.9801 | LR: 0.000019 | TPS: 1717 | 3102s Step 660/4000 | Loss: 2.4208 | LR: 0.000019 | TPS: 1717 | 3149s Step 670/4000 | Loss: 2.3331 | LR: 0.000019 | TPS: 1717 | 3196s Step 680/4000 | Loss: 2.1299 | LR: 0.000019 | TPS: 1718 | 3243s Step 690/4000 | Loss: 2.1551 | LR: 0.000019 | TPS: 1718 | 3290s Step 700/4000 | Loss: 2.0940 | LR: 0.000019 | TPS: 1718 | 3337s Step 710/4000 | Loss: 2.0533 | LR: 0.000019 | TPS: 1719 | 3384s Step 720/4000 | Loss: 2.2076 | LR: 0.000019 | TPS: 1719 | 3431s Step 730/4000 | Loss: 1.9816 | LR: 0.000019 | TPS: 1719 | 3478s Step 740/4000 | Loss: 2.1420 | LR: 0.000019 | TPS: 1719 | 3526s Step 750/4000 | Loss: 2.2928 | LR: 0.000019 | TPS: 1720 | 3573s Step 760/4000 | Loss: 2.1035 | LR: 0.000019 | TPS: 1720 | 3620s Step 770/4000 | Loss: 2.1663 | LR: 0.000019 | TPS: 1720 | 3667s Step 780/4000 | Loss: 2.2270 | LR: 0.000019 | TPS: 1721 | 3714s Step 790/4000 | Loss: 2.1436 | LR: 0.000019 | TPS: 1721 | 3761s Step 800/4000 | Loss: 2.3599 | LR: 0.000019 | TPS: 1721 | 3808s 📊 Val loss: 2.1960 (NEW BEST!) 💾 Best model saved to /tmp/sft/sft_model_v2.pt Step 810/4000 | Loss: 2.2325 | LR: 0.000019 | TPS: 1696 | 3912s Step 820/4000 | Loss: 2.0798 | LR: 0.000019 | TPS: 1696 | 3960s Step 830/4000 | Loss: 2.1527 | LR: 0.000019 | TPS: 1697 | 4007s Step 840/4000 | Loss: 2.2046 | LR: 0.000019 | TPS: 1697 | 4054s Step 850/4000 | Loss: 2.0648 | LR: 0.000019 | TPS: 1698 | 4101s Step 860/4000 | Loss: 2.1708 | LR: 0.000019 | TPS: 1698 | 4148s Step 870/4000 | Loss: 2.3088 | LR: 0.000019 | TPS: 1699 | 4195s Step 880/4000 | Loss: 1.9936 | LR: 0.000019 | TPS: 1699 | 4242s Step 890/4000 | Loss: 2.1869 | LR: 0.000019 | TPS: 1700 | 4290s Step 900/4000 | Loss: 2.4199 | LR: 0.000019 | TPS: 1700 | 4337s Step 910/4000 | Loss: 2.3803 | LR: 0.000018 | TPS: 1700 | 4384s Step 920/4000 | Loss: 2.0193 | LR: 0.000018 | TPS: 1701 | 4431s Step 930/4000 | Loss: 2.1047 | LR: 0.000018 | TPS: 1701 | 4478s Step 940/4000 | Loss: 2.1449 | LR: 0.000018 | TPS: 1702 | 4525s Step 950/4000 | Loss: 2.1521 | LR: 0.000018 | TPS: 1702 | 4572s Step 960/4000 | Loss: 2.2820 | LR: 0.000018 | TPS: 1702 | 4620s Step 970/4000 | Loss: 2.2996 | LR: 0.000018 | TPS: 1703 | 4667s Step 980/4000 | Loss: 2.3187 | LR: 0.000018 | TPS: 1703 | 4714s Step 990/4000 | Loss: 2.1756 | LR: 0.000018 | TPS: 1703 | 4761s Step 1000/4000 | Loss: 1.9765 | LR: 0.000018 | TPS: 1704 | 4808s 🔤 Generation samples (step 1000): [EN] The capital of France is located in Normandy. [HE] מלזיה. [AR] باريس. [FA] پاریس یکی از شهرهای بزرگ و تاریخی جهان است که دارای جاذبه های طبیعی، فرهنگی و اقتصادی متعددی می باشد. شهر پاریس در غرب کشورمان قرار دارد و به عنوان یکی از مهم ترین مراکز تجاری و مالی دنیا شناخته شده ا [TRANSLATE] "תודה על הכול, אבא. אני כאן איתך בכל רגע נתון." Step 1010/4000 | Loss: 2.1665 | LR: 0.000018 | TPS: 1703 | 4859s Step 1020/4000 | Loss: 2.1047 | LR: 0.000018 | TPS: 1703 | 4906s Step 1030/4000 | Loss: 2.2359 | LR: 0.000018 | TPS: 1704 | 4953s Step 1040/4000 | Loss: 2.0109 | LR: 0.000018 | TPS: 1704 | 5000s Step 1050/4000 | Loss: 2.1515 | LR: 0.000018 | TPS: 1704 | 5047s Step 1060/4000 | Loss: 2.0880 | LR: 0.000018 | TPS: 1705 | 5094s Step 1070/4000 | Loss: 2.2460 | LR: 0.000018 | TPS: 1705 | 5142s Step 1080/4000 | Loss: 1.9325 | LR: 0.000018 | TPS: 1705 | 5189s Step 1090/4000 | Loss: 2.2283 | LR: 0.000018 | TPS: 1705 | 5236s Step 1100/4000 | Loss: 2.3303 | LR: 0.000018 | TPS: 1706 | 5283s Step 1110/4000 | Loss: 2.1772 | LR: 0.000018 | TPS: 1706 | 5330s Step 1120/4000 | Loss: 2.1615 | LR: 0.000018 | TPS: 1706 | 5377s Step 1130/4000 | Loss: 2.1470 | LR: 0.000017 | TPS: 1707 | 5424s Step 1140/4000 | Loss: 1.9640 | LR: 0.000017 | TPS: 1707 | 5472s Step 1150/4000 | Loss: 2.1891 | LR: 0.000017 | TPS: 1707 | 5519s Step 1160/4000 | Loss: 2.2183 | LR: 0.000017 | TPS: 1707 | 5566s Step 1170/4000 | Loss: 2.0268 | LR: 0.000017 | TPS: 1708 | 5613s Step 1180/4000 | Loss: 2.2234 | LR: 0.000017 | TPS: 1708 | 5660s Step 1190/4000 | Loss: 2.1961 | LR: 0.000017 | TPS: 1708 | 5707s Step 1200/4000 | Loss: 2.2019 | LR: 0.000017 | TPS: 1708 | 5754s 📊 Val loss: 2.2238 Step 1210/4000 | Loss: 2.0809 | LR: 0.000017 | TPS: 1707 | 5807s Step 1220/4000 | Loss: 2.1716 | LR: 0.000017 | TPS: 1707 | 5854s Step 1230/4000 | Loss: 2.2607 | LR: 0.000017 | TPS: 1707 | 5901s Step 1240/4000 | Loss: 2.1838 | LR: 0.000017 | TPS: 1708 | 5949s Step 1250/4000 | Loss: 2.0725 | LR: 0.000017 | TPS: 1708 | 5996s Step 1260/4000 | Loss: 2.2797 | LR: 0.000017 | TPS: 1708 | 6043s Step 1270/4000 | Loss: 2.0366 | LR: 0.000017 | TPS: 1708 | 6090s Step 1280/4000 | Loss: 2.1469 | LR: 0.000017 | TPS: 1709 | 6137s Step 1290/4000 | Loss: 2.1541 | LR: 0.000017 | TPS: 1709 | 6184s Step 1300/4000 | Loss: 2.0311 | LR: 0.000017 | TPS: 1709 | 6231s Step 1310/4000 | Loss: 2.1828 | LR: 0.000016 | TPS: 1709 | 6279s Step 1320/4000 | Loss: 2.2004 | LR: 0.000016 | TPS: 1709 | 6326s Step 1330/4000 | Loss: 2.2589 | LR: 0.000016 | TPS: 1710 | 6373s Step 1340/4000 | Loss: 2.1475 | LR: 0.000016 | TPS: 1710 | 6420s Step 1350/4000 | Loss: 2.1672 | LR: 0.000016 | TPS: 1710 | 6467s Step 1360/4000 | Loss: 2.1921 | LR: 0.000016 | TPS: 1710 | 6514s Step 1370/4000 | Loss: 2.0689 | LR: 0.000016 | TPS: 1710 | 6561s Step 1380/4000 | Loss: 2.2560 | LR: 0.000016 | TPS: 1711 | 6609s Step 1390/4000 | Loss: 1.9519 | LR: 0.000016 | TPS: 1711 | 6656s Step 1400/4000 | Loss: 1.9671 | LR: 0.000016 | TPS: 1711 | 6703s Step 1410/4000 | Loss: 2.1535 | LR: 0.000016 | TPS: 1711 | 6750s Step 1420/4000 | Loss: 2.1726 | LR: 0.000016 | TPS: 1711 | 6797s Step 1430/4000 | Loss: 2.0854 | LR: 0.000016 | TPS: 1712 | 6844s Step 1440/4000 | Loss: 2.0955 | LR: 0.000016 | TPS: 1712 | 6891s Step 1450/4000 | Loss: 2.1260 | LR: 0.000016 | TPS: 1712 | 6939s Step 1460/4000 | Loss: 2.2860 | LR: 0.000016 | TPS: 1712 | 6986s Step 1470/4000 | Loss: 1.6098 | LR: 0.000015 | TPS: 1712 | 7033s Step 1480/4000 | Loss: 2.1327 | LR: 0.000015 | TPS: 1712 | 7080s Step 1490/4000 | Loss: 2.0506 | LR: 0.000015 | TPS: 1713 | 7127s Step 1500/4000 | Loss: 2.0568 | LR: 0.000015 | TPS: 1713 | 7174s Step 1510/4000 | Loss: 2.0177 | LR: 0.000015 | TPS: 1713 | 7221s Step 1520/4000 | Loss: 2.0383 | LR: 0.000015 | TPS: 1713 | 7269s Step 1530/4000 | Loss: 2.0994 | LR: 0.000015 | TPS: 1713 | 7316s Step 1540/4000 | Loss: 2.0863 | LR: 0.000015 | TPS: 1713 | 7363s Step 1550/4000 | Loss: 2.3287 | LR: 0.000015 | TPS: 1714 | 7410s Step 1560/4000 | Loss: 2.1585 | LR: 0.000015 | TPS: 1714 | 7457s Step 1570/4000 | Loss: 1.9781 | LR: 0.000015 | TPS: 1714 | 7504s Step 1580/4000 | Loss: 1.9344 | LR: 0.000015 | TPS: 1714 | 7551s Step 1590/4000 | Loss: 2.1031 | LR: 0.000015 | TPS: 1714 | 7599s Step 1600/4000 | Loss: 2.2633 | LR: 0.000015 | TPS: 1714 | 7646s 📊 Val loss: 2.1164 (NEW BEST!) 💾 Best model saved to /tmp/sft/sft_model_v2.pt Step 1610/4000 | Loss: 2.0217 | LR: 0.000015 | TPS: 1702 | 7750s Step 1620/4000 | Loss: 2.0437 | LR: 0.000014 | TPS: 1702 | 7797s Step 1630/4000 | Loss: 2.3588 | LR: 0.000014 | TPS: 1702 | 7844s Step 1640/4000 | Loss: 2.1927 | LR: 0.000014 | TPS: 1702 | 7892s Step 1650/4000 | Loss: 1.9298 | LR: 0.000014 | TPS: 1703 | 7939s Step 1660/4000 | Loss: 2.1604 | LR: 0.000014 | TPS: 1703 | 7986s Step 1670/4000 | Loss: 2.0326 | LR: 0.000014 | TPS: 1703 | 8033s Step 1680/4000 | Loss: 2.1872 | LR: 0.000014 | TPS: 1703 | 8080s Step 1690/4000 | Loss: 2.0633 | LR: 0.000014 | TPS: 1703 | 8127s Step 1700/4000 | Loss: 2.2547 | LR: 0.000014 | TPS: 1704 | 8174s Step 1710/4000 | Loss: 1.8940 | LR: 0.000014 | TPS: 1704 | 8221s Step 1720/4000 | Loss: 2.0726 | LR: 0.000014 | TPS: 1704 | 8269s Step 1730/4000 | Loss: 2.0857 | LR: 0.000014 | TPS: 1704 | 8316s Step 1740/4000 | Loss: 2.0686 | LR: 0.000014 | TPS: 1704 | 8363s Step 1750/4000 | Loss: 2.1306 | LR: 0.000014 | TPS: 1705 | 8410s Step 1760/4000 | Loss: 2.0932 | LR: 0.000013 | TPS: 1705 | 8457s Step 1770/4000 | Loss: 2.0751 | LR: 0.000013 | TPS: 1705 | 8504s Step 1780/4000 | Loss: 2.1802 | LR: 0.000013 | TPS: 1705 | 8551s Step 1790/4000 | Loss: 1.6657 | LR: 0.000013 | TPS: 1705 | 8599s Step 1800/4000 | Loss: 2.1290 | LR: 0.000013 | TPS: 1706 | 8646s Step 1810/4000 | Loss: 2.1032 | LR: 0.000013 | TPS: 1706 | 8693s Step 1820/4000 | Loss: 2.1255 | LR: 0.000013 | TPS: 1706 | 8740s Step 1830/4000 | Loss: 2.1091 | LR: 0.000013 | TPS: 1706 | 8787s Step 1840/4000 | Loss: 1.9875 | LR: 0.000013 | TPS: 1706 | 8834s Step 1850/4000 | Loss: 1.9615 | LR: 0.000013 | TPS: 1706 | 8881s Step 1860/4000 | Loss: 2.0189 | LR: 0.000013 | TPS: 1707 | 8929s Step 1870/4000 | Loss: 2.1387 | LR: 0.000013 | TPS: 1707 | 8976s Step 1880/4000 | Loss: 2.0963 | LR: 0.000013 | TPS: 1707 | 9023s Step 1890/4000 | Loss: 2.1750 | LR: 0.000013 | TPS: 1707 | 9070s Step 1900/4000 | Loss: 2.3945 | LR: 0.000012 | TPS: 1707 | 9117s Step 1910/4000 | Loss: 2.1515 | LR: 0.000012 | TPS: 1707 | 9164s Step 1920/4000 | Loss: 2.2224 | LR: 0.000012 | TPS: 1708 | 9211s Step 1930/4000 | Loss: 2.3160 | LR: 0.000012 | TPS: 1708 | 9259s Step 1940/4000 | Loss: 2.0126 | LR: 0.000012 | TPS: 1708 | 9306s Step 1950/4000 | Loss: 2.2443 | LR: 0.000012 | TPS: 1708 | 9353s Step 1960/4000 | Loss: 1.9590 | LR: 0.000012 | TPS: 1708 | 9400s Step 1970/4000 | Loss: 2.2280 | LR: 0.000012 | TPS: 1708 | 9447s Step 1980/4000 | Loss: 1.9723 | LR: 0.000012 | TPS: 1708 | 9494s Step 1990/4000 | Loss: 2.0697 | LR: 0.000012 | TPS: 1709 | 9541s Step 2000/4000 | Loss: 2.0568 | LR: 0.000012 | TPS: 1709 | 9589s 📊 Val loss: 2.1674 🔤 Generation samples (step 2000): [EN] Paris (pronounced "Paris") is a city located in northeastern France. It borders Germany to the east, with Belgium and Luxembourg as its easternmost provinces. [HE] בצרפת, העיר העתיקה היא אזור התיירות העיקרי. [AR] باريس [FA] پاریس، پایتخت کشور فرانسه است. [TRANSLATE] The answer is YES. Step 2010/4000 | Loss: 1.9474 | LR: 0.000012 | TPS: 1708 | 9643s Step 2020/4000 | Loss: 2.1131 | LR: 0.000012 | TPS: 1708 | 9690s Step 2030/4000 | Loss: 2.0446 | LR: 0.000012 | TPS: 1708 | 9737s Step 2040/4000 | Loss: 2.2229 | LR: 0.000011 | TPS: 1708 | 9784s Step 2050/4000 | Loss: 2.1576 | LR: 0.000011 | TPS: 1708 | 9832s Step 2060/4000 | Loss: 2.1899 | LR: 0.000011 | TPS: 1708 | 9879s Step 2070/4000 | Loss: 2.0957 | LR: 0.000011 | TPS: 1708 | 9926s Step 2080/4000 | Loss: 2.2643 | LR: 0.000011 | TPS: 1709 | 9973s Step 2090/4000 | Loss: 2.0676 | LR: 0.000011 | TPS: 1709 | 10020s Step 2100/4000 | Loss: 2.1386 | LR: 0.000011 | TPS: 1709 | 10067s Step 2110/4000 | Loss: 2.1891 | LR: 0.000011 | TPS: 1709 | 10114s Step 2120/4000 | Loss: 1.9532 | LR: 0.000011 | TPS: 1709 | 10162s Step 2130/4000 | Loss: 1.9766 | LR: 0.000011 | TPS: 1709 | 10209s Step 2140/4000 | Loss: 2.3656 | LR: 0.000011 | TPS: 1709 | 10256s Step 2150/4000 | Loss: 2.0545 | LR: 0.000011 | TPS: 1709 | 10303s Step 2160/4000 | Loss: 1.9706 | LR: 0.000011 | TPS: 1710 | 10350s Step 2170/4000 | Loss: 2.0302 | LR: 0.000010 | TPS: 1710 | 10397s Step 2180/4000 | Loss: 2.1752 | LR: 0.000010 | TPS: 1710 | 10444s Step 2190/4000 | Loss: 2.1455 | LR: 0.000010 | TPS: 1710 | 10492s Step 2200/4000 | Loss: 2.2238 | LR: 0.000010 | TPS: 1710 | 10539s Step 2210/4000 | Loss: 2.1010 | LR: 0.000010 | TPS: 1710 | 10586s Step 2220/4000 | Loss: 2.1831 | LR: 0.000010 | TPS: 1710 | 10633s Step 2230/4000 | Loss: 1.6542 | LR: 0.000010 | TPS: 1710 | 10680s Step 2240/4000 | Loss: 2.1102 | LR: 0.000010 | TPS: 1711 | 10727s Step 2250/4000 | Loss: 2.2099 | LR: 0.000010 | TPS: 1711 | 10774s Step 2260/4000 | Loss: 2.1750 | LR: 0.000010 | TPS: 1711 | 10821s Step 2270/4000 | Loss: 2.2369 | LR: 0.000010 | TPS: 1711 | 10869s Step 2280/4000 | Loss: 2.0393 | LR: 0.000010 | TPS: 1711 | 10916s Step 2290/4000 | Loss: 2.3140 | LR: 0.000010 | TPS: 1711 | 10963s Step 2300/4000 | Loss: 2.0601 | LR: 0.000010 | TPS: 1711 | 11010s Step 2310/4000 | Loss: 2.1472 | LR: 0.000009 | TPS: 1711 | 11057s Step 2320/4000 | Loss: 2.0987 | LR: 0.000009 | TPS: 1712 | 11104s Step 2330/4000 | Loss: 2.0354 | LR: 0.000009 | TPS: 1712 | 11152s Step 2340/4000 | Loss: 1.9309 | LR: 0.000009 | TPS: 1712 | 11199s Step 2350/4000 | Loss: 2.1222 | LR: 0.000009 | TPS: 1712 | 11246s Step 2360/4000 | Loss: 1.9861 | LR: 0.000009 | TPS: 1712 | 11293s Step 2370/4000 | Loss: 2.1986 | LR: 0.000009 | TPS: 1712 | 11340s Step 2380/4000 | Loss: 2.0335 | LR: 0.000009 | TPS: 1712 | 11387s Step 2390/4000 | Loss: 2.2123 | LR: 0.000009 | TPS: 1712 | 11434s Step 2400/4000 | Loss: 2.0287 | LR: 0.000009 | TPS: 1712 | 11482s 📊 Val loss: 2.1943 Step 2410/4000 | Loss: 2.0483 | LR: 0.000009 | TPS: 1712 | 11534s Step 2420/4000 | Loss: 2.0710 | LR: 0.000009 | TPS: 1712 | 11581s Step 2430/4000 | Loss: 2.3005 | LR: 0.000009 | TPS: 1712 | 11629s Step 2440/4000 | Loss: 2.0617 | LR: 0.000009 | TPS: 1712 | 11676s Step 2450/4000 | Loss: 2.2063 | LR: 0.000008 | TPS: 1712 | 11723s Step 2460/4000 | Loss: 2.0405 | LR: 0.000008 | TPS: 1712 | 11770s Step 2470/4000 | Loss: 2.2280 | LR: 0.000008 | TPS: 1712 | 11817s Step 2480/4000 | Loss: 2.3856 | LR: 0.000008 | TPS: 1712 | 11864s Step 2490/4000 | Loss: 1.9853 | LR: 0.000008 | TPS: 1712 | 11911s Step 2500/4000 | Loss: 2.0673 | LR: 0.000008 | TPS: 1713 | 11959s Step 2510/4000 | Loss: 2.1777 | LR: 0.000008 | TPS: 1713 | 12006s Step 2520/4000 | Loss: 1.9846 | LR: 0.000008 | TPS: 1713 | 12053s Step 2530/4000 | Loss: 2.1922 | LR: 0.000008 | TPS: 1713 | 12100s Step 2540/4000 | Loss: 2.0542 | LR: 0.000008 | TPS: 1713 | 12147s Step 2550/4000 | Loss: 2.1041 | LR: 0.000008 | TPS: 1713 | 12194s Step 2560/4000 | Loss: 2.0099 | LR: 0.000008 | TPS: 1713 | 12241s Step 2570/4000 | Loss: 1.8186 | LR: 0.000008 | TPS: 1713 | 12289s Step 2580/4000 | Loss: 2.2079 | LR: 0.000008 | TPS: 1713 | 12336s Step 2590/4000 | Loss: 1.9931 | LR: 0.000007 | TPS: 1713 | 12383s Step 2600/4000 | Loss: 2.0986 | LR: 0.000007 | TPS: 1714 | 12430s Step 2610/4000 | Loss: 2.0439 | LR: 0.000007 | TPS: 1714 | 12477s Step 2620/4000 | Loss: 1.9408 | LR: 0.000007 | TPS: 1714 | 12524s Step 2630/4000 | Loss: 2.1992 | LR: 0.000007 | TPS: 1714 | 12571s Step 2640/4000 | Loss: 2.0929 | LR: 0.000007 | TPS: 1714 | 12619s Step 2650/4000 | Loss: 1.9728 | LR: 0.000007 | TPS: 1714 | 12666s Step 2660/4000 | Loss: 1.8369 | LR: 0.000007 | TPS: 1714 | 12713s Step 2670/4000 | Loss: 1.9926 | LR: 0.000007 | TPS: 1714 | 12760s Step 2680/4000 | Loss: 2.0414 | LR: 0.000007 | TPS: 1714 | 12807s Step 2690/4000 | Loss: 2.1368 | LR: 0.000007 | TPS: 1714 | 12854s Step 2700/4000 | Loss: 2.0254 | LR: 0.000007 | TPS: 1714 | 12901s Step 2710/4000 | Loss: 2.1572 | LR: 0.000007 | TPS: 1715 | 12948s Step 2720/4000 | Loss: 2.0418 | LR: 0.000007 | TPS: 1715 | 12996s Step 2730/4000 | Loss: 2.1235 | LR: 0.000007 | TPS: 1715 | 13043s Step 2740/4000 | Loss: 2.0756 | LR: 0.000006 | TPS: 1715 | 13090s Step 2750/4000 | Loss: 2.1417 | LR: 0.000006 | TPS: 1715 | 13137s Step 2760/4000 | Loss: 1.9427 | LR: 0.000006 | TPS: 1715 | 13184s Step 2770/4000 | Loss: 2.1166 | LR: 0.000006 | TPS: 1715 | 13231s Step 2780/4000 | Loss: 1.9711 | LR: 0.000006 | TPS: 1715 | 13278s Step 2790/4000 | Loss: 2.1390 | LR: 0.000006 | TPS: 1715 | 13326s Step 2800/4000 | Loss: 2.0557 | LR: 0.000006 | TPS: 1715 | 13373s 📊 Val loss: 2.1839 Step 2810/4000 | Loss: 2.0581 | LR: 0.000006 | TPS: 1715 | 13425s Step 2820/4000 | Loss: 2.1139 | LR: 0.000006 | TPS: 1715 | 13473s Step 2830/4000 | Loss: 2.1228 | LR: 0.000006 | TPS: 1715 | 13520s Step 2840/4000 | Loss: 1.9685 | LR: 0.000006 | TPS: 1715 | 13567s Step 2850/4000 | Loss: 2.1206 | LR: 0.000006 | TPS: 1715 | 13614s Step 2860/4000 | Loss: 2.1942 | LR: 0.000006 | TPS: 1715 | 13661s Step 2870/4000 | Loss: 1.9068 | LR: 0.000006 | TPS: 1715 | 13708s Step 2880/4000 | Loss: 2.2099 | LR: 0.000006 | TPS: 1715 | 13755s Step 2890/4000 | Loss: 2.0948 | LR: 0.000006 | TPS: 1715 | 13803s Step 2900/4000 | Loss: 2.0630 | LR: 0.000005 | TPS: 1715 | 13850s Step 2910/4000 | Loss: 1.9867 | LR: 0.000005 | TPS: 1715 | 13897s Step 2920/4000 | Loss: 2.0602 | LR: 0.000005 | TPS: 1715 | 13944s Step 2930/4000 | Loss: 2.0163 | LR: 0.000005 | TPS: 1716 | 13991s Step 2940/4000 | Loss: 2.0337 | LR: 0.000005 | TPS: 1716 | 14038s Step 2950/4000 | Loss: 2.2476 | LR: 0.000005 | TPS: 1716 | 14085s Step 2960/4000 | Loss: 2.0430 | LR: 0.000005 | TPS: 1716 | 14133s Step 2970/4000 | Loss: 2.3037 | LR: 0.000005 | TPS: 1716 | 14180s Step 2980/4000 | Loss: 2.0831 | LR: 0.000005 | TPS: 1716 | 14227s Step 2990/4000 | Loss: 2.1781 | LR: 0.000005 | TPS: 1716 | 14274s Step 3000/4000 | Loss: 2.0784 | LR: 0.000005 | TPS: 1716 | 14321s 🔤 Generation samples (step 3000): [EN] The city of Paris is a metropolitan area in Europe, consisting of 57 counties. Its main cities include Lyons, Bordeaux and Valence. [HE] איטליה. [AR] باريس. [FA] پاریس پایتخت کشور فرانسه و یکی از شهرهای بزرگ این کشور است. شهر پاریس در شمال غربی قاره اروپا قرار دارد. [TRANSLATE] You are the first one in the world to learn how to think. Step 3010/4000 | Loss: 2.1244 | LR: 0.000005 | TPS: 1716 | 14370s Step 3020/4000 | Loss: 2.1107 | LR: 0.000005 | TPS: 1716 | 14417s Step 3030/4000 | Loss: 2.3589 | LR: 0.000005 | TPS: 1716 | 14464s Step 3040/4000 | Loss: 2.0592 | LR: 0.000005 | TPS: 1716 | 14511s Step 3050/4000 | Loss: 2.0730 | LR: 0.000005 | TPS: 1716 | 14559s Step 3060/4000 | Loss: 2.1365 | LR: 0.000005 | TPS: 1716 | 14606s Step 3070/4000 | Loss: 1.9819 | LR: 0.000005 | TPS: 1716 | 14653s Step 3080/4000 | Loss: 2.2175 | LR: 0.000004 | TPS: 1716 | 14700s Step 3090/4000 | Loss: 2.1442 | LR: 0.000004 | TPS: 1716 | 14747s Step 3100/4000 | Loss: 2.0811 | LR: 0.000004 | TPS: 1717 | 14794s Step 3110/4000 | Loss: 2.1427 | LR: 0.000004 | TPS: 1717 | 14841s Step 3120/4000 | Loss: 2.1722 | LR: 0.000004 | TPS: 1717 | 14889s Step 3130/4000 | Loss: 2.0577 | LR: 0.000004 | TPS: 1717 | 14936s Step 3140/4000 | Loss: 2.0873 | LR: 0.000004 | TPS: 1717 | 14983s Step 3150/4000 | Loss: 2.2920 | LR: 0.000004 | TPS: 1717 | 15030s Step 3160/4000 | Loss: 1.8839 | LR: 0.000004 | TPS: 1717 | 15077s Step 3170/4000 | Loss: 2.0144 | LR: 0.000004 | TPS: 1717 | 15124s Step 3180/4000 | Loss: 1.9689 | LR: 0.000004 | TPS: 1717 | 15171s Step 3190/4000 | Loss: 2.2123 | LR: 0.000004 | TPS: 1717 | 15219s Step 3200/4000 | Loss: 2.0510 | LR: 0.000004 | TPS: 1717 | 15266s 📊 Val loss: 2.1269 Step 3210/4000 | Loss: 2.4087 | LR: 0.000004 | TPS: 1717 | 15318s Step 3220/4000 | Loss: 2.2608 | LR: 0.000004 | TPS: 1717 | 15365s Step 3230/4000 | Loss: 2.1930 | LR: 0.000004 | TPS: 1717 | 15413s Step 3240/4000 | Loss: 2.0713 | LR: 0.000004 | TPS: 1717 | 15460s Step 3250/4000 | Loss: 2.2660 | LR: 0.000004 | TPS: 1717 | 15507s Step 3260/4000 | Loss: 1.9479 | LR: 0.000004 | TPS: 1717 | 15554s Step 3270/4000 | Loss: 1.9657 | LR: 0.000004 | TPS: 1717 | 15601s Step 3280/4000 | Loss: 2.1884 | LR: 0.000004 | TPS: 1717 | 15648s Step 3290/4000 | Loss: 2.0927 | LR: 0.000004 | TPS: 1717 | 15695s Step 3300/4000 | Loss: 2.0393 | LR: 0.000003 | TPS: 1717 | 15743s Step 3310/4000 | Loss: 2.1302 | LR: 0.000003 | TPS: 1717 | 15790s Step 3320/4000 | Loss: 2.0059 | LR: 0.000003 | TPS: 1717 | 15837s Step 3330/4000 | Loss: 1.8687 | LR: 0.000003 | TPS: 1717 | 15884s Step 3340/4000 | Loss: 2.0293 | LR: 0.000003 | TPS: 1717 | 15931s Step 3350/4000 | Loss: 2.1500 | LR: 0.000003 | TPS: 1718 | 15978s Step 3360/4000 | Loss: 1.9667 | LR: 0.000003 | TPS: 1718 | 16025s Step 3370/4000 | Loss: 2.1206 | LR: 0.000003 | TPS: 1718 | 16073s Step 3380/4000 | Loss: 2.3028 | LR: 0.000003 | TPS: 1718 | 16120s Step 3390/4000 | Loss: 2.0075 | LR: 0.000003 | TPS: 1718 | 16167s Step 3400/4000 | Loss: 2.0562 | LR: 0.000003 | TPS: 1718 | 16214s Step 3410/4000 | Loss: 1.9977 | LR: 0.000003 | TPS: 1718 | 16261s Step 3420/4000 | Loss: 2.1680 | LR: 0.000003 | TPS: 1718 | 16308s Step 3430/4000 | Loss: 2.0009 | LR: 0.000003 | TPS: 1718 | 16355s Step 3440/4000 | Loss: 1.8301 | LR: 0.000003 | TPS: 1718 | 16403s Step 3450/4000 | Loss: 2.0239 | LR: 0.000003 | TPS: 1718 | 16450s Step 3460/4000 | Loss: 2.0535 | LR: 0.000003 | TPS: 1718 | 16497s Step 3470/4000 | Loss: 2.1348 | LR: 0.000003 | TPS: 1718 | 16544s Step 3480/4000 | Loss: 2.0337 | LR: 0.000003 | TPS: 1718 | 16591s Step 3490/4000 | Loss: 1.9342 | LR: 0.000003 | TPS: 1718 | 16638s Step 3500/4000 | Loss: 2.0052 | LR: 0.000003 | TPS: 1718 | 16685s Step 3510/4000 | Loss: 1.9902 | LR: 0.000003 | TPS: 1718 | 16732s Step 3520/4000 | Loss: 2.1567 | LR: 0.000003 | TPS: 1719 | 16780s Step 3530/4000 | Loss: 2.0515 | LR: 0.000003 | TPS: 1719 | 16827s Step 3540/4000 | Loss: 2.1572 | LR: 0.000003 | TPS: 1719 | 16874s Step 3550/4000 | Loss: 2.1381 | LR: 0.000003 | TPS: 1719 | 16921s Step 3560/4000 | Loss: 2.0383 | LR: 0.000003 | TPS: 1719 | 16968s Step 3570/4000 | Loss: 2.3566 | LR: 0.000003 | TPS: 1719 | 17015s Step 3580/4000 | Loss: 1.9773 | LR: 0.000003 | TPS: 1719 | 17062s Step 3590/4000 | Loss: 2.0418 | LR: 0.000003 | TPS: 1719 | 17110s Step 3600/4000 | Loss: 2.1756 | LR: 0.000002 | TPS: 1719 | 17157s 📊 Val loss: 2.1478 Step 3610/4000 | Loss: 2.0761 | LR: 0.000002 | TPS: 1718 | 17209s Step 3620/4000 | Loss: 2.1353 | LR: 0.000002 | TPS: 1718 | 17257s Step 3630/4000 | Loss: 2.1856 | LR: 0.000002 | TPS: 1719 | 17304s Step 3640/4000 | Loss: 2.1298 | LR: 0.000002 | TPS: 1719 | 17351s Step 3650/4000 | Loss: 2.0784 | LR: 0.000002 | TPS: 1719 | 17398s Step 3660/4000 | Loss: 2.0533 | LR: 0.000002 | TPS: 1719 | 17445s Step 3670/4000 | Loss: 2.2151 | LR: 0.000002 | TPS: 1719 | 17492s Step 3680/4000 | Loss: 2.0177 | LR: 0.000002 | TPS: 1719 | 17539s Step 3690/4000 | Loss: 2.1048 | LR: 0.000002 | TPS: 1719 | 17587s Step 3700/4000 | Loss: 2.0629 | LR: 0.000002 | TPS: 1719 | 17634s Step 3710/4000 | Loss: 2.0375 | LR: 0.000002 | TPS: 1719 | 17681s Step 3720/4000 | Loss: 2.2282 | LR: 0.000002 | TPS: 1719 | 17728s Step 3730/4000 | Loss: 2.2049 | LR: 0.000002 | TPS: 1719 | 17775s Step 3740/4000 | Loss: 2.0247 | LR: 0.000002 | TPS: 1719 | 17822s Step 3750/4000 | Loss: 2.0337 | LR: 0.000002 | TPS: 1719 | 17869s Step 3760/4000 | Loss: 2.0922 | LR: 0.000002 | TPS: 1719 | 17917s Step 3770/4000 | Loss: 2.1018 | LR: 0.000002 | TPS: 1719 | 17964s Step 3780/4000 | Loss: 2.1183 | LR: 0.000002 | TPS: 1719 | 18011s Step 3790/4000 | Loss: 2.2469 | LR: 0.000002 | TPS: 1719 | 18058s Step 3800/4000 | Loss: 2.1373 | LR: 0.000002 | TPS: 1719 | 18105s Step 3810/4000 | Loss: 2.1103 | LR: 0.000002 | TPS: 1719 | 18152s Step 3820/4000 | Loss: 2.0317 | LR: 0.000002 | TPS: 1719 | 18199s Step 3830/4000 | Loss: 2.0022 | LR: 0.000002 | TPS: 1720 | 18247s Step 3840/4000 | Loss: 2.1618 | LR: 0.000002 | TPS: 1720 | 18294s Step 3850/4000 | Loss: 2.1421 | LR: 0.000002 | TPS: 1720 | 18341s Step 3860/4000 | Loss: 1.9279 | LR: 0.000002 | TPS: 1720 | 18388s Step 3870/4000 | Loss: 2.1657 | LR: 0.000002 | TPS: 1720 | 18435s Step 3880/4000 | Loss: 2.1433 | LR: 0.000002 | TPS: 1720 | 18482s Step 3890/4000 | Loss: 2.0893 | LR: 0.000002 | TPS: 1720 | 18529s Step 3900/4000 | Loss: 2.0036 | LR: 0.000002 | TPS: 1720 | 18576s Step 3910/4000 | Loss: 2.0691 | LR: 0.000002 | TPS: 1720 | 18624s Step 3920/4000 | Loss: 2.0282 | LR: 0.000002 | TPS: 1720 | 18671s Step 3930/4000 | Loss: 1.9818 | LR: 0.000002 | TPS: 1720 | 18718s Step 3940/4000 | Loss: 2.1466 | LR: 0.000002 | TPS: 1720 | 18765s Step 3950/4000 | Loss: 2.0455 | LR: 0.000002 | TPS: 1720 | 18812s Step 3960/4000 | Loss: 2.1226 | LR: 0.000002 | TPS: 1720 | 18859s Step 3970/4000 | Loss: 1.9890 | LR: 0.000002 | TPS: 1720 | 18906s Step 3980/4000 | Loss: 2.1891 | LR: 0.000002 | TPS: 1720 | 18954s Step 3990/4000 | Loss: 1.8920 | LR: 0.000002 | TPS: 1720 | 19001s Step 4000/4000 | Loss: 2.0073 | LR: 0.000002 | TPS: 1720 | 19048s 📊 Val loss: 2.1472 🔤 Generation samples (step 4000): [EN] The capital of France consists of 38 cities, 26.9% (14) of which are in the metropolitan area. [HE] צרפת היא אחת מיעדי התיירות הפופולאריים ביותר בעולם, בשל היותה מוקד משיכה תיירותי משמעותי עבור תיירים מכל רחבי העולם. העיר בנויה משני חלקים עיקריים - כיכר ד'ארסאן (Droite Sud) ורחוב ד'ארסאן (De La Roch [AR] باريس. [FA] پاریس شهری بزرگ و تاریخی در شمال غربی اروپا است. [TRANSLATE] It’s very short. ============================================================ SFT TRAINING COMPLETE Steps: 4000, Time: 19057s (317.6min) Best val loss: 2.1164 Model saved to: /tmp/sft/sft_model_v2.pt ============================================================ Uploading to S3...