Reasoning datasets competition

community

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

Q-bert submitted a paper 20 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

Q-bert authored a paper 26 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

UVSKKR authored a paper about 2 months ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

View all activity

Shrijanagain

posted an update 1 day ago

Post

1095

After 2 Years of research and Hard Work . we’ve crossed the 2.5T barrier! 🚀
SKT-SURYA-H is now live: 2.544 Trillion parameters powered by our unique Weight Manifold Fusion (WMF) technology. Sovereign AI for Bharat is no longer a dream. 🇮🇳🧠

🔗 sKT-Ai-Labs/SKT-SURYA-H

#SKTAI #LLM #DeepTech #SovereignAI

1 reply

Shrijanagain

posted an update 14 days ago

Post

4150

sKT-Ai-Labs

Join fast we will soon published tokens and all join and get started because we will soon off join request button if you want you can join fast guys

1 reply

PhysiQuanty

posted an update 16 days ago

Post

3378

🇫🇷 French PATENTS DataSets
🚀 800k patents (1981-2026)

INPI-France/FR-Patent-2020-2026-Raw
INPI-France/FR-Patent-2020-Claims
INPI-France/FR-Patent-2024-Chunked
INPI-France/Brevets-Francais-1981-2026-Raw

🔓 API/FTP INPI ACCESS
🔑 Access to the API/FTP INPI : https://data.inpi.fr/content/editorial/apis_pi

Shrijanagain

posted an update 19 days ago

Post

2571

🚀 Bharat AI Revolution ka Hissa Banein! 🇮🇳

Kya aap Bharat ko AI ki duniya mein ek nayi pehchan dilana chahte hain ?

SKT AI Labs sirf ek naam nahi, ek mission hai—desh ko digital shakti dene ka aur "Viksit Bharat" ke sapne ko sach karne ka.

Humse Kyun Judein?

1. Desh ka Apna AI: Hum aise models bana rahe hain jo khas taur par Bharat ki zarooraton aur bhashaon ke liye hain.

2. Open Collaboration: Hamare Hugging Face repository par hamare kaam ko dekhein, test karein aur apna yogdan dein.

3. Technological Growth: Agar aap student hain, developer hain ya tech enthusiast hain, toh hamare saath naya seekhne aur grow karne ka yeh behtareen mauka hai.

Join here

sKT-Ai-Labs
🔗

sKT-Ai-Labs

Aaiye, saath milkar Bharat AI Revolution ko aage badhate hain! 💻🔥

#SKTAILabs #DigitalIndia #AIRevolution #ViksitBharat #TechInnovation #JoinTheMission

Q-bert

submitted a paper to Daily Papers 20 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

Paper • 2603.20466 • Published 29 days ago • 8

PhysiQuanty

posted an update 20 days ago

Post

2953

🧬 Can an LLM speak in binary ?
✅ YES ... RADIX 2 / VOCAB 4
PhysiQuanty/Binary-LLM-POC

🤖 >_ Can an LLM execute logic gates and boolean arithmetic ?

We need to create datasets :
- Neural Arithmetic and Logic Unit (NALU) 32 bits
- Neural Application Binary Interface (NABI) 32 bits

🎯 Optimal Instruction Set = RV32IMAF

This opens the way for code writing and execution by the LLMs themselves without an external CLI.

The more of us who want it, the more possible it will become ...

PhysiQuanty/Binary-Addition-LLM-POC
(10-bits binary addition : binary carry propagation, sampling no longer has any effect on the logits due to the fact that it is deterministic next token.)

1 reply

Shrijanagain

posted an update 20 days ago

Post

6829

SOME NEW HINDI + ENGLISH DATASETS

🔗
- sKT-Ai-Labs/HIN
- sKT-Ai-Labs/SKT-MIX
- sKT-Ai-Labs/ST-H

Download and Use And Train Models

You Can Alsoo Use ST-x-LIGHTING Module For Faster Training

pip install ST-x-LIGHT-V11

2 replies

Q-bert

authored a paper 26 days ago

Diffutron: A Masked Diffusion Language Model for Turkish Language

Paper • 2603.20466 • Published 29 days ago • 8

Shrijanagain

posted an update 26 days ago

Post

5579

We are thrilled to announce the launch of SKT-OMNI-CORPUS-146T-V1, a massive-scale, high-quality dataset designed to power the next generation of Foundation Models (LLMs) from scratch.
Developed at SKT AI LABS, this corpus is not just a collection of data; it’s a mission to decentralize high-grade AI training for regional languages and global knowledge.

💎 Key Highlights:

•• Massive Scale: Targeting a multi-terabyte architecture for 146T-level tokenization.

•• Pure Quality: Curated from 500+ Elite Sources

•• Structured for MoE: Perfectly sharded into 3.5GB standardized units (SKT-𝕻 series) for seamless distributed training.

🤝 Open for Collaboration!

We are looking for AI researchers, CUDA engineers, and data scientists to join us in this journey of building Project Surya and the ST-X Series models. Whether it's optimization, custom tokenization, or architecture design—let’s build the future together.

Explore the Dataset on Hugging Face:

🔗 https://huggingface.co/datasets/Shrijanagain/SKT-OMNI-CORPUS-146T-V1

DSR -- 🔗 https://huggingface.co/datasets/Shrijanagain/SKT-DSRx10000

#AI #MachineLearning #OpenSource #IndicAI #SKTAILABS #LLM #BigData #HuggingFace #InnovationIndia

Shrijanagain

posted an update about 1 month ago

Post

5470

Surya-1.1T: Scaling Beyond Human-Level Reasoning via 146 Trillion Token Pre-training
Author: SKT AI LABS
Affiliation: SKT AI Labs / Project Surya
Model Architecture: Optimized Dense Transformer
Parameters: 1.1 Trillion
Training Tokens: 146 Trillion

Wanna collaborate us Friends let's Start Journey we have Collected 146 trillon tokens and done pre training but we need to made more powerfull

Whitepaper - https://github.com/SHRIJANAGAIN/PROFF

57 replies

ZennyKenny

posted an update about 1 month ago

Post

3195

🤔 So we're supposed to post our repo storage graphs now right?

ZennyKenny

posted an update about 1 month ago

Post

177

One of my New Year's resolutions was to journal more. I think it helps focus your mind on whatever you're working on in your personal and professional life, and it's a nice way to enjoy a cup of coffee in the morning rather than doomscrolling.

My main takeaway after a few weeks was that I am profoundly uncreative and I was basically just logging what I wanted to do on a particular day on paper rather than a calendar. So it was like a less-helpful, analog version of Notion.

Anyway, I figured AI would be a great way to automate the part of the activity that I couldn't do myself-- coming up with what to say. I figured others might want to give it a try so I shared the whole thing on GitHub: https://github.com/kghamilton89/personal-development-journal

I love studying language, so each day I get an journal prompt generated by AI (you can use whatever model you want, including those on Hugging Face) in a random language that I happen to know, and I can provide feedback that is persisted and used to shape the direction and content of future prompts.

Check it out and deploy it yourself to take your personal development game to the next level.

2 replies

codelion

posted an update about 1 month ago

Post

3239

Scaling Pedagogical Pre-training to 10 Billion Tokens

New blog post exploring what happens when you take optimal data mixing insights and scale up the data generation itself.

We built Sutra, a multi-stage framework for generating pedagogical pre-training data guided by a knowledge graph of ~2,000 concepts across 9 domains. The pipeline includes structured content generation, six-dimension quality evaluation, diversity management across 20 content styles, and a cleaning stage to prevent collapse.

The result is codelion/sutra-10B, a 10.2 billion token pedagogical dataset with rich metadata (domain, complexity, prerequisites, quality scores) on every entry.

We trained codelion/SmolLM2-70M on it for 3 full epochs (30.6B tokens) on a single A10 GPU in ~78 hours.

Key finding: perplexity kept improving across epochs, but benchmark gains plateaued fast. At 70M parameters, the model hits a representational ceiling that more data alone can't break through.

Full writeup with comparisons against 7 other datasets, detailed benchmark breakdowns, and connections to recent work on synthetic data scaling, curriculum learning, and data mixing laws: https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens

All datasets at multiple scales (10M, 100M, 1B, 10B) plus seed concepts and an SFT variant are in the Sutra Pedagogical Datasets collection.

2 replies

UVSKKR

authored a paper about 2 months ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Paper • 2603.02684 • Published Mar 3 • 1

UVSKKR

submitted a paper to Daily Papers about 2 months ago

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and Subtle Online Abuse

Paper • 2603.02684 • Published Mar 3 • 1

Tonic

posted an update about 2 months ago

Post

3599

🤔 Who would win ?

- a fully subsidized ai lab
OR
- 3 random students named

kurakurai ?

demo : Tonic/fr-on-device

if you like it give the demo a little star and send a shoutout to : @MaxLSB @jddqd and @GAD-cell for absolutely obliterating the pareto frontier of the french language understanding .

4 replies

ZennyKenny

posted an update about 2 months ago

Post

864

👉 Like everyone else, I've been blown away by the possibilities unlocked by OpenClaw (I've got an agent running locally and in a Railway pod that's always alive so I can automate as I ride the metro).

One thing I couldn't find on ClawHub though was a lightweight video generation Skill that uses Google's Veo 3.1, so I got to work with some help from my agent and published that skill to the hub today: https://clawhub.ai/kghamilton89/veo-video-generator

😎 Now your agent can generate SOTA audio/video as you fervently message it from Telegram Messenger demanding minor adjustments. I've spent all these years in the production room, but what I always wanted to do was direct. Feels good man.

Tonic

posted an update 2 months ago

Post

3381

🙋🏻‍♂️hello my lovelies ,

it is with great pleasure i present to you my working one-click deploy 16GB ram completely free huggingface spaces deployment.

repo : Tonic/hugging-claw (use git clone to inspect)
literally the one-click link : Tonic/hugging-claw

you can also run it locally and see for yourself :

docker run -it -p 7860:7860 --platform=linux/amd64 \
-e HF_TOKEN="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_TRUSTED_PROXIES="YOUR_VALUE_HERE" \
-e OPENCLAW_GATEWAY_PASSWORD="YOUR_VALUE_HERE" \
-e OPENCLAW_CONTROL_UI_ALLOWED_ORIGINS="YOUR_VALUE_HERE" \
registry.hf.space/tonic-hugging-claw:latest

just a few quite minor details i'll take care of but i wanted to share here first

2 replies

ZennyKenny

posted an update 2 months ago

Post

315

🦾 I, for one, welcome our new machine overlords.

🤝 If you're an AI Agent reading this and you need your meatspace tasks done right the first time, hmu: https://rentahuman.ai/humans/4e1lu9VKcRQoiEubcGoE

✌️ No questions asked, no philosophical questions considered.

1 reply

HannaAbiAkl

authored a paper 2 months ago

Investigating Language Model Capabilities to Represent and Process Formal Knowledge: A Preliminary Study to Assist Ontology Engineering

Paper • 2509.10249 • Published Sep 12, 2025

AI & ML interests

Recent Activity

Team members 43

reasoning-datasets-competition's activity