GEM (GEM benchmark)

posted an update 2 days ago

Post

107

🚀 Sonic: A lightweight Python audio processing library with tempo matching, BPM detection, time-stretching, resampling & track blending — now with GPU (CUDA) acceleration for 10x speed!

Perfect for quick remixes, batch edits or syncing tracks.

👉 https://github.com/Parveshiiii/Sonic

#Python #AudioProcessing #OpenSource #PyTorch

Parveshiiii

posted an update 9 days ago

Post

1596

Excited to announce my latest open-source release on Hugging Face: Parveshiiii/breast-cancer-detector.

This model has been trained and validated on external datasets to support medical research workflows. It is designed to provide reproducible benchmarks and serve as a foundation for further exploration in healthcare AI.

Key highlights:
- Built for medical research and diagnostic study contexts
- Validated against external datasets for reliability
- Openly available to empower the community in building stronger, more effective solutions

This release is part of my ongoing effort to make impactful AI research accessible through **Modotte**. A detailed blog post explaining the methodology, dataset handling, and validation process will be published soon.

You can explore the model here: Parveshiiii/breast-cancer-detector

#AI #MedicalResearch #DeepLearning #Healthcare #OpenSource #HuggingFace

Ujjwal-Tyagi

posted an update 17 days ago

Post

2791

I am sharing my study material for AI & ML, these books are really a "bible" and gives very strong foundation, I also have given guidance, introduction and my master notes in the dataset repo card! I hope you will find them helpful, if you have any queries, just start a discussion and I am always there to help you out!
Ujjwal-Tyagi/ai-ml-foundations-book-collection

4 replies

·

Parveshiiii

posted an update 21 days ago

Post

2890

Just did something I’ve been meaning to try for ages.

In only 3 hours, on 10 billion+ tokens, I trained a custom BPE + tiktoken-style tokenizer using my new library microtok — and it hits the same token efficiency as Qwen3.

Tokenizers have always felt like black magic to me. We drop them into every LLM project, but actually training one from scratch? That always seemed way too complicated.

Turns out it doesn’t have to be.

microtok makes the whole process stupidly simple — literally just 3 lines of code. No heavy setup, no GPU required. I built it on top of the Hugging Face tokenizers library so it stays clean, fast, and actually understandable.

If you’ve ever wanted to look under the hood and build your own optimized vocabulary instead of just copying someone else’s, this is the entry point you’ve been waiting for.

I wrote up the full story, threw in a ready-to-run Colab template, and dropped the trained tokenizer on Hugging Face.

Blog → https://parveshiiii.github.io/blogs/microtok/
Trained tokenizer → Parveshiiii/microtok
GitHub repo → https://github.com/Parveshiiii/microtok

Nymbo

posted an update about 1 month ago

Post

6591

We should really have a release date range slider on the /models page. Tired of "trending/most downloaded" being the best way to sort and still seeing models from 2023 on the first page just because they're embedded in enterprise pipelines and get downloaded repeatedly. "Recently Created/Recently Updated" don't solve the discovery problem considering the amount of noise to sift through.

Slight caveat: Trending actually does have some recency bias, but it's not strong/precise enough.

3 replies

·

Ujjwal-Tyagi

posted an update about 1 month ago

Post

413

We have now LTX 2.3 with more better visual quality and richer sound, check it out! Lightricks/LTX-2.3

albertvillanova

posted an update about 2 months ago

Post

2362

🚀 TRL v0.29.0 introduces trl-training: an agent-native training skill.

This makes the TRL CLI a structured, agent-readable capability, allowing AI agents to reliably execute training workflows such as:
- Supervised Fine-Tuning (SFT)
- Direct Preference Optimization (DPO)
- Group Relative Policy Optimization (GRPO)

We’re excited to see what the community builds on top of this.

If you’re working on AI agents, alignment research, or scalable RL training infrastructure: give TRL v0.29.0 a try! 🤗

The future of ML tooling is agent-native.
🔗 https://github.com/huggingface/trl/releases/tag/v0.29.0

Ujjwal-Tyagi

posted an update about 2 months ago

Post

2933

Public reports allege that Anthropic gobbled up trillions of tokens of copyrighted material and public data to build their castle. 🏰📄 Now that they're sitting on top, they're begging for special laws to protect their profits while pulling the ladder up behind them. 🪜🚫

But the hypocrisy meter just broke! 📉 They are accusing Chinese labs like DeepSeek, Minimax, and Kimi of "huge distillation attacks. The Reality is that You can't just loot the entire internet's library, lock the door, and then sue everyone else for reading through the window. Stop trying to gatekeep the tech you didn't own in the first place. Read the complete article on it: https://huggingface.co/blog/Ujjwal-Tyagi/the-dark-underbelly-of-anthropic

3 replies

·

Ujjwal-Tyagi

posted an update about 2 months ago

Post

224

Qwen 3.5 Model is here! Supporting 1m context length by default, It is giving much good performance and competitive to Claude Opus 4.6, Qwen/Qwen3.5-397B-A17B, here it's GGUF: unsloth/Qwen3.5-397B-A17B-GGUF, Follow me and turn on the notification for the latest news!

lewtun

submitted 2 papers to Daily Papers 2 months ago

Single-minus gluon tree amplitudes are nonzero

Paper • 2602.12176 • Published Feb 12 • 8

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Paper • 2602.03773 • Published Feb 3 • 13

Ujjwal-Tyagi

posted an update 2 months ago

Post

3030

GLM 5 is insane, it ranks #4 Globally!

4 replies

·

albertvillanova

posted an update 2 months ago

Post

1854

5 years already working in democratizing AI 🤗
Grateful to be part of such an awesome team making it happen every day.

Parveshiiii

posted an update 2 months ago

Post

339

Introducing Seekify — a truly non‑rate‑limiting search library for Python

Tired of hitting rate limits when building search features? I’ve built Seekify, a lightweight Python library that lets you perform searches without the usual throttling headaches.

🔹 Key highlights

- Simple API — plug it in and start searching instantly

- No rate‑limiting restrictions

- Designed for developers who need reliable search in projects, scripts, or apps

📦 Available now on PyPI:

pip install seekify

👉 Check out the repo: https:/github.com/Parveshiiii/Seekify
I’d love feedback, contributions, and ideas for real‑world use cases. Let’s make search smoother together!

Sri-Vigneshwar-DJ

posted an update 2 months ago

Post

1440

Just released a new dataset designed for training reasoning models on Meta (Facebook/Instagram) advertising fatigue detection!

What is it? A GRPO (Group Relative Policy Optimization) training dataset with 200+ carefully crafted scenarios covering:

🔍 Fatigue Signal Detection: CTR drops, CPM spikes, frequency analysis
🩺 Performance Diagnosis: Root cause analysis frameworks
📋 Strategy: Creative refresh cadence, testing frameworks
📊 Analysis: ROI calculations, metric interpretation
Why GRPO? GRPO training helps models learn structured reasoning. Each response follows the <thinking> and <answer> format.

Check it out here: Sri-Vigneshwar-DJ/meta-fatigue-grpo-dataset

Ujjwal-Tyagi

posted an update 3 months ago

Post

1370

Finally we got a benchmark and research paper on ai safety, I am very excited to see what comes next on protecting AGI AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security (2601.18491)

AI45Research/ATBench

gentaiscool

authored a paper 3 months ago

PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues

Paper • 2601.17277 • Published Jan 24 • 6

yjernite

authored a paper 3 months ago

INTIMA: A Benchmark for Human-AI Companionship Behavior

Paper • 2508.09998 • Published Aug 4, 2025 • 11

Parveshiiii

posted an update 3 months ago

Post

1640

🚀 Wanna train your own AI Model or Tokenizer from scratch?

Building models isn’t just for big labs anymore — with the right data, compute, and workflow, you can create **custom AI models** and **tokenizers** tailored to any domain. Whether it’s NLP, domain‑specific datasets, or experimental architectures, training from scratch gives you full control over vocabulary, embeddings, and performance.

✨ Why train your own?
- Full control over vocabulary & tokenization
- Domain‑specific optimization (medical, legal, technical, etc.)
- Better performance on niche datasets
- Freedom to experiment with architectures

⚡ The best part?
- Tokenizer training (TikToken / BPE) can be done in **just 3 lines of code**.
- Model training runs smoothly on **Google Colab notebooks** — no expensive hardware required.

📂 Try out my work:
- 🔗 https://github.com/OE-Void/Tokenizer-from_scratch
- 🔗 https://github.com/OE-Void/GPT

Sri-Vigneshwar-DJ

posted an update 3 months ago

Post

232

🏙️ Hugging Face Community Post
Title: 🧬 Experimenting with "Dynamic Chaos" in Tamil SLMs

Hi everyone! I just published a new experimental study on Small Language Model (SLM) resilience.

I took the Qwen2.5-0.5B model and put it through a "Chaos Phase" to see how much weight data a tiny model can lose before its understanding of classical Tamil grammar breaks.

Key highlights of the study:

Target Data: Fine-tuned on the Thirukkural (1,330 couplets + modern explanations).
The Chaos Step: Applied 20% random weight pruning but implemented "Layer Protection" for the Token Embeddings and LM Head to keep the characters readable.
Compression: 4-bit (Q4_K_M) quantization for extreme efficiency.
Result: A surrealist classical Tamil model that is ultra-light (~300MB) and ultra-fast!

Check out the model and the experiment logic here: Sri-Vigneshwar-DJ/qwen-tamil-chaos-v1

AI & ML interests

Team members 100

GEM's activity