--- title: Data Science Agent emoji: ๐Ÿค– colorFrom: blue colorTo: purple sdk: docker python_version: "3.12" app_file: src/api/app.py pinned: false license: mit --- # Data Science Agent ๐Ÿค– An intelligent **multi-agent AI system** for automated end-to-end data science workflows. Upload any dataset and watch the agent autonomously profile, clean, engineer features, train models, and generate insightsโ€”all through natural language. ## โœจ Key Features ### ๐Ÿง  Multi-Agent Architecture - **5 Specialist Agents**: EDA, ML Modeling, Data Engineering, Visualization, Business Insights - **Semantic Routing**: SBERT-powered agent selection based on query intent - **Autonomous Workflows**: Full ML pipeline completion without manual intervention ### ๐Ÿ“Š Complete ML Pipeline - **Data Profiling**: YData profiling, statistical analysis, data quality reports - **Data Cleaning**: Missing values, outliers, type conversion, deduplication - **Feature Engineering**: 50+ feature types (time, interactions, aggregations, encodings) - **Model Training**: 6 baseline models (Ridge, Lasso, Random Forest, XGBoost, LightGBM, CatBoost) - **Hyperparameter Tuning**: Optuna-based optimization with early stopping - **Visualizations**: Plotly dashboards, matplotlib plots, feature importance, residuals ### ๐Ÿ”ง Production-Ready Features - **Real-time Progress**: SSE streaming for live workflow updates - **Session Memory**: Maintains context across follow-up queries - **Error Recovery**: Graceful fallbacks and parameter validation - **Large Dataset Support**: Automatic sampling for 100K+ row datasets - **HuggingFace Export**: Export datasets, models, and outputs directly to your HuggingFace repos ### ๐Ÿ” Authentication & Integration - **Supabase Auth**: Secure user authentication with email/password and OAuth - **HuggingFace Integration**: Connect your HF account to export artifacts - **Personal Token Support**: Use your own HF write tokens for private uploads ## ๐Ÿ—๏ธ Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ React Frontend โ”‚ โ”‚ (Upload Dataset + Chat Interface) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ SSE Stream โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FastAPI Server โ”‚ โ”‚ (Port 7860) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Orchestrator โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Intent โ”‚ โ”‚ Agent โ”‚ โ”‚ Conversation โ”‚ โ”‚ โ”‚ โ”‚ Detection โ”‚โ”€โ”€โ”‚ Selection โ”‚โ”€โ”€โ”‚ Pruning โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (SBERT) โ”‚ โ”‚ (12 exchanges) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 5 Specialist Agents โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ EDA โ”‚ โ”‚ Modeling โ”‚ โ”‚ Data โ”‚ โ”‚ Viz โ”‚ โ”‚ โ”‚ โ”‚ Agent โ”‚ โ”‚ Agent โ”‚ โ”‚Engineeringโ”‚ โ”‚ Agent โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Insights โ”‚ โ”‚ โ”‚ โ”‚ Agent โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 50+ Tools โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Data Profiling โ”‚ Feature Engineering โ”‚ Model Trainingโ”‚ โ”‚ โ”‚ โ”‚ Data Cleaning โ”‚ Visualizations โ”‚ NLP Analytics โ”‚ โ”‚ โ”‚ โ”‚ Time Series โ”‚ Computer Vision โ”‚ Business Intelโ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## ๐Ÿš€ Quick Start ### Usage 1. **Upload** your CSV/Excel/Parquet dataset 2. **Ask** in natural language: *"Analyze this dataset and predict the target column"* 3. **Watch** the agent autonomously execute the full ML pipeline 4. **Review** generated visualizations, model metrics, and insights ### Example Queries ``` "Profile this dataset and show data quality issues" "Train models to predict the 'price' column" "Generate feature importance visualizations" "What are the key insights from this analysis?" ``` ### HuggingFace Export 1. **Connect** your HuggingFace account via Settings โ†’ Add your HF token 2. **Generate** artifacts (datasets, models, visualizations) 3. **Export** directly to your HuggingFace repos from the Assets sidebar 4. **Share** your work with the ML community ## ๐Ÿ› ๏ธ Tech Stack | Component | Technology | |-----------|------------| | **LLM Provider** | Mistral (mistral-large-latest) / Gemini / Groq | | **Backend** | FastAPI + Python 3.12 | | **Frontend** | React 19 + TypeScript + Vite + Tailwind | | **Data Processing** | Polars (primary) + Pandas (XGBoost compatibility) | | **ML Libraries** | Scikit-learn, XGBoost, LightGBM, CatBoost | | **Hyperparameter Tuning** | Optuna with MedianPruner | | **Semantic Search** | Sentence-BERT (all-MiniLM-L6-v2) | | **Streaming** | Server-Sent Events (SSE) | | **Authentication** | Supabase Auth | | **Cloud Storage** | HuggingFace Hub API | ## ๐Ÿ“ Project Structure ``` src/ โ”œโ”€โ”€ api/ โ”‚ โ””โ”€โ”€ app.py # FastAPI endpoints + SSE streaming โ”œโ”€โ”€ orchestrator.py # Main workflow orchestration (4500+ lines) โ”œโ”€โ”€ session_memory.py # Context persistence across queries โ”œโ”€โ”€ session_store.py # Session database management โ”œโ”€โ”€ storage/ โ”‚ โ”œโ”€โ”€ huggingface_storage.py # HuggingFace Hub integration โ”‚ โ””โ”€โ”€ artifact_store.py # Local artifact management โ”œโ”€โ”€ tools/ โ”‚ โ”œโ”€โ”€ data_profiling.py # YData profiling, statistics โ”‚ โ”œโ”€โ”€ data_cleaning.py # Missing values, outliers โ”‚ โ”œโ”€โ”€ feature_engineering.py # 50+ feature types โ”‚ โ”œโ”€โ”€ model_training.py # 6 baseline models + progress logging โ”‚ โ”œโ”€โ”€ advanced_training.py # Optuna hyperparameter tuning โ”‚ โ”œโ”€โ”€ plotly_visualizations.py โ”‚ โ”œโ”€โ”€ matplotlib_visualizations.py โ”‚ โ””โ”€โ”€ tools_registry.py # Tool definitions for LLM โ”œโ”€โ”€ reasoning/ โ”‚ โ”œโ”€โ”€ business_summary.py # Executive summaries โ”‚ โ””โ”€โ”€ model_explanation.py # Model interpretation โ””โ”€โ”€ utils/ โ”œโ”€โ”€ semantic_layer.py # SBERT embeddings โ””โ”€โ”€ error_recovery.py # Checkpoint management ``` ## โš™๏ธ Configuration ### Environment Variables ```bash # Required - Choose one LLM provider MISTRAL_API_KEY=your_mistral_key # Recommended GEMINI_API_KEY=your_gemini_key # Alternative GROQ_API_KEY=your_groq_key # Alternative # Optional LLM_PROVIDER=mistral # mistral, gemini, or groq MAX_ITERATIONS=20 # Max workflow steps # Supabase (for authentication) SUPABASE_URL=your_supabase_url SUPABASE_ANON_KEY=your_supabase_anon_key ``` ### HuggingFace Spaces Set secrets in: **Settings โ†’ Repository secrets** ## ๐Ÿ–ฅ๏ธ Local Development ```bash # Clone repository git clone https://github.com/your-repo/data-science-agent cd data-science-agent # Install Python dependencies pip install -r requirements.txt # Install and build frontend cd FRRONTEEEND && npm install && npm run build && cd .. # Set API key export MISTRAL_API_KEY=your_key_here # Run server uvicorn src.api.app:app --host 0.0.0.0 --port 7860 ``` ## ๐Ÿ“Š Model Training Details ### Baseline Models (Regression) | Model | Type | Key Features | |-------|------|--------------| | Ridge | Linear | L2 regularization, fast | | Lasso | Linear | L1 regularization, feature selection | | Random Forest | Ensemble | Robust, feature importance | | XGBoost | Gradient Boosting | High accuracy, GPU support | | LightGBM | Gradient Boosting | Fast training, low memory | | CatBoost | Gradient Boosting | Handles categoricals natively | ### Progress Logging Real-time training progress with elapsed time: ``` ๐Ÿš€ Training 6 regression models on 140,757 samples... [1/6] Training ridge... โœ“ ridge trained in 2.3s [2/6] Training lasso... โœ“ lasso trained in 1.8s [3/6] Training random_forest... โœ“ random_forest trained in 45.2s ... ๐Ÿ† Best model: random_forest (Rยฒ=0.7585) ``` ## ๐Ÿ”ง Recent Improvements ### Workflow Reliability - โœ… **Autonomous Completion**: Full ML pipeline without manual confirmation - โœ… **Smart Context Pruning**: Keeps 12 exchanges (was 4) for better memory - โœ… **Target Column Persistence**: Injected into workflow guidance after pruning - โœ… **Parameter Validation**: Strips invalid LLM-hallucinated parameters ### Performance - โœ… **Real-time Progress Logging**: See model-by-model training status - โœ… **Large Dataset Sampling**: Auto-sample to 50K rows for tuning - โœ… **Checkpoint Clearing**: Fresh workflow for each new query ### Error Handling - โœ… **SBERT Fallback**: Graceful keyword routing if embeddings fail - โœ… **Tool Name Mapping**: Maps 8+ common hallucinated tool names - โœ… **NoneType Safety**: Validates all comparison operands ### HuggingFace Integration - โœ… **One-Click Export**: Export datasets, models, and outputs to HuggingFace - โœ… **Personal Repos**: Auto-creates `ds-agent-data`, `ds-agent-models`, `ds-agent-outputs` repos - โœ… **Secure Tokens**: User tokens stored securely in Supabase - โœ… **Status Caching**: Efficient HF connection status checking ## ๐Ÿณ Docker Deployment ```dockerfile # Multi-stage build FROM node:20-slim AS frontend # Build React frontend FROM python:3.12-slim AS backend # Install Python dependencies + copy frontend build EXPOSE 7860 CMD ["uvicorn", "src.api.app:app", "--host", "0.0.0.0", "--port", "7860"] ``` ## ๐Ÿ“ˆ Performance Benchmarks | Dataset Size | Profiling | Training (6 models) | Total Workflow | |--------------|-----------|---------------------|----------------| | 10K rows | ~5s | ~30s | ~2 min | | 50K rows | ~15s | ~2 min | ~5 min | | 175K rows | ~45s | ~5 min | ~10 min | ## ๐Ÿ”ฎ Future Enhancements We're actively working on exciting new features to make the Data Science Agent even more powerful: ### ๐Ÿ—„๏ธ BigQuery Integration - **Direct BigQuery Connection**: Query and analyze massive datasets directly from Google BigQuery - **Smart Sampling**: Intelligent sampling strategies for billion-row tables - **Cost Optimization**: Query cost estimation before execution - **Schema Discovery**: Auto-detect tables, columns, and relationships ### ๐Ÿ”— LangChain / LlamaIndex Compatibility - **Framework Agnostic**: Use as a tool within LangChain agents or LlamaIndex pipelines - **Custom Tool Registration**: Expose 50+ data science tools as LangChain tools - **RAG Integration**: Combine with document retrieval for context-aware analysis - **Memory Backends**: Support for LangChain memory stores and conversation history ### ๐Ÿ’ป First-Class CLI Experience & Beautiful TUI - **Rich Terminal UI**: Interactive dashboards with progress bars, tables, and charts - **Keyboard Navigation**: Full workflow control without leaving the terminal - **Pipeline Scripting**: Define reproducible workflows in YAML/TOML - **Offline Mode**: Run locally without requiring a browser - **SSH-Friendly**: Perfect for remote server analysis --- ## ๐Ÿค Contributing Contributions welcome! Please: 1. Fork the repository 2. Create a feature branch 3. Submit a pull request ## ๐Ÿ“„ License MIT License - see LICENSE file for details. --- **Built with โค๏ธ for autonomous data science**