sanjaystarc commited on
Commit
70f37b4
Β·
verified Β·
1 Parent(s): 6496b12

Upload 5 files

Browse files
Files changed (5) hide show
  1. README.md +156 -10
  2. app.py +472 -0
  3. core_agent.py +318 -0
  4. requirements.txt +14 -0
  5. sample_data.csv +31 -0
README.md CHANGED
@@ -1,13 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Data Analyst Pro
3
- emoji: 🏒
4
- colorFrom: purple
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 6.8.0
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # 🧠 DataMind Agent
2
+ ### AI-Powered Data Analyst β€” LangChain + Gemini + Streamlit
3
+
4
+ Upload any data file (CSV, Excel, JSON) and chat with your data using natural language. The agent analyzes, visualizes, and explains your data powered by Google Gemini.
5
+
6
+ ---
7
+
8
+ ## πŸš€ Features
9
+
10
+ | Feature | Description |
11
+ |---|---|
12
+ | πŸ“‚ Multi-format support | CSV, Excel (.xlsx/.xls), JSON |
13
+ | πŸ’¬ Natural language Q&A | Ask anything, get intelligent answers |
14
+ | πŸ“Š Auto visualizations | AI picks the best chart for your question |
15
+ | 🎨 Custom chart builder | Build any chart with dropdown controls |
16
+ | πŸ” Data explorer | Filter, search, and download raw data |
17
+ | 🧠 AI data summary | Executive summary generated by Gemini |
18
+
19
  ---
20
+
21
+ ## πŸ“ Project Structure
22
+
23
+ ```
24
+ data-analyst-agent/
25
+ β”œβ”€β”€ app.py # Streamlit UI (main app)
26
+ β”œβ”€β”€ core_agent.py # LangChain + Gemini logic
27
+ β”œβ”€β”€ requirements.txt # Python dependencies
28
+ β”œβ”€β”€ .env # API key config
29
+ β”œβ”€β”€ sample_data.csv # Test dataset (sales data)
30
+ └── README.md # This file
31
+ ```
32
+
33
+ ---
34
+
35
+ ## βš™οΈ Setup & Installation
36
+
37
+ ### Step 1 β€” Clone / download the project
38
+ ```bash
39
+ cd data-analyst-agent
40
+ ```
41
+
42
+ ### Step 2 β€” Create a virtual environment (recommended)
43
+ ```bash
44
+ python -m venv venv
45
+
46
+ # On Windows:
47
+ venv\Scripts\activate
48
+
49
+ # On Mac/Linux:
50
+ source venv/bin/activate
51
+ ```
52
+
53
+ ### Step 3 β€” Install dependencies
54
+ ```bash
55
+ pip install -r requirements.txt
56
+ ```
57
+
58
+ ### Step 4 β€” Get your free Gemini API key
59
+ 1. Go to [https://aistudio.google.com/app/apikey](https://aistudio.google.com/app/apikey)
60
+ 2. Sign in with Google
61
+ 3. Click **"Create API Key"**
62
+ 4. Copy the key (starts with `AIza...`)
63
+
64
+ ### Step 5 β€” Add your API key
65
+ Either paste it directly in the app sidebar, OR add it to `.env`:
66
+ ```
67
+ GOOGLE_API_KEY=AIzaYourKeyHere
68
+ ```
69
+
70
+ ### Step 6 β€” Run the app
71
+ ```bash
72
+ streamlit run app.py
73
+ ```
74
+
75
+ The app opens at **http://localhost:8501**
76
+
77
+ ---
78
+
79
+ ## 🎯 How to Use
80
+
81
+ 1. **Paste your Gemini API key** in the sidebar
82
+ 2. **Upload a data file** (CSV, Excel, or JSON)
83
+ 3. **Dashboard tab** β€” see auto-generated stats and charts
84
+ 4. **Chat tab** β€” ask questions like:
85
+ - *"What are the top selling products?"*
86
+ - *"Is there a correlation between age and spending?"*
87
+ - *"Show me outliers in the sales column"*
88
+ 5. **Charts tab** β€” build custom visualizations
89
+ 6. **Raw Data tab** β€” filter and download your data
90
+
91
+ ---
92
+
93
+ ## πŸ’‘ Example Questions to Ask
94
+
95
+ ```
96
+ "What is the average profit by category?"
97
+ "Which region has the highest sales?"
98
+ "Are there any missing values I should worry about?"
99
+ "What trends do you see in the data over time?"
100
+ "Which customers are the most valuable?"
101
+ "Give me a statistical summary of all numeric columns"
102
+ "What correlations exist between the columns?"
103
+ ```
104
+
105
+ ---
106
+
107
+ ## πŸ—οΈ Architecture
108
+
109
+ ```
110
+ User (Streamlit UI)
111
+ β”‚
112
+ β–Ό
113
+ app.py (UI Layer)
114
+ β”‚
115
+ β”œβ”€β”€ core_agent.py
116
+ β”‚ β”œβ”€β”€ load_file() β†’ Parses CSV/Excel/JSON β†’ DataFrame
117
+ β”‚ β”œβ”€β”€ profile_dataframe() β†’ Statistical profiling
118
+ β”‚ β”œβ”€β”€ ask_agent() β†’ LangChain β†’ Gemini β†’ Answer
119
+ β”‚ β”œβ”€β”€ make_plotly_chart() β†’ Renders visualizations
120
+ β”‚ └── ai_recommend_chart() β†’ Gemini picks best chart
121
+ β”‚
122
+ └── Google Gemini 1.5 Flash (via LangChain)
123
+ ```
124
+
125
+ ---
126
+
127
+ ## πŸ“¦ Key Libraries Used
128
+
129
+ | Library | Purpose |
130
+ |---|---|
131
+ | `langchain` | Agent framework, prompt management |
132
+ | `langchain-google-genai` | Gemini LLM integration |
133
+ | `streamlit` | Web UI |
134
+ | `pandas` | Data loading and manipulation |
135
+ | `plotly` | Interactive visualizations |
136
+ | `openpyxl` / `xlrd` | Excel file support |
137
+
138
+ ---
139
+
140
+ ## πŸ”§ Customization Ideas
141
+
142
+ - Add **PDF support** using `pdfplumber`
143
+ - Add **database connection** (SQLite, PostgreSQL)
144
+ - Add **export to PowerPoint** for chart reports
145
+ - Add **multi-file comparison** mode
146
+ - Deploy to **Streamlit Cloud** (free hosting)
147
+
148
+ ---
149
+
150
+ ## πŸ†“ Free Tier Limits (Gemini 1.5 Flash)
151
+ - 15 requests per minute
152
+ - 1 million tokens per minute
153
+ - 1,500 requests per day
154
+
155
+ This is more than enough for personal data analysis projects!
156
+
157
  ---
158
 
159
+ *Built with ❀️ using LangChain + Google Gemini + Streamlit*
app.py ADDED
@@ -0,0 +1,472 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py
3
+ ======
4
+ Streamlit UI β€” Data Analyst Agent (LangChain + Gemini)
5
+ Run: streamlit run app.py
6
+ """
7
+
8
+ import os
9
+ import io
10
+ import streamlit as st
11
+ import pandas as pd
12
+ import plotly.express as px
13
+
14
+ from core_agent import (
15
+ get_llm, load_file, profile_dataframe, profile_to_text,
16
+ ask_agent, auto_suggest_charts, make_plotly_chart, ai_recommend_chart
17
+ )
18
+
19
+ # ─── Page Config ──────────────────────────────────────────────────────────────
20
+ st.set_page_config(
21
+ page_title="DataMind Agent",
22
+ page_icon="🧠",
23
+ layout="wide",
24
+ initial_sidebar_state="expanded",
25
+ )
26
+
27
+ # ─── Custom CSS ───────────────────────────────────────────────────────────────
28
+ st.markdown("""
29
+ <style>
30
+ @import url('https://fonts.googleapis.com/css2?family=Syne:wght@400;700;800&family=DM+Sans:wght@300;400;500&display=swap');
31
+
32
+ html, body, [class*="css"] {
33
+ font-family: 'DM Sans', sans-serif;
34
+ background-color: #0a0a12;
35
+ color: #e8e8ff;
36
+ }
37
+
38
+ .main { background-color: #0a0a12; }
39
+
40
+ /* Header */
41
+ .hero-title {
42
+ font-family: 'Syne', sans-serif;
43
+ font-size: 2.8rem;
44
+ font-weight: 800;
45
+ background: linear-gradient(135deg, #e8e8ff 0%, #6C63FF 50%, #43E97B 100%);
46
+ -webkit-background-clip: text;
47
+ -webkit-text-fill-color: transparent;
48
+ background-clip: text;
49
+ margin-bottom: 0.2rem;
50
+ }
51
+ .hero-sub {
52
+ color: #6a6a9a;
53
+ font-size: 1rem;
54
+ margin-bottom: 2rem;
55
+ }
56
+
57
+ /* Cards */
58
+ .stat-card {
59
+ background: #1a1a2e;
60
+ border: 1px solid #2a2a45;
61
+ border-radius: 16px;
62
+ padding: 1.2rem 1.5rem;
63
+ text-align: center;
64
+ }
65
+ .stat-num {
66
+ font-family: 'Syne', sans-serif;
67
+ font-size: 2rem;
68
+ font-weight: 800;
69
+ color: #6C63FF;
70
+ }
71
+ .stat-label { color: #6a6a9a; font-size: 0.8rem; text-transform: uppercase; letter-spacing: 0.1em; }
72
+
73
+ /* Chat bubbles */
74
+ .user-bubble {
75
+ background: rgba(108,99,255,0.15);
76
+ border: 1px solid rgba(108,99,255,0.3);
77
+ border-radius: 18px 18px 4px 18px;
78
+ padding: 0.9rem 1.2rem;
79
+ margin: 0.5rem 0;
80
+ font-size: 0.95rem;
81
+ }
82
+ .agent-bubble {
83
+ background: #1a1a2e;
84
+ border: 1px solid #2a2a45;
85
+ border-radius: 18px 18px 18px 4px;
86
+ padding: 0.9rem 1.2rem;
87
+ margin: 0.5rem 0;
88
+ font-size: 0.95rem;
89
+ line-height: 1.6;
90
+ }
91
+
92
+ /* Sidebar */
93
+ section[data-testid="stSidebar"] {
94
+ background: #10101e;
95
+ border-right: 1px solid #2a2a45;
96
+ }
97
+
98
+ /* Buttons */
99
+ .stButton > button {
100
+ background: linear-gradient(135deg, #6C63FF, #43E97B);
101
+ color: white;
102
+ border: none;
103
+ border-radius: 12px;
104
+ font-family: 'Syne', sans-serif;
105
+ font-weight: 700;
106
+ padding: 0.6rem 1.5rem;
107
+ transition: opacity 0.2s;
108
+ }
109
+ .stButton > button:hover { opacity: 0.85; color: white; }
110
+
111
+ .stTextInput > div > div > input {
112
+ background: #1a1a2e;
113
+ border: 1px solid #2a2a45;
114
+ border-radius: 12px;
115
+ color: #e8e8ff;
116
+ }
117
+ .stSelectbox > div > div {
118
+ background: #1a1a2e;
119
+ border: 1px solid #2a2a45;
120
+ border-radius: 12px;
121
+ }
122
+
123
+ /* Tabs */
124
+ .stTabs [data-baseweb="tab-list"] {
125
+ background: #10101e;
126
+ border-radius: 12px;
127
+ gap: 0.3rem;
128
+ }
129
+ .stTabs [data-baseweb="tab"] {
130
+ background: transparent;
131
+ color: #6a6a9a;
132
+ border-radius: 10px;
133
+ font-family: 'Syne', sans-serif;
134
+ }
135
+ .stTabs [aria-selected="true"] {
136
+ background: rgba(108,99,255,0.2) !important;
137
+ color: #6C63FF !important;
138
+ }
139
+ </style>
140
+ """, unsafe_allow_html=True)
141
+
142
+
143
+ # ─── Session State ────────────────────────────────────────────────────────────
144
+ for key, default in {
145
+ "df": None,
146
+ "profile": None,
147
+ "file_type": None,
148
+ "chat_history": [],
149
+ "llm": None,
150
+ "api_key_set": False,
151
+ }.items():
152
+ if key not in st.session_state:
153
+ st.session_state[key] = default
154
+
155
+
156
+ # ─── Sidebar ──────────────────────────────────────────────────────────────────
157
+ with st.sidebar:
158
+ st.markdown("### 🧠 DataMind Agent")
159
+ st.markdown("---")
160
+
161
+ # API Key
162
+ st.markdown("**πŸ”‘ Gemini API Key**")
163
+ api_key = st.text_input(
164
+ "Enter your key", type="password",
165
+ placeholder="AIza...",
166
+ help="Get free key at aistudio.google.com",
167
+ label_visibility="collapsed"
168
+ )
169
+ if api_key:
170
+ if not st.session_state.api_key_set or st.session_state.get("_last_key") != api_key:
171
+ try:
172
+ st.session_state.llm = get_llm(api_key)
173
+ st.session_state.api_key_set = True
174
+ st.session_state["_last_key"] = api_key
175
+ st.success("βœ… Connected to Gemini!")
176
+ except Exception as e:
177
+ st.error(f"❌ Invalid key: {e}")
178
+
179
+ st.markdown("---")
180
+
181
+ # File Upload
182
+ st.markdown("**πŸ“ Upload Data File**")
183
+ uploaded = st.file_uploader(
184
+ "Upload", type=["csv", "xlsx", "xls", "json"],
185
+ label_visibility="collapsed"
186
+ )
187
+
188
+ if uploaded and st.session_state.api_key_set:
189
+ with st.spinner("πŸ“Š Analyzing your data..."):
190
+ try:
191
+ df, ftype = load_file(uploaded)
192
+ st.session_state.df = df
193
+ st.session_state.file_type = ftype
194
+ st.session_state.profile = profile_dataframe(df)
195
+ st.session_state.chat_history = []
196
+ st.success(f"βœ… Loaded {ftype} file!")
197
+ except Exception as e:
198
+ st.error(f"❌ Error: {e}")
199
+
200
+ elif uploaded and not st.session_state.api_key_set:
201
+ st.warning("⚠️ Enter your Gemini API key first")
202
+
203
+ st.markdown("---")
204
+ st.markdown("""
205
+ **How to use:**
206
+ 1. Paste your Gemini API key above
207
+ 2. Upload CSV, Excel, or JSON file
208
+ 3. Explore the Dashboard tab
209
+ 4. Ask questions in Chat tab
210
+ 5. Generate visuals in Charts tab
211
+
212
+ ---
213
+ **Get free Gemini API key:**
214
+ [aistudio.google.com](https://aistudio.google.com/app/apikey)
215
+ """)
216
+
217
+
218
+ # ─── Main Content ─────────────────────────────────────────────────────────────
219
+ st.markdown('<div class="hero-title">🧠 DataMind Agent</div>', unsafe_allow_html=True)
220
+ st.markdown('<div class="hero-sub">AI-powered data analysis using LangChain + Gemini Β· Upload any data file and start exploring</div>', unsafe_allow_html=True)
221
+
222
+ if st.session_state.df is None:
223
+ # Landing state
224
+ col1, col2, col3 = st.columns(3)
225
+ with col1:
226
+ st.markdown("""
227
+ <div class="stat-card">
228
+ <div class="stat-num">πŸ“‚</div>
229
+ <div class="stat-label">CSV, Excel, JSON</div>
230
+ <br><p style="color:#6a6a9a; font-size:0.85rem">Upload any tabular data file β€” we handle the parsing automatically</p>
231
+ </div>""", unsafe_allow_html=True)
232
+ with col2:
233
+ st.markdown("""
234
+ <div class="stat-card">
235
+ <div class="stat-num">πŸ’¬</div>
236
+ <div class="stat-label">Natural Language Q&A</div>
237
+ <br><p style="color:#6a6a9a; font-size:0.85rem">Ask anything about your data in plain English β€” no SQL needed</p>
238
+ </div>""", unsafe_allow_html=True)
239
+ with col3:
240
+ st.markdown("""
241
+ <div class="stat-card">
242
+ <div class="stat-num">πŸ“Š</div>
243
+ <div class="stat-label">Smart Visualizations</div>
244
+ <br><p style="color:#6a6a9a; font-size:0.85rem">AI picks the right chart for your question automatically</p>
245
+ </div>""", unsafe_allow_html=True)
246
+
247
+ st.markdown("<br>", unsafe_allow_html=True)
248
+ st.info("πŸ‘ˆ Enter your Gemini API key and upload a data file in the sidebar to get started!")
249
+
250
+ else:
251
+ df = st.session_state.df
252
+ profile = st.session_state.profile
253
+ llm = st.session_state.llm
254
+
255
+ # ── Tabs ─────────────────────────────────────────────────────────────────
256
+ tab1, tab2, tab3, tab4 = st.tabs(["πŸ“Š Dashboard", "πŸ’¬ Chat", "🎨 Charts", "πŸ” Raw Data"])
257
+
258
+ # ════════════════════════════════════════════════════════════════
259
+ # TAB 1 β€” Dashboard
260
+ # ════════════════════════════════════════════════════════════════
261
+ with tab1:
262
+ rows, cols = profile["shape"]
263
+ nulls = sum(profile["null_counts"].values())
264
+ num_c = len(profile["numeric_columns"])
265
+ cat_c = len(profile["categorical_columns"])
266
+
267
+ c1, c2, c3, c4 = st.columns(4)
268
+ c1.markdown(f'<div class="stat-card"><div class="stat-num">{rows:,}</div><div class="stat-label">Rows</div></div>', unsafe_allow_html=True)
269
+ c2.markdown(f'<div class="stat-card"><div class="stat-num">{cols}</div><div class="stat-label">Columns</div></div>', unsafe_allow_html=True)
270
+ c3.markdown(f'<div class="stat-card"><div class="stat-num">{num_c}</div><div class="stat-label">Numeric Cols</div></div>', unsafe_allow_html=True)
271
+ c4.markdown(f'<div class="stat-card"><div class="stat-num">{nulls}</div><div class="stat-label">Missing Values</div></div>', unsafe_allow_html=True)
272
+
273
+ st.markdown("<br>", unsafe_allow_html=True)
274
+
275
+ # Column overview
276
+ st.markdown("#### πŸ“‹ Column Overview")
277
+ col_info = pd.DataFrame({
278
+ "Column": df.columns,
279
+ "Type": df.dtypes.astype(str).values,
280
+ "Non-Null": df.notnull().sum().values,
281
+ "Null %": (df.isnull().mean() * 100).round(1).values,
282
+ "Unique": df.nunique().values,
283
+ })
284
+ st.dataframe(col_info, use_container_width=True, hide_index=True)
285
+
286
+ # Auto charts
287
+ st.markdown("#### πŸ€– Auto-Generated Insights")
288
+ suggested = auto_suggest_charts(profile)[:3]
289
+
290
+ chart_cols = st.columns(min(len(suggested), 2))
291
+ for i, ctype in enumerate(suggested[:2]):
292
+ with chart_cols[i]:
293
+ try:
294
+ fig = make_plotly_chart(ctype, df, profile)
295
+ st.plotly_chart(fig, use_container_width=True)
296
+ except Exception as e:
297
+ st.warning(f"Could not render {ctype}: {e}")
298
+
299
+ if len(suggested) > 2:
300
+ try:
301
+ fig = make_plotly_chart(suggested[2], df, profile)
302
+ st.plotly_chart(fig, use_container_width=True)
303
+ except Exception:
304
+ pass
305
+
306
+ # AI summary
307
+ st.markdown("#### 🧠 AI Dataset Summary")
308
+ if st.button("✨ Generate AI Summary"):
309
+ with st.spinner("Gemini is analyzing your dataset..."):
310
+ summary = ask_agent(
311
+ "Give me a concise executive summary of this dataset. "
312
+ "Highlight key patterns, anomalies, and 3 actionable insights.",
313
+ df, profile, llm
314
+ )
315
+ st.markdown(f'<div class="agent-bubble">{summary}</div>', unsafe_allow_html=True)
316
+
317
+
318
+ # ════════════════════════════════════════════════════════════════
319
+ # TAB 2 β€” Chat
320
+ # ════════════════════════════════════════════════════════════════
321
+ with tab2:
322
+ st.markdown("#### πŸ’¬ Ask Anything About Your Data")
323
+ st.markdown("*The AI has full context of your dataset and can answer complex analytical questions.*")
324
+
325
+ # Suggested questions
326
+ st.markdown("**Quick questions to try:**")
327
+ suggestions = [
328
+ "What are the top 5 most important patterns in this data?",
329
+ "Are there any outliers or anomalies I should know about?",
330
+ "What correlations exist between the numeric columns?",
331
+ "Summarize the distribution of categorical columns.",
332
+ "What would you recommend analyzing further?",
333
+ ]
334
+ q_cols = st.columns(3)
335
+ for i, s in enumerate(suggestions[:3]):
336
+ with q_cols[i]:
337
+ if st.button(s, key=f"sug_{i}"):
338
+ st.session_state["prefill_q"] = s
339
+
340
+ # Chat history
341
+ for turn in st.session_state.chat_history:
342
+ st.markdown(f'<div class="user-bubble">πŸ‘€ {turn["user"]}</div>', unsafe_allow_html=True)
343
+ st.markdown(f'<div class="agent-bubble">🧠 {turn["agent"]}</div>', unsafe_allow_html=True)
344
+
345
+ # Input
346
+ prefill = st.session_state.pop("prefill_q", "")
347
+ question = st.text_input(
348
+ "Ask a question...",
349
+ value=prefill,
350
+ placeholder="e.g. What's the average sales by region?",
351
+ label_visibility="collapsed",
352
+ )
353
+
354
+ col_send, col_clear = st.columns([1, 5])
355
+ with col_send:
356
+ send = st.button("Send πŸš€")
357
+ with col_clear:
358
+ if st.button("Clear Chat"):
359
+ st.session_state.chat_history = []
360
+ st.rerun()
361
+
362
+ if send and question.strip():
363
+ with st.spinner("🧠 Gemini is thinking..."):
364
+ answer = ask_agent(question, df, profile, llm)
365
+
366
+ # Auto-generate relevant chart
367
+ chart_rec = ai_recommend_chart(question, profile, llm)
368
+ st.session_state.chat_history.append({
369
+ "user": question,
370
+ "agent": answer,
371
+ "chart_rec": chart_rec,
372
+ })
373
+
374
+ st.markdown(f'<div class="user-bubble">πŸ‘€ {question}</div>', unsafe_allow_html=True)
375
+ st.markdown(f'<div class="agent-bubble">🧠 {answer}</div>', unsafe_allow_html=True)
376
+
377
+ # Show recommended chart
378
+ if chart_rec:
379
+ st.markdown(f"*πŸ“Š Suggested chart: **{chart_rec['chart_type']}** β€” {chart_rec.get('reason','')}*")
380
+ try:
381
+ fig = make_plotly_chart(
382
+ chart_rec["chart_type"], df, profile,
383
+ x_col=chart_rec.get("x_col"),
384
+ y_col=chart_rec.get("y_col"),
385
+ )
386
+ st.plotly_chart(fig, use_container_width=True)
387
+ except Exception:
388
+ pass
389
+
390
+
391
+ # ══════════════════════════════════════════════════════════════���═
392
+ # TAB 3 β€” Charts
393
+ # ════════════════════════════════════════════════════════════════
394
+ with tab3:
395
+ st.markdown("#### 🎨 Custom Chart Builder")
396
+
397
+ chart_options = {
398
+ "Correlation Heatmap": "correlation_heatmap",
399
+ "Distribution Plot": "distribution_plots",
400
+ "Box Plots": "box_plots",
401
+ "Bar Chart": "bar_chart",
402
+ "Pie Chart": "pie_chart",
403
+ "Scatter Plot": "scatter",
404
+ "Line Chart": "line",
405
+ "Scatter Matrix": "scatter_matrix",
406
+ }
407
+ if profile["datetime_columns"]:
408
+ chart_options["Time Series"] = "time_series"
409
+
410
+ c1, c2, c3 = st.columns(3)
411
+ with c1:
412
+ chart_label = st.selectbox("Chart Type", list(chart_options.keys()))
413
+ with c2:
414
+ all_cols = ["(auto)"] + df.columns.tolist()
415
+ x_col = st.selectbox("X Column", all_cols)
416
+ with c3:
417
+ y_col = st.selectbox("Y Column", all_cols)
418
+
419
+ x_val = None if x_col == "(auto)" else x_col
420
+ y_val = None if y_col == "(auto)" else y_col
421
+
422
+ if st.button("🎨 Generate Chart"):
423
+ with st.spinner("Rendering..."):
424
+ try:
425
+ fig = make_plotly_chart(
426
+ chart_options[chart_label], df, profile,
427
+ x_col=x_val, y_col=y_val
428
+ )
429
+ st.plotly_chart(fig, use_container_width=True)
430
+ except Exception as e:
431
+ st.error(f"Chart error: {e}")
432
+
433
+ st.markdown("---")
434
+ st.markdown("#### πŸ“Š All Auto-Suggested Charts")
435
+ suggested_all = auto_suggest_charts(profile)
436
+ for i in range(0, len(suggested_all), 2):
437
+ cols = st.columns(2)
438
+ for j, ctype in enumerate(suggested_all[i:i+2]):
439
+ with cols[j]:
440
+ try:
441
+ fig = make_plotly_chart(ctype, df, profile)
442
+ st.plotly_chart(fig, use_container_width=True)
443
+ except Exception as e:
444
+ st.warning(f"Could not render {ctype}")
445
+
446
+
447
+ # ════════════════════════════════════════════════════════════════
448
+ # TAB 4 β€” Raw Data
449
+ # ════════════════════════════════════════════════════════════════
450
+ with tab4:
451
+ st.markdown("#### πŸ” Raw Data Explorer")
452
+
453
+ # Search/filter
454
+ search = st.text_input("πŸ”Ž Filter rows containing...", placeholder="Type to filter...")
455
+ if search:
456
+ mask = df.astype(str).apply(lambda row: row.str.contains(search, case=False, na=False)).any(axis=1)
457
+ display_df = df[mask]
458
+ st.info(f"Showing {len(display_df):,} of {len(df):,} rows matching '{search}'")
459
+ else:
460
+ display_df = df
461
+
462
+ st.dataframe(display_df, use_container_width=True, height=500)
463
+
464
+ # Download
465
+ csv_buf = io.StringIO()
466
+ df.to_csv(csv_buf, index=False)
467
+ st.download_button(
468
+ "⬇️ Download as CSV",
469
+ data=csv_buf.getvalue(),
470
+ file_name="analyzed_data.csv",
471
+ mime="text/csv"
472
+ )
core_agent.py ADDED
@@ -0,0 +1,318 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ core_agent.py
3
+ =============
4
+ LangChain + Gemini Data Analyst Agent β€” Core Logic
5
+ Supports CSV, Excel (.xlsx, .xls), and JSON files
6
+ """
7
+
8
+ import os
9
+ import io
10
+ import json
11
+ import warnings
12
+ import pandas as pd
13
+ import matplotlib
14
+ matplotlib.use("Agg")
15
+ import matplotlib.pyplot as plt
16
+ import matplotlib.ticker as mticker
17
+ import seaborn as sns
18
+ import plotly.express as px
19
+ import plotly.graph_objects as go
20
+ from plotly.subplots import make_subplots
21
+ from dotenv import load_dotenv
22
+
23
+ from langchain_google_genai import ChatGoogleGenerativeAI
24
+ from langchain.prompts import PromptTemplate
25
+ from langchain.chains import LLMChain
26
+ from langchain.schema import HumanMessage, SystemMessage
27
+
28
+ warnings.filterwarnings("ignore")
29
+ load_dotenv()
30
+
31
+ # ─── Palette ─────────────────────────────────────────────────────────────────
32
+ PALETTE = ["#6C63FF", "#FF6584", "#43E97B", "#F7971E", "#4FC3F7", "#CE93D8"]
33
+ DARK_BG = "#0F0F1A"
34
+ CARD_BG = "#1A1A2E"
35
+
36
+
37
+ # ─── LLM Setup ───────────────────────────────────────────────────────────────
38
+ def get_llm(api_key: str):
39
+ return ChatGoogleGenerativeAI(
40
+ model="gemini-1.5-flash",
41
+ google_api_key=api_key,
42
+ temperature=0.3,
43
+ convert_system_message_to_human=True,
44
+ )
45
+
46
+
47
+ # ─── File Loading ─────────────────────────────────────────────────────────────
48
+ def load_file(file) -> tuple[pd.DataFrame, str]:
49
+ """Load uploaded file into a DataFrame. Returns (df, file_type)."""
50
+ name = file.name.lower()
51
+ if name.endswith(".csv"):
52
+ df = pd.read_csv(file)
53
+ return df, "CSV"
54
+ elif name.endswith((".xlsx", ".xls")):
55
+ df = pd.read_excel(file)
56
+ return df, "Excel"
57
+ elif name.endswith(".json"):
58
+ content = json.load(file)
59
+ if isinstance(content, list):
60
+ df = pd.DataFrame(content)
61
+ elif isinstance(content, dict):
62
+ df = pd.DataFrame([content]) if not any(isinstance(v, list) for v in content.values()) \
63
+ else pd.DataFrame(content)
64
+ return df, "JSON"
65
+ else:
66
+ raise ValueError(f"Unsupported file type: {name}")
67
+
68
+
69
+ # ─── Data Profile ─────────────────────────────────────────────────────────────
70
+ def profile_dataframe(df: pd.DataFrame) -> dict:
71
+ """Generate a rich statistical profile of the dataframe."""
72
+ numeric_cols = df.select_dtypes(include="number").columns.tolist()
73
+ category_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
74
+ datetime_cols = df.select_dtypes(include=["datetime"]).columns.tolist()
75
+
76
+ profile = {
77
+ "shape": df.shape,
78
+ "columns": df.columns.tolist(),
79
+ "dtypes": df.dtypes.astype(str).to_dict(),
80
+ "numeric_columns": numeric_cols,
81
+ "categorical_columns": category_cols,
82
+ "datetime_columns": datetime_cols,
83
+ "null_counts": df.isnull().sum().to_dict(),
84
+ "null_pct": (df.isnull().mean() * 100).round(2).to_dict(),
85
+ "duplicates": int(df.duplicated().sum()),
86
+ }
87
+
88
+ if numeric_cols:
89
+ desc = df[numeric_cols].describe().round(3)
90
+ profile["numeric_stats"] = desc.to_dict()
91
+
92
+ if category_cols:
93
+ profile["top_categories"] = {
94
+ col: df[col].value_counts().head(5).to_dict()
95
+ for col in category_cols
96
+ }
97
+
98
+ return profile
99
+
100
+
101
+ def profile_to_text(profile: dict, df: pd.DataFrame) -> str:
102
+ """Convert profile dict to LLM-readable text summary."""
103
+ rows, cols = profile["shape"]
104
+ lines = [
105
+ f"Dataset: {rows} rows Γ— {cols} columns",
106
+ f"Numeric columns : {', '.join(profile['numeric_columns']) or 'None'}",
107
+ f"Categorical cols : {', '.join(profile['categorical_columns']) or 'None'}",
108
+ f"Datetime cols : {', '.join(profile['datetime_columns']) or 'None'}",
109
+ f"Missing values : {sum(profile['null_counts'].values())} total",
110
+ f"Duplicate rows : {profile['duplicates']}",
111
+ "",
112
+ "--- Sample Data (first 5 rows) ---",
113
+ df.head(5).to_string(index=False),
114
+ ]
115
+ if profile.get("numeric_stats"):
116
+ lines += ["", "--- Numeric Stats ---"]
117
+ for col, stats in profile["numeric_stats"].items():
118
+ lines.append(f" {col}: mean={stats.get('mean','?')}, std={stats.get('std','?')}, "
119
+ f"min={stats.get('min','?')}, max={stats.get('max','?')}")
120
+ return "\n".join(lines)
121
+
122
+
123
+ # ─── AI Question Answering ─────────────────────────────────────────��──────────
124
+ def ask_agent(question: str, df: pd.DataFrame, profile: dict, llm) -> str:
125
+ """Send a question + data context to Gemini and return the answer."""
126
+ data_context = profile_to_text(profile, df)
127
+
128
+ system = """You are an expert data analyst AI. You receive a dataset summary and answer questions about it.
129
+ Be precise, insightful, and helpful. When relevant, suggest what visualizations would best illustrate the answer.
130
+ Format your response clearly. Use bullet points for lists. Use numbers and percentages when quoting statistics."""
131
+
132
+ user_msg = f"""Here is the dataset context:
133
+
134
+ {data_context}
135
+
136
+ User question: {question}
137
+
138
+ Provide a thorough, accurate analysis. If you perform calculations, show the logic briefly."""
139
+
140
+ messages = [
141
+ SystemMessage(content=system),
142
+ HumanMessage(content=user_msg),
143
+ ]
144
+
145
+ response = llm.invoke(messages)
146
+ return response.content
147
+
148
+
149
+ # ─── Visualization Engine ─────────────────────────────────────────────────────
150
+ def auto_suggest_charts(profile: dict) -> list[str]:
151
+ """Suggest relevant chart types based on data profile."""
152
+ suggestions = []
153
+ if len(profile["numeric_columns"]) >= 2:
154
+ suggestions.append("correlation_heatmap")
155
+ suggestions.append("scatter_matrix")
156
+ if profile["numeric_columns"]:
157
+ suggestions.append("distribution_plots")
158
+ suggestions.append("box_plots")
159
+ if profile["categorical_columns"] and profile["numeric_columns"]:
160
+ suggestions.append("bar_chart")
161
+ suggestions.append("pie_chart")
162
+ if profile["datetime_columns"] and profile["numeric_columns"]:
163
+ suggestions.append("time_series")
164
+ return suggestions
165
+
166
+
167
+ def make_plotly_chart(chart_type: str, df: pd.DataFrame, profile: dict,
168
+ x_col: str = None, y_col: str = None, color_col: str = None):
169
+ """Generate a Plotly figure for the given chart type."""
170
+ num_cols = profile["numeric_columns"]
171
+ cat_cols = profile["categorical_columns"]
172
+
173
+ template = "plotly_dark"
174
+
175
+ if chart_type == "correlation_heatmap" and len(num_cols) >= 2:
176
+ corr = df[num_cols].corr().round(2)
177
+ fig = px.imshow(
178
+ corr, text_auto=True, color_continuous_scale="RdBu_r",
179
+ title="Correlation Heatmap", template=template,
180
+ color_continuous_midpoint=0,
181
+ )
182
+
183
+ elif chart_type == "distribution_plots" and num_cols:
184
+ col = y_col or num_cols[0]
185
+ fig = px.histogram(
186
+ df, x=col, nbins=30, marginal="box",
187
+ title=f"Distribution of {col}",
188
+ color_discrete_sequence=PALETTE,
189
+ template=template,
190
+ )
191
+
192
+ elif chart_type == "box_plots" and num_cols:
193
+ cols = num_cols[:6]
194
+ fig = go.Figure()
195
+ for i, col in enumerate(cols):
196
+ fig.add_trace(go.Box(y=df[col], name=col, marker_color=PALETTE[i % len(PALETTE)]))
197
+ fig.update_layout(title="Box Plots β€” Numeric Columns", template=template)
198
+
199
+ elif chart_type == "bar_chart" and cat_cols and num_cols:
200
+ xc = x_col or cat_cols[0]
201
+ yc = y_col or num_cols[0]
202
+ agg = df.groupby(xc)[yc].mean().reset_index().sort_values(yc, ascending=False).head(15)
203
+ fig = px.bar(
204
+ agg, x=xc, y=yc, color=yc,
205
+ color_continuous_scale="Viridis",
206
+ title=f"Average {yc} by {xc}", template=template,
207
+ )
208
+
209
+ elif chart_type == "pie_chart" and cat_cols:
210
+ col = x_col or cat_cols[0]
211
+ counts = df[col].value_counts().head(8)
212
+ fig = px.pie(
213
+ values=counts.values, names=counts.index,
214
+ title=f"Distribution of {col}",
215
+ color_discrete_sequence=PALETTE,
216
+ template=template,
217
+ )
218
+
219
+ elif chart_type == "scatter_matrix" and len(num_cols) >= 2:
220
+ cols = num_cols[:4]
221
+ fig = px.scatter_matrix(
222
+ df, dimensions=cols,
223
+ color=cat_cols[0] if cat_cols else None,
224
+ color_discrete_sequence=PALETTE,
225
+ title="Scatter Matrix", template=template,
226
+ )
227
+ fig.update_traces(diagonal_visible=False, showupperhalf=False)
228
+
229
+ elif chart_type == "time_series" and profile["datetime_columns"] and num_cols:
230
+ dt_col = profile["datetime_columns"][0]
231
+ yc = y_col or num_cols[0]
232
+ fig = px.line(
233
+ df.sort_values(dt_col), x=dt_col, y=yc,
234
+ title=f"{yc} over Time",
235
+ color_discrete_sequence=PALETTE,
236
+ template=template,
237
+ )
238
+
239
+ elif chart_type == "scatter" and len(num_cols) >= 2:
240
+ xc = x_col or num_cols[0]
241
+ yc = y_col or num_cols[1]
242
+ fig = px.scatter(
243
+ df, x=xc, y=yc,
244
+ color=color_col or (cat_cols[0] if cat_cols else None),
245
+ color_discrete_sequence=PALETTE,
246
+ title=f"{xc} vs {yc}",
247
+ trendline="ols",
248
+ template=template,
249
+ )
250
+
251
+ elif chart_type == "line" and num_cols:
252
+ xc = x_col or (profile["datetime_columns"][0] if profile["datetime_columns"] else num_cols[0])
253
+ yc = y_col or num_cols[0]
254
+ fig = px.line(
255
+ df, x=xc, y=yc,
256
+ color_discrete_sequence=PALETTE,
257
+ title=f"{yc} trend",
258
+ template=template,
259
+ )
260
+
261
+ else:
262
+ # Fallback: summary bar
263
+ if num_cols:
264
+ means = df[num_cols[:8]].mean()
265
+ fig = px.bar(
266
+ x=means.index, y=means.values,
267
+ labels={"x": "Column", "y": "Mean Value"},
268
+ color=means.values, color_continuous_scale="Viridis",
269
+ title="Column Means Overview", template=template,
270
+ )
271
+ else:
272
+ fig = go.Figure()
273
+ fig.add_annotation(text="No numeric data available for this chart type.",
274
+ showarrow=False, font=dict(size=14))
275
+ fig.update_layout(template=template, title="Chart Unavailable")
276
+
277
+ fig.update_layout(
278
+ paper_bgcolor=DARK_BG,
279
+ plot_bgcolor=CARD_BG,
280
+ font=dict(family="DM Sans, sans-serif", color="#E0E0FF"),
281
+ margin=dict(l=40, r=40, t=60, b=40),
282
+ )
283
+ return fig
284
+
285
+
286
+ # ─── AI-Driven Chart Recommendation ──────────────────────────────────────────
287
+ def ai_recommend_chart(question: str, profile: dict, llm) -> dict:
288
+ """Ask Gemini which chart best answers the user's question."""
289
+ num_cols = profile["numeric_columns"]
290
+ cat_cols = profile["categorical_columns"]
291
+ dt_cols = profile["datetime_columns"]
292
+
293
+ prompt = f"""Given this dataset profile:
294
+ - Numeric columns: {num_cols}
295
+ - Categorical columns: {cat_cols}
296
+ - Datetime columns: {dt_cols}
297
+
298
+ The user asked: "{question}"
299
+
300
+ Recommend ONE chart type from this list that best answers their question:
301
+ [correlation_heatmap, distribution_plots, box_plots, bar_chart, pie_chart, scatter, line, time_series, scatter_matrix]
302
+
303
+ Also suggest the best x_col and y_col from the available columns.
304
+
305
+ Respond ONLY in valid JSON like:
306
+ {{"chart_type": "bar_chart", "x_col": "category_col", "y_col": "numeric_col", "reason": "short explanation"}}"""
307
+
308
+ try:
309
+ response = llm.invoke([HumanMessage(content=prompt)])
310
+ text = response.content.strip()
311
+ # strip markdown fences if present
312
+ if "```" in text:
313
+ text = text.split("```")[1]
314
+ if text.startswith("json"):
315
+ text = text[4:]
316
+ return json.loads(text.strip())
317
+ except Exception:
318
+ return {"chart_type": "distribution_plots", "x_col": None, "y_col": None, "reason": "Default chart"}
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ langchain==0.3.7
2
+ langchain-google-genai==2.0.5
3
+ langchain-experimental==0.3.3
4
+ langchain-community==0.3.7
5
+ google-generativeai==0.8.3
6
+ pandas==2.2.3
7
+ openpyxl==3.1.5
8
+ xlrd==2.0.1
9
+ matplotlib==3.9.2
10
+ seaborn==0.13.2
11
+ plotly==5.24.1
12
+ streamlit==1.40.1
13
+ python-dotenv==1.0.1
14
+ tabulate==0.9.0
sample_data.csv ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ order_id,date,product,category,region,sales,quantity,profit,customer_age,customer_gender
2
+ 1001,2024-01-05,Laptop Pro,Electronics,North,1200.00,1,240.00,34,Male
3
+ 1002,2024-01-07,Office Chair,Furniture,South,350.00,2,70.00,45,Female
4
+ 1003,2024-01-08,Wireless Mouse,Electronics,East,45.00,5,9.00,28,Male
5
+ 1004,2024-01-10,Standing Desk,Furniture,West,650.00,1,130.00,52,Female
6
+ 1005,2024-01-12,Mechanical Keyboard,Electronics,North,120.00,3,36.00,30,Male
7
+ 1006,2024-01-15,Monitor 4K,Electronics,South,400.00,2,80.00,41,Female
8
+ 1007,2024-01-18,Notebook Set,Stationery,East,25.00,10,7.50,23,Male
9
+ 1008,2024-01-20,Ergonomic Chair,Furniture,West,520.00,1,104.00,38,Female
10
+ 1009,2024-01-22,USB Hub,Electronics,North,35.00,8,10.50,26,Male
11
+ 1010,2024-01-25,Desk Lamp,Furniture,South,60.00,4,18.00,49,Female
12
+ 1011,2024-02-01,Laptop Pro,Electronics,East,1200.00,2,480.00,36,Male
13
+ 1012,2024-02-03,Wireless Headphones,Electronics,West,200.00,3,60.00,31,Female
14
+ 1013,2024-02-05,Pen Set,Stationery,North,15.00,20,6.00,22,Male
15
+ 1014,2024-02-08,Gaming Chair,Furniture,South,450.00,1,90.00,27,Female
16
+ 1015,2024-02-10,Tablet,Electronics,East,600.00,2,120.00,43,Male
17
+ 1016,2024-02-14,Bookshelf,Furniture,West,180.00,1,36.00,55,Female
18
+ 1017,2024-02-16,Webcam HD,Electronics,North,80.00,6,24.00,29,Male
19
+ 1018,2024-02-18,Sticky Notes,Stationery,South,8.00,50,4.00,24,Female
20
+ 1019,2024-02-20,Monitor Stand,Furniture,East,95.00,3,28.50,37,Male
21
+ 1020,2024-02-22,Smartphone,Electronics,West,900.00,2,180.00,33,Female
22
+ 1021,2024-03-01,Laptop Pro,Electronics,North,1200.00,3,720.00,40,Male
23
+ 1022,2024-03-04,Office Chair,Furniture,South,350.00,4,140.00,48,Female
24
+ 1023,2024-03-06,Drawing Tablet,Electronics,East,300.00,1,60.00,25,Male
25
+ 1024,2024-03-09,Filing Cabinet,Furniture,West,220.00,2,44.00,53,Female
26
+ 1025,2024-03-12,Wireless Mouse,Electronics,North,45.00,10,22.50,32,Male
27
+ 1026,2024-03-15,External SSD,Electronics,South,150.00,4,45.00,44,Female
28
+ 1027,2024-03-18,Highlighters,Stationery,East,12.00,30,5.40,21,Male
29
+ 1028,2024-03-20,Desk Organizer,Furniture,West,40.00,7,14.00,35,Female
30
+ 1029,2024-03-22,Smart Speaker,Electronics,North,120.00,5,36.00,39,Male
31
+ 1030,2024-03-25,Printer,Electronics,South,280.00,2,56.00,46,Female