Spaces:
Runtime error
Runtime error
File size: 5,298 Bytes
96a945a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | # Possible Analyses
The dataset is a binary detection task with known ground truth. Each response is: did the participant detect colorization artifacts (PRESENT) or not (ABSENT)?
---
## 1. Per-Method Detection Rate (primary)
**Question:** Which colorization method is hardest to detect?
```python
df.groupby('method').apply(lambda g: (
g[g['label']=='fake']['response'].eq('fake').mean()
))
```
Produces hit rate per method. Lower = more convincing. Compare methods pairwise with proportion tests or a logistic mixed-effects model.
---
## 2. Signal Detection Theory (d-prime)
**Question:** Separating sensitivity from response bias.
For each participant (or aggregated per method):
```python
from scipy.stats import norm
def dprime(hits, misses, fas, crs):
hr = hits / (hits + misses)
far = fas / (fas + crs)
# Clip to avoid ±inf
hr = max(0.01, min(0.99, hr))
far = max(0.01, min(0.99, far))
return norm.ppf(hr) - norm.ppf(far)
# hits = label=='fake' AND response=='fake'
# misses = label=='fake' AND response=='real'
# false alarms = label=='real' AND response=='fake'
# correct rejections = label=='real' AND response=='real'
```
d' = 0 → chance; d' = 1 → moderate sensitivity; d' ≥ 2 → strong.
Criterion c = −0.5 × (z_HR + z_FAR) tells you response bias (positive = conservative, negative = liberal).
**Important:** The 20/80 real/fake base rate means you need at least 10 gt trials and 40 fake trials per participant — these are already guaranteed by the sampling scheme.
---
## 3. Response Time Analysis
**Question:** Do faster responses indicate more automatic (vs deliberative) detection?
```python
import seaborn as sns
sns.boxplot(data=df, x='method', y='response_time_ms', hue='correct')
```
Hypotheses:
- Correct rejections of hard-to-detect fakes may be slower (more deliberation)
- Easy artifacts (some methods) may produce fast, confident PRESENT responses
- Compare median RT for correct vs incorrect per method with Mann-Whitney U
---
## 4. Expertise Effect
**Question:** Do experts outperform novices?
```python
df.groupby(['expertise', 'method'])['correct'].mean().unstack()
```
Run a 2-way ANOVA or mixed-effects logistic regression:
```
correct ~ expertise * method + (1|session_id)
```
Use `pymer4` (R-style) or `statsmodels` MixedLM.
---
## 5. Colorblindness Effect
**Question:** Does colorblindness affect artifact detection?
```python
df.groupby(['cb_redgreen', 'method'])['correct'].mean()
```
Red-green colorblindness may reduce sensitivity to chromatic artifacts from some methods. Compare d' between `normal` and `deficient` groups with independent-samples t-test or Bayesian comparison (if n is small).
---
## 6. Variant Comparison (ortho vs standard)
**Question:** Are orthogonal colorizations harder to detect?
```python
df[df['label']=='fake'].groupby(['method','variant'])['correct'].mean()
```
Within each method, compare detection rates for `ortho` vs `standard`. Paired within method to control for method-level difficulty.
---
## 7. Dataset Difficulty
**Question:** Do COCO / ImageNet / Instance images differ in detection difficulty?
```python
df[df['label']=='fake'].groupby(['dataset','method'])['correct'].mean().unstack()
```
---
## 8. Learning / Fatigue Effects
**Question:** Does performance change over the course of the session?
```python
df['trial_bin'] = pd.cut(df['trial_index'], bins=5, labels=['1-10','11-20','21-30','31-40','41-50'])
df.groupby('trial_bin')['correct'].mean()
```
Plot accuracy across bins. Rising = learning (calibration to task). Falling = fatigue.
---
## 9. Per-Image Difficulty
**Question:** Which specific images are consistently detected / missed?
```python
img_stats = df.groupby('image_id').agg(
n=('correct','count'),
accuracy=('correct','mean'),
method=('method','first'),
dataset=('dataset','first'),
).sort_values('accuracy')
```
Top easiest (always detected) and hardest (never detected) images. Useful for understanding method-specific failure modes.
---
## 10. Inter-Rater Agreement
**Question:** Do participants agree with each other on specific images?
For images seen by multiple participants, compute Fleiss' kappa or just pairwise agreement rate. High agreement on specific images = consistent perceptual signal, not noise.
---
## Suggested Analysis Pipeline (Python)
```python
import pandas as pd
from scipy.stats import norm, mannwhitneyu, chi2_contingency
df = pd.read_csv('colorization_results.csv')
# Filter completed sessions only
completed = df[df['session_id'].isin(
df.groupby('session_id')['trial_index'].max()[lambda x: x >= 49].index
)]
# Base rates
print(completed['label'].value_counts(normalize=True))
# Per-method detection rate (fake images only)
fakes = completed[completed['label'] == 'fake']
print(fakes.groupby('method')['correct'].mean().sort_values())
# Overall accuracy
print("Overall accuracy:", completed['correct'].mean())
```
---
## Recommended Libraries
| Task | Library |
|---|---|
| Data wrangling | `pandas` |
| Statistical tests | `scipy.stats`, `pingouin` |
| Mixed-effects models | `statsmodels`, `pymer4` |
| Signal detection | `sdtpy` or manual (formula above) |
| Visualization | `matplotlib`, `seaborn` |
| Reporting | `jupyter` notebooks |
|