File size: 5,298 Bytes
96a945a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Possible Analyses

The dataset is a binary detection task with known ground truth. Each response is: did the participant detect colorization artifacts (PRESENT) or not (ABSENT)?

---

## 1. Per-Method Detection Rate (primary)

**Question:** Which colorization method is hardest to detect?

```python
df.groupby('method').apply(lambda g: (
    g[g['label']=='fake']['response'].eq('fake').mean()
))
```

Produces hit rate per method. Lower = more convincing. Compare methods pairwise with proportion tests or a logistic mixed-effects model.

---

## 2. Signal Detection Theory (d-prime)

**Question:** Separating sensitivity from response bias.

For each participant (or aggregated per method):

```python
from scipy.stats import norm

def dprime(hits, misses, fas, crs):
    hr = hits / (hits + misses)
    far = fas / (fas + crs)
    # Clip to avoid ±inf
    hr  = max(0.01, min(0.99, hr))
    far = max(0.01, min(0.99, far))
    return norm.ppf(hr) - norm.ppf(far)

# hits  = label=='fake' AND response=='fake'
# misses = label=='fake' AND response=='real'
# false alarms = label=='real' AND response=='fake'
# correct rejections = label=='real' AND response=='real'
```

d' = 0 → chance; d' = 1 → moderate sensitivity; d' ≥ 2 → strong.  
Criterion c = −0.5 × (z_HR + z_FAR) tells you response bias (positive = conservative, negative = liberal).

**Important:** The 20/80 real/fake base rate means you need at least 10 gt trials and 40 fake trials per participant — these are already guaranteed by the sampling scheme.

---

## 3. Response Time Analysis

**Question:** Do faster responses indicate more automatic (vs deliberative) detection?

```python
import seaborn as sns
sns.boxplot(data=df, x='method', y='response_time_ms', hue='correct')
```

Hypotheses:
- Correct rejections of hard-to-detect fakes may be slower (more deliberation)
- Easy artifacts (some methods) may produce fast, confident PRESENT responses
- Compare median RT for correct vs incorrect per method with Mann-Whitney U

---

## 4. Expertise Effect

**Question:** Do experts outperform novices?

```python
df.groupby(['expertise', 'method'])['correct'].mean().unstack()
```

Run a 2-way ANOVA or mixed-effects logistic regression:
```
correct ~ expertise * method + (1|session_id)
```
Use `pymer4` (R-style) or `statsmodels` MixedLM.

---

## 5. Colorblindness Effect

**Question:** Does colorblindness affect artifact detection?

```python
df.groupby(['cb_redgreen', 'method'])['correct'].mean()
```

Red-green colorblindness may reduce sensitivity to chromatic artifacts from some methods. Compare d' between `normal` and `deficient` groups with independent-samples t-test or Bayesian comparison (if n is small).

---

## 6. Variant Comparison (ortho vs standard)

**Question:** Are orthogonal colorizations harder to detect?

```python
df[df['label']=='fake'].groupby(['method','variant'])['correct'].mean()
```

Within each method, compare detection rates for `ortho` vs `standard`. Paired within method to control for method-level difficulty.

---

## 7. Dataset Difficulty

**Question:** Do COCO / ImageNet / Instance images differ in detection difficulty?

```python
df[df['label']=='fake'].groupby(['dataset','method'])['correct'].mean().unstack()
```

---

## 8. Learning / Fatigue Effects

**Question:** Does performance change over the course of the session?

```python
df['trial_bin'] = pd.cut(df['trial_index'], bins=5, labels=['1-10','11-20','21-30','31-40','41-50'])
df.groupby('trial_bin')['correct'].mean()
```

Plot accuracy across bins. Rising = learning (calibration to task). Falling = fatigue.

---

## 9. Per-Image Difficulty

**Question:** Which specific images are consistently detected / missed?

```python
img_stats = df.groupby('image_id').agg(
    n=('correct','count'),
    accuracy=('correct','mean'),
    method=('method','first'),
    dataset=('dataset','first'),
).sort_values('accuracy')
```

Top easiest (always detected) and hardest (never detected) images. Useful for understanding method-specific failure modes.

---

## 10. Inter-Rater Agreement

**Question:** Do participants agree with each other on specific images?

For images seen by multiple participants, compute Fleiss' kappa or just pairwise agreement rate. High agreement on specific images = consistent perceptual signal, not noise.

---

## Suggested Analysis Pipeline (Python)

```python
import pandas as pd
from scipy.stats import norm, mannwhitneyu, chi2_contingency

df = pd.read_csv('colorization_results.csv')

# Filter completed sessions only
completed = df[df['session_id'].isin(
    df.groupby('session_id')['trial_index'].max()[lambda x: x >= 49].index
)]

# Base rates
print(completed['label'].value_counts(normalize=True))

# Per-method detection rate (fake images only)
fakes = completed[completed['label'] == 'fake']
print(fakes.groupby('method')['correct'].mean().sort_values())

# Overall accuracy
print("Overall accuracy:", completed['correct'].mean())
```

---

## Recommended Libraries

| Task | Library |
|---|---|
| Data wrangling | `pandas` |
| Statistical tests | `scipy.stats`, `pingouin` |
| Mixed-effects models | `statsmodels`, `pymer4` |
| Signal detection | `sdtpy` or manual (formula above) |
| Visualization | `matplotlib`, `seaborn` |
| Reporting | `jupyter` notebooks |