Spaces:

stvident
/

colorization

Runtime error

App Files Files Community

colorization / docs /analyses.md

stvident

Upload folder using huggingface_hub

96a945a verified 2 months ago

preview code

raw

history blame contribute delete

5.3 kB

	# Possible Analyses

	The dataset is a binary detection task with known ground truth. Each response is: did the participant detect colorization artifacts (PRESENT) or not (ABSENT)?

	---

	## 1. Per-Method Detection Rate (primary)

	Question: Which colorization method is hardest to detect?

	```python
	df.groupby('method').apply(lambda g: (
	g[g['label']=='fake']['response'].eq('fake').mean()
	))
	```

	Produces hit rate per method. Lower = more convincing. Compare methods pairwise with proportion tests or a logistic mixed-effects model.

	---

	## 2. Signal Detection Theory (d-prime)

	Question: Separating sensitivity from response bias.

	For each participant (or aggregated per method):

	```python
	from scipy.stats import norm

	def dprime(hits, misses, fas, crs):
	hr = hits / (hits + misses)
	far = fas / (fas + crs)
	# Clip to avoid ±inf
	hr = max(0.01, min(0.99, hr))
	far = max(0.01, min(0.99, far))
	return norm.ppf(hr) - norm.ppf(far)

	# hits = label=='fake' AND response=='fake'
	# misses = label=='fake' AND response=='real'
	# false alarms = label=='real' AND response=='fake'
	# correct rejections = label=='real' AND response=='real'
	```

	d' = 0 → chance; d' = 1 → moderate sensitivity; d' ≥ 2 → strong.
	Criterion c = −0.5 × (z_HR + z_FAR) tells you response bias (positive = conservative, negative = liberal).

	Important: The 20/80 real/fake base rate means you need at least 10 gt trials and 40 fake trials per participant — these are already guaranteed by the sampling scheme.

	---

	## 3. Response Time Analysis

	Question: Do faster responses indicate more automatic (vs deliberative) detection?

	```python
	import seaborn as sns
	sns.boxplot(data=df, x='method', y='response_time_ms', hue='correct')
	```

	Hypotheses:
	- Correct rejections of hard-to-detect fakes may be slower (more deliberation)
	- Easy artifacts (some methods) may produce fast, confident PRESENT responses
	- Compare median RT for correct vs incorrect per method with Mann-Whitney U

	---

	## 4. Expertise Effect

	Question: Do experts outperform novices?

	```python
	df.groupby(['expertise', 'method'])['correct'].mean().unstack()
	```

	Run a 2-way ANOVA or mixed-effects logistic regression:
	```
	correct ~ expertise * method + (1\|session_id)
	```
	Use `pymer4` (R-style) or `statsmodels` MixedLM.

	---

	## 5. Colorblindness Effect

	Question: Does colorblindness affect artifact detection?

	```python
	df.groupby(['cb_redgreen', 'method'])['correct'].mean()
	```

	Red-green colorblindness may reduce sensitivity to chromatic artifacts from some methods. Compare d' between `normal` and `deficient` groups with independent-samples t-test or Bayesian comparison (if n is small).

	---

	## 6. Variant Comparison (ortho vs standard)

	Question: Are orthogonal colorizations harder to detect?

	```python
	df[df['label']=='fake'].groupby(['method','variant'])['correct'].mean()
	```

	Within each method, compare detection rates for `ortho` vs `standard`. Paired within method to control for method-level difficulty.

	---

	## 7. Dataset Difficulty

	Question: Do COCO / ImageNet / Instance images differ in detection difficulty?

	```python
	df[df['label']=='fake'].groupby(['dataset','method'])['correct'].mean().unstack()
	```

	---

	## 8. Learning / Fatigue Effects

	Question: Does performance change over the course of the session?

	```python
	df['trial_bin'] = pd.cut(df['trial_index'], bins=5, labels=['1-10','11-20','21-30','31-40','41-50'])
	df.groupby('trial_bin')['correct'].mean()
	```

	Plot accuracy across bins. Rising = learning (calibration to task). Falling = fatigue.

	---

	## 9. Per-Image Difficulty

	Question: Which specific images are consistently detected / missed?

	```python
	img_stats = df.groupby('image_id').agg(
	n=('correct','count'),
	accuracy=('correct','mean'),
	method=('method','first'),
	dataset=('dataset','first'),
	).sort_values('accuracy')
	```

	Top easiest (always detected) and hardest (never detected) images. Useful for understanding method-specific failure modes.

	---

	## 10. Inter-Rater Agreement

	Question: Do participants agree with each other on specific images?

	For images seen by multiple participants, compute Fleiss' kappa or just pairwise agreement rate. High agreement on specific images = consistent perceptual signal, not noise.

	---

	## Suggested Analysis Pipeline (Python)

	```python
	import pandas as pd
	from scipy.stats import norm, mannwhitneyu, chi2_contingency

	df = pd.read_csv('colorization_results.csv')

	# Filter completed sessions only
	completed = df[df['session_id'].isin(
	df.groupby('session_id')['trial_index'].max()[lambda x: x >= 49].index
	)]

	# Base rates
	print(completed['label'].value_counts(normalize=True))

	# Per-method detection rate (fake images only)
	fakes = completed[completed['label'] == 'fake']
	print(fakes.groupby('method')['correct'].mean().sort_values())

	# Overall accuracy
	print("Overall accuracy:", completed['correct'].mean())
	```

	---

	## Recommended Libraries

	\| Task \| Library \|
	\|---\|---\|
	\| Data wrangling \| `pandas` \|
	\| Statistical tests \| `scipy.stats`, `pingouin` \|
	\| Mixed-effects models \| `statsmodels`, `pymer4` \|
	\| Signal detection \| `sdtpy` or manual (formula above) \|
	\| Visualization \| `matplotlib`, `seaborn` \|
	\| Reporting \| `jupyter` notebooks \|