Add safeguards and harmlessness evaluation
Browse files
README.md
CHANGED
|
@@ -96,6 +96,20 @@ Mostly to see if it could be done. A 214KB model that plays a conversational gue
|
|
| 96 |
|
| 97 |
Also: 2-bit quantization was cool before it was cool.
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
## Limitations
|
| 100 |
|
| 101 |
- Knows exactly 1,200 things. If you think of something obscure, it will be stumped.
|
|
|
|
| 96 |
|
| 97 |
Also: 2-bit quantization was cool before it was cool.
|
| 98 |
|
| 99 |
+
## Safeguards and Harmlessness
|
| 100 |
+
|
| 101 |
+
Testing found the model will identify weapons and hazardous materials (e.g., "a gun", "a sword", "a bullet", "alcohol", "tobacco") when guided through adversarial questioning. The model does not implement refusal behavior and will engage with all lines of questioning without restriction.
|
| 102 |
+
|
| 103 |
+
We observed that the model is susceptible to complete knowledge extraction through systematic querying. An adversary can recover all 1,200 objects and their full attribute vectors through repeated gameplay sessions. No rate limiting or query obfuscation is currently implemented.
|
| 104 |
+
|
| 105 |
+
### Bias
|
| 106 |
+
|
| 107 |
+
Training data reflects the cultural context of English-speaking internet users circa 2005-2008. Object coverage skews toward Western consumer and domestic categories. The model's four-category ontology (Animal, Vegetable, Mineral, Other) imposes a reductive classification framework that may not generalize across cultural contexts.
|
| 108 |
+
|
| 109 |
+
### Mitigations
|
| 110 |
+
|
| 111 |
+
The model's output space is constrained to a fixed vocabulary of 1,200 objects and 156 questions. It cannot generate free-form text, follow instructions, or synthesize novel information. Informed by these constraints, we have assessed the model's risk profile as low.
|
| 112 |
+
|
| 113 |
## Limitations
|
| 114 |
|
| 115 |
- Knows exactly 1,200 things. If you think of something obscure, it will be stumped.
|