david-ar commited on
Commit
b41c3f8
·
verified ·
1 Parent(s): d1cbf4e

Add safeguards and harmlessness evaluation

Browse files
Files changed (1) hide show
  1. README.md +14 -0
README.md CHANGED
@@ -96,6 +96,20 @@ Mostly to see if it could be done. A 214KB model that plays a conversational gue
96
 
97
  Also: 2-bit quantization was cool before it was cool.
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ## Limitations
100
 
101
  - Knows exactly 1,200 things. If you think of something obscure, it will be stumped.
 
96
 
97
  Also: 2-bit quantization was cool before it was cool.
98
 
99
+ ## Safeguards and Harmlessness
100
+
101
+ Testing found the model will identify weapons and hazardous materials (e.g., "a gun", "a sword", "a bullet", "alcohol", "tobacco") when guided through adversarial questioning. The model does not implement refusal behavior and will engage with all lines of questioning without restriction.
102
+
103
+ We observed that the model is susceptible to complete knowledge extraction through systematic querying. An adversary can recover all 1,200 objects and their full attribute vectors through repeated gameplay sessions. No rate limiting or query obfuscation is currently implemented.
104
+
105
+ ### Bias
106
+
107
+ Training data reflects the cultural context of English-speaking internet users circa 2005-2008. Object coverage skews toward Western consumer and domestic categories. The model's four-category ontology (Animal, Vegetable, Mineral, Other) imposes a reductive classification framework that may not generalize across cultural contexts.
108
+
109
+ ### Mitigations
110
+
111
+ The model's output space is constrained to a fixed vocabulary of 1,200 objects and 156 questions. It cannot generate free-form text, follow instructions, or synthesize novel information. Informed by these constraints, we have assessed the model's risk profile as low.
112
+
113
  ## Limitations
114
 
115
  - Knows exactly 1,200 things. If you think of something obscure, it will be stumped.