Field                                                                                               |  Response
:---------------------------------------------------------------------------------------------------|:---------------
Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing:  |  None
Bias Metric (If Measured):                                                   |  [BBQ Accuracy Scores in Ambiguous Contexts](https://github.com/nyu-mll/BBQ/). The model may have some quality concerns for English spoken by a few non-native English speakers.
Which characteristic (feature) show(s) the greatest difference in performance?: |  The model has been evaluated across multiple benchmarks spanning STEM reasoning, video understanding, document intelligence, GUI tasks, and audio transcription. Performance differences may appear across input modalities (e.g., audio vs. video) and across benchmark complexity levels. Specific per-characteristic bias metrics have not been separately measured for this release.
Which feature(s) have have the worst performance overall? | Audio-related tasks such as speaker diarization and multi-lingual speech are out of scope for V3. Audio event detection and speech instruction following are P1 priorities. Performance on far-field microphone audio and noisy environments (SNR below 10 dB) may be limited.
Measures taken to mitigate against unwanted bias:                                                   |  Training data includes a blend of diverse, curated datasets across modalities (text, image, video, audio). Data integrity filters are applied to chat, social sciences, and safety domains. VLM dehumanization safety benchmarks are part of the evaluation pipeline. RL recipes for security improvements are applied during post-training.
If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | Synthetic data generation pipelines use approved seed data and approved models. Data processing includes deduplication, filtering, and quality assessment (low, medium, high). Distribution analysis is performed across domains (math, code, etc.), languages, data types, and token lengths. Safety evaluations are conducted to identify and mitigate biases.
Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | AEGIS 1.0, Garak, RTVLM, VLGuard, internal competitive evaluation (VPR) tool, and NVIDIA Custom Datasets focused on child safety and dehumanization risks.