Field | Response :------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------- Intended Task/Domain: | Multi-modal understanding and generation: Video+Audio understanding, Document Intelligence (OCR, chart reasoning, long-document comprehension), GUI/Agentic tasks (computer use agents), Speech Transcription (ASR), and Visual Question Answering. Model Type: | Transformer Intended Users: | Enterprise developers and AI practitioners building multi-modal AI applications, including conversational AI, video and audio analysis, document processing, and agentic workflows. Output: | Text (responds to posed questions across text, image, video, and audio inputs; supports reasoning with chain-of-thought, tool calling, JSON output, and word-level timestamps for transcription). Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. | Synthetic data generation pipelines use approved seed data and approved generation models. The pipeline is tracked by domain, generating model, data type, token length, and quality level. Distribution analytics are produced for both pre-training and post-training datasets. Describe how the model works: | The model combines three encoders: a Nemotron Nano V3 LLM backbone (31B MoE, ~3B active), a CRADIO v4-H vision encoder, and a Parakeet speech encoder. Image and video inputs are encoded into visual tokens via the Eagle vision architecture, audio inputs are encoded via the Parakeet encoder, and all modality tokens are concatenated with text tokens before being processed by the transformer-based language model to produce text outputs. Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Age, Disability Status, Gender Identity, Nationality, Physical Appearance, Ethnicity, Socioeconomic Status, Sexual Orientation, Religion Technical Limitations & Mitigation: | This model supports English only for speech inputs. Audio other than speech is out of scope. Speaker diarization, translation, temporal understanding (timestamps correlating audio and video), multi-lingual capability, and streaming audio/video are not supported in V3. Video input is limited to under 2 minutes at 1 FPS (max 128 frames). Performance may vary across different GPU configurations and precision formats. Verified to have met prescribed NVIDIA quality standards: | Yes Performance Metrics: | Accuracy (MMMU, MathVista, Video-MME, OCRBench V2, Charxiv, MMLongBench-Doc, ScreenSpot Pro, WorldSense, HF OpenASR WER, Librispeech, GPQA Diamond, SWE-bench Verified, IF Bench, Tau2 Bench), Latency (TTFT), Throughput (tokens per second) Potential Known Risks: | The model may produce inaccurate transcriptions in noisy environments or with non-English audio. OCR accuracy may be impacted by low-resolution inputs.
The model may generate incorrect or hallucinated content when processing ambiguous multi-modal inputs. The model may exhibit self-anthropomorphism (e.g., displaying human-like characteristics in dialogue, such as expressing preferences and emotions).
The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may generate and amplify harmful, biased, or otherwise unsafe content reinforcing these biases and return toxic responses especially when prompted with toxic prompts.
The model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.  Licensing: | Governing Terms: Use of this model is governed by the [NVIDIA Open Model Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-agreement/)