Upload processor

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +199 -0
chat_template.jinja +237 -0
midi_tokenizer/tokenization_song2midi.py +200 -0
midi_tokenizer/tokenizer_config.json +51 -0
midi_tokenizer/vocab.json +0 -0
processing_song2midi.py +401 -0
processor_config.json +23 -0
tokenizer.json +3 -0
tokenizer_config.json +41 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,237 @@

+{%- set image_count = namespace(value=0) %}
+{%- set video_count = namespace(value=0) %}
+{%- set midi_count = namespace(value=0) %}  {# added midi counter #}
+{%- macro render_content(content, do_vision_count, is_system_content=false) %}
+    {%- if content is string %}
+        {{- content }}
+    {%- elif content is iterable and content is not mapping %}
+        {%- for item in content %}
+            {%- if 'image' in item or 'image_url' in item or item.type == 'image' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain images.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set image_count.value = image_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Picture ' ~ image_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|image_pad|><|vision_end|>' }}
+            {%- elif 'video' in item or item.type == 'video' %}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain videos.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set video_count.value = video_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'Video ' ~ video_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|vision_start|><|video_pad|><|vision_end|>' }}
+            {%- elif 'midi' in item or 'midi_url' in item or item.type == 'midi' %}  {# midi handling #}
+                {%- if is_system_content %}
+                    {{- raise_exception('System message cannot contain MIDI content.') }}
+                {%- endif %}
+                {%- if do_vision_count %}
+                    {%- set midi_count.value = midi_count.value + 1 %}
+                {%- endif %}
+                {%- if add_vision_id %}
+                    {{- 'MIDI ' ~ midi_count.value ~ ': ' }}
+                {%- endif %}
+                {{- '<|midi_start|><|image_pad|><|midi_end|>' }}
+            {%- elif 'text' in item %}
+                {{- item.text }}
+            {%- else %}
+                {{- raise_exception('Unexpected item type in content.') }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif content is none or content is undefined %}
+        {{- '' }}
+    {%- else %}
+        {{- raise_exception('Unexpected content type.') }}
+    {%- endif %}
+{%- endmacro %}
+{%- if not messages %}
+    {{- raise_exception('No messages provided.') }}
+{%- endif %}
+{%- if tools and tools is iterable and tools is not mapping %}
+    {{- '<|im_start|>system
+' }}
+    {{- "# Tools
+You have access to the following functions:
+<tools>" }}
+    {%- for tool in tools %}
+        {{- "
+" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "
+</tools>" }}
+    {{- '
+If you choose to call a function ONLY reply in the following format with NO suffix:
+<tool_call>
+<function=example_function_name>
+<parameter=example_parameter_1>
+value_1
+</parameter>
+<parameter=example_parameter_2>
+This is the value for the second parameter
+that can span
+multiple lines
+</parameter>
+</function>
+</tool_call>
+<IMPORTANT>
+Reminder:
+- Function calls MUST follow the specified format: an inner <function=...></function> block must be nested within <tool_call></tool_call> XML tags
+- Required parameters MUST be specified
+- You may provide optional reasoning for your function call in natural language BEFORE the function call, but NOT after
+- If there is no function call available, answer the question like normal with your current knowledge and do not tell the user about function calls
+</IMPORTANT>' }}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {%- if content %}
+            {{- '
+' + content }}
+        {%- endif %}
+    {%- endif %}
+    {{- '<|im_end|>
+' }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {%- set content = render_content(messages[0].content, false, true)|trim %}
+        {{- '<|im_start|>system
+' + content + '<|im_end|>
+' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" %}
+        {%- set content = render_content(message.content, false)|trim %}
+        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
+            {%- set ns.multi_step_tool = false %}
+            {%- set ns.last_query_index = index %}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if ns.multi_step_tool %}
+    {{- raise_exception('No user query found in messages.') }}
+{%- endif %}
+{%- for message in messages %}
+    {%- set content = render_content(message.content, true)|trim %}
+    {%- if message.role == "system" %}
+        {%- if not loop.first %}
+            {{- raise_exception('System message must be at the beginning.') }}
+        {%- endif %}
+    {%- elif message.role == "user" %}
+        {{- '<|im_start|>' + message.role + '
+' + content + '<|im_end|>' + '
+' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('
+').split('<think>')[-1].lstrip('
+') %}
+                {%- set content = content.split('</think>')[-1].lstrip('
+') %}
+            {%- endif %}
+        {%- endif %}
+        {%- set reasoning_content = reasoning_content|trim %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {{- '<|im_start|>' + message.role + '
+<think>
+' + reasoning_content + '
+</think>
+' + content }}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '
+' + content }}
+        {%- endif %}
+        {%- if message.tool_calls and message.tool_calls is iterable and message.tool_calls is not mapping %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if tool_call.function is defined %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {%- if loop.first %}
+                    {%- if content|trim %}
+                        {{- '
+<tool_call>
+<function=' + tool_call.name + '>
+' }}
+                    {%- else %}
+                        {{- '<tool_call>
+<function=' + tool_call.name + '>
+' }}
+                    {%- endif %}
+                {%- else %}
+                    {{- '
+<tool_call>
+<function=' + tool_call.name + '>
+' }}
+                {%- endif %}
+                {%- if tool_call.arguments is defined %}
+                    {%- for args_name, args_value in tool_call.arguments|items %}
+                        {{- '<parameter=' + args_name + '>
+' }}
+                        {%- set args_value = args_value | tojson | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
+                        {{- args_value }}
+                        {{- '
+</parameter>
+' }}
+                    {%- endfor %}
+                {%- endif %}
+                {{- '</function>
+</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>
+' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.previtem and loop.previtem.role != "tool" %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '
+<tool_response>
+' }}
+        {{- content }}
+        {{- '
+</tool_response>' }}
+        {%- if not loop.last and loop.nextitem.role != "tool" %}
+            {{- '<|im_end|>
+' }}
+        {%- elif loop.last %}
+            {{- '<|im_end|>
+' }}
+        {%- endif %}
+    {%- else %}
+        {{- raise_exception('Unexpected message role.') }}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant
+' }}
+    {%- if enable_thinking is defined and enable_thinking is true %}
+        {{- '<think>
+' }}
+    {%- else %}
+        {{- '<think>
+</think>
+' }}
+    {%- endif %}
+{%- endif %}

midi_tokenizer/tokenization_song2midi.py ADDED Viewed

	@@ -0,0 +1,200 @@

+import os
+from pathlib import Path
+from typing import Union
+from transformers import BatchEncoding, PythonBackend
+from transformers.tokenization_utils_base import TruncationStrategy
+from transformers.utils.generic import PaddingStrategy, TensorType
+try:
+    from miditok import PerTok, TokSequence
+    from symusic import Score
+except ImportError:
+    raise ImportError(
+        "The `miditok` library is required for processing MIDI files. "
+        "Please install it with `pip install miditok`."
+    )
+class Song2MIDIPerTokTokenizer(PythonBackend):
+    vocab_files_names = {"vocab_file": "vocab.json"}
+    def __init__(
+        self,
+        vocab_file: str | os.PathLike | Path,
+        unk_token: str = "UNK_None",
+        bos_token: str = "BOS_None",
+        eos_token: str = "EOS_None",
+        pad_token: str = "PAD_None",
+        **kwargs,
+    ):
+        self._tokenizer = PerTok(params=vocab_file)
+        self._decoder = {value: key for key, value in self._tokenizer.vocab.items()}
+        super().__init__(
+            unk_token=unk_token,
+            bos_token=bos_token,
+            eos_token=eos_token,
+            pad_token=pad_token,
+            **kwargs,
+        )
+    @property
+    def vocab_size(self):
+        return len(self._tokenizer)
+    def get_vocab(self):
+        return self._tokenizer.vocab
+    def _encode_plus(
+        self,
+        text: Union["Score", Path, bytes, list[Union["Score", Path, bytes]], list[int]],
+        text_pair: Union["Score", Path, list[Union["Score", Path]], list[int], None] = None,
+        add_special_tokens: bool = True,
+        padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
+        truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
+        max_length: int | None = None,
+        stride: int = 0,
+        pad_to_multiple_of: int | None = None,
+        padding_side: str | None = None,
+        return_tensors: str | TensorType | None = None,
+        return_token_type_ids: bool | None = None,
+        return_attention_mask: bool | None = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        **kwargs,
+    ): # ty: ignore[invalid-method-override]
+        midi = text
+        midi_pair = text_pair
+        # From https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_python.py (v5.3.0)
+        is_batched = isinstance(midi, (list, tuple)) and (
+            (not midi) or (midi and isinstance(midi[0], (str, Path, Score, bytes)))
+        )
+        if is_batched:
+            if midi_pair is not None:
+                if not isinstance(midi_pair, (list, tuple)) or len(midi_pair) != len(
+                    midi
+                ):
+                    raise ValueError(
+                        "If `midi` is a batch, `midi_pair` must be a batch of the same length."
+                    )
+            pairs = midi_pair if midi_pair is not None else [None] * len(midi)
+            batch_outputs = {}
+            for current_midi, current_pair in zip(midi, pairs):
+                current_output = self._encode_plus(
+                    text=current_midi,
+                    text_pair=current_pair,
+                    add_special_tokens=add_special_tokens,
+                    padding_strategy=PaddingStrategy.DO_NOT_PAD,  # we pad in batch afterward
+                    truncation_strategy=truncation_strategy,
+                    max_length=max_length,
+                    stride=stride,
+                    pad_to_multiple_of=None,  # we pad in batch afterward
+                    padding_side=None,  # we pad in batch afterward
+                    return_tensors=None,  # we convert the whole batch to tensors at the end
+                    return_token_type_ids=return_token_type_ids,
+                    return_attention_mask=False,  # we pad in batch afterward
+                    return_overflowing_tokens=return_overflowing_tokens,
+                    return_special_tokens_mask=return_special_tokens_mask,
+                    return_length=return_length,
+                    verbose=verbose,
+                    **kwargs,
+                )
+                for key, value in current_output.items():
+                    batch_outputs.setdefault(key, []).append(value)
+            # Remove overflow-related keys before tensor conversion if return_tensors is set
+            # Slow tokenizers don't support returning these as tensors
+            if return_tensors and return_overflowing_tokens:
+                batch_outputs.pop("overflowing_tokens", None)
+                batch_outputs.pop("num_truncated_tokens", None)
+            batch_outputs = self.pad(
+                batch_outputs,
+                padding=padding_strategy.value,
+                max_length=max_length,
+                pad_to_multiple_of=pad_to_multiple_of,
+                padding_side=padding_side,
+                return_attention_mask=return_attention_mask,
+            )
+            return BatchEncoding(batch_outputs, tensor_type=return_tensors)
+        # Single sequence handling
+        def get_input_ids(midi_input):
+            if isinstance(midi_input, (str, Path, Score, bytes)):
+                if isinstance(midi_input, bytes):
+                    midi_input = Score.from_midi(midi_input)
+                return self._tokenizer.encode(midi_input)[0].ids
+            if isinstance(midi_input, (list, tuple)) and midi_input:
+                if isinstance(midi_input[0], int):
+                    return midi_input
+            raise ValueError(
+                "Input must be a Score, a path to a MIDI file, or a list of token IDs."
+            )
+        first_ids = get_input_ids(midi)
+        second_ids = get_input_ids(midi_pair) if midi_pair is not None else None
+        return self.prepare_for_model(
+            first_ids,
+            pair_ids=second_ids,
+            add_special_tokens=add_special_tokens,
+            padding=padding_strategy.value,
+            truncation=truncation_strategy.value,
+            max_length=max_length,
+            stride=stride,
+            pad_to_multiple_of=pad_to_multiple_of,
+            padding_side=padding_side,
+            prepend_batch_axis=True,
+            return_attention_mask=return_attention_mask,
+            return_token_type_ids=return_token_type_ids,
+            return_overflowing_tokens=return_overflowing_tokens,
+            return_special_tokens_mask=return_special_tokens_mask,
+            return_length=return_length,
+            verbose=verbose,
+        )
+    def _decode(
+        self,
+        token_ids: int | list[int],
+        skip_special_tokens: bool = False,
+        clean_up_tokenization_spaces: bool | None = None,
+        **kwargs,
+    ) -> str:
+        if isinstance(token_ids, int):
+            token_ids = [token_ids]
+        tok_sequence = TokSequence(ids=token_ids, are_ids_encoded=True)
+        self._tokenizer.decode_token_ids(tok_sequence)
+        tokens = [self._decoder[token_id] for token_id in tok_sequence.ids]
+        if skip_special_tokens:
+            tokens = [
+                token for token in tokens if token not in self._tokenizer.special_tokens
+            ]
+        return " ".join(tokens)
+    def save_vocabulary(
+        self, save_directory: str, filename_prefix: str | None = None
+    ) -> tuple[str, ...]:
+        """Save the MidiTok tokenizer params to disk."""
+        if not os.path.isdir(save_directory):
+            return ()
+        prefix = f"{filename_prefix}-" if filename_prefix else ""
+        vocab_file = os.path.join(save_directory, prefix + "vocab.json")
+        # Use MidiTok's own serialization
+        self._tokenizer.save(vocab_file)
+        return (vocab_file,)

midi_tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "PAD_None",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "BOS_None",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "EOS_None",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "UNK_None",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "auto_map": {
+    "AutoProcessor": "processing_song2midi.Song2MIDIProcessor",
+    "AutoTokenizer": [
+      "tokenization_song2midi.Song2MIDIPerTokTokenizer",
+      null
+    ]
+  },
+  "backend": "custom",
+  "bos_token": "BOS_None",
+  "eos_token": "EOS_None",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "PAD_None",
+  "processor_class": "Song2MIDIProcessor",
+  "tokenizer_class": "Song2MIDIPerTokTokenizer",
+  "unk_token": "UNK_None"
+}

midi_tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

processing_song2midi.py ADDED Viewed

	@@ -0,0 +1,401 @@

+import bisect
+from typing import Unpack
+from transformers import BatchFeature
+from transformers.audio_utils import load_audio
+from transformers.processing_utils import AllKwargsForChatTemplate, ProcessorMixin
+from transformers.utils.chat_template_utils import render_jinja_template
+class Song2MIDIProcessor(ProcessorMixin):
+    def __init__(
+        self,
+        tokenizer,
+        midi_tokenizer,
+        feature_extractor,
+        midi_pad="<|midi_pad|>",
+        **kwargs,
+    ):
+        self.midi_offset_by = len(tokenizer)
+        self.midi_pad_token = midi_pad
+        super().__init__(tokenizer, midi_tokenizer, feature_extractor, **kwargs)
+    def __call__(
+        self, images=None, text=None, videos=None, audio=None, midi=None, **kwargs
+    ):
+        # From https://github.com/huggingface/transformers/blob/e5a861d381bf65a146ce487c3d3c0fca919ef316/src/transformers/processing_utils.py#L606
+        if "audios" in kwargs and audio is None:
+            raise ValueError(
+                "You passed keyword argument `audios` which is deprecated. Please use `audio` instead."
+            )
+        if images is None and text is None and videos is None and audio is None and midi is None:
+            raise ValueError(
+                f"You need to provide at least one input to call {self.__class__.__name__}"
+            )
+        kwargs = self._merge_kwargs(
+            self.valid_processor_kwargs,
+            tokenizer_init_kwargs=self.tokenizer.init_kwargs
+            if hasattr(self, "tokenizer")
+            else {},
+            **kwargs,
+        )
+        kwargs["midi_kwargs"] = {}
+        # We will do the padding later
+        text_kwargs = kwargs.get("text_kwargs", {})
+        kwargs["text_kwargs"] = {}
+        attribute_to_kwargs = {
+            "tokenizer": (text, "text_kwargs"),
+            "image_processor": (images, "images_kwargs"),
+            "video_processor": (videos, "videos_kwargs"),
+            "feature_extractor": (audio, "audio_kwargs"),
+            "midi_tokenizer": (midi, "midi_kwargs"),
+        }
+        outputs = {}
+        for attribute_name in self.get_attributes():
+            attribute = getattr(self, attribute_name, None)
+            input_data, input_kwargs = attribute_to_kwargs[attribute_name]
+            if input_data is not None and attribute is not None:
+                attribute_output = attribute(input_data, **kwargs[input_kwargs])
+                outputs[attribute_name] = attribute_output
+        midi_token_id = self.tokenizer.convert_tokens_to_ids(self.midi_pad_token)
+        def _merge_text_midi(text_input_ids, midi_input_ids):
+            is_batched = True
+            if text_input_ids and isinstance(text_input_ids[0], int):
+                is_batched = False
+                text_input_ids = [text_input_ids]
+                midi_input_ids = [midi_input_ids]
+            new_input_ids = []
+            midi_idx = 0
+            for batch_idx in range(len(text_input_ids)):
+                new_ids = []
+                for token_id in text_input_ids[batch_idx]:
+                    if token_id == midi_token_id and midi_idx < len(midi_input_ids):
+                        new_ids.extend(
+                            [
+                                tok + self.midi_offset_by
+                                for tok in midi_input_ids[midi_idx]
+                            ]
+                        )
+                        midi_idx += 1
+                    else:
+                        new_ids.append(token_id)
+                new_input_ids.append(new_ids)
+            return new_input_ids if is_batched else new_input_ids[0]
+        new_outputs = {}
+        if midi:
+            new_text_input_ids = {
+                "input_ids": _merge_text_midi(
+                    outputs["tokenizer"]["input_ids"],
+                    outputs["midi_tokenizer"]["input_ids"],
+                )
+            }
+        else:
+            new_text_input_ids = {"input_ids": outputs["tokenizer"]["input_ids"]}
+        # Pad
+        new_outputs.update(self.tokenizer.pad(new_text_input_ids, **text_kwargs))
+        for key, value in outputs.items():
+            if key not in ["tokenizer", "midi_tokenizer"]:
+                new_outputs.update(value)
+        return BatchFeature(new_outputs)
+    def apply_chat_template(
+        self,
+        conversation: list[dict[str, str]] | list[list[dict[str, str]]],
+        chat_template: str | None = None,
+        **kwargs: Unpack[AllKwargsForChatTemplate],
+    ) -> str:
+        # From https://github.com/huggingface/transformers/blob/e5a861d381bf65a146ce487c3d3c0fca919ef316/src/transformers/processing_utils.py#L1631
+        if chat_template is None:
+            if isinstance(self.chat_template, dict) and "default" in self.chat_template:
+                chat_template = self.chat_template["default"]
+            elif isinstance(self.chat_template, dict):
+                raise ValueError(
+                    'The processor has multiple chat templates but none of them are named "default". You need to specify'
+                    " which one to use by passing the `chat_template` argument. Available templates are: "
+                    f"{', '.join(self.chat_template.keys())}"
+                )
+            elif self.chat_template is not None:
+                chat_template = self.chat_template
+            else:
+                raise ValueError(
+                    "Cannot use apply_chat_template because this processor does not have a chat template."
+                )
+        else:
+            if (
+                isinstance(self.chat_template, dict)
+                and chat_template in self.chat_template
+            ):
+                # It's the name of a template, not a full template string
+                chat_template = self.chat_template[chat_template]
+            else:
+                # It's a template string, render it directly
+                pass
+        # Check if tokenizer is fast - use backend attribute if available, otherwise fall back to class name
+        is_tokenizers_fast = False
+        if hasattr(self, "tokenizer"):
+            if hasattr(self.tokenizer, "backend"):
+                is_tokenizers_fast = self.tokenizer.backend == "tokenizers"
+            else:
+                # Fallback to class name check
+                is_tokenizers_fast = self.tokenizer.__class__.__name__.endswith("Fast")
+        if kwargs.get("continue_final_message", False):
+            if kwargs.get("add_generation_prompt", False):
+                raise ValueError(
+                    "continue_final_message and add_generation_prompt are not compatible. Use continue_final_message when you want the model to continue the final message, and add_generation_prompt when you want to add a header that will prompt it to start a new assistant message instead."
+                )
+            if kwargs.get("return_assistant_tokens_mask", False):
+                raise ValueError(
+                    "continue_final_message is not compatible with return_assistant_tokens_mask."
+                )
+        if kwargs.get("return_assistant_tokens_mask", False):
+            if not is_tokenizers_fast:
+                raise ValueError(
+                    "`return_assistant_tokens_mask` is not possible with slow tokenizers. Make sure you have `tokenizers` installed. "
+                    "If the error persists, open an issue to support a Fast tokenizer for your model."
+                )
+            else:
+                kwargs["return_offsets_mapping"] = (
+                    True  # force offset mapping so we can infer token boundaries
+                )
+        # Fill sets of kwargs that should be used by jinja template, filtering out kwargs used in `processor.__call__`
+        # NOTE: we don't only filter but also set the default values here. Without default values, we can remove it
+        template_kwargs = {}
+        for key in AllKwargsForChatTemplate.__annotations__[
+            "template_kwargs"
+        ].__annotations__:
+            kwarg_type_defaults = AllKwargsForChatTemplate.__annotations__[
+                "template_kwargs"
+            ]
+            default_value = getattr(kwarg_type_defaults, key, None)
+            value = kwargs.pop(key, default_value)
+            if value is not None and not isinstance(value, dict):
+                template_kwargs[key] = value
+        # Pass unprocessed custom kwargs
+        template_kwargs.update(kwargs)
+        # Set the sampling rate to load the audio files if user hasn't already passed with `kwargs`
+        if "sampling_rate" not in template_kwargs:
+            if hasattr(self, "feature_extractor") and hasattr(
+                self.feature_extractor, "sampling_rate"
+            ):
+                template_kwargs["sampling_rate"] = self.feature_extractor.sampling_rate
+            else:
+                template_kwargs["sampling_rate"] = 16_000
+        if isinstance(conversation, (list, tuple)) and (
+            isinstance(conversation[0], (list, tuple))
+            or hasattr(conversation[0], "content")
+        ):
+            is_batched = True
+            conversations = conversation
+        else:
+            is_batched = False
+            conversations = [conversation]
+        # Normalize OpenAI-style "image_url" content blocks to HuggingFace-style "image" blocks
+        # OpenAI format: {"type": "image_url", "image_url": {"url": "..."}}
+        # HuggingFace format: {"type": "image", "url": "..."}
+        for conversation_idx, conversation in enumerate(conversations):
+            for message in conversation:
+                if not isinstance(message.get("content"), list):
+                    continue
+                new_content = []
+                for content in message["content"]:
+                    if (
+                        isinstance(content, dict)
+                        and content.get("type") == "image_url"
+                        and "image_url" in content
+                    ):
+                        image_url_info = content["image_url"]
+                        url = (
+                            image_url_info.get("url", "")
+                            if isinstance(image_url_info, dict)
+                            else image_url_info
+                        )
+                        new_content.append({"type": "image", "url": url})
+                    else:
+                        new_content.append(content)
+                message["content"] = new_content
+        tokenize = template_kwargs.pop("tokenize", False)
+        return_dict = template_kwargs.pop("return_dict", True)
+        if tokenize:
+            batch_images, batch_videos = [], []
+            batch_audios = []
+            batch_midis = []  # midi
+            for conversation in conversations:
+                images, videos = [], []
+                for message in conversation:
+                    visuals = [
+                        content
+                        for content in message["content"]
+                        if content["type"] in ["image", "video"]
+                    ]
+                    audio_fnames = [
+                        content[key]
+                        for content in message["content"]
+                        for key in ["audio", "url", "path"]
+                        if key in content and content["type"] == "audio"
+                    ]
+                    image_fnames = [
+                        vision_info[key]
+                        for vision_info in visuals
+                        for key in ["image", "url", "path", "base64"]
+                        if key in vision_info and vision_info["type"] == "image"
+                    ]
+                    images.extend(image_fnames)
+                    video_fnames = [
+                        vision_info[key]
+                        for vision_info in visuals
+                        for key in ["video", "url", "path"]
+                        if key in vision_info and vision_info["type"] == "video"
+                    ]
+                    videos.extend(video_fnames)
+                    # midi
+                    midi_fnames = [
+                        content[key]
+                        for content in message["content"]
+                        for key in ["score", "path"]
+                        if key in content and content["type"] == "midi"
+                    ]
+                    batch_midis.extend(midi_fnames)
+                    # Audio models do not accept nested list of audios (yet!) so we construct a flat input audio list
+                    if not template_kwargs["load_audio_from_video"]:
+                        for fname in audio_fnames:
+                            batch_audios.append(
+                                load_audio(
+                                    fname,
+                                    sampling_rate=template_kwargs["sampling_rate"],
+                                )
+                            )
+                    else:
+                        for fname in video_fnames:
+                            batch_audios.append(
+                                load_audio(
+                                    fname,
+                                    sampling_rate=template_kwargs["sampling_rate"],
+                                )
+                            )
+                # Currently all processors can accept nested list of batches, but not flat list of visuals
+                # So we'll make a batched list of images and let the processor handle it
+                batch_images.append(images)
+                batch_videos.append(videos)
+        special_tokens_map = {}
+        if hasattr(self, "tokenizer") and hasattr(self.tokenizer, "special_tokens_map"):
+            special_tokens = self.tokenizer.special_tokens_map
+            # Filter out tokens that conflict with template kwargs
+            special_tokens_map = {
+                k: v for k, v in special_tokens.items() if k not in template_kwargs
+            }
+        prompt, generation_indices = render_jinja_template(
+            conversations=conversations,
+            chat_template=chat_template,
+            **template_kwargs,  # different flags such as `return_assistant_mask`
+            **special_tokens_map,  # tokenizer special tokens are used by some templates
+        )
+        if not is_batched:
+            prompt = prompt[0]
+        if tokenize:
+            # Tokenizer's `apply_chat_template` never adds special tokens when tokenizing
+            # But processor's `apply_chat_template` didn't have an option to tokenize, so users had to format the prompt
+            # and pass it to the processor. Users thus never worried about special tokens relying on processor handling
+            # everything internally. The below line is to keep BC for that and be able to work with model that have
+            # special tokens in the template (consistent with tokenizers). We dont want to raise warning, it will flood command line
+            # without actionable solution for users
+            single_prompt = prompt[0] if is_batched else prompt
+            if self.tokenizer.bos_token is not None and single_prompt.startswith(
+                self.tokenizer.bos_token
+            ):
+                kwargs["add_special_tokens"] = False
+            # Always sample frames by default unless explicitly set to `False` by users. If users do not pass `num_frames`/`fps`
+            # sampling should not done for BC.
+            if "do_sample_frames" not in kwargs and (
+                kwargs.get("fps") is not None or kwargs.get("num_frames") is not None
+            ):
+                kwargs["do_sample_frames"] = True
+            images_exist = any(
+                (im is not None) for im_list in batch_images for im in im_list
+            )
+            videos_exist = any(
+                (vid is not None) for vid_list in batch_videos for vid in vid_list
+            )
+            out = self(
+                text=prompt,
+                images=batch_images if images_exist else None,
+                videos=batch_videos if videos_exist else None,
+                audio=batch_audios if batch_audios else None,
+                midi=batch_midis if batch_midis else None,
+                **kwargs,
+            )
+            if return_dict:
+                if template_kwargs.get("return_assistant_tokens_mask", False):
+                    assistant_masks = []
+                    offset_mapping = out.pop("offset_mapping")
+                    input_ids = out["input_ids"]
+                    for i in range(len(input_ids)):
+                        current_mask = [0] * len(input_ids[i])
+                        offsets = offset_mapping[i]
+                        offset_starts = [start for start, end in offsets]
+                        for (
+                            assistant_start_char,
+                            assistant_end_char,
+                        ) in generation_indices[i]:
+                            start_pos = bisect.bisect_left(
+                                offset_starts, assistant_start_char
+                            )
+                            end_pos = bisect.bisect_left(
+                                offset_starts, assistant_end_char
+                            )
+                            if not (
+                                start_pos >= 0
+                                and start_pos < len(offsets)
+                                and offsets[start_pos][0]
+                                <= assistant_start_char
+                                < offsets[start_pos][1]
+                            ):
+                                # start_token is out of bounds maybe due to truncation.
+                                continue
+                            # Ensure end_pos is also within bounds
+                            if end_pos > len(input_ids[i]):
+                                end_pos = len(input_ids[i])
+                            for token_id in range(
+                                start_pos, end_pos if end_pos else len(input_ids[i])
+                            ):
+                                current_mask[token_id] = 1
+                        assistant_masks.append(current_mask)
+                    out["assistant_masks"] = assistant_masks
+                    out.convert_to_tensors(tensor_type=kwargs.get("return_tensors"))
+                return out
+            else:
+                return out["input_ids"]
+        return prompt

processor_config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_song2midi.Song2MIDIProcessor"
+  },
+  "feature_extractor": {
+    "auto_map": {
+      "AutoProcessor": "processing_song2midi.Song2MIDIProcessor"
+    },
+    "chunk_length": 30,
+    "dither": 0.0,
+    "feature_extractor_type": "WhisperFeatureExtractor",
+    "feature_size": 128,
+    "hop_length": 160,
+    "n_fft": 400,
+    "n_samples": 480000,
+    "nb_max_frames": 3000,
+    "padding_side": "right",
+    "padding_value": 0.0,
+    "return_attention_mask": false,
+    "sampling_rate": 16000
+  },
+  "processor_class": "Song2MIDIProcessor"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:144ef28cd74e9fbc5d92e359a2ad561e7710e16241c615e3d99beedf7704ba98
+size 19989912

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "add_prefix_space": false,
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>",
+  "audio_token": "<|audio_pad|>",
+  "auto_map": {
+    "AutoProcessor": "processing_song2midi.Song2MIDIProcessor"
+  },
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "image_token": "<|image_pad|>",
+  "is_local": true,
+  "midi_bos_token": "<|midi_start|>",
+  "midi_eos_token": "<|midi_end|>",
+  "midi_token": "<|midi_pad|>",
+  "model_max_length": 262144,
+  "model_specific_special_tokens": {
+    "audio_bos_token": "<|audio_start|>",
+    "audio_eos_token": "<|audio_end|>",
+    "audio_token": "<|audio_pad|>",
+    "image_token": "<|image_pad|>",
+    "midi_bos_token": "<|midi_start|>",
+    "midi_eos_token": "<|midi_end|>",
+    "midi_token": "<|midi_pad|>",
+    "video_token": "<|video_pad|>",
+    "vision_bos_token": "<|vision_start|>",
+    "vision_eos_token": "<|vision_end|>"
+  },
+  "pad_token": "<|endoftext|>",
+  "pretokenize_regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?[\\p{L}\\p{M}]+|\\p{N}| ?[^\\s\\p{L}\\p{M}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
+  "processor_class": "Song2MIDIProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": null,
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>"
+}