Few shot generation
Great work!
Does this model support any form of in-context learning (ICL) for images?
For example, can it take a few example images (or image-token patterns) in the prompt and reliably generalize or adapt its output based on that context without updating weights — similar to how LLMs exhibit ICL with text?
If not currently supported, is there a conceptual path where autoregressive image models like this could perform ICL (e.g., via joint text+image contexts or positional conditioning)?
Thanks!
Thank you for your recognition of our work!
The current model is only trained on text–image pairs and therefore cannot take images directly as prompts. However, our framework is well-suited for unified multimodal modeling and supports inputs and outputs across different modalities. In fact, we are training more general-purpose models to enable capabilities such as image editing and multimodal interleaved input and output. Please stay tuned for our upcoming models!
That sounds very exciting — I’m really looking forward to seeing the upcoming models!
The direction toward unified multimodal modeling and interleaved input/output is especially interesting. It will be fascinating to see how far this framework can go.
Best of luck with the development, and thanks again for sharing your work!