Few shot generation

by alfredplpl - opened Feb 17

Feb 17

Great work!

Does this model support any form of in-context learning (ICL) for images?
For example, can it take a few example images (or image-token patterns) in the prompt and reliably generalize or adapt its output based on that context without updating weights — similar to how LLMs exhibit ICL with text?

If not currently supported, is there a conceptual path where autoregressive image models like this could perform ICL (e.g., via joint text+image contexts or positional conditioning)?

Thanks!

shallowdream204

Owner Feb 17

Thank you for your recognition of our work!

The current model is only trained on text–image pairs and therefore cannot take images directly as prompts. However, our framework is well-suited for unified multimodal modeling and supports inputs and outputs across different modalities. In fact, we are training more general-purpose models to enable capabilities such as image editing and multimodal interleaved input and output. Please stay tuned for our upcoming models!

alfredplpl

Feb 17

That sounds very exciting — I’m really looking forward to seeing the upcoming models!

The direction toward unified multimodal modeling and interleaved input/output is especially interesting. It will be fascinating to see how far this framework can go.

Best of luck with the development, and thanks again for sharing your work!

alfredplpl changed discussion status to closed Feb 17

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment