Enabling models to see images.
A picture is worth a thousand words (or at least a thousand tokens). With NobodyWho, you can easily provide image information to your LLM.
Choosing a model
Not all models have built-in image capabilities. Generally, you will need two parts for making this work:
- Vision-Language (VL) LLM, so the LLM can consume image-tokens
- Projection model, which converts images to image-tokens
To find such a model, refer to the HuggingFace Image-Text-to-Text section.
Usually, the projection model then includes mmproj in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try Gemma 3 4b with its F16 projection model.
With the downloaded GGUFs, you can set the projection model on your NobodyWhoModel node.
In the editor, set the image_model_path property to point to your projection model file.
Alternatively, you can set it in GDScript:
$ChatModel.image_model_path = "res://mmproj.gguf"
Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through the NobodyWhoPrompt object.
extends NobodyWhoChat
func _ready():
self.model_node = get_node("../ChatModel")
self.system_prompt = "You are a helpful assistant."
var prompt = NobodyWhoPrompt.new()
prompt.add_text("Tell me what you see in the images.")
prompt.add_image("res://dog.png")
prompt.add_image("res://penguin.png")
ask(prompt)
var response = await response_finished # It's a dog and a penguin!
Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain scenarios. If the model performs poorly, try to mess around with the order of supplying the text and the images, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
var prompt = NobodyWhoPrompt.new()
prompt.add_text("Tell me what you see in the first image.")
prompt.add_image("res://dog.png")
prompt.add_text("Also tell me what you see in the second image.")
prompt.add_image("res://penguin.png")
Also, there is still a lot of variance between how the models internally process the images. This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
self.context_length = 8192
Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).
Nevertheless, with more niche models you can find bugs. If you stumble upon some of them, please be sure to report them, so we can fix the functionality.