Vision & Hearing
Easily provide image and audio information to your LLM.
Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will need two parts:
- Multimodal LLM that can consume image-tokens and/or audio-tokens
- Projection model that converts images to image-tokens and/or audio to audio-tokens
To find such a model, refer to the HuggingFace Image-Text-to-Text section
and Audio-Text-to-Text. Some models like Gemma 4 manage both!
Usually, the projection model includes mmproj in its name.
If you are unsure which ones to pick, try Gemma 4 with its BF16 projection model.
Load the projection model alongside the main model:
import ai.nobodywho.Model
import ai.nobodywho.Chat
val model = Model.load(
modelPath = "./multimodal-model.gguf",
projectionModelPath = "./mmproj.gguf"
)
val chat = Chat(model = model)
The language model and projection model must fit together, as they are trained together. You can't take an arbitrary projection model and pair it with any LLM.
Composing a prompt
With the model configured, compose a multimodal prompt using Prompt:
import ai.nobodywho.Prompt
val response = chat.ask(Prompt(
Prompt.Text("Tell me what you see in the image and what you hear in the audio."),
Prompt.Image("./dog.png"),
Prompt.Audio("./sound.mp3"),
)).completed()
println(response) // It's a dog!
Tips for multimodality
The format in which you supply the multimodal prompt can matter. If the model performs poorly, try changing the order of text and media, or adjusting descriptions:
chat.resetHistory()
val response = chat.ask(Prompt(
Prompt.Text("Tell me what you see in the image."),
Prompt.Image("./dog.png"),
Prompt.Text("Also tell me what you hear in the audio."),
Prompt.Audio("./sound.mp3"),
)).completed()
Different models process images differently — some use a fixed number of tokens per image, others scale with image size. You may need to increase the context size:
val chat = Chat(
model = model,
contextSize = 8192u
)
For large images, consider downsizing before sending to the model to reduce processing time, especially on mobile devices.