Version: 1.0.0

Vision & Hearing

Easily provide image and audio information to your LLM.

Choosing a model

Not all models have built-in image and audio capabilities. Generally, you will need two parts:

Multimodal LLM that can consume image-tokens and/or audio-tokens
Projection model that converts images to image-tokens and/or audio to audio-tokens

To find such a model, refer to the HuggingFace Image-Text-to-Text section and Audio-Text-to-Text. Some models like Gemma 4 manage both! Usually, the projection model includes mmproj in its name.

If you are unsure which ones to pick, try Gemma 4 with its BF16 projection model.

Load the projection model alongside the main model:

import ai.nobodywho.Model
import ai.nobodywho.Chat

val model = Model.load(
    modelPath = "./multimodal-model.gguf",
    projectionModelPath = "./mmproj.gguf"
)
val chat = Chat(model = model)

info

The language model and projection model must fit together, as they are trained together. You can't take an arbitrary projection model and pair it with any LLM.

Composing a prompt

With the model configured, compose a multimodal prompt using Prompt:

import ai.nobodywho.Prompt

val response = chat.ask(Prompt(
    Prompt.Text("Tell me what you see in the image and what you hear in the audio."),
    Prompt.Image("./dog.png"),
    Prompt.Audio("./sound.mp3"),
)).completed()
println(response) // It's a dog!

Tips for multimodality

The format in which you supply the multimodal prompt can matter. If the model performs poorly, try changing the order of text and media, or adjusting descriptions:

chat.resetHistory()
val response = chat.ask(Prompt(
    Prompt.Text("Tell me what you see in the image."),
    Prompt.Image("./dog.png"),
    Prompt.Text("Also tell me what you hear in the audio."),
    Prompt.Audio("./sound.mp3"),
)).completed()

Different models process images differently — some use a fixed number of tokens per image, others scale with image size. You may need to increase the context size:

val chat = Chat(
    model = model,
    contextSize = 8192u
)

For large images, consider downsizing before sending to the model to reduce processing time, especially on mobile devices.

Choosing a model​

Composing a prompt​

Tips for multimodality​

Choosing a model

Composing a prompt

Tips for multimodality