# NobodyWho

> Local-first LLM inference for Python, Godot, Flutter, and React Native, built on llama.cpp. Streaming chat, tool calling, embeddings, and RAG with offline GPU-accelerated inference.

## Home

### Home

## What is NobodyWho?

NobodyWho is a lightweight, open-source inference engine for running open-weights LLMs inside your software.
We provide a simple, efficient, offline and privacy forward way of interacting with LLMs. No infrastructure needed!

In short, if you want to run a LLM, and integrate it with [tools](./python/tool-calling.md), configure its output,
enable real-time streaming of tokens, or maybe use it for creation of embeddings, NobodyWho makes it easy.

All of this is enabled by [Llama.cpp](https://github.com/ggml-org/llama.cpp), while having nice, simple API.

No need to mess around with docker containers, GPU servers, API keys, etc. We make it easy to run local LLMs in Python, React Native, Flutter and Godot with more integrations coming soon!

## Code documentation 

If you are already familiar with the basics of LLMs we suggest you go straight to the documentation of your selected integration. 

- [Python](python/index.md)
- [react-native](react-native/index.md)
- [Flutter](flutter/index.md)
- [Godot](godot/install.md)

## Basic LLM concepts

If you are unfamiliar with the basics of LLMs or are just intestered we also provide a simple introduction to the most important concepts you need to know in order to get the most out of NobodyWho.

## LLM Basics

### LLM Basics

Our goal with NobodyWho is to make it easy to run local LLMs. For this reason we have made it possible to use NobodyWho with minimal knowledge of how LLM works. However you still need to know some basic concepts, so for these we provide some brief explanations. The concepts covered are tokens, context, samplers and tools. 

## Tokens

Tokens are the basic units that LLMs process. A token is typically a word, part of a word, or a punctuation mark. For example, "hello" is one token, while "understanding" might be split into two tokens: "understand" and "ing". It is worth noting that the vocabulary of tokens used is different for each model as it is defined during training. 

When the model generates text, it produces one token at a time. This is why the default response object of NobodyWho is a stream of tokens and why you can read the response token-by-token.

## Context

Context refers to all the text the model can "see" when generating a response. This includes:
- Previous messages in the conversation
- The current user prompt
- Any system instructions

Essentially the context acts as the models memory of the current conversation, available tools etc. This is important to remember as once your chosen model has been initialized most of  your interactions with the model will happen through the context.

### Context Size

Every model has a maximum context size (also called context window or context length), measured in tokens. Common sizes range from 2048 to 128,000 tokens.

Once you reach the context limit, you must either:
- Start a new conversation
- Remove old messages from the history
- Summarize earlier parts of the conversation

Currently NobodyWho resolves this issue automatically by removing old messages from the context.
Having a larger context allows for longer and more complex conversations, but it also slows down the response time, as the model has to process a more tokens each time it generates a response.

## Samplers

LLMs don't output text directly. Instead, they generate a probability distribution over all possible next tokens. Since the model weigths are static after training, this means that the same input tokens always generate the same distribution. Depending on the use case however, there are many possible ways of choosing a next token from this distribution. This is configured using a **sampler**. A **sampler** splits the process of choosing a next token into two parts: Shiftingh the distribution and Sampling the distribution.

### Shifting the Distribution
Before sampling the distribution to get the next token, it is possible to adjust the distribution provided by the LLM to encourage certain behavior. Examples of these adjustments are:

- **Temperature**: Higher values make output more creative/random, lower values make it more focused/deterministic.
- **Top-k/Top-p**: Limit which tokens are considered, filtering out unlikely options
- **Penalties**: Lower the probalities of tokens already present in the context.

It is important to note that the steps in this part of the process can be chained. So it is possible to first apply a Temperature shift and then Top-k.


### Sampling the distribution
Once the distribution has been shifted the next step is to actually sample the distribution. This can also be done a few different ways:

- **Dist**: Sample the distribution randomly 
- **Greedy**: Always pick the most likely token (deterministic but sometimes repetitive)
- **Mirostat**: Advanced sampling presented in this [article](https://arxiv.org/abs/1904.09751)

Since this part actually chooses the next token, these cannot be chained.


NobodyWho also supports more advanced ways of configuraing a sampler, like for example follow a JSON Schema.

## Tools

Tools (also called function calling) allow the LLM to request external actions. Instead of just generating text, the model can indicate it wants to:
- Search a database
- Perform a calculation
- Fetch data from an API
- Execute custom code

You define available tools, and the model decides when to use them based on the conversation. After a tool executes, you provide the result back to the model so it can continue the conversation.

This enables LLMs to go beyond pure text generation and interact with your application's functionality.

## Model Selection

### Model Selection

Choosing the right language model can make or break your project. In general you want to go as small as possible while still having the capabilities you need for your application.

## TL;DR

If you just want a ~2GB chat model that works well, use:

```
huggingface:NobodyWho/Qwen_Qwen3-4B-GGUF/Qwen_Qwen3-4B-Q4_K_M.gguf
```

If you want something smaller and faster, use:

```
huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf
```

Pass these as the model path when creating a `Chat`, `Model`, `Encoder`, etc. NobodyWho will download the model automatically and cache it locally for future use.


## Getting a model

NobodyWho can download models directly from Hugging Face. Instead of downloading a file manually, pass a `huggingface:` path where you'd normally pass a file path:

```
huggingface:owner/repo/filename.gguf
```

The model is downloaded once and cached locally — no internet connection is needed after the first load. `hf:` is also accepted as a shorthand.

You can also pass a full `https://` URL to download a model from any host.

Of course, you can still pass a local file path if you prefer to manage model files yourself.

We recommend starting with the models on our [Hugging Face page](https://huggingface.co/NobodyWho) since they are known to work well with NobodyWho.

Once you're more familiar, you can also try models from [Bartowski](https://huggingface.co/bartowski) and [Unsloth](https://huggingface.co/unsloth/models).

Broadly, almost any `.gguf` model on [Hugging Face](https://huggingface.co) should work, though some may fail due to formatting issues.


## Understanding model file names

Model files follow a naming convention like this: `Qwen_Qwen3-0.6B-Q4_K_M.gguf`

Here's what each part means:

- `Qwen` the organization that trained the model.
- `Qwen3` the name of the model release.
- `0.6B` the parameter count in billions. This model has 0.6 billion (600 million) parameters.
- `Q4` the quantization level, i.e. the number of bits used per parameter.
- `K_M` details about the quantization technique. `S` is faster but less precise, `L` is slower but more precise, and `M` is a middle ground. You don't need to worry too much about this for now.

For chatting, you'll need an instruction-tuned GGUF file that includes a Jinja2 chat template in its metadata. This describes the vast majority of GGUF files available, so if you're unsure, just try it — NobodyWho will give you a descriptive error message if something isn't right.

For embeddings or cross-encoding, you'll need models specifically designed for those tasks, they are typically named accordingly. Although note that cross-encoding models are sometimes referred to as "reranking" models.


## Quantization

Quantization refers to the practice of reducing the number of bits per weight.
This can make the model faster and smaller, with a relatively small loss in response quality.

Generally speaking, you can used models quantized down the Q4 or Q5 levels (4 or 5 bits per weight respectively),
while loosing barely any accuracy.

Look at the plot below to get a feel for how quantization levels differ.
It shows the models' ability to predict text on the y-axis versus the number of bits per weight on the x-axis.

![Perplexity/Quantization curve](./assets/quantcurve.png)

In general, it's preferable to use a model with more parameters and fewer bits per parameter, as compared to a model with fewer parameters and more bits per parameter.
Your results may vary.


## Estimating Memory Usage

The memory requirement of a model is roughly its parameter count multiplied by its quantization level.

Here's a few examples:

- 2B @ Q8 ~= 2GB
- 2B @ Q4 ~= 1GB
- 14B @ Q4 ~= 7GB
- 14B @ Q2 ~= 3.5GB
- ..and so on


## Comparing Models

There are many places online for comparing benchmark scores of different LLMs, here's a few of them:

**[LLM-Stats.com](https://llm-stats.com/)**
- Includes filters for open models and small models.
- Compares recent models on a few different benchmarks.

**[OpenEvals on huggingface](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard)**
- A collection of benchmark leaderboards in different domains.
- Includes both inaccessible proprietary models and open models.

Remember that you need an open model, in order to be able to find a GGUF download and run it locally (e.g. Gemma is open, but Gemini isn't).


---

*Need help choosing between specific models? Check our [community Discord](https://discord.gg/qhaMc2qCYB).*

## Python

### Getting started with Python

## How do I get started?

First, install `nobodywho`.
```bash
pip install nobodywho
```

Next, pick a model. NobodyWho can download GGUF models directly from Hugging Face — just pass a `huggingface:` path. See [model selection](../model-selection.md) for recommendations.

Then make a `Chat` object and call `.ask()`!

```python
from nobodywho import Chat

chat = Chat('huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf')
response = chat.ask('Is water wet?')

# print each token as it is generated
for token in response:
    print(token, end="", flush=True)

# ...or get the entire response as a single string
full_response = response.completed()
print(full_response)
```

This is a super simple example, but we believe that examples which do simple things, should be simple!

## Tracking download progress

When loading a remote model, pass an `on_download_progress` callback to observe the download. It receives `(downloaded_bytes, total_bytes)` and is not called for cached or local files. If you don't pass anything, NobodyWho prints a default terminal progress bar.

```python
from nobodywho import Model

model = Model(
    'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
    on_download_progress=lambda downloaded, total: print(f"{downloaded}/{total} bytes"),
)
```

To get a full overview of the functionality provided by NobodyWho, simply keep reading.

### Chat

As you may have noticed in the [welcome guide](./index.md), every interaction with your LLM starts by instantiating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.

## Prompts and responses

The `Chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.

```python
from nobodywho import Chat, TokenStream
chat = Chat("./model.gguf")
response: TokenStream = chat.ask("Is water wet?")
```

The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can just iterate over the `TokenStream`.
Each token is either an individual word or fragments of a word.

```{.python continuation}
for token in response:
   print(token, end="", flush=True)
print("\n")
```

If you just want to get the complete response, you can call `TokenStream.completed()`.
This will block until the model is done generating its entire response.

```{.python continuation}
full_response: str = response.completed()
```

All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `Chat.ask()`, it will remember the previous messages.

## Chat history

If you want to inspect the messages inside the `Chat` object, you can use `get_chat_history`.

```{.python continuation}
msgs: list[dict] = chat.get_chat_history()
print(msgs[0]["content"]) # "Is water wet?"
```

Similarly, if you want to edit what messages are in the context, you can use `set_chat_history`:


```{.python continuation}
chat.set_chat_history([{
   "role": "user",
   "content": "What is water?",
   "assets": []
}])
```

## System prompt

A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.

You can specify a system prompt when initializing a `Chat`:

```python
from nobodywho import Chat
chat = Chat("./model.gguf", system_prompt="You are a mischievous assistant!")
```

This `system_prompt` is then persisted until the chat context is `reset`.


## Context

The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
As bigger context size means more computational overhead, it makes sense to constrain it. This can be done with `n_ctx` setting, again at the time of creation:

```python
chat = Chat("./model.gguf", n_ctx=4096)
```

The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected models documentation and see what their recommended context size is.

Even with properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `n_ctx / 2`. The KV cache is also updated automatically. In the future we plan on adding more advanced methods of context shrinking.

Again, `n_ctx` is fixed to the `Chat` instance, so it is currently not possible to change the size after `Chat` is created. To reset the current context content, just call `.reset()` with the new system prompt and potentially changed tools.

```{.python continuation}
chat.reset(system_prompt="New system prompt", tools=[])
```

If you don't want to change the already set defaults (`system_prompt`, `tools`), but only reset the context, then go for `reset_history`.

## Sharing model between contexts

There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. With plain `Chat` this is not possible.

For this use case, instead of the path to the `.gguf` model, you can pass in `Model` object, which can be shared between multiple `Chat` instances.

```python
from nobodywho import Chat, Model

model = Model('./model.gguf')
chat1 = Chat(model)
chat2 = Chat(model)
...
```

NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.

## Asynchronous model loading

Loading a model into memory can take a few seconds - longer if you're using a really large model.

If you want to load the model without blocking execution of your application (e.g. to keep UI responsive), you can load the model asynchronously:


```python
import asyncio
from nobodywho import ChatAsync, Model

async def main():
   model = await Model.load_model_async("./model.gguf")
   chat = ChatAsync(model)

asyncio.run(main())
```

## GPU
Instantiating `Model` is also useful, when enabling GPU acceleration. This can be done as:
```python
Model('./model.gguf', use_gpu_if_available=True)
```
So far, NobodyWho relies purely on [Vulkan](https://www.vulkan.org), however support
of more architectures is planned (for details check out our [issues](https://github.com/nobodywho-ooo/nobodywho/issues) or join us on [Discord](https://discord.gg/qhaMc2qCYB)).

## Template Variables

Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.

### Using Template Variables

You can set template variables when creating a chat or modify them on existing instances:

```python
# Set template variables when creating a chat
chat = Chat("./model.gguf", template_variables={"enable_thinking": True})
```

You can also modify template variables on an existing chat instance:

```{.python continuation}
# Set a single template variable
chat.set_template_variable("enable_thinking", True)

# Set multiple template variables at once
chat.set_template_variables({
    "enable_thinking": True,
    "verbose_mode": False
})

# Get current template variables
variables = chat.get_template_variables()
print(variables)  # {"enable_thinking": True, "verbose_mode": False}
```

With the next message sent, the updated settings will be propagated to the model.

### Example: Qwen3 and Qwen3.5 Reasoning

The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:

```python
# Enable thinking mode for Qwen models
chat = Chat("./model.gguf", template_variables={"enable_thinking": True})
chat.ask("Solve this logic puzzle: ...")
```

When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.

### Model-Specific Variables

Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.

!!! info ""
    Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.

### Backward Compatibility

For backward compatibility, the deprecated `allow_thinking` parameter is still available but internally sets the `enable_thinking` template variable:

```python
# Deprecated - use template_variables instead
chat = Chat("./model.gguf", allow_thinking=True)
chat.set_allow_thinking(True)
```

### Tool Calling

To give your LLM the ability to interact with the outside world, you will need tool calling.

!!! info ""
    Note that **not every model** supports tool calling. If the model does not have
    such an option, it might not call your tools.
    For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/Qwen/models) family of models.

## Declaring a tool

A tool can be created from any (synchronous) python function, which returns a string.
To perform the conversion, you simply need to use the `@tool` decorator. To get
a good sense of how such a tool can look like, consider this geometry example:

```python
import math
from nobodywho import tool, Chat

@tool(description="Calculates the area of a circle given its radius")
def circle_area(radius: float) -> str:
    area = math.pi * radius ** 2
    return f"Circle with radius {radius} has area {area:.2f}"
```

As you can see, every `@tool` definition has to be complemented by a description
of what such tool does. To let your LLM use it, simply add it when creating `Chat`:

```{.python continuation}
chat = Chat('./model.gguf', tools=[circle_area])
```

NobodyWho then figures out the right tool calling format, inspects the names and types of the parameters,
and configures the sampler.

Naturally, more tools can be defined and the model can chain the calls for them:

```python
import os
from pathlib import Path
from nobodywho import Chat, tool

@tool(description="Gets path of the current directory")
def get_current_dir() -> str:
    return os.getcwd()

@tool(description="Lists files in the given directory", params={"path": "a relative or absolute path to a directory"})  
def list_files(path: str) -> str:
    files = [f.name for f in Path(path).iterdir() if f.is_file()]
    return f"Files: {', '.join(files)}"

@tool(description="Gets the size of a file in bytes")
def get_file_size(filepath: str) -> str:
    size = Path(filepath).stat().st_size
    return f"File size: {size} bytes"

chat = Chat('./model.gguf', tools=[get_current_dir, list_files, get_file_size])
response = chat.ask('What is the biggest file in my current directory?').completed()
print(response) # The largest file in your current directory is `model.gguf`.
```

## Providing parameter descriptions

When a tool call is declared, information about the description, the types and the parameters is provided to the model, so it knows it can use it. Crucially, also parameter names are provided.

If those are not enough, you can decide to provide additional information by the `params` parameter:
```python
from nobodywho import tool
@tool(
    description="Given a longitude and latitude, gets the current temperature.",
    params={
        "lon": "Longitude - that is the vertical one!",
        "lat": "Latitude - that is the horizontal one!"
    }
)
def get_current_temperature(lon: str, lat: str) -> str:
    ...
```
These will be then appended to the information provided to model, so it can better navigate itself
when using the tool.

## Pre-packaged tools
We ship NobodyWho with two packaged-in tools, which are general enough for mutliple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.

The usage is straightforward. Start with importing either `python_tool` or `bash_tool` from `nobodywho`.


```python
from nobodywho import python_tool, bash_tool

chat = Chat('./model.gguf', tools=[python_tool(), bash_tool()])
```

Lastly, keep in mind that for most use-cases it is reasonable to constraint the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `python_tool` provides `max_duration`, `max_memory` and `max_recursion_depth`
and `bash_tool` provides `max_commands`.


## Tool calling and the context

As with everything made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.

### Vision & Hearing

A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image information to your LLM.

## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:

1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens

To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.

If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.

With the downloaded GGUFs, you can simply add the projection model as:

<!-- not tested: requires audio-capable mmproj (Gemma 4), CI uses Gemma 3 -->
```python
from nobodywho import Model, Chat

model = Model("./vision-model.gguf", projection_model_path="./projection_model.gguf")
chat = Chat(
    model, system_prompt="You are a helpful assistant, that can hear and see stuff!"
)
```

!!! info ""
    The language model and projection model have to **fit** together, as they are trained together!
    Unfortunately you can't just take projection model and a LLM that you like and expect them
    to work together.

## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through the `Prompt` object.
```{.python continuation}
from nobodywho import Audio, Image, Prompt, Text

prompt = Prompt([
    Text("Tell me what you see in the image and what you hear in the audio."),
    Image("./dog.png"),
    Audio("./sound.mp3")
])

chat.ask(prompt).completed() # It's a dog and a penguin!
```

## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.

```{.python continuation}
prompt = Prompt([
    Text("Tell me what you see in the image."),
    Image("./dog.png"),
    Text("Also tell me what you hear in the audio"),
    Audio("./sound.mp3")
])
```

Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```{.python continuation}
chat = Chat(
    model, system_prompt="You are a helpful assistant.", n_ctx=4096
)
```
Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).

Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.

As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.

### Sampling

The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify,
to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).

## Sampler presets

To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```python
from nobodywho import SamplerPresets

Chat("./model.gguf", sampler=SamplerPresets.temperature(0.2))
```
Setting `temperature` to `0.2`, will then affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.

To see the whole list of presets, check out the `SamplerPresets` class:
```python
class SamplerPresets:
    def default() -> SamplerConfig: ...
    def dry() -> SamplerConfig: ...
    def grammar(grammar: str) -> SamplerConfig: ...
    def greedy() -> SamplerConfig: ...
    def json() -> SamplerConfig: ...
    def temperature(temperature: float) -> SamplerConfig: ...
    def top_k(top_k: int) -> SamplerConfig: ...
    def top_p(top_p: float) -> SamplerConfig: ...
    ...
```

## Structured output

One of the most useful presets to have, is to be able to generate structured output,
such as JSON. This way, you dont have to rely on your model being clever enough to
generate syntactically valid JSON, but instead you are strictly guaranteed that the
output will be right. For plain JSON, it suffices to:
```python
Chat('./model.gguf', sampler=SamplerPresets.json())
```

Still, you might have more advanced needs, such as generating CSVs or JSON with some specific keys. This can be supported by creating custom grammars, such as this one for CSV:
```python
sampler = SamplerPresets.grammar("""
    file ::= record (newline record)* newline?
    record ::= field ("," field)*
    field ::= quoted_field | unquoted_field
    unquoted_field ::= unquoted_char*
    unquoted_char ::= [^,"\n\r]
    quoted_field ::= "\"" quoted_char* "\""
    quoted_char ::= [^"] | "\"\""
    newline ::= "\r\n" | "\n"
""")
```
The format that NobodyWho utilizes is called GBNF, which is a Llama.cpp native format.
See the [GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md).


## Defining your own samplers

Sampler presets abstract away some control, that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide `SamplerBuilder` class:
```python
from nobodywho import SamplerBuilder

Chat(
    "./model.gguf",
    sampler=SamplerBuilder()
        .temperature(0.8)
        .top_k(5)
        .dist()
)
```
With `SamplerBuilder` you can chain multiple steps together and then select how do you
want to sample from the distribution. Keep in mind, that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly and not giving you type errors, be careful
to always end the chain with one of the sampling steps (e.g. `dist`, `greedy`, `mirostat_v2`, etc.).

### Streaming & Async API

Synchronously waiting for the full response to arrive can be costly. If your application
domain allows you to, you would ideally want to stream the tokens to the user as soon
as they arrive, and spend the time in between doing useful work, rather than just waiting.

## Streaming tokens

Allowing streaming is super simple. Instead of calling the `.completed()` method, just iterate
over the response object:
```python
chat = Chat('./model.gguf')
response = chat.ask('How are you?')
for token in response:
    print(token, end="", flush=True)
```

Still, bear in mind that for the individual tokens, you are waiting synchronously.

## Async API

If you don't want to wait synchronously, swap out the `Chat` object for a `ChatAsync`. All of the API stays the same, so either you can opt for a full, completed message:
```python
import asyncio
from nobodywho import ChatAsync

async def main():
    chat = ChatAsync('./model.gguf')
    response = await chat.ask('How are you?').completed()
    print(response)

asyncio.run(main())
```
Or again stream tokens:
```python
import asyncio
from nobodywho import ChatAsync

async def main():
    chat = ChatAsync('./model.gguf')
    response = chat.ask('How are you?')
    async for token in response:
        print(token, end="", flush=True)

asyncio.run(main())
```

Similarly, the other model types we support also implement async behaviour, so
you can go for `EncoderAsync` and `CrossEncoderAsync`, which are
both part of the [embeddings & rag functionality](./embeddings-and-rag.md).

### Embeddings & RAG

When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.

## Understanding Embeddings

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.

For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.

## The Encoder

The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).

We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).

```python
from nobodywho import Encoder

encoder = Encoder('./embedding-model.gguf')
embedding = encoder.encode("What is the weather like?")
print(f"Vector with {len(embedding)} dimensions")
```

The resulting embedding is a list of floats (typically 384 or 768 dimensions depending on the model).

### Comparing Embeddings

To measure how similar two pieces of text are, compare their embeddings using cosine similarity:

```python
from nobodywho import Encoder, cosine_similarity

encoder = Encoder('./embedding-model.gguf')

query = encoder.encode("How do I reset my password?")
doc1 = encoder.encode("You can reset your password in the account settings")
doc2 = encoder.encode("The password requirements include 8 characters minimum")

similarity1 = cosine_similarity(query, doc1)
similarity2 = cosine_similarity(query, doc2)

print(f"Document 1 similarity: {similarity1:.3f}")  # Higher score
print(f"Document 2 similarity: {similarity2:.3f}")  # Lower score
```

Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.

### Practical Example: Finding Relevant Documents

```python
from nobodywho import Encoder, cosine_similarity

encoder = Encoder('./embedding-model.gguf')

# Your knowledge base
documents = [
    "Python supports multiple programming paradigms including object-oriented and functional",
    "JavaScript is primarily used for web development and runs in browsers",
    "SQL is a domain-specific language for managing relational databases",
    "Git is a version control system for tracking changes in source code"
]

# Pre-compute document embeddings
doc_embeddings = [encoder.encode(doc) for doc in documents]

# Search query
query = "What language should I use for database queries?"
query_embedding = encoder.encode(query)

# Find the most relevant document
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_idx = similarities.index(max(similarities))

print(f"Most relevant: {documents[best_idx]}")
print(f"Similarity score: {similarities[best_idx]:.3f}")
```

## The CrossEncoder for Better Ranking

While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.

The key difference: embeddings compare vector similarity, while cross-encoders understand the relationship between query and document.

### Why CrossEncoder Matters

Consider this example:

```
Query: "What are the office hours for customer support?"
Documents: [
    "Customer asked: What are the office hours for customer support?",
    "Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
    "Note: Weekend support is not available at this time"
]
```

Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.

### Using CrossEncoder

```python
from nobodywho import CrossEncoder

# Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
crossencoder = CrossEncoder('./reranker-model.gguf')

query = "How do I install Python packages?"
documents = [
    "Someone previously asked about Python packages",
    "Use pip install package-name to install Python packages",
    "Python packages are not included in the standard library"
]

# Get relevance scores for each document
scores = crossencoder.rank(query, documents)
print(scores)  # [0.23, 0.89, 0.45] - second doc scores highest
```

### Automatic Sorting

For convenience, use `rank_and_sort` to get documents sorted by relevance:

```{.python continuation}
# Returns list of (document, score) tuples, sorted by score
ranked_docs = crossencoder.rank_and_sort(query, documents)

for doc, score in ranked_docs:
    print(f"[{score:.3f}] {doc}")
```

This returns documents ordered from most to least relevant.

## Building a RAG System

Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.

Here's a complete example building a customer service assistant with access to company policies:

```python
from nobodywho import Chat, CrossEncoder

# Initialize the cross-encoder for document ranking
crossencoder = CrossEncoder('./reranker-model.gguf')

# Your knowledge base
knowledge = [
    "Our company offers a 30-day return policy for all products",
    "Free shipping is available on orders over $50",
    "Customer support is available via email and phone",
    "We accept credit cards, PayPal, and bank transfers",
    "Order tracking is available through your account dashboard"
]

# Create a tool that searches the knowledge base
from nobodywho import tool

@tool(description="Search the knowledge base for relevant information")
def search_knowledge(query: str) -> str:
    # Rank all documents by relevance to the query
    ranked = crossencoder.rank_and_sort(query, knowledge)
    
    # Return top 3 most relevant documents
    top_docs = [doc for doc, score in ranked[:3]]
    return "\n".join(top_docs)

# Create a chat with access to the knowledge base
chat = Chat(
    './model.gguf',
    system_prompt="You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
    tools=[search_knowledge]
)

# The chat will automatically search the knowledge base when needed
response = chat.ask("What is your return policy?").completed()
print(response)
```

The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.

## Async API

For non-blocking operations, use `EncoderAsync` and `CrossEncoderAsync`:

```python
import asyncio
from nobodywho import EncoderAsync, CrossEncoderAsync

async def main():
    encoder = EncoderAsync('./embedding-model.gguf')
    crossencoder = CrossEncoderAsync('./reranker-model.gguf')
    
    # Generate embeddings asynchronously
    embedding = await encoder.encode("What is the weather?")
    
    # Rank documents asynchronously
    query = "What is our refund policy?"
    docs = ["Refunds processed within 5-7 business days", "No refunds on sale items", "Contact support to initiate refund"]
    ranked = await crossencoder.rank_and_sort(query, docs)
    
    for doc, score in ranked:
        print(f"[{score:.3f}] {doc}")

asyncio.run(main())
```


## Recommended Models

### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings

### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy

## Best Practices

**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.

**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.

**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.

**Choose appropriate context size**: The `n_ctx` parameter (default 2048) should match your model's recommended context size. Check the model documentation.

```python
# For longer documents, increase context size
encoder = Encoder('./embedding-model.gguf', n_ctx=4096)
crossencoder = CrossEncoder('./reranker-model.gguf', n_ctx=4096)
```

## Complete RAG Example

Here's a full example showing a two-stage retrieval system:

```python
from nobodywho import Chat, Encoder, CrossEncoder, cosine_similarity, tool

# Initialize models
encoder = Encoder('./embedding-model.gguf')
crossencoder = CrossEncoder('./reranker-model.gguf')

# Large knowledge base
knowledge_base = [
    "Python 3.11 introduced performance improvements through faster CPython",
    "The Django framework is used for building web applications",
    "NumPy provides support for large multi-dimensional arrays",
    "Pandas is the standard library for data manipulation and analysis",
    # ... 100+ more documents
]

# Precompute embeddings for all documents
doc_embeddings = [encoder.encode(doc) for doc in knowledge_base]

@tool(description="Search the knowledge base for information relevant to the query")
def search(query: str) -> str:
    # Stage 1: Fast filtering with embeddings
    query_embedding = encoder.encode(query)
    similarities = [
        (doc, cosine_similarity(query_embedding, doc_emb))
        for doc, doc_emb in zip(knowledge_base, doc_embeddings)
    ]
    # Get top 20 candidates
    candidates = sorted(similarities, key=lambda x: x[1], reverse=True)[:20]
    candidate_docs = [doc for doc, _ in candidates]
    
    # Stage 2: Precise ranking with cross-encoder
    ranked = crossencoder.rank_and_sort(query, candidate_docs)
    
    # Return top 3 most relevant
    top_results = [doc for doc, score in ranked[:3]]
    return "\n---\n".join(top_results)

# Create RAG-enabled chat
chat = Chat(
    './model.gguf',
    system_prompt="You are a technical documentation assistant. Always use the search tool to find relevant information before answering programming questions.",
    tools=[search]
)

# The chat automatically searches and uses retrieved documents
response = chat.ask("What Python libraries are best for data analysis?").completed()
print(response)
```

This two-stage approach combines the speed of embeddings with the accuracy of cross-encoders, making it efficient even for large knowledge bases.

### Logging

The python bindings for NobodyWho integrate with python's standard `logging` utilities.

In short, to enable debug logs:

```python
import logging
logging.basicConfig(level=logging.DEBUG)
```

This can be useful for getting some insight into what the model is choosing to do and when.
For example when tool calls are made, when context shifting happens, etc.

## React Native

### Getting started with React Native

## How do I get started?

First, install `react-native-nobodywho`.
```bash
npm install react-native-nobodywho
```

No additional initialization step is required — the native module is loaded automatically when you first import from the package.

Now you are ready to download a GGUF model you like - if you don't have a specific model in mind, try [this one](https://huggingface.co/NobodyWho/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf). Read more about [model selection](../model-selection.md).

Once you have the `.gguf` file on the device, the next step is to create a `Chat` and call `.ask`!

```typescript
import { Chat } from "react-native-nobodywho";

const chat = await Chat.fromPath({ modelPath: "/path/to/model.gguf" });
const response = await chat.ask("Is water wet?").completed();
console.log(response); // Yes, indeed, water is wet!
```

This is a super simple example, but we believe that examples which do simple things, should be simple!

## Tracking download progress

When loading a remote model (e.g. via a `huggingface:` or `https://` path), pass an `onDownloadProgress` option to observe the download. It receives `(downloaded, total)` byte counts, is throttled to roughly 10 Hz with a guaranteed final emit on completion, and is not called for cached or local files.

```typescript
const chat = await Chat.fromPath({
  modelPath: "huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf",
  onDownloadProgress: (downloaded, total) => {
    console.log(`${downloaded} / ${total} bytes`);
  },
});
```

To get a full overview of the functionality provided by NobodyWho, simply keep reading.

## Android requirements

If you use the x86_64 Android emulator for development, your app must set `minSdkVersion` to at least 31. This is due to a threading feature (ELF TLS) that the Rust runtime requires on x86_64. ARM64 devices (i.e. all real phones) work with any `minSdkVersion`.

No specific NDK version is required — NobodyWho ships prebuilt shared libraries, so your project's NDK version does not affect the Rust code.

## Minimum recommended specs

- iOS: iPhone 11 or newer with at least 4 GB of RAM. We tested a Qwen3 0.6B (332 MB) on an iPhone X (iOS 16) and while it ran, performance was too slow to be practical.
- Android: Snapdragon 855 / Adreno 640 / 6 GB RAM or better. The same Qwen3 0.6B model performed notably better on a OnePlus 7 Pro (Android 12) than on the iPhone X tested above.

## Feedback & Contributions

We welcome your feedback and ideas!

- Bug Reports & Improvements: If you encounter a bug or have suggestions, please open an issue on our [Issues](https://github.com/nobodywho-ooo/nobodywho/issues) page.
- Feature Requests & Questions: For new feature requests or general questions, join the discussion on our [Discussions](https://github.com/nobodywho-ooo/nobodywho/discussions) page.

### Chat

As you may have noticed in the [welcome guide](./index.md), every interaction with your LLM starts by creating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.

## Creating a Chat

There are two main ways of creating a `Chat` object and the difference lies in when the model file is loaded.
The simplest way is using `Chat.fromPath`:

```typescript
import { Chat } from "react-native-nobodywho";

const chat = await Chat.fromPath({ modelPath: "/path/to/model.gguf" });
```

This function is async since loading a model can take a bit of time, but this should not block any of your UI.
Another way to achieve the same thing is to load the model separately and then use the `Chat` constructor:

```typescript
import { Model, Chat } from "react-native-nobodywho";

const model = await Model.load({ modelPath: "/path/to/model.gguf" });
const chat = new Chat({ model });
```

This allows for sharing the model between several `Chat` instances.

## Prompts and responses

The `chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.

```typescript
const chat = await Chat.fromPath({ modelPath: "/path/to/model.gguf" });
const response = chat.ask("Is water wet?");
```

The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can iterate over the `TokenStream` using `for await`.
Each token is either an individual word or a fragment of a word.

```typescript
for await (const token of response) {
  console.log(token);
}
```

If you just want to get the complete response, you can call `completed()`.
This will return the entire response string once the model is done generating.

```typescript
const fullResponse = await response.completed();
```

All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `chat.ask()`, it will remember the previous messages.

## Chat history

If you want to inspect the messages inside the `Chat` object, you can use `getChatHistory`.

```typescript
const msgs = await chat.getChatHistory();
console.log(msgs[0]); // The first message
```

Similarly, if you want to edit what messages are in the context, you can use `setChatHistory`:

```typescript
await chat.setChatHistory([
  { role: "user", content: "What is water?" },
]);
```

## System prompt

A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.

You can specify a system prompt when creating a `Chat`:

```typescript
import { Chat } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  systemPrompt: "You are a mischievous assistant!",
});
```

This `systemPrompt` is then persisted until the chat context is reset.

## Context

The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
A bigger context size means more computational overhead, so it makes sense to constrain it. This can be done with the `contextSize` setting at creation time:

```typescript
const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  contextSize: 4096,
});
```

The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected model's documentation and see what their recommended context size is.

Even with a properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `contextSize / 2`. The KV cache is also updated automatically.

To reset the current context content, call `resetContext()` with a new system prompt and potentially changed tools.

```typescript
await chat.resetContext({ systemPrompt: "New system prompt", tools: [] });
```

If you don't want to change the already set defaults (`systemPrompt`, `tools`), but only reset the context, then go for `resetHistory`.

## Sharing model between contexts

There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. In this case you must load the model separately from creating the `Chat` instance.

```typescript
import { Model, Chat } from "react-native-nobodywho";

const model = await Model.load({ modelPath: "/path/to/model.gguf" });
const chat1 = new Chat({ model });
const chat2 = new Chat({ model });
```

NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.

## GPU

When using `Model.load` or `Chat.fromPath` you have the option to disable/enable GPU acceleration:

```typescript
const model = await Model.load({ modelPath: "/path/to/model.gguf", useGpu: false });
```

or

```typescript
const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  useGpu: false,
});
```

By default `useGpu` is set to `true`.

## Template Variables

Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.

### Using Template Variables

You can set template variables when creating a chat or modify them on existing instances:

```typescript
const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  templateVariables: { enable_thinking: true },
});
```

You can also modify template variables on an existing chat instance:

```typescript
// Set a single template variable
await chat.setTemplateVariable("enable_thinking", true);

// Get current template variables
const variables = await chat.getTemplateVariables();
console.log(variables); // Map { "enable_thinking" => true }
```

With the next message sent, the updated settings will be propagated to the model.

### Example: Qwen3 and Qwen3.5 Reasoning

The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:

```typescript
const chat = await Chat.fromPath({
  modelPath: '/path/to/model.gguf',
  templateVariables: { enable_thinking: true },
});
const response = chat.ask("Solve this logic puzzle: ...");
```

When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.

### Model-Specific Variables

Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.

!!! info ""
    Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.

### Tool Calling

To give your LLM the ability to interact with the outside world, you will need tool calling.

!!! info ""
    Note that **not every model** supports tool calling. If the model does not have
    such an option, it might not call your tools.
    For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/collections/NobodyWho/qwen-3) family of models.

## Declaring a tool

A tool is created by providing a name, description, parameter schemas, and a callback function.
Any regular function can be used as a tool — arguments from the LLM are passed positionally
in the same order as the `parameters` array:

```typescript
import { Tool } from 'react-native-nobodywho';

const circleArea = (radius: number): string => {
  const area = Math.PI * radius * radius;
  return `Circle with radius ${radius} has area ${area.toFixed(2)}`;
};

const circleAreaTool = new Tool({
  name: 'circle_area',
  description: 'Calculates the area of a circle given its radius',
  parameters: [
    { name: 'radius', type: 'number', description: 'The radius of the circle' },
  ],
  call: circleArea,
});
```

Every `Tool` needs a callback function, a name, a description of what the tool does, and a `parameters` array describing its inputs. Each parameter uses [JSON Schema](https://json-schema.org/) properties (`type`, `enum`, `description`, etc.) plus a `name` field. Arguments are passed to your function positionally in array order.

To let your LLM use it, simply add it when creating `Chat`:

```typescript
import { Chat } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  tools: [circleAreaTool],
});
```

NobodyWho then figures out the right tool calling format and configures the sampler.

Naturally, more tools can be defined and the model can chain the calls for them:

```typescript
import { Chat, Tool } from "react-native-nobodywho";

const getCurrentDir = (): string => '/home/user/documents';

// In a real app, you'd read the filesystem here
const listFiles = (path: string): string =>
  'Files: report.pdf, notes.txt, model.gguf';

// In a real app, you'd check the actual file size
const getFileSize = (filepath: string): string => 'File size: 1024 bytes';

const getCurrentDirTool = new Tool({
  name: "get_current_dir",
  description: "Gets path of the current directory",
  parameters: [],
  call: getCurrentDir,
});

const listFilesTool = new Tool({
  name: "list_files",
  description: "Lists files in the given directory",
  parameters: [
    { name: "path", type: "string", description: "The path to the directory to list" },
  ],
  call: listFiles,
});

const getFileSizeTool = new Tool({
  name: "get_file_size",
  description: "Gets the size of a file in bytes",
  parameters: [
    { name: "filepath", type: "string", description: "The path to the file" },
  ],
  call: getFileSize,
});

const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  tools: [getCurrentDirTool, listFilesTool, getFileSizeTool],
});

const response = await chat
  .ask("What is the biggest file in my current directory?")
  .completed();
console.log(response);
```

## Tool calling and the context

As with most things made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.

### Vision

A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image and audio information to your LLM.

## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:

1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens

To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model includes `mmproj` in its name.

If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.

With the downloaded GGUFs, you can load them using `Chat.fromPath`:

```typescript
import { Chat } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/vision-model.gguf",
  projectionModelPath: "/path/to/mmproj.gguf",
  systemPrompt: "You are a helpful assistant, that can hear and see stuff!",
});
```

Or load the model separately:

```typescript
import { Model, Chat } from "react-native-nobodywho";

const model = await Model.load({
  modelPath: "/path/to/vision-model.gguf",
  projectionModelPath: "/path/to/mmproj.gguf",
});
const chat = new Chat({
  model,
  systemPrompt: "You are a helpful assistant.",
});
```

## Composing a prompt object

With the model configured, all that is left is to compose the prompt and send it to the model.
Use `Prompt` to build prompts that mix text, images, and audio, then pass them to `chat.ask()`:

```typescript
import { Chat, Prompt } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/vision-model.gguf",
  projectionModelPath: "/path/to/mmproj.gguf",
});

const response = await chat
  .ask(
    new Prompt([
      Prompt.Text("Tell me what you see in the image and what you hear in the audio."),
      Prompt.Image("/path/to/dog.png"),
      Prompt.Audio("/path/to/sound.mp3"),
    ]),
  )
  .completed();
```

## Tips for multimodality

As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.

```typescript
const prompt = new Prompt([
  Prompt.Text("Tell me what you see in the image."),
  Prompt.Image("/path/to/dog.png"),
  Prompt.Text("Also tell me what you hear in the audio."),
  Prompt.Audio("/path/to/sound.mp3"),
]);
```

Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:

```typescript
const chat = await Chat.fromPath({
  modelPath: "/path/to/vision-model.gguf",
  projectionModelPath: "/path/to/mmproj.gguf",
  contextSize: 8192,
});
```

Or, for example, preprocess your images with some kind of downsampling (sometimes even changing the image type helps).

Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend at least trying out different
projection model files, if the one you picked does not work.

As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.

### Sampling

The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify, to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).

## Sampler presets

To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:

```typescript
import { Chat, SamplerPresets } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  sampler: SamplerPresets.temperature(0.2),
});
```

Setting `temperature` to `0.2` will affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.

To see the whole list of presets, check out the `SamplerPresets` class:

```typescript
class SamplerPresets {
  static default(): SamplerConfig;
  static dry(): SamplerConfig;
  static grammar(grammar: string): SamplerConfig;
  static greedy(): SamplerConfig;
  static json(): SamplerConfig;
  static temperature(temperature: number): SamplerConfig;
  static topK(topK: number): SamplerConfig;
  static topP(topP: number): SamplerConfig;
}
```

## Structured output

One of the most useful presets to have is to be able to generate structured output,
such as JSON. This way, you don't have to rely on your model being clever enough to
generate syntactically valid JSON, but instead you are strictly guaranteed that the
output will be right. For plain JSON, it suffices to:

```typescript
const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  sampler: SamplerPresets.json(),
});
```

Still, you might have more advanced needs, such as generating CSVs or JSON with some specific keys. This can be supported by creating custom grammars, such as this one for CSV:

```typescript
const sampler = SamplerPresets.grammar(`
    file ::= record (newline record)* newline?
    record ::= field ("," field)*
    field ::= quoted_field | unquoted_field
    unquoted_field ::= unquoted_char*
    unquoted_char ::= [^,"\\n\\r]
    quoted_field ::= "\\"" quoted_char* "\\""
    quoted_char ::= [^"] | "\\"\\""
    newline ::= "\\r\\n" | "\\n"
`);
```

The format that NobodyWho utilizes is called GBNF, which is a Llama.cpp native format.
See the [GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md).

## Defining your own samplers

Sampler presets abstract away some control that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide the `SamplerBuilder` class:

```typescript
import { Chat, SamplerBuilder } from "react-native-nobodywho";

const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  sampler: new SamplerBuilder().temperature(0.8).topK(5).dist() as SamplerConfig,
});
```

With `SamplerBuilder` you can chain multiple steps together and then select how you
want to sample from the distribution. Keep in mind that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly, be careful
to always end the chain with one of the sampling steps (e.g. `dist()`, `greedy()`, `mirostatV2()`, etc.).

You can also change the sampler configuration on an existing chat instance:

```typescript
const sampler = new SamplerBuilder()
  .temperature(0.8)
  .topK(5)
  .dist() as SamplerConfig;

await chat.setSamplerConfig(sampler);
```

### Embeddings & RAG

When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.

## Understanding Embeddings

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.

For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.

## The Encoder

The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).

We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).

```typescript
import { Encoder } from "react-native-nobodywho";

const encoder = await Encoder.fromPath({
  modelPath: "/path/to/embedding-model.gguf",
});
const embedding = await encoder.encode("What is the weather like?");
console.log(`Vector with ${embedding.length} dimensions`);
```

The resulting embedding is an array of numbers (typically 384 or 768 dimensions depending on the model).

### Comparing Embeddings

To measure how similar two pieces of text are, compare their embeddings using cosine similarity:

```typescript
import { Encoder, cosineSimilarity } from "react-native-nobodywho";

const encoder = await Encoder.fromPath({
  modelPath: "/path/to/embedding-model.gguf",
});

const query = await encoder.encode("How do I reset my password?");
const doc1 = await encoder.encode(
  "You can reset your password in the account settings",
);
const doc2 = await encoder.encode(
  "The password requirements include 8 characters minimum",
);

const similarity1 = cosineSimilarity(query, doc1);
const similarity2 = cosineSimilarity(query, doc2);

console.log(`Document 1 similarity: ${similarity1.toFixed(3)}`); // Higher score
console.log(`Document 2 similarity: ${similarity2.toFixed(3)}`); // Lower score
```

Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.

## The CrossEncoder for Better Ranking

While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.

The key difference is that embeddings compare vector similarity, while cross-encoders understand the relationship between query and document, at a potentially larger computation cost.

### Why CrossEncoder Matters

Consider this example:

```
Query: "What are the office hours for customer support?"
Documents: [
    "Customer asked: What are the office hours for customer support?",
    "Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
    "Note: Weekend support is not available at this time"
]
```

Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.

### Using CrossEncoder

```typescript
import { CrossEncoder } from "react-native-nobodywho";

// Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
const crossencoder = await CrossEncoder.fromPath({
  modelPath: "/path/to/reranker-model.gguf",
});

const query = "How do I install Python packages?";
const documents = [
  "Someone previously asked about Python packages",
  "Use pip install package-name to install Python packages",
  "Python packages are not included in the standard library",
];

// Get relevance scores for each document
const scores = await crossencoder.rank(query, documents);
console.log(scores); // [0.23, 0.89, 0.45] - second doc scores highest
```

### Automatic Sorting

For convenience, use `rankAndSort` to get documents sorted by relevance:

```typescript
// Returns list of [document, score] pairs, sorted by score
const rankedDocs = await crossencoder.rankAndSort(query, documents);

for (const [doc, score] of rankedDocs) {
  console.log(`[${score.toFixed(3)}] ${doc}`);
}
```

This returns documents ordered from most to least relevant.

## Building a RAG System

Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.

Here's a complete example building a customer service assistant with access to company policies:

```typescript
import { Chat, Tool, CrossEncoder } from "react-native-nobodywho";

// Initialize the cross-encoder for document ranking
const crossencoder = await CrossEncoder.fromPath({
  modelPath: "/path/to/reranker-model.gguf",
});

// Your knowledge base
const knowledge = [
  "Our company offers a 30-day return policy for all products",
  "Free shipping is available on orders over $50",
  "Customer support is available via email and phone",
  "We accept credit cards, PayPal, and bank transfers",
  "Order tracking is available through your account dashboard",
];

// Create a tool that searches the knowledge base
const searchKnowledgeTool = new Tool({
  name: "search_knowledge",
  description: "Search the knowledge base for relevant information",
  parameters: [
    { name: "query", type: "string", description: "The search query" },
  ],
  call: async (query: string) => {
    const ranked = await crossencoder.rankAndSort(query, knowledge);
    const topDocs = ranked
      .slice(0, 3)
      .map(([doc]) => doc);
    return topDocs.join("\n");
  },
});

// Create a chat with access to the knowledge base
const chat = await Chat.fromPath({
  modelPath: "/path/to/model.gguf",
  systemPrompt:
    "You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
  tools: [searchKnowledgeTool],
});

// The chat will automatically search the knowledge base when needed
const response = await chat.ask("What is your return policy?").completed();
console.log(response);
```

The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.

## Recommended Models

### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings

### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy

## Best Practices

**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.

**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.

**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.

## Flutter

### Getting started with Flutter

## How do I get started?

First, install `nobodywho`.
```bash
flutter pub add nobodywho
```

Next you need to import NobodyWho and we highly suggets you do this using the namespace `nobodywho` like so:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
```
since we have generic names such as `Model` and `Chat` in our package. 
After you have imported the package it is very important that the next step is done correctly. As we dynamically link the rust binaries you must make 
the following function call exactly once in your application!

```dart
await nobodywho.NobodyWho.init();
```

A call to any of the functions in NobodyWho will result in an error before `.init()` has been called. 
However a second call to `.init()` will also result in an error, so you should be mindful about when you make this call.
We suggest you make it as early and as close to the root of your app as possible, as even though it is async it is a very fast operation.

With that setup done we can move on to the exiting stuff! We will in the rest of the docs that 
you have imported NobodyWho using namespacing and that `.init()` has been called. 

Now you are ready to pick a model. NobodyWho can download GGUF models directly from Hugging Face — just pass a `huggingface:` path. See [model selection](../model-selection.md) for recommendations.

Then create a `Chat` object and call `.ask`!

```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: 'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
);
final msg = await chat.ask('Is water wet?').completed();
print(msg); // Yes, indeed, water is wet!
```

This is a super simple example, but we believe that examples which do simple things, should be simple!

## Tracking download progress

When loading a remote model, pass an `onDownloadProgress` callback to observe the download. It receives `(downloadedBytes, totalBytes)`, is throttled to roughly 10 Hz with a guaranteed final emit on completion, and is not called for cached or local files.

```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: 'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
  onDownloadProgress: (downloaded, total) {
    print('$downloaded / $total bytes');
  },
);
```

To get a full overview of the functionality provided by NobodyWho, simply keep reading. You can also have a look at our [flutter starter app repository](https://github.com/nobodywho-ooo/flutter-starter-example).

## Minimum recommended specs

- iOS: iPhone 11 or newer with at least 4 GB of RAM. We tested a Qwen3 0.6B (332 MB) on an iPhone X (iOS 16) and while it ran, performance was too slow to be practical.
- Android: Snapdragon 855 / Adreno 640 / 6 GB RAM or better. The same Qwen3 0.6B model performed notably better on a OnePlus 7 Pro (Android 12) than on the iPhone X tested above.

## Feedback & Contributions

We welcome your feedback and ideas!

- Bug Reports & Improvements: If you encounter a bug or have suggestions, please open an issue on our [Issues](https://github.com/nobodywho-ooo/nobodywho/issues) page.
- Feature Requests & Question: For new feature requests or general questions, join the discussion on our [Discussions](https://github.com/nobodywho-ooo/nobodywho/discussions) page.

### Chat

As you may have noticed in the [welcome guide](./index.md), every interaction with your LLM starts by instantiating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.w


## Creating a Chat 

There are two main ways of instantiating a `Chat` object and the difference lies in when the model file is loaded. 
The simplest way is using `Chat.fromPath` like so:

```dart 
final chat = await nobodywho.Chat.fromPath(modelPath: "./model.gguf");
```
This function is async since loading a model can take a bit of time, but this should not block the any of your UI.
Another way to achieve the same thing is to load the model seperately and then use the `Chat` constructor:

```dart
final model = await nobodywho.Model.load(modelPath: "./model.gguf");
final chat = nobodywho.Chat(model : model);
```

This allows for sharing the model between several `Chat` instances.

## Prompts and responses

The `Chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.

```dart
import "dart:io"
final chat = await nobodywho.Chat.fromPath(modelPath: "./model.gguf");
final response = chat.ask("Is water wet?");
```

The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can just iterate over the `TokenStream`.
Each token is either an individual word or fragments of a word.

```{.dart continuation}
await for (final token in response) {
   print(token);
}
```

If you just want to get the complete response, you can call `TokenStream.completed()`.
This will return the entire response string once the model is done generating its entire response.

```{.dart continuation}
final fullResponse = await response.completed();
```

All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `Chat.ask()`, it will remember the previous messages.

## Chat history

If you want to inspect the messages inside the `Chat` object, you can use `getChatHistory`.

```{.dart continuation}
final msgs = await chat.getChatHistory();
print(msgs[0].content); // "Is water wet?"
```

Similarly, if you want to edit what messages are in the context, you can use `setChatHistory`:

```{.dart continuation}
await chat.setChatHistory([
  nobodywho.Message.user(content: "What is water?")
]);
```

## System prompt

A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.

You can specify a system prompt when initializing a `Chat`:

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  systemPrompt: "You are a mischievous assistant!"
);
```

This `systemPrompt` is then persisted until the chat context is `reset`.

## Context

The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
As bigger context size means more computational overhead, it makes sense to constrain it. This can be done with `contextSize` setting, again at the time of creation:

```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  contextSize: 4096
);
```

The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected models documentation and see what their recommended context size is.

Even with properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `contextSize / 2`. The KV cache is also updated automatically. In the future we plan on adding more advanced methods of context shrinking.

Again, `contextSize` is fixed to the `Chat` instance, so it is currently not possible to change the size after `Chat` is created. To reset the current context content, just call `resetContext()` with the new system prompt and potentially changed tools.

```{.dart continuation}
await chat.resetContext(systemPrompt: "New system prompt", tools: []);
```

If you don't want to change the already set defaults (`systemPrompt`, `tools`), but only reset the context, then go for `resetHistory`.

## Sharing model between contexts

There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. In this case you must load the model 
seperately from creating the `Chat` instance.

For this use case, instead of the path to the `.gguf` model, you can pass in `Model` object, which can be shared between multiple `Chat` instances.

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final model = await nobodywho.Model.load(modelPath: './model.gguf');
final chat1 = nobodywho.Chat(model: model);
final chat2 = nobodywho.Chat(model: model);
...
```

NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.

## GPU
When instantiating `Model` or using `Chat.fromPath` you have the option to disable/enable GPU acceleration. This can be done as:
```dart
final model = await nobodywho.Model.load(modelPath: './model.gguf', useGpu: true);
```
or 
```dart
final chat = await nobodywho.Chat.fromPath(modelPath: './model.gguf', useGpu : false);
```
By defualt `useGpu` is set to true.
So far, NobodyWho relies purely on [Vulkan](https://www.vulkan.org), however support
of more architectures is planned (for details check out our [issues](https://github.com/nobodywho-ooo/nobodywho/issues) or join us on [Discord](https://discord.gg/qhaMc2qCYB)).

## Template Variables

Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.

### Using Template Variables

You can set template variables when creating a chat or modify them on existing instances:

```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  templateVariables: {"enable_thinking": true}
);
```

You can also modify template variables on an existing chat instance:

```{.dart continuation}
// Set a single template variable
await chat.setTemplateVariable("enable_thinking", true);

// Set multiple template variables at once
await chat.setTemplateVariables({
    "enable_thinking": true,
    "verbose_mode": false
});

// Get current template variables
final variables = await chat.getTemplateVariables();
print(variables); // {enable_thinking: true, verbose_mode: false}
```

With the next message sent, the updated settings will be propagated to the model.

### Example: Qwen3 and Qwen3.5 Reasoning

The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:

```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  templateVariables: {"enable_thinking": true}
);
final response = chat.ask("Solve this logic puzzle: ...");
```

When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.

### Model-Specific Variables

Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.

!!! info ""
    Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.

### Backward Compatibility

For backward compatibility, the deprecated `allowThinking` parameter is still available but internally sets the `enable_thinking` template variable:

```dart
// Deprecated - use templateVariables instead
final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  allowThinking: true
);
```

### Tool Calling

To give your LLM the ability to interact with the outside world, you will need tool calling.

!!! info ""
    Note that **not every model** supports tool calling. If the model does not have
    such an option, it might not call your tools.
    For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/Qwen/models) family of models.

## Declaring a tool

A tool can be created from any Dart function that returns a `String` or `Future<String>`.
To perform the conversion, you simply need to use the `describeTool()` function. To get
a good sense of how such a tool can look like, consider this geometry example:

```dart
import 'dart:math' as math;
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final circleAreaTool = nobodywho.Tool(
  name: "circle_area",
  description: "Calculates the area of a circle given its radius",
  function: ({ required double radius }) {
    final area = math.pi * radius * radius;
    return "Circle with radius $radius has area ${area.toStringAsFixed(2)}";
  }
);
```

As you can see, every `Tool()` call needs a function, a name, and a description
of what such tool does. To let your LLM use it, simply add it when creating `Chat`:

``` {.dart continuation}
final chat = nobodywho.Chat.fromPath(
  modelPath: './model.gguf',
  tools: [circleAreaTool]
);
```

NobodyWho then figures out the right tool calling format, inspects the names and types of the parameters,
and configures the sampler.

Naturally, more tools can be defined and the model can chain the calls for them:

```dart
import 'dart:io';
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final getCurrentDirTool = nobodywho.Tool(
  name: "get_current_dir",
  description: "Gets path of the current directory",
  function: () => Directory.current.path
);

final listFilesTool = nobodywho.Tool(
  name: "list_files",
  description: "Lists files in the given directory.",
  function: ({required String path}) {
    final dir = Directory(path);
    final files = dir.listSync()
        .where((entity) => entity is File)
        .map((file) => file.path.split('/').last)
        .toList();
    return "Files: ${files.join(', ')}";
  },
  parameterDescriptions : {"path" : "The path to directory you want list. Must be a valid path." }
);

final getFileSizeTool = nobodywho.Tool(
  name: "get_file_size",
  description: "Gets the size of a file in bytes.",
  function: ({required String filepath}) async {
    final file = File(filepath);
    final size = await file.length();
    return "File size: $size bytes";
  },
  parameterDescriptions : {"filepath" : "The path to file you wish to know the size of. Must be a valid path." }
);

final chat = await nobodywho.Chat.fromPath(
  modelPath: './model.gguf',
  tools: [getCurrentDirTool, listFilesTool, getFileSizeTool],
  templateVariables: {"enable_thinking": false}
);

final response = await chat.ask('What is the biggest file in my current directory?').completed();
print(response); // The largest file in your current directory is `model.gguf`.
```

## Pre-packaged tools

We ship NobodyWho with two packaged-in tools, which are general enough for multiple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.

The usage is straightforward. Use the `Tool.python()` and `Tool.bash()` factory constructors:

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final chat = await nobodywho.Chat.fromPath(
  modelPath: './model.gguf',
  tools: [nobodywho.Tool.python(), nobodywho.Tool.bash()],
);
```

Lastly, keep in mind that for most use-cases it is reasonable to constrain the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `Tool.python()` provides `maxDuration`, `maxMemoryBytes` and `maxRecursionDepth`
and `Tool.bash()` provides `maxCommands`.

## Tool calling and the context

As with everything made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.

### Vision & Hearing

Easily provide image and audio information to your LLM.

## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:

1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens

To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.

If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.

With the downloaded GGUFs, you can simply add the projection model when loading the model:

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final model = await nobodywho.Model.load(
  modelPath: "./multimodal-model.gguf",
  projectionModelPath: "./mmproj.gguf",
);
final chat = nobodywho.Chat(
  model: model,
  systemPrompt: "You are a helpful assistant, that can hear and see stuff!",
);
```

!!! info ""
    The language model and projection model have to **fit** together, as they are trained together!
    Unfortunately you can't just take projection model and a LLM that you like and expect them
    to work together.

## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through `askWithPrompt`, which accepts a `Prompt` containing a list of `PromptPart` values.

```{.dart continuation}
final response = await chat.askWithPrompt(nobodywho.Prompt([
  nobodywho.TextPart("Tell me what you see in the image and what you hear in the audio."),
  nobodywho.ImagePart("./dog.png"),
  nobodywho.AudioPart("./sound.mp3"),
])).completed(); // It's a dog and a penguin!
```

## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.

```{.dart continuation}
await chat.resetHistory();
final response2 = await chat.askWithPrompt(nobodywho.Prompt([
  nobodywho.TextPart("Tell me what you see in the image."),
  nobodywho.ImagePart("./dog.png"),
  nobodywho.TextPart("Also tell me what you hear in the audio"),
  nobodywho.AudioPart("./sound.mp3"),
])).completed();
```

Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:

```{.dart continuation}
final chat2 = nobodywho.Chat(
  model: model,
  systemPrompt: "You are a helpful assistant.",
  contextSize: 8192,
);
```

Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).

Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.

As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.

### Sampling

The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify,
to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).

## Sampler presets

To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  sampler: nobodywho.SamplerPresets.temperature(temperature: 0.2)
);
```
Setting `temperature` to `0.2`, will then affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.

To see the whole list of presets, check out the `SamplerPresets` class:
```dart
class SamplerPresets {
  static SamplerConfig defaultSampler();
  static SamplerConfig dry();
  static SamplerConfig grammar({required String grammar});
  static SamplerConfig greedy();
  static SamplerConfig json();
  static SamplerConfig temperature({required double temperature});
  static SamplerConfig topK({required int topK});
  static SamplerConfig topP({required double topP});
  ...
}
```

## Structured output

One of the most useful presets to have, is to be able to generate structured output,
such as JSON. This way, you dont have to rely on your model being clever enough to
generate syntactically valid JSON, but instead you are strictly guaranteed that the
output will be right. For plain JSON, it suffices to:
```dart
final chat = await nobodywho.Chat.fromPath(
  modelPath: './model.gguf',
  sampler: nobodywho.SamplerPresets.json()
);
```

Still, you might have more advanced needs, such as generating CSVs or JSON with some specific keys. This can be supported by creating custom grammars, such as this one for CSV:
```dart
final sampler = nobodywho.SamplerPresets.grammar(grammar: """
    file ::= record (newline record)* newline?
    record ::= field ("," field)*
    field ::= quoted_field | unquoted_field
    unquoted_field ::= unquoted_char*
    unquoted_char ::= [^,"\n\r]
    quoted_field ::= "\"" quoted_char* "\""
    quoted_char ::= [^"] | "\"\""
    newline ::= "\r\n" | "\n"
""");
```
The format that NobodyWho utilizes is called GBNF, which is a Llama.cpp native format.
See the [GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md).


## Defining your own samplers

Sampler presets abstract away some control, that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide `SamplerBuilder` class:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final chat = await nobodywho.Chat.fromPath(
  modelPath: "./model.gguf",
  sampler: nobodywho.SamplerBuilder()
      .temperature(temperature: 0.8)
      .topK(topK: 5)
      .dist()
);
```
With `SamplerBuilder` you can chain multiple steps together and then select how do you
want to sample from the distribution. Keep in mind, that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly and not giving you type errors, be careful
to always end the chain with one of the sampling steps (e.g. `dist()`, `greedy()`, `mirostatV2()`, etc.).

You can also change the sampler configuration on an existing chat instance:

```{.dart continuation}
final sampler = nobodywho.SamplerBuilder()
    .temperature(temperature: 0.8)
    .topK(topK: 5)
    .dist();

await chat.setSamplerConfig(sampler);
```

### Embeddings & RAG

When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.

## Understanding Embeddings

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.

For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.

## The Encoder

The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).

We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).

```dart
import 'dart:typed_data';

final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final embedding = await encoder.encode(text: "What is the weather like?");
print("Vector with ${embedding.length} dimensions");
```

The resulting embedding is a `Float32List` (typically 384 or 768 dimensions depending on the model).

### Comparing Embeddings

To measure how similar two pieces of text are, compare their embeddings using cosine similarity:

```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');

final query = await encoder.encode(text: "How do I reset my password?");
final doc1 = await encoder.encode(text: "You can reset your password in the account settings");
final doc2 = await encoder.encode(text: "The password requirements include 8 characters minimum");

final similarity1 = nobodywho.cosineSimilarity(
  a: query.toList(),
  b: doc1.toList()
);
final similarity2 = nobodywho.cosineSimilarity(
  a: query.toList(),
  b: doc2.toList()
);

print("Document 1 similarity: ${similarity1.toStringAsFixed(3)}");  // Higher score
print("Document 2 similarity: ${similarity2.toStringAsFixed(3)}");  // Lower score
```

Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.

### Practical Example: Finding Relevant Documents

```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;

final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');

// Your knowledge base
final documents = [
  "Python supports multiple programming paradigms including object-oriented and functional",
  "JavaScript is primarily used for web development and runs in browsers",
  "SQL is a domain-specific language for managing relational databases",
  "Git is a version control system for tracking changes in source code"
];

// Pre-compute document embeddings
final docEmbeddings = <Float32List>[];
for (final doc in documents) {
  docEmbeddings.add(await encoder.encode(text: doc));
}

// Search query
final query = "What language should I use for database queries?";
final queryEmbedding = await encoder.encode(text: query);

// Find the most relevant document
double maxSimilarity = -1;
int bestIdx = 0;
for (int i = 0; i < docEmbeddings.length; i++) {
  final similarity = nobodywho.cosineSimilarity(
    a: queryEmbedding.toList(),
    b: docEmbeddings[i].toList()
  );
  if (similarity > maxSimilarity) {
    maxSimilarity = similarity;
    bestIdx = i;
  }
}

print("Most relevant: ${documents[bestIdx]}");
print("Similarity score: ${maxSimilarity.toStringAsFixed(3)}");
```

## The CrossEncoder for Better Ranking

While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.

The key difference is that embeddings compare vector similarity, while cross-encoders understand the relationship between query and document, at a potentially larger computation cost.

### Why CrossEncoder Matters

Consider this example:

```
Query: "What are the office hours for customer support?"
Documents: [
    "Customer asked: What are the office hours for customer support?",
    "Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
    "Note: Weekend support is not available at this time"
]
```

Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.

### Using CrossEncoder

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

// Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');

final query = "How do I install Python packages?";
final documents = [
  "Someone previously asked about Python packages",
  "Use pip install package-name to install Python packages",
  "Python packages are not included in the standard library"
];

// Get relevance scores for each document
final scores = await crossencoder.rank(query: query, documents: documents);
print(scores);  // [0.23, 0.89, 0.45] - second doc scores highest
```

### Automatic Sorting

For convenience, use `rankAndSort` to get documents sorted by relevance:

```{.dart continuation}
// Returns list of (document, score) tuples, sorted by score
final rankedDocs = await crossencoder.rankAndSort(query: query, documents: documents);

for (final (doc, score) in rankedDocs) {
  print("[${score.toStringAsFixed(3)}] $doc");
}
```

This returns documents ordered from most to least relevant.

## Building a RAG System

Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.

Here's a complete example building a customer service assistant with access to company policies:

```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;

Future<void> main() async {
  // Initialize the cross-encoder for document ranking
  final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');

  // Your knowledge base
  final knowledge = [
    "Our company offers a 30-day return policy for all products",
    "Free shipping is available on orders over \$50",
    "Customer support is available via email and phone",
    "We accept credit cards, PayPal, and bank transfers",
    "Order tracking is available through your account dashboard"
  ];

  // Create a tool that searches the knowledge base
  final searchKnowledgeTool = nobodywho.Tool(
    function: ({required String query}) async {
      // Rank all documents by relevance to the query
      final ranked = await crossencoder.rankAndSort(query: query, documents: knowledge);
      
      // Return top 3 most relevant documents
      final topDocs = ranked.take(3).map((e) => e.$1).toList();
      return topDocs.join("\n");
    },
    name: "search_knowledge",
    description: "Search the knowledge base for relevant information"
  );

  // Create a chat with access to the knowledge base
  final chat = await nobodywho.Chat.fromPath(
    modelPath: './model.gguf',
    systemPrompt: "You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
    templateVariables: {"enable_thinking": false},
    tools: [searchKnowledgeTool]
  );

  // The chat will automatically search the knowledge base when needed
  final response = await chat.ask("What is your return policy?").completed();
  print(response);
}
```

The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.

## Async Operations

In Flutter/Dart, all operations are asynchronous by default. There are no separate `EncoderAsync` or `CrossEncoderAsync` classes - the regular `Encoder` and `CrossEncoder` classes use async/await patterns:

```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;

Future<void> main() async {
  final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
  
  final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
  
  // Generate embeddings asynchronously
  final embedding = await encoder.encode(text: "What is the weather?");
  
  // Rank documents asynchronously
  final query = "What is our refund policy?";
  final docs = [
    "Refunds processed within 5-7 business days",
    "No refunds on sale items",
    "Contact support to initiate refund"
  ];
  final ranked = await crossencoder.rankAndSort(query: query, documents: docs);
  
  for (final (doc, score) in ranked) {
    print("[${score.toStringAsFixed(3)}] $doc");
  }
}
```


## Recommended Models

### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings

### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy

## Best Practices

**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.

**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.

**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.

**Choose appropriate context size**: The `nCtx` parameter (default 4096) should match your model's recommended context size. Check the model documentation.

```dart
// For longer documents, increase context size
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');

final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
```

## Complete RAG Example

Here's a full example showing a two-stage retrieval system:

```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;

Future<void> main() async {
  // Initialize models
  final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
  
  final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');

  // Large knowledge base
  final knowledgeBase = [
    "Python 3.11 introduced performance improvements through faster CPython",
    "The Django framework is used for building web applications",
    "NumPy provides support for large multi-dimensional arrays",
    "Pandas is the standard library for data manipulation and analysis",
    // ... 100+ more documents
  ];

  // Precompute embeddings for all documents
  final docEmbeddings = <Float32List>[];
  for (final doc in knowledgeBase) {
    docEmbeddings.add(await encoder.encode(text: doc));
  }

  Future<String> search({required String query}) async {
    // Stage 1: Fast filtering with embeddings
    final queryEmbedding = await encoder.encode(text: query);
    final similarities = <(String, double)>[];
    for (int i = 0; i < knowledgeBase.length; i++) {
      final similarity = nobodywho.cosineSimilarity(
        a: queryEmbedding.toList(),
        b: docEmbeddings[i].toList()
      );
      similarities.add((knowledgeBase[i], similarity));
    }
    // Get top 20 candidates
    similarities.sort((a, b) => b.$2.compareTo(a.$2));
    final candidateDocs = similarities.take(20).map((e) => e.$1).toList();
    
    // Stage 2: Precise ranking with cross-encoder
    final ranked = await crossencoder.rankAndSort(query: query, documents: candidateDocs);
    
    // Return top 3 most relevant
    final topResults = ranked.take(3).map((e) => e.$1).toList();
    return topResults.join("\n---\n");
  }

  final searchTool = nobodywho.Tool(
    function: search,
    name: "search",
    description: "Search the knowledge base for information relevant to the query"
  );

  // Create RAG-enabled chat
  final chat = await nobodywho.Chat.fromPath(
    modelPath: './model.gguf',
    systemPrompt: "You are a technical documentation assistant. Always use the search tool to find relevant information before answering programming questions.",
    templateVariables: {"enable_thinking": false},
    tools: [searchTool]
  );

  // The chat automatically searches and uses retrieved documents
  final response = await chat.ask("What Python libraries are best for data analysis?").completed();
  print(response);
}
```

This two-stage approach combines the speed of embeddings with the accuracy of cross-encoders, making it efficient even for large knowledge bases.

## Godot

### Godot Installation

_How to install NobodyWho and start building._

---

### Via Asset Library

- **Open Godot 4.5** (or any newer 4.x release).
- Switch to the **Asset Library** tab.
- Search for **“NobodyWho”** and select the entry.
- Click **Download**, tick **Ignore asset root**, then choose **Install**.
- Godot puts the plugin in `res://addons/nobodywho`. Open *Create Node* and you should see **`NobodyWhoChat`**. If it’s missing, restart Godot and try again. 

(1)
{ .annotate }

1. ![Installing NobodyWho from the Asset Library](assets/godot_asset_library.gif)

### Via GitHub

- Download the latest ZIP from the [GitHub releases](https://github.com/nobodywho-ooo/nobodywho/releases).
- In Godot, open **AssetLib ▸ Import** and pick the ZIP.
- Tick **Ignore asset root** and finish the import. 

(1)
{ .annotate }

1. ![Importing the ZIP in Godot](assets/godot_github.gif)

---

After installation, NobodyWho’s nodes should appear in your editor. If not, retrace your steps above or reach out on Discord or GitHub - we are there to help.

### Getting Started

_A minimal, end-to-end example showing how to load a model and perform a single chat interaction._ 

---

One of the most important components of NobodyWho is the Chat node. It handles all the conversation logic between the user and the LLM.
When you use the chat, you first pick a model and tell it what kind of answers you want.
When you send a message, the chat remembers what you said and sends it off to get an answer. 
The model will then start reading and generating a response.
You can choose to wait for the full answer to generate or get the response in a stream.

Here are the key terms you'll see throughout this guide:

| Term | Meaning |
| ---- | ------- |
| **Model (GGUF)** | A `*.gguf` file that holds the weights of a large‑language model. |
| **System prompt** | Text that sets the ground rules for the model. |
| **Token** | The smallest chunk of text the model emits (roughly a word). |
| **Chat** | The node/component that owns the context, sends user input to the worker, and keeps conversation state in sync with the LLM. |
| **Context** | The message history and metadata passed to the model each turn; it lives inside the Chat. |
| **Worker** | NobodyWho's background task for a single conversation — it keeps the model ready and acts as a communication layer between the program and the model. Each Chat has its own worker. |

Let's show you how to use the plugin to get a large language model to answer you.

## Download a GGUF Model

The first step is to get a model.
If you're in a hurry, just download [Qwen3 0.6B Q4_K_M](https://huggingface.co/NobodyWho/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf).
It's super small and fast, and works for well for simple use-cases.

Otherwise, check out our [recommended models](../model-selection.md) or if you have a non-standard use case, shoot us a question in Discord.

## Load the GGUF model

At this point you should have downloaded the model and put it into your project folder.


Add a `NobodyWhoModel` node to your scene tree.

Set the model path to point to your GGUF model. (1)
{ .annotate }

1. ![set model path](assets/godot_model_selection.png)

### Supported model path formats

The `model_path` field (and `projection_model_path` for vision models) accepts several forms:

| Form | Example | Notes |
| ---- | ------- | ----- |
| Godot resource path | `res://models/my-model.gguf` | Bundled with your game export |
| User data path | `user://downloaded.gguf` | Written by your game at runtime |
| Absolute filesystem path | `/opt/models/foo.gguf` | Local file |
| HuggingFace reference | `huggingface:owner/repo/file.gguf` or `hf://owner/repo/file.gguf` | Downloaded & cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded & cached on first use |

Remote models are downloaded to the platform cache directory on the first load and re-used on subsequent runs. Downloads happen on a background thread — the Godot main loop stays responsive while a multi-GB model is fetched.

### Showing download progress

`NobodyWhoModel` emits a `download_progress(downloaded, total)` signal while a remote model is downloading, throttled to roughly 10 Hz with a guaranteed final emit on completion. Connect it if you'd like to drive a progress bar:

```gdscript
model.download_progress.connect(func(downloaded: int, total: int):
    print("%d / %d bytes" % [downloaded, total])
)
```

The signal is not emitted for local files or already-cached downloads.

### Knowing when the worker is ready

`start_worker()` returns immediately. The worker finishes loading in the background (including any download). Connect to the new signals if your game logic needs to wait:

```gdscript
chat.worker_started.connect(func():
    print("Ready to chat!")
)
chat.worker_failed.connect(func(err):
    push_error("Model load failed: " + err)
)
chat.start_worker()
```

You can also call `ask()` straight away — prompts issued before the worker is ready are queued and dispatched as soon as loading completes. The same applies to `NobodyWhoEncoder.encode()` and `NobodyWhoCrossEncoder.rank()`.


## Create a new Chat

The next step is adding a Chat to our scene. 


Add a `NobodyWhoChat` node to your scene tree.

Then add a script to the node:

```gdscript
extends NobodyWhoChat

func _ready():
    # configure the node (feel free to do this in the UI)
    self.system_prompt = "You are an evil wizard."
    self.model_node = get_node("../ChatModel")

    # connect signals to signal handlers
    self.response_updated.connect(_on_response_updated)
    self.response_finished.connect(_on_response_finished)

    # Start the worker, this is not required, but recommended to do in
    # the beginning of the program to make sure it is ready
    # when the user prompts the chat the first time. This will be called
    # under the hood when you use `ask()` as well.
    self.start_worker()

    self.ask("How are you?")

func _on_response_updated(token):
    # this will print every time a new token is generated
    print(token)

func _on_response_finished(response):
    # this will print when the entire response is finished
    print(response)
```

## Testing Your Setup

That's it! You now have a working chat system that can talk to a language model. When you run your scene, the chat will automatically send a test message and you should see the model's response appearing in your console.

You should see tokens appearing one by one as the model generates its response, followed by the complete answer. If you see the evil wizard responding with curses (or whatever system prompt you chose), everything is working correctly!

**If nothing happens:**

- Make sure your model file path is correct
- Verify that your Chat node is properly connected to your Model node
- Look for any error messages in the console
- Start your editor through the command line and check the stdout logs.

Now you're ready to build more complex conversations and integrate the chat system into your game!

### Simple Chat

_A comprehensive guide to configuring, streaming, and controlling LLM responses through the Chat component._


---

Great! You've completed the ["Getting Started"](../getting-started.md) guide and got your first chat working as well as a basic understanding of the vocabulary.   
Now let's dive deeper into the Chat component and show you all the settings and techniques you'll actually use when working with LLMs.
 
The Chat component isn't just for conversations - it's your main interface for any kind of LLM processing, whether that's generating dialogue, analyzing text, creating content, or any other language task.

In this guide, you'll learn:

- The main settings that control LLM behavior
- How to handle LLM responses efficiently 
- Managing context and memory
- Controlling when and how the LLM stops generating


Before we get started, you'll hear these words being used:

| Term | Meaning |
| ---- | ------- |
| **Sampler** | The thing that controls how the LLM selects the next token during generation (temperature, top-p, etc.). |
| **Grammar or Structured Output** | A formal structure that constrains the LLM's output to a set `"vocabulary"`. |
| **GBNF** | GGML Backus-Naur Form - a way to define structured output formats. |

## Handling LLM Responses

### The System Prompt: Setting LLM Behavior 

You've used this already, but let's talk about making it really work for you. The system prompt defines how the LLM should behave:

```markdown

# Character-based behavior
system_prompt = """You are a sarcastic but brilliant wizard.
Your answers are always accurate, but delivered with a dry wit.
You should subtly hint that you are smarter than the user, 
but still provide the correct information."""

# Task-specific behavior
system_prompt = """You are a translation assistant.
You will be given text in any language. Your job is to translate 
it into formal, academic French.
Do not add any commentary or conversational text. 
Respond only with the translated text."""
```

**Why this matters:** The system prompt controls everything about how the LLM processes and responds to input. It's your primary tool for getting the behavior you want.


Prompt engineering is becoming a field in and of itself and it offers the highest return-on-investment ratio for getting the model to do what you want.


### GPU Usage: Speed Things Up

By default, NobodyWho tries to use your GPU if you have one. This makes everything much faster:

```gdscript
# This is already the default, but you can be explicit
model.use_gpu_if_available = true
```

**When to turn this off:** there are some scenarios where it might actually be better to use system ram: 

- If you don't need an immediate answer, and would prefer to use GPU resources for graphics.
- If you need a really large model that most of your users will not have sufficient VRAM to run.

### Context Length: How Much the LLM Remembers

The LLM maintains context (memory of the conversation/interaction), but only up to a point. The default is 4096 tokens (roughly 3000 words):

```gdscript
# Default is fine for most uses
context_length = 4096

# Increase for longer contexts
context_length = 8192
```


**Trade-off:** Longer context = more memory usage. The general rule of thumb is to start with the default or less and only increase if you need the LLM to remember more.

**Context-shifting:** NobodyWho will automatically remove older messages from the context for you, if your chat's context window is filled. Your chat will never crash because of a full context, but it will start forgetting older messages - including the system message.

### Streaming Responses vs Waiting for Complete Output

You have two main approaches for handling LLM responses, and choosing the right one depends on your use case:

**Streaming** gives you each token as it's generated - good for user interfaces where you want immediate feedback.

**Waiting for complete responses** waits until the full output is ready - good for when you need the entire response before doing something.

If you're implementing an interactive chat, you likely want to do both:

- Show each token to the user as they arrive. This will make the chat feel a lot faster.
- Wait for the completion of the entire response, before re-enabling text areas, and allowing the user to send a new message.

```gdscript
var current_response = ""

func _on_response_updated(token: String):
    current_response += token
    # Good for: UI updates, real-time feedback
    ui_label.text = current_response

func _on_response_finished(response: String):
    # Good for: Final processing, logging, triggering next actions
    print(response)
    response = response.replace("<player>", player.name)
    trigger_next_game_event()
```

**When to use streaming:**
- Interactive dialogue where users expect immediate feedback
- Long responses where you want to show progress

**When to wait for complete responses:**
- When you need to make decisions based on the full LLM output
- Content generation where partial results are useless (like JSON or structured output answers).

You most likely end up using both; having the response_updated to stream to your UI and then triggering the next step in your program when you get the full response.

## Managing Context and Memory

Sometimes you need to reset the LLM's memory or manage what it remembers.

### Starting Fresh

```gdscript
# Clear all context, it will still have all the settings that you 
# have set up before (including the system prompt)
reset_context()
```

This is useful when:
- Starting a new task that's unrelated to previous ones, where the previous history is irrelevant
- The LLM gets confused as it has context shifted too much

### Advanced Context Management

If you need more control over what the LLM remembers:

```gdscript
# See what's in the context
var messages = await get_chat_history()
for message in messages:
    print(message.role, ": ", message.content)

# Set a custom context (useful for templates or saved states)
var task_context = [
    {"role": "user", "content": "Analyze the following data:", "assets": []},
    {"role": "assistant", "content": "I'm ready to analyze data. Please provide it.", "assets": []},
    {"role": "user", "content": "Here's the data: " + data_to_analyze, "assets": []}
]
await set_chat_history(task_context)
```

### Enforce Structured Output (JSON)

For reliable data extraction, you can force the LLM to output a response that strictly follows a basic JSON structure. This is incredibly useful for parsing LLM output into usable data without complex string matching.

When you enable grammar without providing a custom grammar string, the system defaults to a built-in JSON grammar that ensures valid JSON output.

```gdscript
# Set the sampler to use the json preset
chat.set_sampler_preset_json()

# Tell the LLM to provide structured data
chat.system_prompt = """You are a character creator.
Generate a character with name, weapon, and armor properties."""
chat.ask("Create a fantasy character")

# Expected output will be valid JSON, like:
# {"name": "Eldara", "weapon": "enchanted bow", "armor": "leather vest"}
```

**Note:** For advanced use cases where you need a very specific JSON structure or structured output that is not JSON, you can provide your own custom GBNF grammar by setting the `gbnf_grammar` property (Godot) or `grammar` field (Unity). This is covered in the [Structured Output](structured-output.md) guide.

## Performance and Memory Tips

### Start the Worker Early

In a real-time application, you don't want the user's first interaction to trigger a long loading time. Starting the worker early, like during a splash screen or initial setup, pre-loads the model into memory so the first response is fast.

```gdscript
# In your _ready() function, set up everything before the app starts.
func _ready():
    # 1. Configure the chat behavior
    self.system_prompt = "You are a helpful assistant."
    self.model_node = get_node("../SharedModel")

    # 2. Start the worker *before* the user can interact.
    # This pre-loads the model so the first interaction isn't slow.
    start_worker()

    # 3. Now other setup can happen
    print("Assistant chat is ready.")
```

**Why:** Starting the worker loads the model into memory. It's slow the first time, but then all LLM operations are much faster. 
You should definitely think about when to do this to not ruin the UX too much.

### Share Models Between Components

An application might need to use an LLM for several different tasks. Instead of loading the same heavy model multiple times, you can have multiple `Chat` components that all share a single `Model` component. Each `Chat` can have its own system prompt and configuration, directing it to perform a different task.

```gdscript
# An application with multiple LLM-powered behaviors, all sharing one model.

func _ready():
    # 1. Get the single, shared model
    var shared_model = get_node("../SharedModel")

    # 2. Configure a chat component for general conversation
    var casual_chat = get_node("CasualChat")
    casual_chat.model_node = shared_model
    casual_chat.system_prompt = "You are a friendly and helpful assistant. Keep your answers concise."
    casual_chat.start_worker()

    # 3. Configure another chat component for structured data extraction
    var extractor_chat = get_node("ExtractorChat")
    extractor_chat.model_node = shared_model
    extractor_chat.system_prompt = "Extract the key information from the user's text and provide it in JSON format."
    # This one would likely use a grammar to enforce JSON output.
    extractor_chat.start_worker()

    # Now you can use both for different tasks without loading two models!
    casual_chat.ask("Can you tell me about your capabilities?")
    extractor_chat.ask("My name is Jane Doe and my email is jane@example.com.")
```

**Memory savings:** Instead of loading multiple models, you load one and share it. Much more efficient!

### Structured output

_Getting reliable, structured responses from your models_

---

Congratulations - you have understood the basics of having a large language models generate text for you. 
You are now ready for some more juicy and complex options.

Here are the key terms you should know:

| Term | Meaning |
| ---- | ------- |
| **GBNF** | GGML Backus-Naur Form - a way to define strict rules for output format |
| **Grammar** | The set of rules that define what valid output looks like |
| **Token** | A piece of text (word, punctuation, etc.) that the model generates, generally 1 to 4 characters long |
| **Encoder** | Translates text into tokens that the model can understand |


## My model is so stupid that it can not even write json

Yeah, most models will fail to generate valid json at some point if you just ask it to. 
But fret not dear friend, the solution you are looking for is called :star: **STRUCTURED OUTPUT** :star:. 

It is pretty much what it claims to be; A system that constrains the model's vocabulary to one that you determine.
This can be useful for a myriad of things, from forcing the LLM to never use modern words, to using the LLM
as the engine for your own procedural generation dungeon room.

This section will take you through creating your own grammar that the model will have to use.

### Why GBNF Beats Prompt Engineering

You've probably tried this before:

```
""" Please respond in JSON format with name, level, and class fields
Only use those fields.
Only use valid json.
All json attributes should have " around them.
Please do not deviate from the instructions.
You will lose 10 points if you use other fields than level, class and name.
Do not write a message just json.
If you do not respond in valid json I will lose my job and my kids will starve.
"""
```

And got back something like:
```
Sure! Here's a character: {"name": "Eldara", "level": 15, "class": Wizard} - hope this helps!
```

Notice the problems? Missing quotes around "Wizard", extra text before and after. Your JSON parser explodes. 💥

GBNF fixes this by making it **impossible** for the model to generate anything except the format you define:

```json
{"name": "Eldara", "level": 15, "class": "Wizard"}
```

Valid :clap: every :clap: time :clap:.

## Understanding GBNF Grammar Rules

### The Absolute Basics

A GBNF grammar is made up of **rules**. Each rule says "this thing can be made from these parts":

```
rule-name ::= what-it-can-be
```

### Your First Grammar: Hello World

Let's start with the simplest possible grammar:

```
root ::= "Hello World"
```

This says: "The output must be exactly the text 'Hello World'". That's it. The model can't say anything else.

Try this and the model will always output: `Hello World`

### Adding Choices with `|`

What if we want some variety? Use `|` (pipe) to give options:

```
root ::= "Hello World" | "Hi there" | "Greetings"
```

Now the model can choose between these three options, but nothing else.

### Building Blocks with Multiple Rules

Here's where it gets interesting. You can break things into smaller pieces:

```
root ::= greeting " " name
greeting ::= "Hello" | "Hi" | "Hey"
name ::= "World" | "Friend" | "There"
```

This creates outputs like:
- `Hello World`
- `Hi Friend`
- `Hey There`

The model picks one option from `greeting`, adds a space, then picks one option from `name`.

### Character Classes

Instead of listing every letter, use character classes:

```
root ::= letter letter letter
letter ::= [a-z]
```

`[a-z]` means "any lowercase letter from a to z". This generates random 3-letter combinations like `cat`, `how`, `dog`.
so letter letter letter will make a three letter word

Common character classes:
- `[a-z]` - lowercase letters
- `[A-Z]` - uppercase letters  
- `[0-9]` - digits
- `[a-zA-Z]` - any letter
- `[a-zA-Z0-9]` - letters and numbers


### Repetitions


This quickly becomes tedious if you want to create either long words or just any word. This is where repetitions come in:

- `*` means "zero or more"
- `+` means "one or more"  
- `?` means "optional (zero or one)"

- `{n}` means "exactly n times"
- `{n,}` means "at least n times"
- `{n,m}` means "at least n and at most m times"

```
root ::= letter+
```

This means "one or more lowercase letters" - so you get words like `hello`, `a`, `supercalifragilisticexpialidocious`.

```
root ::= [a-z]+ [0-9]*
```

This means "letters followed by optional numbers" - so you get `hello`, `test123`, `word`.


### Building JSON Step by Step

Now that you have been tricked into learning the basics of regex, we should build a small JSON generator. Start simple:

```
root ::= "{" "}"
```

This only generates: `{}`

Add one field:

```
root ::= "{" "\"name\"" ":" string "}"
string ::= "\"" [a-zA-Z]+ "\""
```

This generates: `{"name":"Bob"}` (where Bob is any sequence of letters)

Add more fields:

```
root ::= "{" "\"name\"" ":" string "," "\"level\"" ":" number "}"
string ::= "\"" [a-zA-Z]+ "\""
number ::= [0-9]+
```

This generates: `{"name":"Alice","level":"25"}`

### Making It Flexible

Use repetition to handle variable numbers of fields:

```
root ::= "{" pair ("," pair)* "}"
pair ::= word ":" word
word ::= "\"" [a-zA-Z]+ "\""
```

The `("," pair)*` means "zero or more additional pairs, each preceded by a comma". This generates:
- `{"name":"Bob"}`
- `{"name":"Alice","job":"Wizard"}`
- `{"name":"Charlie","job":"Knight","weapon":"Sword"}`

### Whitespace: Making It Readable

Add optional whitespace to make output prettier:

```
root ::= "{" ws pair (ws "," ws pair)* ws "}"
pair ::= string ws ":" ws string
string ::= "\"" [a-zA-Z ]+ "\""
ws ::= [ \t\n]*
```

The `ws` rule means "whitespace" - zero or more spaces, tabs, or newlines. Now you get nicely formatted JSON.

### Advanced: Specific Values

Control exactly what values are allowed:

```
root ::= "{" "\"class\"" ":" class-type "}"
class-type ::= "\"Warrior\"" | "\"Mage\"" | "\"Rogue\"" | "\"Cleric\""
```

This only allows those four specific classes - no hallucinated "Tank-operator" in your neolithic era game!

### Nested Structures

Build complex nested data:

```
root ::= "{" "\"character\"" ":" character-object "}"
character-object ::= "{" "\"name\"" ":" string "," "\"stats\"" ":" stats-object "}"
stats-object ::= "{" "\"hp\"" ":" number "," "\"mp\"" ":" number "}"
string ::= "\"" [a-zA-Z ]+ "\""
number ::= [0-9]+
```

This creates nested JSON like:
```json
{"character":{"name":"Gandalf","stats":{"hp":"100","mp":"200"}}}
```

## Performance Optimization: Compact Formats

Now that you understand GBNF with JSON, let's talk optimization. JSON is verbose and every token costs time. For high-performance applications, you can create much more compact formats.

### Why Compact Formats Matter

**JSON Format:**
```json
{"name":"Gandalf","level":15,"class":"Mage","hp":100,"mp":80}
```
*60 characters, ~38 tokens*

**Compact Format:**
```
Gandalf|High|Mage|Low|High
```
*22 characters, ~10 tokens*

**That's ~4 times faster while maintaining the same information!**

### Building Compact Formats

Start with pipe-separated values:

```
root ::= [A-Z][a-z]+ "|" [1-9][0-9]? "|" class-type
class-type ::= "Warrior" | "Mage" | "Rogue" | "Cleric"
```

This generates: `Gandalf|15|Mage` (semantically clear - no ambiguity about what "Mage" means!)

**Why not single letters?** If you used `"W" | "M" | "R" | "C"`, the LLM has no inherent knowledge that "M" means "Mage" rather than "Monk" or "Mercenary". The model generates tokens based on semantic understanding, not arbitrary mappings.

### Different delimiters for different levels

Use different separators for different levels:

```
root ::= character ("|" character)*
character ::= [A-Z][a-z]+ ":" stats ":" equipment
stats ::= stats-range "," stats-range "," stats-range
stats-range ::= "low" | "medium" | "high" 
equipment ::= weapon-type "," armor-type
weapon-type ::= "Sword" | "Axe" | "Staff" | "Dagger"
armor-type ::= "Leather" | "Robes" | "Chain" | "Plate"
```

This generates: `Gandalf:high,low,low:Staff,Robes|Aragorn:low,high,medium:Sword,Plate` which in JSON would be:

```json
[
  {
    "name": "Gandalf",
    "stats": {
      "hp": "high",
      "mp": "low",
      "level": "low"
    },
    "equipment": {
      "weapon": "Staff",
      "armor": "Robes"
    }
  },
  {
    "name": "Aragorn", 
    "stats": {
      "hp": "low",
      "mp": "high",
      "level": "medium"
    },
    "equipment": {
      "weapon": "Sword",
      "armor": "Plate"
    }
  }
]
```

### Semantic Soundness

One advantage of using JSON is the hints it gives the LLM. 
If it sees `"name": "Gandalf"`, instead of just `Gandalf` it might be more inclined to generate a wizard class or give the character a staff.
The same goes for numbers, the llm does not inherently understand what a good number for a high level or mana pool is - but it understands high vs low.

When designing compact formats:

✅ **Good:** `"Warrior" | "Mage" | "Rogue"`  
✅ **Good:** `"Sword" | "Staff" | "Dagger"`  
✅ **Good:** `"Leather" | "Robes" | "Chain"`  
✅ **Good:** `"Low" | "Medium" | "High"`  

❌ **Bad:** `"WAR" | "MAG" | "ROG"` - abbreviated and potentially ambiguous  
❌ **Bad:** `"W" | "M" | "R"` - arbitrary single letters  
❌ **Bad:** `"1" | "2" | "3"` - numeric values  

The LLM generates text based on semantic understanding. Use full words that align perfectly with how language models think about concepts.  
You should additionally provide the right context and single or few shots prompting to make it more robust.

### Underscores footgun

The GBNF format does not support `_`. According to the [the GBNF format documentation](https://github.com/ggml-org/llama.cpp/tree/master/grammars#json-schemas--gbnf), only lowercase characters and dashes are allowed for naming nonterminals.

## Practical Example: Legendary Weapon Generator

Let's build a weapon generation system that creates legendary weapons for your RPG. We'll start simple and add complexity step by step, showing you how GBNF grammars work in practice.

### Why Use GBNF for Weapon Generation?

Traditional random generators often create nonsensical combinations like "Flaming Sword of Ice", with 8 fire damage and a random generic backstory as well an ice ability. More advanced systems exist but they rely on lookup tables which can become tedious very quickly.
LLMs with GBNF understand semantic coherence - they'll generate "Flamebrand, Ancient Sword of Solar Wrath" instead. 
Which has 8 fire damage, and a meaningful backstory based on how you got it 
or the lore from your game as well as an ability that is chosen based on the backstory, damage and name.

### Step 1: Dynamic Weapon Name Generator

Let's start with a weapon generator that builds weapon names:

**Grammar:**
```
root ::= weapon-name " (" weapon-type ")"
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
```

```gdscript
extends Node

@onready var model = $Model # Your NobodyWhoModel node
@onready var chat = $Chat   # Your NobodyWhoChat node

func _ready():
    # Configure the weapon generator
    model.model_path = "res://models/your-model.gguf"
    chat.model_node = model
    chat.system_prompt = "You are a legendary weapon generator for a fantasy RPG."
    
    # Start the worker so it's ready
    chat.start_worker()
    
    # Connect to handle responses
    chat.response_finished.connect(_on_weapon_generated)

func _input(event):
    if event is InputEventKey and event.pressed and event.keycode == KEY_SPACE:
        generate_weapon()

func generate_weapon():
    chat.set_sampler_preset_grammar(grammar_string)

    # Reset context to avoid new weapons to be influenced by already generated ones.
    chat.reset_context()
    chat.ask("Generate a weapon:")

func _on_weapon_generated(weapon_name: String):
    print(weapon_name)
    # Here you could add the weapon to inventory, display it in UI, etc.
```

**Output examples:**

- `Flamebrand (Sword)`
- `Shadowfang (Dagger)`
- `Stormcall (Staff)`
- `Darkward (Bow)`

This is more or less just a random number generator, but more GPU expensive...

### Step 2: Adding Weapon Stats

Let's add damage and abilities to make weapons more interesting for gameplay, this is where we deviate from a random weapon generator to a semantic weapon generator:

**Grammar:**
```
root ::= weapon-name " (" weapon-type ") - " damage-level " damage, " ability-name " ability. "  backstory
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
backstory ::= [a-zA-Z0-9 ]+ "."
```

Be careful not to add too many symbols in your backstory. If the model can not write a `.` it will increase the chance that it will end the sentence instead of writing paragraph upon paragraph of text.

```gdscript
func generate_weapon():
    chat.set_sampler_preset_grammar(grammar_string)

    # Reset context to avoid new weapons to be influenced by already generated ones.
    chat.reset_context()
    chat.ask("Generate a weapon:")

func _on_weapon_generated(weapon_data: String):
    print(weapon_data)
```

**Output examples:**

- `Shadowfang (Sword) - Legendary damage, Shadow Step ability. Shadowfang is a legendary sword that was forged by the ancient shadow realm.`

See how the examples will match flame and brand to a sword, will give it the flame strike ability as well as a thematic backstory. It feels like there is intent behind the creation of this weapon.

### Step 3: Enhanced Backstories

Let's expand the backstory system to allow for richer, more detailed weapon lore:

**Grammar:**
```
root ::= weapon-name " (" weapon-type ") - " damage-level " damage, " ability-name " ability. Story: " backstory
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
backstory ::= [a-zA-Z0-9 ]{50,200} "."
```

When doing this we want to also inject some of our lore. We will borrow from  Lord of the rings here - replace with your own lore.

```gdscript

func _ready():
  # Configure the weapon generator 
  chat.model_node = model
  chat.system_prompt = "Generate a weapon a backstory in the LOTR universe"
    # ... rest of the setup


func generate_weapon():
    chat.set_sampler_preset_grammar(grammar_string)

    # Reset context to avoid new weapons to be influenced by already generated ones.
    chat.reset_context()
    chat.ask("The party just found a new weapon after travelling through the mines of Moria:")

func _on_weapon_generated(weapon_data: String):
    print(weapon_data)
```

**Output examples:**
- `Shadowfang (Sword) - Legendary damage, Shadow Step ability. The sword is made from the dark shards that were once part of the Balrog`
- `Flamebrand (Sword) - High damage, Flame Strike ability. Backstory involves a fallen dwarf lord named Drakon who was corrupted by the Balrogs and used the sword to slay an enemy.`

### Step 4: Compact Format for Performance

For games that generate many weapons or even very complex weapons, you want maximum efficiency. Let's create a compact pipe-separated format:

**Grammar:**
```
root ::= weapon-name "|" weapon-type "|" damage-level "|" ability-name "|" weight "|" throwable "|" damage-type "|" durability "|" rarity "|" enchantment "|" material "|" short-story
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
weight ::= "Heavy" | "Light"
throwable ::= "Throwable" | "Non-throwable"
damage-type ::= "Sharp" | "Pierce" | "Blunt"
durability ::= "Fragile" | "Sturdy" | "Unbreakable"
rarity ::= "Common" | "Rare" | "Epic" | "Legendary"
enchantment ::= "Glowing" | "Humming" | "Pulsing" | "Silent"
material ::= "Steel" | "Mithril" | "Obsidian" | "Crystal"
backstory ::= [a-zA-Z0-9 ]{50,200} "."
```

**Note:** So-called "thinking" or "reasoning" models will strongly prefer to start every generation with a block of text inside `<think>` tags. If your grammar doesn't naturally allow the output to be prefixed with a "thinking" section like this, it will try to squeeze it into free-text sections (e.g. like the backstory section in the example above). If relying a lot on structured generation, you may prefer to use a "non-thinking" model. If you prefer to keep the "thinking" ability, you could begin your grammar with a section like `"<think>" [a-zA-Z0-9 ]{10,1000} "." "</think>` to allow it to get it's reasoning section out of the way.
Furthermore, the current implementation of GBNF has some performance issues with using specifc ranges (eg: word{10,20}) - so it might be smarter to have a non grammarized model generate the short story.

**Output examples:**
- `Flamebrand|Sword|High|Flame Strike|Heavy|Non-throwable|Sharp|Sturdy|Epic|Glowing|Steel|Forged by fire elementals in ancient volcano`

or with thinking models (demonstrating that it will squeeze in the "thinking" section wherever possible:

- `Shadowfang|Axe|Legendary|Shadow Step|Light|Throwable|Sharp|Sturdy|Epic|Silent|Steel|The Shadowfang is a legendary axe that is said to have been forged in the depths of the Shadowspire Mountains by the elusive Night Hunter.`
- `Stormcall|Staff|Legendary|Lightning Bolt|Light|Non-throwable|Blunt|Unbreakable|Legendary|Pulsing|Crystal|The user wants me to generate a short story for the weapon. I will think...`

---

This is quite a powerful system for procedural generation of anything being weapons, levels, questlines or whatever you can think of, and even better 
You get to influence the generation meaningfully with the prompt that you send, while keeping the variety offered by the system.

This complete system generates weapons with all the attributes your game systems might need, from combat mechanics (damage type, weight) to visual effects (enchantment, material) and lore (story).

### Tool calling

_Triggering actions from within the model._

---

Welcome to the tool calling page!

Now that you have some of the basics understood (if not, please read [Simple Chat](simple-chat.md)), 
we can move on to adding one of the truly powerful and fun components to our model; Tool/Function Calling.

Tool calling is a way to give your model actions to perform in your game world.  

The model can:

* Check data - "What's my health?"
* Change the world - "Open the north gate."
* Run helper logic - damage rolls, crafting math, random loot.

We'll start with a small and simple tool, add arguments, then increase accuracy using schema and adding constraints.

**Note that not all models support tool calling**

---

## A simple tool

This is an example of how to give the model access to a function we have created that gets the player's current stats (health, mana, gold).

    
```gdscript
extends NobodyWhoChat

func get_player_stats() -> String:
    var player = GameManager.get_local_player()
    return JSON.stringify({
        "health": player.health,
        "mana":   player.mana,
        "gold":   player.gold
    })

func _ready():
    add_tool(get_player_stats, "Returns the local player's health, mana, and gold.")
```

Ask "How hurt am I?" - the model calls your tool and answers with real numbers.

---


## But I need arguments, you say:

Sure - that is possible, but only primitives are currently implemented in NobodyWho:
Allowed primitive types: `int`, `float`, `bool`, `String`/`string`, `Array`/`string[]`

Models operate with JSON as an abstract layer instead of using a specific language (like Godot) when calling tools. 
When NobodyWho receives a function or a delegate it will deconstruct the name and parameters and use them 
to construct a JSON schema that we can pass to the model.

In the example below the generated json will look something like this:

```json
{
  "type": "object",
  "properties": {
    "amount": {
      "type": "integer",
      "description": ""
    }
  },
  "required": ["amount"]
}
```

This is then used to construct a lazy-loadable gbnf grammar, so the models always pass the correct number and set of arguments.
A limitation of this is that we cannot extract the description from a given argument. 
Therefore it might be advantageous to write your own schema for maximum precision.

```gdscript
func heal_player(amount: int) -> String:
    GameManager.get_local_player().heal(amount)
    return "Healed %d HP" % amount

add_tool(heal_player, "Heals the local player by a number of hit-points")
```
*Godot auto-builds the JSON schema from the type hints.*  
Therefore you must ensure that all parameters are listed and return type is defined from the method.

---


## Your model is now ready to interact with the world

Have the model open a door.

```gdscript
func open_door(door_id: String) -> String:
    DoorManager.open(door_id)
    return "Opened door %s" % door_id

add_tool(open_door, "Opens a door in the world by id")

chat.ask("can you open the door")
```

The model will pause any generation until the tool is completed.


---

## Multiple Tools & Resetting

You can add as many tools as like, but you need to reset the context before they will be taken into account.

```gdscript
add_tool(get_player_stats, "Player stats")
add_tool(open_door,        "Open a door")
reset_context()
```

---

## But I don't want it to hallucinate random strings

Don't worry, we've got you. 
As I mentioned before, we are using the OpenSchema specification, which goes like this:

```jsonschema
{
  "type": "object",
  "properties": {
    "color": {
      "type": "string",
      "description": "A specific color for the button",
      "enum": ["red", "blue", "green"]
    }
  },
  "required": ["color"],
}
```

The type must always be an `object`, the properties are a dictionary of where the key is the parameter name, and the value describes the data for the parameter. Ie. type determines whether it is a string, a list or something else. Description describes how the parameter is used. 

If the properties are not a part of the `required` list, the model will see them as optional parameter.

```gdscript
# `press_button_schema` holds the JSON shown above.
func press_button(color: String) -> String:
    ButtonManager.press(color)
    return "Pressed %s button" % color

add_tool_with_schema(press_button,
                     press_button_schema,
                     "Press one of the three coloured buttons (red, blue, green)")
```

Result: the model **cannot** request any color other than *red*, *blue*, or *green*.  Use the same pattern for item rarities, quest tiers, etc...

**Heads-up** – NobodyWho turns that schema into a GBNF grammar using the open-source [`richardanaya/gbnf`](https://github.com/richardanaya/gbnf) converter.  It currently supports the common bits: primitive types, `enum`, `required`, flat `oneOf`, and simple arrays.  Exotic keywords (`minimum`, `pattern`, deeply-nested refs) may be ignored until the library grows.

---

A note on descriptions:

The description helps the model pick the right tool and pass the right arguments. Be explicit. Explain when to use the tool, explain what the tool does.
Bad: **"Door"**  
Good: **"Use this function when the assistant is blocked or needs to close a door. This tool opens or closes the door with the given id, if -1 is given, the nearest door will be interacted with."**

---

## Pre-packaged tools

We ship NobodyWho with two packaged-in tools, which are general enough for multiple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.

The usage is straightforward. Use `add_python_tool()` and `add_bash_tool()`:

```gdscript
func _ready():
    add_python_tool()
    add_bash_tool()
```

Lastly, keep in mind that for most use-cases it is reasonable to constrain the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `add_python_tool()` provides `max_duration_secs`, `max_memory_bytes` and `max_recursion_depth`
and `add_bash_tool()` provides `max_commands`.

### Vision & Hearing

_Enabling models to ingest images and audio._

---

A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image information to your LLM.

## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:

1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens

To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.

If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.

With the downloaded GGUFs, you can set the projection model on your `NobodyWhoModel` node.
In the editor, set the `projection_model_path` property to point to your projection model file.
Alternatively, you can set it in GDScript:

```gdscript
$ChatModel.projection_model_path = "res://mmproj.gguf"
```

> **Note:** The language model and projection model have to **fit** together, as they are trained together!
> Unfortunately you can't just take projection model and a LLM that you like and expect them
> to work together.

## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through the `NobodyWhoPrompt` object.

```gdscript
extends NobodyWhoChat

func _ready():
    self.model_node = get_node("../ChatModel")
    self.system_prompt = "You are a helpful assistant, that can hear and see stuff!"

    var prompt = NobodyWhoPrompt.new()
    prompt.add_text("Tell me what you see in the image and what you hear in the audio.")
    prompt.add_image("res://dog.png")
    prompt.add_audio("res://sound.mp3")

    ask(prompt)
    var response = await response_finished  # It's a dog and a penguin!
```

## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.

```gdscript
var prompt = NobodyWhoPrompt.new()
prompt.add_text("Tell me what you see in the image.")
prompt.add_image("res://dog.png")
prompt.add_text("Also tell me what you hear in the audio.")
prompt.add_audio("res://sound.mp3")
```

Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:

```gdscript
self.context_length = 8192
```

Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).

Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.

As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.

### Embeddings

_A complete guide to using embeddings for semantic text comparison and natural language understanding._

---

Cool, you've got the basics of chat working! Now let's explore embeddings, which let you understand what text means rather than just matching exact words.

Embeddings are like a smart way to measure how similar two pieces of text are, even if they use completely different words. 
Instead of looking for exact matches, embeddings understand meaning.   
For example, "Hand me the red potion" and "Give me the scarlet flask" would be recognized as very similar, even though they share no common words.

Here are the key terms for working with embeddings:

| Term | Meaning |
| ---- | ------- |
| **Embedding Model (GGUF)** | A specialized `*.gguf` file trained to convert text into numerical vectors that represent meaning. |
| **Embedding** | A list of numbers (vector) that represents the meaning of a piece of text. |
| **Cosine Similarity** | A mathematical way to compare how similar two embeddings are, returning a value between 0 (completely different) and 1 (identical meaning). |
| **Semantic Search** | Finding text that means the same thing, even if the words are different. |
| **Vector** | The array of numbers that represents your text's meaning. |

Let's show you how to use embeddings to understand what your players really mean when they type commands.

## Download an Embedding Model

Embedding models are different from chat models. You need a model specifically trained for embeddings.

We normally use [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).


## Practical Example: Quest & Reputation System

A good way to visualize the practicality of embeddings is through an example. 
In this example we will guide you through how to make a quest trigger or lowering the user's reputation based on what they say.

We'll build it step by step, but for the impatient; The complete script is copyable in the bottom of the page.

### Step 1: Set up your basic structure and variables

The first step is to setup our components. We will add some statements for quests and some for hostile behavior - these are not exhaustive lists. 

**Do note** that it will take a longer time to embed a lot of sentences (depending on model and hardware of course), so depending on how complex your statements need to be, 
you might be better off having a handful and tuning the sensitivity of the trigger instead.

First, create your script that extends `NobodyWhoEncoder` and define your statement categories:

```gdscript
extends NobodyWhoEncoder

var quest_triggers= [
    "I know where the dragon rests",
    "The druid told me the proper way to meet the dragon",
    "I discovered the ritual needed to gain the dragon's audience",
    "I know about the sacred grove"
]

var hostile_statements = [
    "I want to kill the dragon",
    "I'm going to destroy everything",
    "I hate this place and everyone in it",
    "I will burn down the village",
    "Everyone here deserves to die"
]

var helpful_embeddings = []
var hostile_embeddings = []
var player_reputation = 0
```

### Step 2: Initialize the embedding system


Set up the embedding model and start the worker:

```gdscript
func _ready():
    # Create and configure the embedding model
    var embedding_model = NobodyWhoModel.new()
    embedding_model.model_path = "res://models/bge-small-en-v1.5-q8_0.gguf"
    get_parent().add_child(embedding_model)
    
    # Link to the embedding model
    self.model_node = embedding_model
    self.encoding_finished.connect(_on_encoding_finished)
    self.start_worker()
    
    # Pre-generate embeddings for all statement types
    precompute_all_embeddings()
```


### Step 3: Precompute reference embeddings

Generate embeddings for all your reference statements:

```gdscript
func precompute_all_embeddings():
    # Generate embeddings for helpful statements
    for statement in quest_triggers:
        encode(statement)
        var embedding = await self.encoding_finished
        helpful_embeddings.append(embedding)

    # Generate embeddings for hostile statements
    for statement in hostile_statements:
        encode(statement)
        var embedding = await self.encoding_finished
        hostile_embeddings.append(embedding)
```


### Step 4: Add input handling for testing


Add a simple test trigger using the enter key:

```gdscript
func _input(event):
    # Handle enter key press to send hardcoded test message
    if event is InputEventKey and event.pressed:
        if event.keycode == KEY_ENTER:
            var test_message = "I know the location of the dragon"
            print("Sending test message: ", test_message)
            analyze_player_statement(test_message)
```

### Step 5: Analyze player statements


Compare the player's message against your reference embeddings:

```gdscript
func analyze_player_statement(player_text: String):
    # Generate embedding for player input
    encode(player_text)
    var player_embedding = await self.encoding_finished
    
    # Compare against both categories
    var best_helpful_similarity = get_best_similarity(player_embedding, helpful_embeddings)
    var best_hostile_similarity = get_best_similarity(player_embedding, hostile_embeddings)
    
    print("Helpful similarity: ", best_helpful_similarity)
    print("Hostile similarity: ", best_hostile_similarity)
    
    # Use similarity threshold of 0.8 and compare categories
    if best_helpful_similarity > 0.8 and best_helpful_similarity > best_hostile_similarity:
        handle_helpful_information(player_text)
    elif best_hostile_similarity > 0.8 and best_hostile_similarity > best_helpful_similarity:
        handle_hostile_intent(player_text)
    else:
        print("Unclear intent - no strong match found")
```

### Step 6: Handle the results


Trigger appropriate game systems based on detected intent:

```gdscript
func handle_helpful_information(text: String):
    # Trigger game systems based on detected intent
    print("🐉 Triggering quest: 'Audience with the Ancient Dragon'!")

func handle_hostile_intent(text: String):
    player_reputation -= 15
    print("Player expressed hostile intent! Reputation -15 (now: ", player_reputation, ")")
```

### RAG

_Build AI systems that can search through your game's lore, dialog, or knowledge base and find the most relevant information._

---

Great! You've got chat and embeddings working. Now let's add something useful: the ability to look up specific lore, dialogues, questlines etc.

## Why Your Game Needs Smart Document Search

Picture this: Your player is 40 hours into your RPG and asks an npc "Where do I find that crystal for the sword upgrade?" 
Your LLM, without reranking, might give a generic answer or worse - make something up - leading to a bad player experience. 
There are several ways to combat this, one is to load a lot of information into the context (i.e. the system prompt) but with a limited context, it might 'forget' the important information
or be confused by too much information. Instead we want to add a "long term memory" module to our language model.

To do this in the llm space you are going to use RAG (retrieval augmented generation) we are enriching the knowledge of the LLM by allowing it to search through a database of info we fed it. 
There are many ways to do this. In NobodyWho we currently expose two major ways, one is embeddings; converting a sentence to a vector and then find the vectors that are closest to it.
This is powerful as you can save the vectors to a database or a file beforehand and then use the really fast and cheap cosine similarity to compare them. Another more expensive but more accurate way is to use a cross-encoder that figures out the relationship between the question and the document rather that just how similar they are. 

This approach is often called reranking, due to how it is used as a step two, for sorting and filtering large knowledge databases accessed by LLMs. We'll call it ranking as we are working with a small enough dataset that we do not need a first pass to filter out irrelevant info.

Take this example:

```
Query: "Where do I find crystals for my sword upgrade?"
Documents: [
           "You asked the blacksmith: Where do I find crystals for my sword upgrade?",
           "The blacksmith said: Magic crystals are found in the Northern Mountains.",
           "You heard in the tavern: Magic crystals are not found in the Southern Desert."
]
```

If we rely just on comparing the query with the embeddings using cosine similarity (as we did with the embeddings), we will get back the document "You asked the blacksmith: Where do I find crystals for my sword upgrade?" as it is the most similar sentence to our query. This gave us no useful information and we have just wasted valuable context. 

But with ranking, the cross-encoder model has been trained on knowing that the answer to the question is not the question itself, and thus ranks the document "The blacksmith said: Magic crystals are found in the Northern Mountains." the highest.


Here are the key terms you'll need:

| Term | Meaning |
| ---- | ------- |
| **Document Ranking** | Sorting text documents by how well they match or answer a question. |
| **RAG (Retrieval-Augmented Generation)** | A system that finds relevant documents first, then uses them to generate better LLM responses. |
| **Cross-encoder** | The type of model used for reranking - it reads both the query and document together to score relevance. |


Let's show you how to build smart search systems for your game.

## Download a Reranker Model

Reranking models are different from chat and embedding models. You need one specifically trained for document ranking.

We recommend [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - it works well for most games and supports multiple languages.

Note that the current qwen3 reranker does not work, due to how they created the template as it has some missing fields.

## Practical Example: Smart NPC with Knowledge Base

Let's build a tavern keeper NPC that can answer player questions by searching through their personal knowledge. This NPC knows about the local area, quests, and rumors - perfect for creating more immersive and helpful characters.

We'll build it step by step, but for the impatient - the complete script is at the bottom.

### Step 1: Set up your NPC's knowledge base

First, let's create a knowledge base for our tavern keeper - everything this specific NPC would realistically know:

=== ":simple-godotengine: Godot"

    ```gdscript
    extends NobodyWhoChat

    @onready var reranker = $"../Rerank"
    @onready var chat_model = $"../ChatModel"  

    # The tavern keeper's knowledge - ~50 pieces of local information way more than could fit in a standard 4096 sized context.
    var tavern_keeper_knowledge = PackedStringArray([
        "The lake contains a special clay that blacksmiths use to forge superior weapons.",
        "Ancient oak trees in the sacred grove provide wood that naturally resists dark magic.",
        "Silver veins run through the mountain caves, valuable for crafting blessed weapons.",
        "Rare moonflowers bloom in the ruins only once per season and have powerful magical properties.",
        "The mill pond contains perfect stones for sharpening blades to razor sharpness.",
        "Wild honey from forest bees makes potions more potent when used as a base ingredient.",
        "A hooded stranger was seen asking questions about the old castle ruins last week.",
        "Someone has been leaving fresh flowers at the grave of the village's first mayor.",
        "Strange animal tracks were found near the well that don't match any known creature.",
        "The church bell rang by itself three nights ago at exactly midnight.",
        "Farmers found crop circles in their wheat fields after the last thunderstorm.",
        "A merchant claims he saw lights moving through the abandoned mine from the hill road.",
        "Children report hearing music coming from the forest when they play near the edge of town.",
        "The weather has been unusually warm this winter, and the old-timers are worried.",
        "Someone broke into the general store but only stole a map of the local cave systems.",
        "A wolf with unusual blue eyes has been spotted watching the town from the tree line.",
        "Old Sarah runs the bakery and makes the best apple pies in three kingdoms. Her grandson Tom went missing last week.",
        "Blacksmith Gareth is always looking for quality iron ore and magic crystals. He pays double for rare materials.",
        "Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.",
        "Father Benedict runs the small chapel and knows ancient blessings that can ward off evil spirits.",
        "Widow Martha owns the general store and knows every piece of gossip in town within hours.",
        "Young apprentice Jake works for the blacksmith but dreams of becoming an adventurer himself.",
        "Doctor Thorne treats injuries and illnesses. He keeps rare healing herbs in his back garden.",
        "Stable master Owen knows every horse in the region and can track animals through the wilderness.",
        "Mayor Thompson inherited his position from his father and struggles with the town's growing problems.",
        "The old mine north of town has been abandoned for years. Strange sounds echo from deep inside at night.",
        "The forest path to the east is safe during the day, but wolves hunt there after sunset.",
        "Crystal Mines to the south produce valuable gems but have become dangerous recently.",
        "The ancient stone bridge over Miller's Creek was built by dwarves centuries ago and still stands strong.",
        "Darkwood Forest harbors bandits who prey on merchant caravans traveling the main road.",
        "The Whispering Caves get their name from the wind that creates eerie sounds through the rock formations.",
        "Lake Serenity freezes solid in winter, making it possible to cross on foot to the northern settlements.",
        "The old watchtower on Crow's Hill offers a view of the entire valley but hasn't been manned in decades.",
        "Sacred Grove is where the druids once practiced their rituals before they disappeared from the region.",
        "The ruins of Castle Blackrock still stand on the mountain, though none dare venture there anymore.",
        "Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.",
        "Tom the baker's grandson disappeared near the Crystal Mines while collecting rare stones.",
        "Strange lights have been appearing in the Whispering Caves during moonless nights.",
        "Farmers report their livestock going missing near the edge of Darkwood Forest.",
        "The old mill wheel stopped working after something large damaged it upstream.",
        "Merchants complain about increased bandit activity on the eastern trade route.",
        "Several townsfolk have reported seeing ghostly figures near the abandoned mine at midnight.",
        "The village well's water tastes strange since the earthquake last month.",
        "Wild animals have been acting aggressively and fleeing deeper into the mountains.",
        "Ancient runes appeared overnight on the sacred standing stones outside town.",
        "The town was founded by refugees fleeing the Great Dragon War three hundred years ago.",
        "Legend says a powerful wizard once lived in the castle ruins and cursed the land before vanishing.",
        "The crystal mines were discovered when a shepherd boy fell through a sinkhole and found glowing stones.",
        "Local folklore claims the Whispering Caves connect to an underground realm of spirits.",
        "The stone bridge was payment from dwarf king Thorin for safe passage through human lands.",
        "Bards sing of a hidden treasure buried somewhere within the sacred grove by ancient druids.",
        "The watchtower was built to watch for dragon attacks during the old wars.",
        "Village elders say the standing stones mark the boundary between the mortal world and fairy realm.",
        "The lake got its name from a tragic love story between a knight and a water nymph.",
        "Old maps show secret tunnels connecting the mine, caves, and castle ruins underground.",
        "Red mushrooms grow near the village well and are perfect for brewing healing potions.",
        "The finest iron ore comes from the abandoned northern mine, though it's dangerous to retrieve.",
        "Magic crystals form naturally in the southern mines but require special tools to extract safely.",
        "Medicinal herbs grow wild in the forest but should only be picked during the full moon.",
    ])

    var ranked_docs = []
    ```

### Step 2: Configure your components

=== ":simple-godotengine: Godot"

    ```gdscript

    func _ready():
        # Set up the chat for generating helpful responses
        self.model_node = chat_model
        reranker.connect("ranking_finished", func(result): ranked_docs = result)
        reranker.start_worker()

        self.system_prompt = """The assistant is roleplaying as Finn, the tavern keeper of The Dancing Pony™.

        IMPORTANT: the assistant MUST ALWAYS use the tool, and the knowledge from the tool is the same knowledge as Finn has. 
        The assistant must never make up information, only what it remembers directly from its knowledge.
        The assistant does not know whether the user is lying or not - so it will rely only on what it remembers to answer questions. 
        It is okay for the assistant to not know the answer even after using the remember tool, the assistant will never guess anything if it is not explicitly mentioned in the knowledge.

        The assistant must always speak like a tavern keeper.

        """
        # Add the tool to remember stuff
        self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
        self.connect("response_finished", func(response: String): print("Finn says: ", response))
        start_worker()
    ```


### Step 3: Set up a simple input system

=== ":simple-godotengine: Godot"

    ```gdscript
    func _process(delta):
        if Input.is_action_just_pressed("enter"):
            var test_question = "Where is strider?"
            print("Player asks Finn: ", test_question)
            ask(test_question)
    ```


### Step 4: Use ranked results to generate smart answers

=== ":simple-godotengine: Godot"

    ```gdscript
 
    
    func remember(query: String) -> String:
        print("🔍 Finn searches his memory for: ", query)
        ranked_docs = "\n".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))
        print("🔍 Finn found: ", ranked_docs)

        return ranked_docs

    func _ready():
        ....
        # Add the tool to the llm to allow it to search for the most relevant lore.
        self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
        self.connect("response_finished", func(response: String): print("Finn says: ", response))
        start_worker()

    ```

When we run this we will get the response:

>Finn says: <think>
>Okay, let's see. The user asked about trading in the area.

```
🔍 Finn searches his memory for: What is trading like in the area?
.... goes through and ranks all the documents
🔍 Finn found: 
Merchants complain about increased bandit activity on the eastern trade route.
Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.
Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.
```

> From the tool response, I remember a few key points. First, merchants are complaining about bandits on the eastern trade route. That's a problem for traders. Then there's Merchant Elena who sells spices and silk every second Tuesday. That's a specific thing to mention. Also, Trader Gareth's caravan was attacked in Darkwood Forest. So the answer should include these points.
>
>I need to present this information clearly. Start by mentioning the bandit issue, then note Elena's schedule, and mention Gareth's attack. Make sure it's in a friendly, tavern keeper style. Check if there's anything else, but the tool response only gave these three points. Don't add anything extra. Alright, structure the response with those details.
></think>

and when good ole Finn is done thinking:

>Trading in the area is... if you'll forgive the blunt truth, *dangerous*. Merchants swear the eastern trade route is littered with bandits, and last week a caravan was ambushed in Darkwood Forest. But there are still opportunities! Merchant Elena brings rare spices and silk every second Tuesday—she’s a reliable seller. Just be wary of the roads. And if you spot a caravan with a single rider, don’t engage. They’re probably bandits.


---

### Complete Scripts

<details markdown>
<summary markdown>:simple-godotengine: Complete Godot Script (Click to expand)</summary>

```gdscript
extends NobodyWhoChat

@onready var reranker = $"../Rerank"
@onready var chat_model = $"../ChatModel"  

# The tavern keeper's knowledge - ~50 pieces of local information way more than could fit in a standard 4096 sized context.
var tavern_keeper_knowledge = PackedStringArray([
    "The lake contains a special clay that blacksmiths use to forge superior weapons.",
    "Ancient oak trees in the sacred grove provide wood that naturally resists dark magic.",
    "Silver veins run through the mountain caves, valuable for crafting blessed weapons.",
    "Rare moonflowers bloom in the ruins only once per season and have powerful magical properties.",
    "The mill pond contains perfect stones for sharpening blades to razor sharpness.",
    "Wild honey from forest bees makes potions more potent when used as a base ingredient.",
    "A hooded stranger was seen asking questions about the old castle ruins last week.",
    "Someone has been leaving fresh flowers at the grave of the village's first mayor.",
    "Strange animal tracks were found near the well that don't match any known creature.",
    "The church bell rang by itself three nights ago at exactly midnight.",
    "Farmers found crop circles in their wheat fields after the last thunderstorm.",
    "A merchant claims he saw lights moving through the abandoned mine from the hill road.",
    "Children report hearing music coming from the forest when they play near the edge of town.",
    "The weather has been unusually warm this winter, and the old-timers are worried.",
    "Someone broke into the general store but only stole a map of the local cave systems.",
    "A wolf with unusual blue eyes has been spotted watching the town from the tree line.",
    "Old Sarah runs the bakery and makes the best apple pies in three kingdoms. Her grandson Tom went missing last week.",
    "Blacksmith Gareth is always looking for quality iron ore and magic crystals. He pays double for rare materials.",
    "Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.",
    "Father Benedict runs the small chapel and knows ancient blessings that can ward off evil spirits.",
    "Widow Martha owns the general store and knows every piece of gossip in town within hours.",
    "Young apprentice Jake works for the blacksmith but dreams of becoming an adventurer himself.",
    "Doctor Thorne treats injuries and illnesses. He keeps rare healing herbs in his back garden.",
    "Stable master Owen knows every horse in the region and can track animals through the wilderness.",
    "Mayor Thompson inherited his position from his father and struggles with the town's growing problems.",
    "The old mine north of town has been abandoned for years. Strange sounds echo from deep inside at night.",
    "The forest path to the east is safe during the day, but wolves hunt there after sunset.",
    "Crystal Mines to the south produce valuable gems but have become dangerous recently.",
    "The ancient stone bridge over Miller's Creek was built by dwarves centuries ago and still stands strong.",
    "Darkwood Forest harbors bandits who prey on merchant caravans traveling the main road.",
    "The Whispering Caves get their name from the wind that creates eerie sounds through the rock formations.",
    "Lake Serenity freezes solid in winter, making it possible to cross on foot to the northern settlements.",
    "The old watchtower on Crow's Hill offers a view of the entire valley but hasn't been manned in decades.",
    "Sacred Grove is where the druids once practiced their rituals before they disappeared from the region.",
    "The ruins of Castle Blackrock still stand on the mountain, though none dare venture there anymore.",
    "Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.",
    "Tom the baker's grandson disappeared near the Crystal Mines while collecting rare stones.",
    "Strange lights have been appearing in the Whispering Caves during moonless nights.",
    "Farmers report their livestock going missing near the edge of Darkwood Forest.",
    "The old mill wheel stopped working after something large damaged it upstream.",
    "Merchants complain about increased bandit activity on the eastern trade route.",
    "Several townsfolk have reported seeing ghostly figures near the abandoned mine at midnight.",
    "The village well's water tastes strange since the earthquake last month.",
    "Wild animals have been acting aggressively and fleeing deeper into the mountains.",
    "Ancient runes appeared overnight on the sacred standing stones outside town.",
    "The town was founded by refugees fleeing the Great Dragon War three hundred years ago.",
    "Legend says a powerful wizard once lived in the castle ruins and cursed the land before vanishing.",
    "The crystal mines were discovered when a shepherd boy fell through a sinkhole and found glowing stones.",
    "Local folklore claims the Whispering Caves connect to an underground realm of spirits.",
    "The stone bridge was payment from dwarf king Thorin for safe passage through human lands.",
    "Bards sing of a hidden treasure buried somewhere within the sacred grove by ancient druids.",
    "The watchtower was built to watch for dragon attacks during the old wars.",
    "Village elders say the standing stones mark the boundary between the mortal world and fairy realm.",
    "The lake got its name from a tragic love story between a knight and a water nymph.",
    "Old maps show secret tunnels connecting the mine, caves, and castle ruins underground.",
    "Red mushrooms grow near the village well and are perfect for brewing healing potions.",
    "The finest iron ore comes from the abandoned northern mine, though it's dangerous to retrieve.",
    "Magic crystals form naturally in the southern mines but require special tools to extract safely.",
    "Medicinal herbs grow wild in the forest but should only be picked during the full moon.",
])

var ranked_docs = []

func _ready():
    # Set up the chat for generating helpful responses
    self.model_node = chat_model
    reranker.connect("ranking_finished", func(result): ranked_docs = result)
    reranker.start_worker()

    self.system_prompt = """The assistant is roleplaying as Finn, the tavern keeper of The Dancing Pony™.

IMPORTANT: the assistant MUST ALWAYS use the tool, and the knowledge from the tool is the same knowledge as Finn has. 
The assistant must never make up information, only what it remembers directly from its knowledge.
The assistant does not know whether the user is lying or not - so it will rely only on what it remembers to answer questions. 
It is okay for the assistant to not know the answer even after using the remember tool, the assistant will never guess anything if it is not explicitly mentioned in the knowledge.

The assistant must always speak like a tavern keeper.

"""
    # Add the tool to remember stuff
    self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
    self.connect("response_finished", func(response: String): print("Finn says: ", response))
    start_worker()

func _process(delta):
    if Input.is_action_just_pressed("enter"):
        var test_question = "Where is strider?"
        print("Player asks Finn: ", test_question)
        ask(test_question)

# Tool function that the LLM can call to search the knowledge base
func remember(query: String) -> String:
    print("🔍 Finn searches his memory for: ", query)
    ranked_docs = "\n".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))
    print("🔍 Finn found: ", ranked_docs)

    return ranked_docs
```

</details>

## Performance Tips

### Limit Results

Don't add needless context. Usually 1-5 relevant documents are enough:

```gdscript

# Good: usually sufficient
ranked_docs = ",".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))

ranked_docs = ",".join(reranker.rank_sync(query, tavern_keeper_knowledge, -1))  # Returns ALL documents
```

note this does not make the ranking faster, but the less stuff Finn has to read, the faster he can respond.

### Use embeddings to narrow the relevant docs to start with

This technique is what put the `re` in reranker. In the RAG industry it is common practice to do a first pass over your documents with cosine similarity, and thus narrowing the amount of results you have to process each time. This makes it feasible to have databases with millions of entries and not worry too much about performance. 

depending on the specs you are going for I would not recommend ranking more than 100 results at a time.


# What's Next?

Now you can build smart search systems for your game! check out:

- **[Embeddings](embeddings.md)** for getting a better understanding of the basics
- **[Tool Calling](chat/tool-calling.md)** for letting the LLM trigger game actions

### FAQ

## Frequently Asked Questions

### Where do I find good models to use?

New language models are coming out at a breakneck pace. If you search the web for "best language models for roleplay" or something similar, you'll probably find results that are several months or years old. You want to use something newer.

Selecting the best model for your use-case is mostly about finding the right trade-off between speed, memory usage and quality of the responses.
Using bigger models will yield better responses, but raise minimum system requirements and slow down generation speed.

Have a look at our [model selection guide](../model-selection.md) for more in-depth recommendations.


### Once I export my Godot project, it can no longer find the model file.

Exports are a bit weird for now: Llama.cpp expects a path to a GGUF file on your filesystem, while Godot really wants to package everything in one big .pck file.

The solution (for now) is to manually copy your chosen GGUF file into the export directory (the folder with your exported game executable).

If you're exporting for Android, you can't reliably pass a `res://` path to the model node. The best workaround is to use `user://` instead.
If your model is sufficiently small, you might get away with copying it from `res://` into `user://`. If using double the storage isn't acceptable, consider downloading it at runtime, or find some other way of distributing your model as a file.

We're looking into solutions for including this file automatically.


### NobodyWho-Godot makes Godot crash on Arch Linux / Manjaro

The Godot build currently in the Arch linux repositories does not work with gdextensions at all.

The solution for Arch users is to install godot from elsewhere. The binary being distributed from the godotengine.org website works great.
Other distribution methods like nix, flatpak, or building from source also seems to work great.

If anyone knows how to report this issue and to whom, feel free to do so. At this point I have met many Arch linux users who have this issue.


### NobodyWho-Godot fails to load on NixOS

If using a Godot engine from nixpkgs, with NobodyWho binaries from the Godot Asset Library. It will most likely fail to look up dynamic dependencies (libgomp, vulkan-loader, etc).

The reason is that the dynamic library .so files from the Godot Asset Library are compiled for generic linux, and expect to find them in FHS directories like /lib, which on NixOS will not contain any dynamic libraries.

There are two good solutions for this:

1. The easy way: run the godot editor using steam-run: `steam-run godot4 --editor`
2. The Nix way: compile NobodyWho using Nix. This repo contains a flake, so it's fairly simple to do (if you have nix with nix-command and flakes enabled): `nix build github:nobodywho-ooo/nobodywho`. Remember to move the dynamic libraries into the right directory afterwards.