# Overview
import Link from '@docusaurus/Link';
## What is NobodyWho?
NobodyWho is a lightweight, open-source inference engine for running open-weights LLMs inside your software.
We provide a simple, efficient, offline and privacy forward way of interacting with LLMs. No infrastructure needed!
In short, if you want to run a LLM, and integrate it with [tools](/python/tool-calling), configure its output,
enable real-time streaming of tokens, or maybe use it for creation of embeddings, NobodyWho makes it easy.
All of this is enabled by [Llama.cpp](https://github.com/ggml-org/llama.cpp), while having nice, simple API.
No need to mess around with docker containers, GPU servers, API keys, etc. We make it easy to run local LLMs in Swift, Python, React Native, Flutter and Godot!
## Code documentation
If you are already familiar with the basics of LLMs we suggest you go straight to the documentation of your selected integration.
Python
Swift
React Native
Flutter
Godot
## Basic LLM concepts
If you are unfamiliar with the basics of LLMs or are just interested we also provide a simple introduction to the most important concepts you need to know in order to get the most out of NobodyWho.
---
# LLM Basics
Our goal with NobodyWho is to make it easy to run local LLMs. For this reason we have made it possible to use NobodyWho with minimal knowledge of how LLM works. However you still need to know some basic concepts, so for these we provide some brief explanations. The concepts covered are tokens, context, samplers and tools.
## Tokens
Tokens are the basic units that LLMs process. A token is typically a word, part of a word, or a punctuation mark. For example, "hello" is one token, while "understanding" might be split into two tokens: "understand" and "ing". It is worth noting that the vocabulary of tokens used is different for each model as it is defined during training.
When the model generates text, it produces one token at a time. This is why the default response object of NobodyWho is a stream of tokens and why you can read the response token-by-token.
## Context
Context refers to all the text the model can "see" when generating a response. This includes:
- Previous messages in the conversation
- The current user prompt
- Any system instructions
Essentially the context acts as the models memory of the current conversation, available tools etc. This is important to remember as once your chosen model has been initialized most of your interactions with the model will happen through the context.
### Context Size
Every model has a maximum context size (also called context window or context length), measured in tokens. Common sizes range from 2048 to 128,000 tokens.
Once you reach the context limit, you must either:
- Start a new conversation
- Remove old messages from the history
- Summarize earlier parts of the conversation
Currently NobodyWho resolves this issue automatically by removing old messages from the context.
Having a larger context allows for longer and more complex conversations, but it also slows down the response time, as the model has to process a more tokens each time it generates a response.
## Samplers
LLMs don't output text directly. Instead, they generate a probability distribution over all possible next tokens. Since the model weigths are static after training, this means that the same input tokens always generate the same distribution. Depending on the use case however, there are many possible ways of choosing a next token from this distribution. This is configured using a **sampler**. A **sampler** splits the process of choosing a next token into two parts: Shiftingh the distribution and Sampling the distribution.
### Shifting the Distribution
Before sampling the distribution to get the next token, it is possible to adjust the distribution provided by the LLM to encourage certain behavior. Examples of these adjustments are:
- **Temperature**: Higher values make output more creative/random, lower values make it more focused/deterministic.
- **Top-k/Top-p**: Limit which tokens are considered, filtering out unlikely options
- **Penalties**: Lower the probalities of tokens already present in the context.
It is important to note that the steps in this part of the process can be chained. So it is possible to first apply a Temperature shift and then Top-k.
### Sampling the distribution
Once the distribution has been shifted the next step is to actually sample the distribution. This can also be done a few different ways:
- **Dist**: Sample the distribution randomly
- **Greedy**: Always pick the most likely token (deterministic but sometimes repetitive)
- **Mirostat**: Advanced sampling presented in this [article](https://arxiv.org/abs/1904.09751)
Since this part actually chooses the next token, these cannot be chained.
NobodyWho also supports more advanced ways of configuraing a sampler, like for example follow a JSON Schema.
## Tools
Tools (also called function calling) allow the LLM to request external actions. Instead of just generating text, the model can indicate it wants to:
- Search a database
- Perform a calculation
- Fetch data from an API
- Execute custom code
You define available tools, and the model decides when to use them based on the conversation. After a tool executes, you provide the result back to the model so it can continue the conversation.
This enables LLMs to go beyond pure text generation and interact with your application's functionality.
---
# Model Selection
Choosing the right language model can make or break your project. In general you want to go as small as possible while still having the capabilities you need for your application.
## TL;DR
If you just want a ~2GB chat model that works well, use:
```
huggingface:NobodyWho/Qwen_Qwen3-4B-GGUF/Qwen_Qwen3-4B-Q4_K_M.gguf
```
If you want something smaller and faster, use:
```
huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf
```
Pass these as the model path when creating a `Chat`, `Model`, `Encoder`, etc. NobodyWho will download the model automatically and cache it locally for future use.
## Getting a model
NobodyWho can download models directly from Hugging Face. Instead of downloading a file manually, pass a `huggingface:` path where you'd normally pass a file path:
```
huggingface:owner/repo/filename.gguf
```
The model is downloaded once and cached locally — no internet connection is needed after the first load. `hf:` is also accepted as a shorthand.
You can also pass a full `https://` URL to download a model from any host.
Of course, you can still pass a local file path if you prefer to manage model files yourself.
We recommend starting with the models on our [Hugging Face page](https://huggingface.co/NobodyWho) since they are known to work well with NobodyWho.
Once you're more familiar, you can also try models from [Bartowski](https://huggingface.co/bartowski) and [Unsloth](https://huggingface.co/unsloth/models).
Broadly, almost any `.gguf` model on [Hugging Face](https://huggingface.co) should work, though some may fail due to formatting issues.
## Understanding model file names
Model files follow a naming convention like this: `Qwen_Qwen3-0.6B-Q4_K_M.gguf`
Here's what each part means:
- `Qwen` the organization that trained the model.
- `Qwen3` the name of the model release.
- `0.6B` the parameter count in billions. This model has 0.6 billion (600 million) parameters.
- `Q4` the quantization level, i.e. the number of bits used per parameter.
- `K_M` details about the quantization technique. `S` is faster but less precise, `L` is slower but more precise, and `M` is a middle ground. You don't need to worry too much about this for now.
For chatting, you'll need an instruction-tuned GGUF file that includes a Jinja2 chat template in its metadata. This describes the vast majority of GGUF files available, so if you're unsure, just try it — NobodyWho will give you a descriptive error message if something isn't right.
For embeddings or cross-encoding, you'll need models specifically designed for those tasks, they are typically named accordingly. Although note that cross-encoding models are sometimes referred to as "reranking" models.
## Quantization
Quantization refers to the practice of reducing the number of bits per weight.
This can make the model faster and smaller, with a relatively small loss in response quality.
Generally speaking, you can used models quantized down the Q4 or Q5 levels (4 or 5 bits per weight respectively),
while loosing barely any accuracy.
Look at the plot below to get a feel for how quantization levels differ.
It shows the models' ability to predict text on the y-axis versus the number of bits per weight on the x-axis.

In general, it's preferable to use a model with more parameters and fewer bits per parameter, as compared to a model with fewer parameters and more bits per parameter.
Your results may vary.
## Estimating Memory Usage
The memory requirement of a model is roughly its parameter count multiplied by its quantization level.
Here's a few examples:
- 2B @ Q8 ~= 2GB
- 2B @ Q4 ~= 1GB
- 14B @ Q4 ~= 7GB
- 14B @ Q2 ~= 3.5GB
- ..and so on
## Comparing Models
There are many places online for comparing benchmark scores of different LLMs, here's a few of them:
**[LLM-Stats.com](https://llm-stats.com/)**
- Includes filters for open models and small models.
- Compares recent models on a few different benchmarks.
**[OpenEvals on huggingface](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard)**
- A collection of benchmark leaderboards in different domains.
- Includes both inaccessible proprietary models and open models.
Remember that you need an open model, in order to be able to find a GGUF download and run it locally (e.g. Gemma is open, but Gemini isn't).
---
*Need help choosing between specific models? Check our [community Discord](https://discord.gg/qhaMc2qCYB).*
---
# Chat
As you may have noticed in the [welcome guide](./), every interaction with your LLM starts by instantiating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.
## Prompts and responses
The `Chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.
```python
from nobodywho import Chat, TokenStream
chat = Chat("./model.gguf")
response: TokenStream = chat.ask("Is water wet?")
```
The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can just iterate over the `TokenStream`.
Each token is either an individual word or fragments of a word.
```python continuation
for token in response:
print(token, end="", flush=True)
print("\n")
```
If you just want to get the complete response, you can call `TokenStream.completed()`.
This will block until the model is done generating its entire response.
```python continuation
full_response: str = response.completed()
```
All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `Chat.ask()`, it will remember the previous messages.
## Chat history
If you want to inspect the messages inside the `Chat` object, you can use `get_chat_history`.
```python continuation
msgs: list[dict] = chat.get_chat_history()
print(msgs[0]["content"]) # "Is water wet?"
```
Similarly, if you want to edit what messages are in the context, you can use `set_chat_history`:
```python continuation
chat.set_chat_history([{
"role": "user",
"content": "What is water?",
"assets": []
}])
```
## System prompt
A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.
You can specify a system prompt when initializing a `Chat`:
```python
from nobodywho import Chat
chat = Chat("./model.gguf", system_prompt="You are a mischievous assistant!")
```
This `system_prompt` is then persisted until the chat context is `reset`.
## Context
The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
As bigger context size means more computational overhead, it makes sense to constrain it. This can be done with `n_ctx` setting, again at the time of creation:
```python
chat = Chat("./model.gguf", n_ctx=4096)
```
The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected models documentation and see what their recommended context size is.
Even with properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `n_ctx / 2`. The KV cache is also updated automatically. In the future we plan on adding more advanced methods of context shrinking.
Again, `n_ctx` is fixed to the `Chat` instance, so it is currently not possible to change the size after `Chat` is created. To reset the current context content, just call `.reset()` with the new system prompt and potentially changed tools.
```python continuation
chat.reset(system_prompt="New system prompt", tools=[])
```
If you don't want to change the already set defaults (`system_prompt`, `tools`), but only reset the context, then go for `reset_history`.
## Sharing model between contexts
There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. With plain `Chat` this is not possible.
For this use case, instead of the path to the `.gguf` model, you can pass in `Model` object, which can be shared between multiple `Chat` instances.
```python
from nobodywho import Chat, Model
model = Model('./model.gguf')
chat1 = Chat(model)
chat2 = Chat(model)
...
```
NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.
## Asynchronous model loading
Loading a model into memory can take a few seconds - longer if you're using a really large model.
If you want to load the model without blocking execution of your application (e.g. to keep UI responsive), you can load the model asynchronously:
```python
import asyncio
from nobodywho import ChatAsync, Model
async def main():
model = await Model.load_model_async("./model.gguf")
chat = ChatAsync(model)
asyncio.run(main())
```
## GPU
Instantiating `Model` is also useful, when enabling GPU acceleration. This can be done as:
```python
Model('./model.gguf', use_gpu_if_available=True)
```
So far, NobodyWho relies purely on [Vulkan](https://www.vulkan.org), however support
of more architectures is planned (for details check out our [issues](https://github.com/nobodywho-ooo/nobodywho/issues) or join us on [Discord](https://discord.gg/qhaMc2qCYB)).
## Template Variables
Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.
### Using Template Variables
You can set template variables when creating a chat or modify them on existing instances:
```python
# Set template variables when creating a chat
chat = Chat("./model.gguf", template_variables={"enable_thinking": True})
```
You can also modify template variables on an existing chat instance:
```python continuation
# Set a single template variable
chat.set_template_variable("enable_thinking", True)
# Set multiple template variables at once
chat.set_template_variables({
"enable_thinking": True,
"verbose_mode": False
})
# Get current template variables
variables = chat.get_template_variables()
print(variables) # {"enable_thinking": True, "verbose_mode": False}
```
With the next message sent, the updated settings will be propagated to the model.
### Example: Qwen3 and Qwen3.5 Reasoning
The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:
```python
# Enable thinking mode for Qwen models
chat = Chat("./model.gguf", template_variables={"enable_thinking": True})
chat.ask("Solve this logic puzzle: ...")
```
When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.
### Model-Specific Variables
Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.
:::info
Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.
:::
### Backward Compatibility
For backward compatibility, the deprecated `allow_thinking` parameter is still available but internally sets the `enable_thinking` template variable:
```python
# Deprecated - use template_variables instead
chat = Chat("./model.gguf", allow_thinking=True)
chat.set_allow_thinking(True)
```
---
# Downloading models
NobodyWho can either load a model from a path on disk or download it for you on first use, caching it for subsequent runs. This page covers the available model path formats, how to observe a download in progress, how to access gated/private models, and how to inspect what's already in the local cache.
## Supported model path formats
The `model_path` argument to `Chat`, `download_model`, and friends accepts:
| Form | Example | Notes |
| ---- | ------- | ----- |
| HuggingFace reference | `hf:owner/repo/file.gguf` | Downloaded and cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded and cached on first use |
| Local path | `./model.gguf` | Used as-is |
The HuggingFace prefix is case-insensitive and the `//` is optional — `hf:`, `hf://`, `huggingface:`, and `huggingface://` all mean the same thing. Remote models are downloaded to the platform cache directory on first load and re-used on subsequent runs.
## Tracking download progress
When loading a remote model, pass an `on_download_progress` callback to observe the download. It receives `(downloaded_bytes, total_bytes)` and is not called for cached or local files. If you don't pass anything, NobodyWho prints a default terminal progress bar.
```python
from nobodywho import download_model
model_path = download_model(
'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
on_download_progress=lambda downloaded, total: print(f"{downloaded}/{total} bytes"),
)
```
## Downloading a gated model
Some HuggingFace models are private or gated by a license you need to accept. In both cases you need to be authorized to download the model weights.
You can manually download the GGUF file via your web browser and then point `Chat` at the local path:
```python
from nobodywho import Chat
chat = Chat('./model.gguf')
```
Or use `download_model` with an `Authorization` header:
```python
from nobodywho import Chat, download_model
model_path = download_model(
'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
headers={ "Authorization": "Bearer your_hf_token" }
)
chat = Chat(model_path)
```
You can generate a HuggingFace token in [your account settings](https://huggingface.co/settings/tokens).
## Inspecting the model cache
`get_cached_models` returns every `.gguf` model that lives in NobodyWho's cache directory, paired with its size in bytes. This is the same cache used by `download_model` and by `Chat`'s `huggingface:` paths.
```python
from nobodywho import get_cached_models
for path, size in get_cached_models():
print(f"{path}: {size / 1024 / 1024:.1f} MiB")
```
- Paths are absolute.
- Sizes are in bytes.
- The list is empty if nothing has been downloaded yet.
- Raises `RuntimeError` if the cache directory cannot be read.
---
# Embeddings & RAG
When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.
## Understanding Embeddings
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.
For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.
## The Encoder
The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).
We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).
```python
from nobodywho import Encoder
encoder = Encoder('./embedding-model.gguf')
embedding = encoder.encode("What is the weather like?")
print(f"Vector with {len(embedding)} dimensions")
```
The resulting embedding is a list of floats (typically 384 or 768 dimensions depending on the model).
### Comparing Embeddings
To measure how similar two pieces of text are, compare their embeddings using cosine similarity:
```python
from nobodywho import Encoder, cosine_similarity
encoder = Encoder('./embedding-model.gguf')
query = encoder.encode("How do I reset my password?")
doc1 = encoder.encode("You can reset your password in the account settings")
doc2 = encoder.encode("The password requirements include 8 characters minimum")
similarity1 = cosine_similarity(query, doc1)
similarity2 = cosine_similarity(query, doc2)
print(f"Document 1 similarity: {similarity1:.3f}") # Higher score
print(f"Document 2 similarity: {similarity2:.3f}") # Lower score
```
Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.
### Practical Example: Finding Relevant Documents
```python
from nobodywho import Encoder, cosine_similarity
encoder = Encoder('./embedding-model.gguf')
# Your knowledge base
documents = [
"Python supports multiple programming paradigms including object-oriented and functional",
"JavaScript is primarily used for web development and runs in browsers",
"SQL is a domain-specific language for managing relational databases",
"Git is a version control system for tracking changes in source code"
]
# Pre-compute document embeddings
doc_embeddings = [encoder.encode(doc) for doc in documents]
# Search query
query = "What language should I use for database queries?"
query_embedding = encoder.encode(query)
# Find the most relevant document
similarities = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
best_idx = similarities.index(max(similarities))
print(f"Most relevant: {documents[best_idx]}")
print(f"Similarity score: {similarities[best_idx]:.3f}")
```
## The CrossEncoder for Better Ranking
While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.
The key difference: embeddings compare vector similarity, while cross-encoders understand the relationship between query and document.
### Why CrossEncoder Matters
Consider this example:
```
Query: "What are the office hours for customer support?"
Documents: [
"Customer asked: What are the office hours for customer support?",
"Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
"Note: Weekend support is not available at this time"
]
```
Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.
### Using CrossEncoder
```python
from nobodywho import CrossEncoder
# Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
crossencoder = CrossEncoder('./reranker-model.gguf')
query = "How do I install Python packages?"
documents = [
"Someone previously asked about Python packages",
"Use pip install package-name to install Python packages",
"Python packages are not included in the standard library"
]
# Get relevance scores for each document
scores = crossencoder.rank(query, documents)
print(scores) # [0.23, 0.89, 0.45] - second doc scores highest
```
### Automatic Sorting
For convenience, use `rank_and_sort` to get documents sorted by relevance:
```python continuation
# Returns list of (document, score) tuples, sorted by score
ranked_docs = crossencoder.rank_and_sort(query, documents)
for doc, score in ranked_docs:
print(f"[{score:.3f}] {doc}")
```
This returns documents ordered from most to least relevant.
## Building a RAG System
Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.
Here's a complete example building a customer service assistant with access to company policies:
```python
from nobodywho import Chat, CrossEncoder
# Initialize the cross-encoder for document ranking
crossencoder = CrossEncoder('./reranker-model.gguf')
# Your knowledge base
knowledge = [
"Our company offers a 30-day return policy for all products",
"Free shipping is available on orders over $50",
"Customer support is available via email and phone",
"We accept credit cards, PayPal, and bank transfers",
"Order tracking is available through your account dashboard"
]
# Create a tool that searches the knowledge base
from nobodywho import tool
@tool(description="Search the knowledge base for relevant information")
def search_knowledge(query: str) -> str:
# Rank all documents by relevance to the query
ranked = crossencoder.rank_and_sort(query, knowledge)
# Return top 3 most relevant documents
top_docs = [doc for doc, score in ranked[:3]]
return "\n".join(top_docs)
# Create a chat with access to the knowledge base
chat = Chat(
'./model.gguf',
system_prompt="You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
tools=[search_knowledge]
)
# The chat will automatically search the knowledge base when needed
response = chat.ask("What is your return policy?").completed()
print(response)
```
The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.
## Async API
For non-blocking operations, use `EncoderAsync` and `CrossEncoderAsync`:
```python
import asyncio
from nobodywho import EncoderAsync, CrossEncoderAsync
async def main():
encoder = EncoderAsync('./embedding-model.gguf')
crossencoder = CrossEncoderAsync('./reranker-model.gguf')
# Generate embeddings asynchronously
embedding = await encoder.encode("What is the weather?")
# Rank documents asynchronously
query = "What is our refund policy?"
docs = ["Refunds processed within 5-7 business days", "No refunds on sale items", "Contact support to initiate refund"]
ranked = await crossencoder.rank_and_sort(query, docs)
for doc, score in ranked:
print(f"[{score:.3f}] {doc}")
asyncio.run(main())
```
## Recommended Models
### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings
### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy
## Best Practices
**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.
**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.
**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.
**Choose appropriate context size**: The `n_ctx` parameter (default 2048) should match your model's recommended context size. Check the model documentation.
```python
# For longer documents, increase context size
encoder = Encoder('./embedding-model.gguf', n_ctx=4096)
crossencoder = CrossEncoder('./reranker-model.gguf', n_ctx=4096)
```
## Complete RAG Example
Here's a full example showing a two-stage retrieval system:
```python
from nobodywho import Chat, Encoder, CrossEncoder, cosine_similarity, tool
# Initialize models
encoder = Encoder('./embedding-model.gguf')
crossencoder = CrossEncoder('./reranker-model.gguf')
# Large knowledge base
knowledge_base = [
"Python 3.11 introduced performance improvements through faster CPython",
"The Django framework is used for building web applications",
"NumPy provides support for large multi-dimensional arrays",
"Pandas is the standard library for data manipulation and analysis",
# ... 100+ more documents
]
# Precompute embeddings for all documents
doc_embeddings = [encoder.encode(doc) for doc in knowledge_base]
@tool(description="Search the knowledge base for information relevant to the query")
def search(query: str) -> str:
# Stage 1: Fast filtering with embeddings
query_embedding = encoder.encode(query)
similarities = [
(doc, cosine_similarity(query_embedding, doc_emb))
for doc, doc_emb in zip(knowledge_base, doc_embeddings)
]
# Get top 20 candidates
candidates = sorted(similarities, key=lambda x: x[1], reverse=True)[:20]
candidate_docs = [doc for doc, _ in candidates]
# Stage 2: Precise ranking with cross-encoder
ranked = crossencoder.rank_and_sort(query, candidate_docs)
# Return top 3 most relevant
top_results = [doc for doc, score in ranked[:3]]
return "\n---\n".join(top_results)
# Create RAG-enabled chat
chat = Chat(
'./model.gguf',
system_prompt="You are a technical documentation assistant. Always use the search tool to find relevant information before answering programming questions.",
tools=[search]
)
# The chat automatically searches and uses retrieved documents
response = chat.ask("What Python libraries are best for data analysis?").completed()
print(response)
```
This two-stage approach combines the speed of embeddings with the accuracy of cross-encoders, making it efficient even for large knowledge bases.
---
# Getting started
## How do I get started?
First, install `nobodywho`.
```bash
pip install nobodywho
```
Or preferably:
```bash
uv add nobodywho
```
Next, pick a model. NobodyWho can download GGUF models directly from Hugging Face — just pass a `huggingface:` path. See [model selection](/docs/model-selection) for recommendations.
Then make a `Chat` object and call `.ask()`!
```python
from nobodywho import Chat
chat = Chat('huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf')
response = chat.ask('Is water wet?')
# print each token as it is generated
for token in response:
print(token, end="", flush=True)
# ...or get the entire response as a single string
full_response = response.completed()
print(full_response)
```
This is a super simple example, but we believe that examples which do simple things, should be simple!
To get a full overview of the functionality provided by NobodyWho, simply keep reading.
---
# Logging and Troubleshooting
# Logging and troubleshooting
The python bindings for NobodyWho integrate with python's standard `logging` utilities.
In short, to enable debug logs:
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
This can be useful for getting some insight into what the model is choosing to do and when.
For example when tool calls are made, when context shifting happens, etc.
---
# Sampling
The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify,
to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).
## Sampler presets
To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```python
from nobodywho import SamplerPresets
Chat("./model.gguf", sampler=SamplerPresets.temperature(0.2))
```
Setting `temperature` to `0.2`, will then affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.
To see the whole list of presets, check out the `SamplerPresets` class:
```python
class SamplerPresets:
def default() -> SamplerConfig: ...
def dry() -> SamplerConfig: ...
def greedy() -> SamplerConfig: ...
def json() -> SamplerConfig: ...
def temperature(temperature: float) -> SamplerConfig: ...
def top_k(top_k: int) -> SamplerConfig: ...
def top_p(top_p: float) -> SamplerConfig: ...
# Constrain output to a specific format:
def constrain_with_json_schema(schema: str) -> SamplerConfig: ...
def constrain_with_regex(pattern: str) -> SamplerConfig: ...
def constrain_with_grammar(grammar: str) -> SamplerConfig: ...
```
## Structured output
One of the most useful features is constraining the model to produce structured output —
this gives you a hard guarantee that the output matches a specific format, rather than
relying on the model to get it right on its own.
### Regular expressions
For simpler patterns, you can constrain the output with a regex:
```python
# Force the model to answer with exactly "yes" or "no"
chat = Chat('./model.gguf', sampler=SamplerPresets.constrain_with_regex(r"yes|no"))
answer = chat.ask("Is the sky blue?").completed()
```
### JSON schema
In some use-cases it might be useful to let the LLM generate JSON output.
This could be done either in the simple way, just enforcing any JSON by the preset:
```python
Chat('./model.gguf', sampler=SamplerPresets.json())
```
Or utilizing JSON schemas to really force the LLM to give you the specific object shapes
that you want:
```python
import json
chat = Chat('./model.gguf', sampler=SamplerPresets.constrain_with_json_schema({
"type": "object",
"properties": {
"name": {"type": "string", "maxLength": 50},
"age": {"type": "integer"}
},
"required": ["name", "age"],
"additionalProperties": False
}))
response = chat.ask("Give me a person as JSON with name and age fields.").completed()
person = json.loads(response) # always valid JSON matching the schema
```
### Custom grammars
For cases where JSON schema and regex are not expressive enough, you can supply a custom grammar.
`constrain_with_grammar` accepts both **Lark** syntax and **GBNF** (llama.cpp format) -
NobodyWho automatically converts GBNF to Lark before passing it to the inference engine.
**Lark syntax** (recommended):
```python
sampler = SamplerPresets.constrain_with_grammar("""
start: record (NEWLINE record)* NEWLINE?
record: field ("," field)*
field: /[^,\"\\n\\r]+/
NEWLINE: /\\r?\\n/
""")
```
**GBNF syntax** (also accepted):
```python
sampler = SamplerPresets.constrain_with_grammar("""
file ::= record (newline record)* newline?
record ::= field ("," field)*
field ::= /[^,"\\n\\r]+/
newline ::= "\\r\\n" | "\\n"
""")
```
See the [Lark documentation](https://lark-parser.readthedocs.io/en/latest/grammar.html) and the
[GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md) for the
full grammar syntax.
:::info
The older `SamplerPresets.grammar()` method is deprecated. Use `SamplerPresets.constrain_with_grammar()` instead - it accepts both Lark and GBNF strings and should run faster!
:::
## Defining your own samplers
Sampler presets abstract away some control, that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide `SamplerBuilder` class:
```python
from nobodywho import SamplerBuilder
Chat(
"./model.gguf",
sampler=SamplerBuilder()
.temperature(0.8)
.top_k(5)
.dist()
)
```
With `SamplerBuilder` you can chain multiple steps together and then select how do you
want to sample from the distribution. Keep in mind, that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly and not giving you type errors, be careful
to always end the chain with one of the sampling steps (e.g. `dist`, `greedy`, `mirostat_v2`, etc.).
For reproducible output, set the RNG seed with `.seed(value)` anywhere in the chain.
It is consumed by every random sampler in the chain — `dist`, `mirostat_v1`, `mirostat_v2`,
and the `xtc` shift step. `greedy` ignores it. If unset, a default seed is used.
```python
sampler = SamplerBuilder().temperature(0.8).top_k(5).seed(42).dist()
```
---
# Streaming & Async API
Synchronously waiting for the full response to arrive can be costly. If your application
domain allows you to, you would ideally want to stream the tokens to the user as soon
as they arrive, and spend the time in between doing useful work, rather than just waiting.
## Streaming tokens
Allowing streaming is super simple. Instead of calling the `.completed()` method, just iterate
over the response object:
```python
chat = Chat('./model.gguf')
response = chat.ask('How are you?')
for token in response:
print(token, end="", flush=True)
```
Still, bear in mind that for the individual tokens, you are waiting synchronously.
## Async API
If you don't want to wait synchronously, swap out the `Chat` object for a `ChatAsync`. All of the API stays the same, so either you can opt for a full, completed message:
```python
import asyncio
from nobodywho import ChatAsync
async def main():
chat = ChatAsync('./model.gguf')
response = await chat.ask('How are you?').completed()
print(response)
asyncio.run(main())
```
Or again stream tokens:
```python
import asyncio
from nobodywho import ChatAsync
async def main():
chat = ChatAsync('./model.gguf')
response = chat.ask('How are you?')
async for token in response:
print(token, end="", flush=True)
asyncio.run(main())
```
Similarly, the other model types we support also implement async behaviour, so
you can go for `EncoderAsync` and `CrossEncoderAsync`, which are
both part of the [embeddings & rag functionality](./embeddings-and-rag).
---
# Tool Calling
To give your LLM the ability to interact with the outside world, you will need tool calling.
:::info
Note that **not every model** supports tool calling. If the model does not have
such an option, it might not call your tools.
For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/Qwen/models) family of models.
:::
## Declaring a tool
A tool can be created from any (synchronous) python function, which returns a string.
To perform the conversion, you simply need to use the `@tool` decorator. To get
a good sense of how such a tool can look like, consider this geometry example:
```python
import math
from nobodywho import tool, Chat
@tool(description="Calculates the area of a circle given its radius")
def circle_area(radius: float) -> str:
area = math.pi * radius ** 2
return f"Circle with radius {radius} has area {area:.2f}"
```
As you can see, every `@tool` definition has to be complemented by a description
of what such tool does. To let your LLM use it, simply add it when creating `Chat`:
```python continuation
chat = Chat('./model.gguf', tools=[circle_area])
```
NobodyWho then figures out the right tool calling format, inspects the names and types of the parameters,
and configures the sampler.
Naturally, more tools can be defined and the model can chain the calls for them:
```python
import os
from pathlib import Path
from nobodywho import Chat, tool
@tool(description="Gets path of the current directory")
def get_current_dir() -> str:
return os.getcwd()
@tool(description="Lists files in the given directory", params={"path": "a relative or absolute path to a directory"})
def list_files(path: str) -> str:
files = [f.name for f in Path(path).iterdir() if f.is_file()]
return f"Files: {', '.join(files)}"
@tool(description="Gets the size of a file in bytes")
def get_file_size(filepath: str) -> str:
size = Path(filepath).stat().st_size
return f"File size: {size} bytes"
chat = Chat('./model.gguf', tools=[get_current_dir, list_files, get_file_size])
response = chat.ask('What is the biggest file in my current directory?').completed()
print(response) # The largest file in your current directory is `model.gguf`.
```
## Providing parameter descriptions
When a tool call is declared, information about the description, the types and the parameters is provided to the model, so it knows it can use it. Crucially, also parameter names are provided.
If those are not enough, you can decide to provide additional information by the `params` parameter:
```python
from nobodywho import tool
@tool(
description="Given a longitude and latitude, gets the current temperature.",
params={
"lon": "Longitude - that is the vertical one!",
"lat": "Latitude - that is the horizontal one!"
}
)
def get_current_temperature(lon: str, lat: str) -> str:
...
```
These will be then appended to the information provided to model, so it can better navigate itself
when using the tool.
## Pre-packaged tools
We ship NobodyWho with two packaged-in tools, which are general enough for mutliple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.
The usage is straightforward. Start with importing either `python_tool` or `bash_tool` from `nobodywho`.
```python
from nobodywho import python_tool, bash_tool
chat = Chat('./model.gguf', tools=[python_tool(), bash_tool()])
```
Lastly, keep in mind that for most use-cases it is reasonable to constraint the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `python_tool` provides `max_duration`, `max_memory` and `max_recursion_depth`
and `bash_tool` provides `max_commands`.
## Tool calling and the context
As with everything made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.
---
# Vision & Hearing
A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image information to your LLM.
## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:
1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens
To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.
With the downloaded GGUFs, you can simply add the projection model as:
```python
from nobodywho import Model, Chat
model = Model("./vision-model.gguf", projection_model_path="./projection_model.gguf")
chat = Chat(
model, system_prompt="You are a helpful assistant, that can hear and see stuff!"
)
```
:::info
The language model and projection model have to **fit** together, as they are trained together!
Unfortunately you can't just take projection model and a LLM that you like and expect them
to work together.
:::
## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through the `Prompt` object.
```python notest
from nobodywho import Audio, Image, Prompt, Text
prompt = Prompt([
Text("Tell me what you see in the image and what you hear in the audio."),
Image("./dog.png"),
Audio("./sound.mp3")
])
chat.ask(prompt).completed() # It's a dog and a penguin!
```
## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
```python
prompt = Prompt([
Text("Tell me what you see in the image."),
Image("./dog.png"),
Text("Also tell me what you hear in the audio"),
Audio("./sound.mp3")
])
```
Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```
chat = Chat(
"./model.gguf", system_prompt="You are a helpful assistant.", n_ctx=4096
)
```
Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).
Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.
As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.
---
# Chat
As you may have noticed in the [welcome guide](./), every interaction with your LLM starts by creating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.
## Creating a Chat
There are two main ways of creating a `Chat` object and the difference lies in when the model file is loaded.
The simplest way is using `Chat.fromPath`:
```swift
import NobodyWho
let chat = try await Chat.fromPath(modelPath: "/path/to/model.gguf")
```
The `modelPath` parameter accepts a local file path, a Hugging Face `hf://` URL, or an `https://` URL:
```swift
// From a Hugging Face repository
let chat = try await Chat.fromPath(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
)
// From an HTTPS URL
let chat = try await Chat.fromPath(
modelPath: "https://example.com/model.gguf"
)
```
When loading from a remote URL, you can track download progress with the `onDownloadProgress` callback:
```swift
let chat = try await Chat.fromPath(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
) { downloaded, total in
print("Downloaded \(downloaded)/\(total) bytes")
}
```
This function is async since loading a model can take a bit of time, but this should not block any of your UI.
Another way to achieve the same thing is to load the model separately and then use the `Chat` constructor:
```swift
import NobodyWho
let model = try await Model.load(modelPath: "/path/to/model.gguf")
let chat = try Chat(model: model)
```
The `Model.load` function also supports `hf://` and `https://` URLs, as well as `onDownloadProgress`:
```swift
let model = try await Model.load(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
) { downloaded, total in
print("Downloaded \(downloaded)/\(total) bytes")
}
```
This allows for sharing the model between several `Chat` instances.
## Prompts and responses
The `chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.
```swift
let chat = try await Chat.fromPath(modelPath: "/path/to/model.gguf")
let response = chat.ask("Is water wet?")
```
The return type of `ask` is a `TokenStream`, which conforms to `AsyncSequence`.
If you want to start reading the response as soon as possible, you can iterate over it using `for await`.
Each token is either an individual word or a fragment of a word.
```swift
for await token in response {
print(token, terminator: "")
}
```
If you just want to get the complete response, you can call `completed()`.
This will return the entire response string once the model is done generating.
```swift
let fullResponse = try await response.completed()
```
All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `chat.ask()`, it will remember the previous messages.
## Stopping generation
If you need to cancel the model's response while it is still generating (for example, when the user taps a "Stop" button), call `stopGeneration()`:
```swift
chat.stopGeneration()
```
This immediately stops token generation. Any tokens already produced are still available in the stream. The partial response is added to the chat history, so the conversation remains coherent. Note that `stopGeneration()` is a synchronous call and can be invoked from any thread.
## Chat history
If you want to inspect the messages inside the `Chat` object, you can use `getChatHistory`.
```swift
let msgs = try await chat.getChatHistory()
print(msgs[0]) // The first message
```
Similarly, if you want to edit what messages are in the context, you can use `setChatHistory`:
```swift
try await chat.setChatHistory([
.message(role: .user, content: "What is water?", assets: [])
])
```
## System prompt
A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.
You can specify a system prompt when creating a `Chat`:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
systemPrompt: "You are a mischievous assistant!"
)
```
This `systemPrompt` is then persisted until the chat context is reset.
## Context
The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
A bigger context size means more computational overhead, so it makes sense to constrain it. This can be done with the `contextSize` setting at creation time:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
contextSize: 4096
)
```
The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected model's documentation and see what their recommended context size is.
Even with a properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `contextSize / 2`. The KV cache is also updated automatically.
To reset the current context content, call `resetContext()` with a new system prompt and potentially changed tools.
```swift
try await chat.resetContext(systemPrompt: "New system prompt", tools: [])
```
If you don't want to change the already set defaults (`systemPrompt`, `tools`), but only reset the context, then go for `resetHistory`.
## Sharing model between contexts
There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. In this case you must load the model separately from creating the `Chat` instance.
```swift
let model = try await Model.load(modelPath: "/path/to/model.gguf")
let chat1 = try Chat(model: model)
let chat2 = try Chat(model: model)
```
NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.
## GPU
When using `Model.load` or `Chat.fromPath` you have the option to disable GPU acceleration:
```swift
let model = try await Model.load(modelPath: "/path/to/model.gguf", useGpu: false)
```
By default `useGpu` is set to `true`, which uses Metal on Apple platforms.
## Template Variables
Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.
### Using Template Variables
You can set template variables when creating a chat or modify them on existing instances:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
templateVariables: ["enable_thinking": true]
)
```
You can also modify template variables on an existing chat instance:
```swift
try await chat.setTemplateVariable(name: "enable_thinking", value: true)
let variables = try await chat.getTemplateVariables()
```
### Example: Qwen3 and Qwen3.5 Reasoning
The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
templateVariables: ["enable_thinking": true]
)
let response = chat.ask("Solve this logic puzzle: ...")
```
When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.
:::info
Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.
:::
---
# Downloading models
NobodyWho can either load a model from a path on disk or download it for you on first use, caching it for subsequent runs. This page covers the available model path formats, how to observe a download in progress, how to access gated/private models, and how to inspect what's already in the local cache.
## Supported model path formats
The `modelPath` argument to `Chat.fromPath` and `Model.downloadModel` accepts:
| Form | Example | Notes |
| ---- | ------- | ----- |
| HuggingFace reference | `hf:owner/repo/file.gguf` | Downloaded and cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded and cached on first use |
| Local path | `./model.gguf` | Used as-is |
The HuggingFace prefix is case-insensitive and the `//` is optional — `hf:`, `hf://`, `huggingface:`, and `huggingface://` all mean the same thing. Remote models are downloaded to the platform cache directory on first load and re-used on subsequent runs.
## Tracking download progress
When loading from a remote URL, pass a progress closure to `Chat.fromPath`. It receives `(downloaded, total)` byte counts and is not called for cached or local files.
```swift
let chat = try await Chat.fromPath(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
) { downloaded, total in
print("Downloaded \(downloaded)/\(total) bytes")
}
```
## Downloading a gated model
Some HuggingFace models are private or gated by a license you need to accept. In both cases you need to be authorized to download the model weights.
You can manually download the GGUF file via your web browser and then point `Chat.fromPath` at the local path:
```swift
let chat = try await Chat.fromPath(modelPath: "./model.gguf")
```
Or use `Model.downloadModel` with an `Authorization` header:
```swift
import NobodyWho
let modelPath = try await Model.downloadModel(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf",
headers: ["Authorization": "Bearer your_hf_token"]
)
let chat = try await Chat.fromPath(modelPath: modelPath)
```
You can generate a HuggingFace token in [your account settings](https://huggingface.co/settings/tokens).
## Inspecting the model cache
`getCachedModels()` returns every `.gguf` model in NobodyWho's cache directory, paired with its size in bytes. This is the same cache used by `Model.downloadModel` and by `Chat.fromPath`'s `hf://` paths.
```swift
import NobodyWho
let models = try getCachedModels()
for model in models {
print("\(model.path): \(model.size) bytes")
}
```
Each `CachedModel` has:
- `path: String` — absolute path to the cached `.gguf` file
- `size: UInt64` — size in bytes
The array is empty if nothing has been downloaded yet. The call throws if the cache directory cannot be read.
---
# Embeddings & RAG
NobodyWho provides `Encoder` and `CrossEncoder` classes for building retrieval-augmented generation (RAG) pipelines entirely on-device.
## Embeddings
An `Encoder` converts text into a numerical vector (embedding) that captures its semantic meaning. Texts with similar meanings will have similar embeddings.
```swift
import NobodyWho
let encoder = try await Encoder.fromPath(modelPath: "/path/to/embeddings.gguf", contextSize: 512, useGpu: true)
let embedding1 = try await encoder.encode("The cat sat on the mat")
let embedding2 = try await encoder.encode("A feline rested on the rug")
let embedding3 = try await encoder.encode("The stock market crashed today")
let similar = cosineSimilarity(a: embedding1, b: embedding2) // High similarity
let different = cosineSimilarity(a: embedding1, b: embedding3) // Low similarity
```
You can also create an encoder from an already-loaded model:
```swift
let model = try await Model.load(modelPath: "/path/to/embeddings.gguf", useGpu: true)
let encoder = Encoder(model: model, contextSize: 512)
```
## Cross-Encoder for reranking
A `CrossEncoder` takes a query and a list of documents and scores each document by its relevance to the query. Unlike embeddings (which are computed independently), a cross-encoder processes the query and document together, giving more accurate relevance scores.
```swift
let crossEncoder = try await CrossEncoder.fromPath(modelPath: "/path/to/reranker.gguf", contextSize: 512, useGpu: true)
let query = "How do I reset my password?"
let documents = [
"Click 'Forgot Password' on the login page.",
"Our company was founded in 2020.",
"Contact support for account recovery.",
"The weather is sunny today.",
]
// Get raw similarity scores
let scores = try await crossEncoder.rank(query: query, documents: documents)
// Or get documents sorted by relevance (most relevant first)
let ranked = try await crossEncoder.rankAndSort(query: query, documents: documents)
for (document, score) in ranked {
print("\(score): \(document)")
}
```
## Building a RAG pipeline
A typical RAG pipeline combines both tools:
1. **Index**: Use the `Encoder` to create embeddings for your document collection
2. **Retrieve**: When a user asks a question, embed the query and find the most similar documents using `cosineSimilarity`
3. **Rerank** (optional): Use the `CrossEncoder` to rerank the top candidates for better precision
4. **Generate**: Pass the relevant documents to a `Chat` as context in the system prompt
```swift
// 1. Embed your documents (do this once, store the results)
let encoder = try await Encoder.fromPath(modelPath: "/path/to/embeddings.gguf", contextSize: 512, useGpu: true)
let docs = ["Document 1...", "Document 2...", "Document 3..."]
let docEmbeddings = try await docs.asyncMap { try await encoder.encode($0) }
// 2. Embed the query and find similar documents
let queryEmbedding = try await encoder.encode("What is the return policy?")
let similarities = docEmbeddings.map { cosineSimilarity(a: queryEmbedding, b: $0) }
// 3. Rerank the top results
let crossEncoder = try await CrossEncoder.fromPath(modelPath: "/path/to/reranker.gguf", contextSize: 512, useGpu: true)
let topDocs = // ... select top N by similarity
let ranked = try await crossEncoder.rankAndSort(query: "What is the return policy?", documents: topDocs)
// 4. Generate a response with context
let context = ranked.prefix(3).map { $0.0 }.joined(separator: "\n\n")
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
systemPrompt: "Answer based on the following documents:\n\n\(context)"
)
let response = try await chat.ask("What is the return policy?").completed()
```
---
# Getting started
## How do I get started?
Add NobodyWho to your project using Swift Package Manager. In Xcode, go to **File → Add Package Dependencies** and enter:
```
https://github.com/nobodywho-ooo/nobodywho-swift.git
```
Or add it to your `Package.swift`:
```swift
dependencies: [
.package(url: "https://github.com/nobodywho-ooo/nobodywho-swift.git", from: "1.0.0")
]
```
Models can be loaded from a local file path, a Hugging Face repository using `hf://` URLs, or any `https://` URL. If you don't have a specific model in mind, try [this one](https://huggingface.co/NobodyWho/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf). Read more about [model selection](/docs/model-selection).
```swift
import NobodyWho
// From a Hugging Face repository
let chat = try await Chat.fromPath(
modelPath: "hf://NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
)
// From an HTTPS URL
let chat = try await Chat.fromPath(
modelPath: "https://huggingface.co/NobodyWho/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
)
// From a local file
let chat = try await Chat.fromPath(modelPath: "/path/to/model.gguf")
```
Once you have a `Chat`, call `.ask` to get a response!
```swift
let response = try await chat.ask("Is water wet?").completed()
print(response) // Yes, indeed, water is wet!
```
This is a super simple example, but we believe that examples which do simple things, should be simple!
To get a full overview of the functionality provided by NobodyWho, simply keep reading.
## Platform requirements
- **iOS**: iPhone 11 or newer with at least 4 GB of RAM. Requires iOS 15+.
- **macOS**: Apple Silicon or Intel Mac with at least 8 GB of RAM. Requires macOS 13+.
- **visionOS**: Apple Vision Pro. Requires visionOS 1.0+.
- **watchOS**: Requires watchOS 10+. CPU-only (Metal is not available). Due to limited memory on Apple Watch, only very small models are practical.
GPU acceleration is enabled by default using Metal on all Apple platforms.
## Feedback & Contributions
We welcome your feedback and ideas!
- Bug Reports & Improvements: If you encounter a bug or have suggestions, please open an issue on our [Issues](https://github.com/nobodywho-ooo/nobodywho/issues) page.
- Feature Requests & Questions: For new feature requests or general questions, join the discussion on our [Discussions](https://github.com/nobodywho-ooo/nobodywho/discussions) page.
---
# Sampling
The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify, to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).
## Sampler presets
To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```swift
import NobodyWho
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.temperature(0.2)
)
```
Setting `temperature` to `0.2` will affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.
To see the whole list of presets, check out the `SamplerPresets` enum:
```swift
enum SamplerPresets {
static func `default`() -> SamplerConfig
static func dry() -> SamplerConfig
static func greedy() -> SamplerConfig
static func temperature(_ temperature: Float) -> SamplerConfig
static func topK(_ topK: Int32) -> SamplerConfig
static func topP(_ topP: Float) -> SamplerConfig
// Constrain output to a specific format:
static func constrainWithJsonSchema(_ schema: String) -> SamplerConfig
static func constrainWithRegex(_ pattern: String) -> SamplerConfig
static func constrainWithGrammar(_ grammar: String) -> SamplerConfig
}
```
## Structured output
One of the most useful features is constraining the model to produce structured output —
this gives you a hard guarantee that the output matches a specific format, rather than
relying on the model to get it right on its own.
### Regular expressions
For simpler patterns, you can constrain the output with a regex:
```swift
// Force the model to answer with exactly "yes" or "no"
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.constrainWithRegex("yes|no")
)
let answer = try await chat.ask("Is the sky blue?").completed()
```
### JSON schema
In some use-cases it might be useful to let the LLM generate JSON output.
You can use a JSON schema to force the LLM to produce the exact object shape you need:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.constrainWithJsonSchema("""
{
"type": "object",
"properties": {
"name": { "type": "string", "maxLength": 50 },
"age": { "type": "integer" }
},
"required": ["name", "age"],
"additionalProperties": false
}
""")
)
let response = try await chat.ask("Give me a person as JSON with name and age fields.").completed()
// `response` is always valid JSON matching the schema
```
### Custom grammars (advanced)
For cases where JSON schema and regex are not expressive enough, you can supply a custom grammar.
`constrainWithGrammar` accepts both **Lark** syntax and **GBNF** (llama.cpp format) —
NobodyWho automatically converts GBNF to Lark before passing it to the inference engine.
**Lark syntax** (recommended):
```swift
let sampler = SamplerPresets.constrainWithGrammar("""
start: record (NEWLINE record)* NEWLINE?
record: field ("," field)*
field: /[^,"\\n\\r]+/
NEWLINE: /\\r?\\n/
""")
```
**GBNF syntax** (also accepted):
```swift
let sampler = SamplerPresets.constrainWithGrammar("""
file ::= record (newline record)* newline?
record ::= field ("," field)*
field ::= /[^,"\\n\\r]+/
newline ::= "\\r\\n" | "\\n"
""")
```
See the [Lark documentation](https://lark-parser.readthedocs.io/en/latest/grammar.html) and the
[GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md) for the
full grammar syntax.
:::info
The older `SamplerPresets.json()` and `SamplerPresets.grammar()` methods are deprecated.
Use `SamplerPresets.constrainWithJsonSchema()` for JSON output or
`SamplerPresets.constrainWithGrammar()` for custom grammars — the latter accepts both Lark and GBNF strings.
:::
## Defining your own samplers
Sampler presets abstract away some control that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide the `SamplerBuilder` class:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
sampler: SamplerBuilder().temperature(0.8).topK(5).dist()
)
```
With `SamplerBuilder` you can chain multiple steps together and then select how you
want to sample from the distribution. Keep in mind that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly, be careful
to always end the chain with one of the sampling steps (e.g. `dist()`, `greedy()`, `mirostatV2()`, etc.).
For reproducible output, set the RNG seed with `.seed(seed:)` anywhere in the chain.
It is consumed by every random sampler in the chain — `dist`, `mirostatV1`, `mirostatV2`,
and the `xtc` shift step. `greedy` ignores it. If unset, a default seed is used.
```swift
let sampler = SamplerBuilder().temperature(0.8).topK(5).seed(seed: 42).dist()
```
You can also change the sampler configuration on an existing chat instance:
```swift
let sampler = SamplerBuilder().temperature(0.8).topK(5).dist()
try await chat.setSamplerConfig(sampler)
```
---
# Tool Calling
To give your LLM the ability to interact with the outside world, you will need tool calling.
:::info
Note that **not every model** supports tool calling. If the model does not have
such an option, it might not call your tools.
For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/collections/NobodyWho/qwen-3) family of models.
:::
## The @DeclareTool macro
The easiest way to create a tool is with the `@DeclareTool` macro. Just annotate any function with a description, and NobodyWho generates the tool for you:
```swift
import NobodyWho
@DeclareTool("Calculates the area of a circle given its radius")
func circleArea(radius: Double) -> String {
let area = Double.pi * radius * radius
return "Circle with radius \(radius) has area \(String(format: "%.2f", area))"
}
// The macro generates `circleAreaTool` automatically
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
tools: [circleAreaTool]
)
```
The generated variable is named `Tool` — so `circleArea` becomes `circleAreaTool`.
Supported parameter types: `String`, `Int`, `Double`, `Float`, `Bool`, and collections like `[String]` or `[String: Int]`.
### Async tools
The `@DeclareTool` macro also works with async functions. This is useful for tools that need to make network requests, database queries, or other asynchronous operations:
```swift
@DeclareTool("Search the knowledge base")
func search(query: String) async -> String {
let results = await knowledgeBase.search(query)
return results.joined(separator: "\n")
}
@DeclareTool("Get the current weather for a city")
func getWeather(city: String, unit: String) async -> String {
let data = await weatherAPI.fetch(city: city, unit: unit)
return "{\"temp\": \(data.temp), \"unit\": \"\(unit)\"}"
}
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
tools: [searchTool, getWeatherTool]
)
```
### Scope limitation
`@DeclareTool` is a Swift peer macro, which means it can only introduce new declarations at **top-level** or **type-member** scope. It **does not work inside function bodies**:
```swift
// ✅ Works — top-level scope
@DeclareTool("Ping the server")
func ping() -> String { return "pong" }
class MyApp {
// ✅ Works — type-member scope
@DeclareTool("Get status")
static func status() -> String { return "ok" }
}
func example() {
// ❌ Does NOT work — local scope
@DeclareTool("Ping the server")
func ping() -> String { return "pong" }
_ = pingTool // error: cannot find 'pingTool' in scope
}
```
If you need to define a tool inside a function body (for example, to capture local variables), use the manual `Tool` initializer described below.
## Creating tools manually
You can also create tools without the macro, using the `Tool` initializer directly. This works in any scope, including inside function bodies where the macro cannot be used. Each parameter is a `(name, jsonSchema)` tuple, and arguments from the LLM are passed positionally in the same order as the `parameters` array:
```swift
let circleAreaTool = Tool(
name: "circle_area",
description: "Calculates the area of a circle given its radius",
parameters: [("radius", #"{"type": "number"}"#)]
) { args in
let radius = args[0] as! Double
let area = Double.pi * radius * radius
return "Circle with radius \(radius) has area \(String(format: "%.2f", area))"
}
```
This is especially useful when you need to capture local state:
```swift
func runChat() async throws {
var callCount = 0
let pingTool = Tool(
name: "ping",
description: "Ping the server",
parameters: []
) { _ in
callCount += 1
return "pong"
}
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
tools: [pingTool]
)
let _ = try await chat.ask("Ping the server").completed()
print("Tool was called \(callCount) time(s)")
}
```
For async callbacks, there's a separate initializer:
```swift
let searchTool = Tool(
name: "search",
description: "Search the knowledge base",
parameters: [("query", #"{"type": "string"}"#)]
) { args async in
let query = args[0] as! String
return await knowledgeBase.search(query)
}
```
### Parameter schema reference
Each parameter schema is a JSON Schema string. Common types:
| Swift type | JSON Schema |
|---|---|
| `String` | `#"{"type": "string"}"#` |
| `Int` | `#"{"type": "integer"}"#` |
| `Double`, `Float` | `#"{"type": "number"}"#` |
| `Bool` | `#"{"type": "boolean"}"#` |
| `[String]` | `#"{"type": "array", "items": {"type": "string"}}"#` |
| `[String: Int]` | `#"{"type": "object", "additionalProperties": {"type": "integer"}}"#` |
## Multiple tools
Naturally, more tools can be defined and the model can chain the calls for them:
```swift
@DeclareTool("Gets path of the current directory")
func getCurrentDir() -> String {
return FileManager.default.currentDirectoryPath
}
@DeclareTool("Lists files in the given directory")
func listFiles(path: String) -> String {
let files = try? FileManager.default.contentsOfDirectory(atPath: path)
return (files ?? []).joined(separator: ", ")
}
@DeclareTool("Gets the size of a file in bytes")
func getFileSize(filepath: String) -> String {
let attrs = try? FileManager.default.attributesOfItem(atPath: filepath)
let size = attrs?[.size] as? Int ?? 0
return "File size: \(size) bytes"
}
let chat = try await Chat.fromPath(
modelPath: "/path/to/model.gguf",
tools: [getCurrentDirTool, listFilesTool, getFileSizeTool]
)
let response = try await chat
.ask("What is the biggest file in my current directory?")
.completed()
print(response)
```
## Tool calling and the context
As with most things made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.
---
# Vision & Hearing
A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image and audio information to your LLM.
## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:
1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens
To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model includes `mmproj` in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.
With the downloaded GGUFs, you can load them using `Chat.fromPath`:
```swift
import NobodyWho
let chat = try await Chat.fromPath(
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
systemPrompt: "You are a helpful assistant, that can hear and see stuff!"
)
```
Or load the model separately:
```swift
let model = try await Model.load(
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf"
)
let chat = try Chat(model: model, systemPrompt: "You are a helpful assistant.")
```
:::info
The language model and projection model have to **fit** together, as they are trained together!
Unfortunately you can't just take a projection model and an LLM that you like and expect them
to work together.
:::
## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
Use `Prompt` to build prompts that mix text, images, and audio, then pass them to `chat.ask()`:
```swift
let prompt = Prompt([
Prompt.text("Tell me what you see in the image and what you hear in the audio."),
Prompt.image("/path/to/dog.png"),
Prompt.audio("/path/to/sound.mp3"),
])
let response = try await chat.ask(prompt).completed()
```
## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
```swift
let prompt = Prompt([
Prompt.text("Tell me what you see in the image."),
Prompt.image("/path/to/dog.png"),
Prompt.text("Also tell me what you hear in the audio."),
Prompt.audio("/path/to/sound.mp3"),
])
```
Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```swift
let chat = try await Chat.fromPath(
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
contextSize: 8192
)
```
Or, for example, preprocess your images with some kind of downsampling (sometimes even changing the image type helps).
Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend at least trying out different
projection model files, if the one you picked does not work.
As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.
---
# Chat
As you may have noticed in the [welcome guide](./), every interaction with your LLM starts by creating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.
## Creating a Chat
There are two main ways of creating a `Chat` object and the difference lies in when the model file is loaded.
The simplest way is using `Chat.fromPath`:
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({ modelPath: "/path/to/model.gguf" });
```
This function is async since loading a model can take a bit of time, but this should not block any of your UI.
Another way to achieve the same thing is to load the model separately and then use the `Chat` constructor:
```typescript
import { Model, Chat } from "react-native-nobodywho";
const model = await Model.load({ modelPath: "/path/to/model.gguf" });
const chat = new Chat({ model });
```
This allows for sharing the model between several `Chat` instances.
## Prompts and responses
The `chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.
```typescript
const chat = await Chat.fromPath({ modelPath: "/path/to/model.gguf" });
const response = chat.ask("Is water wet?");
```
The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can iterate over the `TokenStream` using `for await`.
Each token is either an individual word or a fragment of a word.
```typescript
for await (const token of response) {
console.log(token);
}
```
If you just want to get the complete response, you can call `completed()`.
This will return the entire response string once the model is done generating.
```typescript
const fullResponse = await response.completed();
```
All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `chat.ask()`, it will remember the previous messages.
## Chat history
If you want to inspect the messages inside the `Chat` object, you can use `getChatHistory`.
```typescript
const msgs = await chat.getChatHistory();
console.log(msgs[0]); // The first message
```
Similarly, if you want to edit what messages are in the context, you can use `setChatHistory`:
```typescript
await chat.setChatHistory([
{ role: "user", content: "What is water?" },
]);
```
## System prompt
A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.
You can specify a system prompt when creating a `Chat`:
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
systemPrompt: "You are a mischievous assistant!",
});
```
This `systemPrompt` is then persisted until the chat context is reset.
## Context
The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
A bigger context size means more computational overhead, so it makes sense to constrain it. This can be done with the `contextSize` setting at creation time:
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
contextSize: 4096,
});
```
The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected model's documentation and see what their recommended context size is.
Even with a properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `contextSize / 2`. The KV cache is also updated automatically.
To reset the current context content, call `resetContext()` with a new system prompt and potentially changed tools.
```typescript
await chat.resetContext({ systemPrompt: "New system prompt", tools: [] });
```
If you don't want to change the already set defaults (`systemPrompt`, `tools`), but only reset the context, then go for `resetHistory`.
## Sharing model between contexts
There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. In this case you must load the model separately from creating the `Chat` instance.
```typescript
import { Model, Chat } from "react-native-nobodywho";
const model = await Model.load({ modelPath: "/path/to/model.gguf" });
const chat1 = new Chat({ model });
const chat2 = new Chat({ model });
```
NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.
## GPU
When using `Model.load` or `Chat.fromPath` you have the option to disable/enable GPU acceleration:
```typescript
const model = await Model.load({ modelPath: "/path/to/model.gguf", useGpu: false });
```
or
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
useGpu: false,
});
```
By default `useGpu` is set to `true`.
## Template Variables
Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.
### Using Template Variables
You can set template variables when creating a chat or modify them on existing instances:
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
templateVariables: { enable_thinking: true },
});
```
You can also modify template variables on an existing chat instance:
```typescript
// Set a single template variable
await chat.setTemplateVariable("enable_thinking", true);
// Get current template variables
const variables = await chat.getTemplateVariables();
console.log(variables); // Map { "enable_thinking" => true }
```
With the next message sent, the updated settings will be propagated to the model.
### Example: Qwen3 and Qwen3.5 Reasoning
The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:
```typescript
const chat = await Chat.fromPath({
modelPath: '/path/to/model.gguf',
templateVariables: { enable_thinking: true },
});
const response = chat.ask("Solve this logic puzzle: ...");
```
When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.
### Model-Specific Variables
Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.
:::info
Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.
:::
---
# Downloading models
NobodyWho can either load a model from a path on disk or download it for you on first use, caching it for subsequent runs. This page covers the available model path formats, how to observe a download in progress, how to access gated/private models, and how to inspect what's already in the local cache.
## Supported model path formats
The `modelPath` option to `Chat.fromPath` and `downloadModel` accepts:
| Form | Example | Notes |
| ---- | ------- | ----- |
| HuggingFace reference | `hf:owner/repo/file.gguf` | Downloaded and cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded and cached on first use |
| Local path | `./model.gguf` | Used as-is |
The HuggingFace prefix is case-insensitive and the `//` is optional — `hf:`, `hf://`, `huggingface:`, and `huggingface://` all mean the same thing. Remote models are downloaded to the platform cache directory on first load and re-used on subsequent runs.
## Tracking download progress
When loading a remote model, pass an `onDownloadProgress` option to observe the download. It receives `(downloaded, total)` byte counts, is throttled to roughly 10 Hz with a guaranteed final emit on completion, and is not called for cached or local files.
```typescript
const chat = await Chat.fromPath({
modelPath: "huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf",
onDownloadProgress: (downloaded, total) => {
console.log(`${downloaded} / ${total} bytes`);
},
});
```
## Downloading a gated model
Some HuggingFace models are private or gated by a license you need to accept. In both cases you need to be authorized to download the model weights.
You can manually download the GGUF file via your web browser, place it on the device, and then point `Chat.fromPath` at the local path:
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({ modelPath: "./model.gguf" });
```
Or use `downloadModel` with an `Authorization` header:
```typescript
import { downloadModel, Chat } from "react-native-nobodywho";
const modelPath = await downloadModel({
modelPath: "huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf",
headers: { Authorization: "Bearer your_hf_token" },
});
const chat = await Chat.fromPath({ modelPath });
```
You can generate a HuggingFace token in [your account settings](https://huggingface.co/settings/tokens).
## Inspecting the model cache
`getCachedModels()` returns every `.gguf` model in NobodyWho's cache directory, paired with its size in bytes. This is the same cache used by `downloadModel` and by `Chat.fromPath`'s `huggingface:` paths.
```typescript
import { getCachedModels } from "react-native-nobodywho";
for (const model of getCachedModels()) {
console.log(`${model.path}: ${Number(model.size)} bytes`);
}
```
Each entry has:
- `path: string` — absolute path to the cached `.gguf` file
- `size: bigint` — size in bytes (the underlying Rust `u64`, exposed as JavaScript `bigint`)
The array is empty if nothing has been downloaded yet. The call throws if the cache directory cannot be read.
---
# Embeddings & RAG
When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.
## Understanding Embeddings
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.
For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.
## The Encoder
The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).
We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).
```typescript
import { Encoder } from "react-native-nobodywho";
const encoder = await Encoder.fromPath({
modelPath: "/path/to/embedding-model.gguf",
});
const embedding = await encoder.encode("What is the weather like?");
console.log(`Vector with ${embedding.length} dimensions`);
```
The resulting embedding is an array of numbers (typically 384 or 768 dimensions depending on the model).
### Comparing Embeddings
To measure how similar two pieces of text are, compare their embeddings using cosine similarity:
```typescript
import { Encoder, cosineSimilarity } from "react-native-nobodywho";
const encoder = await Encoder.fromPath({
modelPath: "/path/to/embedding-model.gguf",
});
const query = await encoder.encode("How do I reset my password?");
const doc1 = await encoder.encode(
"You can reset your password in the account settings",
);
const doc2 = await encoder.encode(
"The password requirements include 8 characters minimum",
);
const similarity1 = cosineSimilarity(query, doc1);
const similarity2 = cosineSimilarity(query, doc2);
console.log(`Document 1 similarity: ${similarity1.toFixed(3)}`); // Higher score
console.log(`Document 2 similarity: ${similarity2.toFixed(3)}`); // Lower score
```
Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.
## The CrossEncoder for Better Ranking
While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.
The key difference is that embeddings compare vector similarity, while cross-encoders understand the relationship between query and document, at a potentially larger computation cost.
### Why CrossEncoder Matters
Consider this example:
```
Query: "What are the office hours for customer support?"
Documents: [
"Customer asked: What are the office hours for customer support?",
"Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
"Note: Weekend support is not available at this time"
]
```
Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.
### Using CrossEncoder
```typescript
import { CrossEncoder } from "react-native-nobodywho";
// Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
const crossencoder = await CrossEncoder.fromPath({
modelPath: "/path/to/reranker-model.gguf",
});
const query = "How do I install Python packages?";
const documents = [
"Someone previously asked about Python packages",
"Use pip install package-name to install Python packages",
"Python packages are not included in the standard library",
];
// Get relevance scores for each document
const scores = await crossencoder.rank(query, documents);
console.log(scores); // [0.23, 0.89, 0.45] - second doc scores highest
```
### Automatic Sorting
For convenience, use `rankAndSort` to get documents sorted by relevance:
```typescript
// Returns list of [document, score] pairs, sorted by score
const rankedDocs = await crossencoder.rankAndSort(query, documents);
for (const [doc, score] of rankedDocs) {
console.log(`[${score.toFixed(3)}] ${doc}`);
}
```
This returns documents ordered from most to least relevant.
## Building a RAG System
Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.
Here's a complete example building a customer service assistant with access to company policies:
```typescript
import { Chat, Tool, CrossEncoder } from "react-native-nobodywho";
// Initialize the cross-encoder for document ranking
const crossencoder = await CrossEncoder.fromPath({
modelPath: "/path/to/reranker-model.gguf",
});
// Your knowledge base
const knowledge = [
"Our company offers a 30-day return policy for all products",
"Free shipping is available on orders over $50",
"Customer support is available via email and phone",
"We accept credit cards, PayPal, and bank transfers",
"Order tracking is available through your account dashboard",
];
// Create a tool that searches the knowledge base
const searchKnowledgeTool = new Tool({
name: "search_knowledge",
description: "Search the knowledge base for relevant information",
parameters: [
{ name: "query", type: "string", description: "The search query" },
],
call: async (query: string) => {
const ranked = await crossencoder.rankAndSort(query, knowledge);
const topDocs = ranked
.slice(0, 3)
.map(([doc]) => doc);
return topDocs.join("\n");
},
});
// Create a chat with access to the knowledge base
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
systemPrompt:
"You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
tools: [searchKnowledgeTool],
});
// The chat will automatically search the knowledge base when needed
const response = await chat.ask("What is your return policy?").completed();
console.log(response);
```
The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.
## Recommended Models
### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings
### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy
## Best Practices
**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.
**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.
**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.
---
# Getting started
## How do I get started?
First, install `react-native-nobodywho`.
```bash
npm install react-native-nobodywho
```
No additional initialization step is required — the native module is loaded automatically when you first import from the package.
Now you are ready to pick a model. NobodyWho can download GGUF models directly from Hugging Face — just pass a `huggingface:` path. See [model selection](/docs/model-selection) for recommendations.
Then create a `Chat` object and call `.ask`!
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf",
});
const response = await chat.ask("Is water wet?").completed();
console.log(response); // Yes, indeed, water is wet!
```
This is a super simple example, but we believe that examples which do simple things, should be simple!
To get a full overview of the functionality provided by NobodyWho, simply keep reading.
## Android requirements
If you use the x86_64 Android emulator for development, your app must set `minSdkVersion` to at least 31. This is due to a threading feature (ELF TLS) that the Rust runtime requires on x86_64. ARM64 devices (i.e. all real phones) work with any `minSdkVersion`.
No specific NDK version is required — NobodyWho ships prebuilt shared libraries, so your project's NDK version does not affect the Rust code.
## Minimum recommended specs
- iOS: iPhone 11 or newer with at least 4 GB of RAM. We tested a Qwen3 0.6B (332 MB) on an iPhone X (iOS 16) and while it ran, performance was too slow to be practical.
- Android: Snapdragon 855 / Adreno 640 / 6 GB RAM or better. The same Qwen3 0.6B model performed notably better on a OnePlus 7 Pro (Android 12) than on the iPhone X tested above.
## Feedback & Contributions
We welcome your feedback and ideas!
- Bug Reports & Improvements: If you encounter a bug or have suggestions, please open an issue on our [Issues](https://github.com/nobodywho-ooo/nobodywho/issues) page.
- Feature Requests & Questions: For new feature requests or general questions, join the discussion on our [Discussions](https://github.com/nobodywho-ooo/nobodywho/discussions) page.
---
# Sampling
The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify, to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).
## Sampler presets
To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```typescript
import { Chat, SamplerPresets } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.temperature(0.2),
});
```
Setting `temperature` to `0.2` will affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.
To see the whole list of presets, check out the `SamplerPresets` class:
```typescript
class SamplerPresets {
static default(): SamplerConfig;
static dry(): SamplerConfig;
static greedy(): SamplerConfig;
static json(): SamplerConfig;
static temperature(temperature: number): SamplerConfig;
static topK(topK: number): SamplerConfig;
static topP(topP: number): SamplerConfig;
// Constrain output to a specific format:
static constrainWithJsonSchema(schema: string): SamplerConfig;
static constrainWithRegex(pattern: string): SamplerConfig;
static constrainWithGrammar(grammar: string): SamplerConfig;
}
```
## Structured output
One of the most useful features is constraining the model to produce structured output —
this gives you a hard guarantee that the output matches a specific format, rather than
relying on the model to get it right on its own.
### Regular expressions
For simpler patterns, you can constrain the output with a regex. Both regex literals and strings are accepted:
```typescript
// Force the model to answer with exactly "yes" or "no"
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.constrainWithRegex(/yes|no/),
});
const answer = await chat.ask("Is the sky blue?").completed();
```
### JSON schema
In some use-cases it might be useful to let the LLM generate JSON output.
This could be done either in the simple way, just enforcing any JSON by the preset:
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.json(),
});
```
Or utilizing JSON schemas to really force the LLM to give you the specific object shapes
that you want:
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
sampler: SamplerPresets.constrainWithJsonSchema({
type: "object",
properties: {
name: { type: "string", maxLength: 50 },
age: { type: "integer" },
},
required: ["name", "age"],
additionalProperties: false,
}),
});
const response = await chat.ask("Give me a person as JSON with name and age fields.").completed();
const person = JSON.parse(response); // always valid JSON matching the schema
```
### Custom grammars (advanced)
For cases where JSON schema and regex are not expressive enough, you can supply a custom grammar.
`constrainWithGrammar` accepts both **Lark** syntax and **GBNF** (llama.cpp format) —
NobodyWho automatically converts GBNF to Lark before passing it to the inference engine.
**Lark syntax** (recommended):
```typescript
const sampler = SamplerPresets.constrainWithGrammar(`
start: record (NEWLINE record)* NEWLINE?
record: field ("," field)*
field: /[^,"\\n\\r]+/
NEWLINE: /\\r?\\n/
`);
```
**GBNF syntax** (also accepted):
```typescript
const sampler = SamplerPresets.constrainWithGrammar(`
file ::= record (newline record)* newline?
record ::= field ("," field)*
field ::= /[^,"\\n\\r]+/
newline ::= "\\r\\n" | "\\n"
`);
```
See the [Lark documentation](https://lark-parser.readthedocs.io/en/latest/grammar.html) and the
[GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md) for the
full grammar syntax.
:::info
The older `SamplerPresets.grammar()` method is deprecated. Use
`SamplerPresets.constrainWithGrammar()` instead — it accepts both Lark and GBNF strings.
:::
## Defining your own samplers
Sampler presets abstract away some control that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide the `SamplerBuilder` class:
```typescript
import { Chat, SamplerBuilder } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
sampler: new SamplerBuilder().temperature(0.8).topK(5).dist() as SamplerConfig,
});
```
With `SamplerBuilder` you can chain multiple steps together and then select how you
want to sample from the distribution. Keep in mind that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly, be careful
to always end the chain with one of the sampling steps (e.g. `dist()`, `greedy()`, `mirostatV2()`, etc.).
For reproducible output, set the RNG seed with `.seed(value)` anywhere in the chain.
It is consumed by every random sampler in the chain — `dist`, `mirostatV1`, `mirostatV2`,
and the `xtc` shift step. `greedy` ignores it. If unset, a default seed is used.
```typescript
const sampler = new SamplerBuilder().temperature(0.8).topK(5).seed(42).dist() as SamplerConfig;
```
You can also change the sampler configuration on an existing chat instance:
```typescript
const sampler = new SamplerBuilder()
.temperature(0.8)
.topK(5)
.dist() as SamplerConfig;
await chat.setSamplerConfig(sampler);
```
---
# Tool Calling
To give your LLM the ability to interact with the outside world, you will need tool calling.
:::info
Note that **not every model** supports tool calling. If the model does not have
such an option, it might not call your tools.
For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/collections/NobodyWho/qwen-3) family of models.
:::
## Declaring a tool
A tool is created by providing a name, description, parameter schemas, and a callback function.
Any regular function can be used as a tool — arguments from the LLM are passed positionally
in the same order as the `parameters` array:
```typescript
import { Tool } from 'react-native-nobodywho';
const circleArea = (radius: number): string => {
const area = Math.PI * radius * radius;
return `Circle with radius ${radius} has area ${area.toFixed(2)}`;
};
const circleAreaTool = new Tool({
name: 'circle_area',
description: 'Calculates the area of a circle given its radius',
parameters: [
{ name: 'radius', type: 'number', description: 'The radius of the circle' },
],
call: circleArea,
});
```
Every `Tool` needs a callback function, a name, a description of what the tool does, and a `parameters` array describing its inputs. Each parameter uses [JSON Schema](https://json-schema.org/) properties (`type`, `enum`, `description`, etc.) plus a `name` field. Arguments are passed to your function positionally in array order.
To let your LLM use it, simply add it when creating `Chat`:
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
tools: [circleAreaTool],
});
```
NobodyWho then figures out the right tool calling format and configures the sampler.
Naturally, more tools can be defined and the model can chain the calls for them:
```typescript
import { Chat, Tool } from "react-native-nobodywho";
const getCurrentDir = (): string => '/home/user/documents';
// In a real app, you'd read the filesystem here
const listFiles = (path: string): string =>
'Files: report.pdf, notes.txt, model.gguf';
// In a real app, you'd check the actual file size
const getFileSize = (filepath: string): string => 'File size: 1024 bytes';
const getCurrentDirTool = new Tool({
name: "get_current_dir",
description: "Gets path of the current directory",
parameters: [],
call: getCurrentDir,
});
const listFilesTool = new Tool({
name: "list_files",
description: "Lists files in the given directory",
parameters: [
{ name: "path", type: "string", description: "The path to the directory to list" },
],
call: listFiles,
});
const getFileSizeTool = new Tool({
name: "get_file_size",
description: "Gets the size of a file in bytes",
parameters: [
{ name: "filepath", type: "string", description: "The path to the file" },
],
call: getFileSize,
});
const chat = await Chat.fromPath({
modelPath: "/path/to/model.gguf",
tools: [getCurrentDirTool, listFilesTool, getFileSizeTool],
});
const response = await chat
.ask("What is the biggest file in my current directory?")
.completed();
console.log(response);
```
## Tool calling and the context
As with most things made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.
---
# Vision & Hearing
A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image and audio information to your LLM.
## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:
1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens
To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model includes `mmproj` in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.
With the downloaded GGUFs, you can load them using `Chat.fromPath`:
```typescript
import { Chat } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
systemPrompt: "You are a helpful assistant, that can hear and see stuff!",
});
```
Or load the model separately:
```typescript
import { Model, Chat } from "react-native-nobodywho";
const model = await Model.load({
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
});
const chat = new Chat({
model,
systemPrompt: "You are a helpful assistant.",
});
```
## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
Use `Prompt` to build prompts that mix text, images, and audio, then pass them to `chat.ask()`:
```typescript
import { Chat, Prompt } from "react-native-nobodywho";
const chat = await Chat.fromPath({
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
});
const response = await chat
.ask(
new Prompt([
Prompt.Text("Tell me what you see in the image and what you hear in the audio."),
Prompt.Image("/path/to/dog.png"),
Prompt.Audio("/path/to/sound.mp3"),
]),
)
.completed();
```
## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
```typescript
const prompt = new Prompt([
Prompt.Text("Tell me what you see in the image."),
Prompt.Image("/path/to/dog.png"),
Prompt.Text("Also tell me what you hear in the audio."),
Prompt.Audio("/path/to/sound.mp3"),
]);
```
Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```typescript
const chat = await Chat.fromPath({
modelPath: "/path/to/vision-model.gguf",
projectionModelPath: "/path/to/mmproj.gguf",
contextSize: 8192,
});
```
Or, for example, preprocess your images with some kind of downsampling (sometimes even changing the image type helps).
Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend at least trying out different
projection model files, if the one you picked does not work.
As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.
---
# Chat
As you may have noticed in the [welcome guide](./), every interaction with your LLM starts by instantiating a `Chat` object.
In the following sections, we talk about which configuration options it has, and when to use them.w
## Creating a Chat
There are two main ways of instantiating a `Chat` object and the difference lies in when the model file is loaded.
The simplest way is using `Chat.fromPath` like so:
```dart
final chat = await nobodywho.Chat.fromPath(modelPath: "./model.gguf");
```
This function is async since loading a model can take a bit of time, but this should not block the any of your UI.
Another way to achieve the same thing is to load the model seperately and then use the `Chat` constructor:
```dart
final model = await nobodywho.Model.load(modelPath: "./model.gguf");
final chat = nobodywho.Chat(model : model);
```
This allows for sharing the model between several `Chat` instances.
## Prompts and responses
The `Chat.ask()` function is central to NobodyWho. This function sends your message to the LLM, which then starts generating a response.
```dart
import "dart:io"
final chat = await nobodywho.Chat.fromPath(modelPath: "./model.gguf");
final response = chat.ask("Is water wet?");
```
The return type of `ask` is a `TokenStream`.
If you want to start reading the response as soon as possible, you can just iterate over the `TokenStream`.
Each token is either an individual word or fragments of a word.
```dart continuation
await for (final token in response) {
print(token);
}
```
If you just want to get the complete response, you can call `TokenStream.completed()`.
This will return the entire response string once the model is done generating its entire response.
```dart continuation
final fullResponse = await response.completed();
```
All of your messages and the model's responses are stored in the `Chat` object, so the next time you call `Chat.ask()`, it will remember the previous messages.
## Chat history
If you want to inspect the messages inside the `Chat` object, you can use `getChatHistory`.
```dart continuation
final msgs = await chat.getChatHistory();
print(msgs[0].content); // "Is water wet?"
```
Similarly, if you want to edit what messages are in the context, you can use `setChatHistory`:
```dart continuation
await chat.setChatHistory([
nobodywho.Message.user(content: "What is water?")
]);
```
## System prompt
A system prompt is a special message put into the chat context, which should guide its overall behavior.
Some models ship with a built-in system prompt. If you don't specify a system prompt yourself, NobodyWho will fall back to using the model's default system prompt.
You can specify a system prompt when initializing a `Chat`:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
systemPrompt: "You are a mischievous assistant!"
);
```
This `systemPrompt` is then persisted until the chat context is `reset`.
## Context
The context is the text window which the LLM currently considers. Specifically this is the number of tokens the LLM keeps in memory for your current conversation.
As bigger context size means more computational overhead, it makes sense to constrain it. This can be done with `contextSize` setting, again at the time of creation:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
contextSize: 4096
);
```
The default value is `4096`, however this is mainly useful for short and simple conversations. Choosing the right context size is quite important and depends heavily on your use case. A good place to start is to look at your selected models documentation and see what their recommended context size is.
Even with properly selected context size it might happen that you fill up your entire context during a conversation. When this happens, NobodyWho will shrink the context for you. Currently this is done by removing old messages (apart from the system prompt and the first user message) from the chat history, until the size reaches `contextSize / 2`. The KV cache is also updated automatically. In the future we plan on adding more advanced methods of context shrinking.
Again, `contextSize` is fixed to the `Chat` instance, so it is currently not possible to change the size after `Chat` is created. To reset the current context content, just call `resetContext()` with the new system prompt and potentially changed tools.
```dart continuation
await chat.resetContext(systemPrompt: "New system prompt", tools: []);
```
If you don't want to change the already set defaults (`systemPrompt`, `tools`), but only reset the context, then go for `resetHistory`.
## Sharing model between contexts
There are scenarios where you would like to keep separate chat contexts (e.g. for every user of your app), but have only one model loaded. In this case you must load the model
seperately from creating the `Chat` instance.
For this use case, instead of the path to the `.gguf` model, you can pass in `Model` object, which can be shared between multiple `Chat` instances.
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final model = await nobodywho.Model.load(modelPath: './model.gguf');
final chat1 = nobodywho.Chat(model: model);
final chat2 = nobodywho.Chat(model: model);
...
```
NobodyWho will then take care of the separation, such that your chat histories won't collide or interfere with each other, while having only one model loaded.
## GPU
When instantiating `Model` or using `Chat.fromPath` you have the option to disable/enable GPU acceleration. This can be done as:
```dart
final model = await nobodywho.Model.load(modelPath: './model.gguf', useGpu: true);
```
or
```dart
final chat = await nobodywho.Chat.fromPath(modelPath: './model.gguf', useGpu : false);
```
By defualt `useGpu` is set to true.
So far, NobodyWho relies purely on [Vulkan](https://www.vulkan.org), however support
of more architectures is planned (for details check out our [issues](https://github.com/nobodywho-ooo/nobodywho/issues) or join us on [Discord](https://discord.gg/qhaMc2qCYB)).
## Template Variables
Chat templates are used internally by models to format conversation history into the expected prompt format. Different models may support different template variables that control specific behaviors. Template variables are boolean flags passed to the chat template that can enable or disable certain features.
### Using Template Variables
You can set template variables when creating a chat or modify them on existing instances:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
templateVariables: {"enable_thinking": true}
);
```
You can also modify template variables on an existing chat instance:
```dart continuation
// Set a single template variable
await chat.setTemplateVariable("enable_thinking", true);
// Set multiple template variables at once
await chat.setTemplateVariables({
"enable_thinking": true,
"verbose_mode": false
});
// Get current template variables
final variables = await chat.getTemplateVariables();
print(variables); // {enable_thinking: true, verbose_mode: false}
```
With the next message sent, the updated settings will be propagated to the model.
### Example: Qwen3 and Qwen3.5 Reasoning
The Qwen3 and Qwen3.5 model families support the `enable_thinking` template variable, which controls whether the model should engage in explicit reasoning steps before answering:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
templateVariables: {"enable_thinking": true}
);
final response = chat.ask("Solve this logic puzzle: ...");
```
When `enable_thinking` is enabled, these models will show their reasoning process before providing the final answer.
### Model-Specific Variables
Different models may support different template variables depending on their chat template implementation. The available variables and their effects depend entirely on how the model's chat template is designed. Check your model's documentation to see which template variables are supported.
:::info
Note that template variables are model-specific. If a model's chat template doesn't use a specific variable, that variable will be ignored gracefully.
:::
### Backward Compatibility
For backward compatibility, the deprecated `allowThinking` parameter is still available but internally sets the `enable_thinking` template variable:
```dart
// Deprecated - use templateVariables instead
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
allowThinking: true
);
```
---
# Downloading models
NobodyWho can either load a model from a path on disk or download it for you on first use, caching it for subsequent runs. This page covers the available model path formats, how to observe a download in progress, how to access gated/private models, and how to inspect what's already in the local cache.
## Supported model path formats
The `modelPath` argument to `Chat.fromPath`, `downloadModel`, and friends accepts:
| Form | Example | Notes |
| ---- | ------- | ----- |
| HuggingFace reference | `hf:owner/repo/file.gguf` | Downloaded and cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded and cached on first use |
| Local path | `./model.gguf` | Used as-is |
The HuggingFace prefix is case-insensitive and the `//` is optional — `hf:`, `hf://`, `huggingface:`, and `huggingface://` all mean the same thing. Remote models are downloaded to the platform cache directory on first load and re-used on subsequent runs.
## Tracking download progress
When loading a remote model, pass an `onDownloadProgress` callback to observe the download. It receives `(downloadedBytes, totalBytes)`, is throttled to roughly 10 Hz with a guaranteed final emit on completion, and is not called for cached or local files.
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: 'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
onDownloadProgress: (downloaded, total) {
print('$downloaded / $total bytes');
},
);
```
## Downloading a gated model
Some HuggingFace models are private or gated by a license you need to accept. In both cases you need to be authorized to download the model weights.
You can manually download the GGUF file via your web browser and then point `Chat.fromPath` at the local path:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
);
```
Or use `downloadModel` with an `Authorization` header:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final modelPath = await nobodywho.downloadModel(
modelPath: 'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
headers: {'Authorization': 'Bearer your_hf_token'},
);
final chat = await nobodywho.Chat.fromPath(modelPath: modelPath);
```
You can generate a HuggingFace token in [your account settings](https://huggingface.co/settings/tokens).
## Inspecting the model cache
`getCachedModels` returns every `.gguf` model that lives in NobodyWho's cache directory, paired with its size in bytes. This is the same cache used by `downloadModel` and by `Chat.fromPath`'s `huggingface:` paths. The call is synchronous.
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final models = nobodywho.getCachedModels();
for (final (path, size) in models) {
print('$path: ${size ~/ BigInt.from(1024 * 1024)} MiB');
}
```
- Paths are absolute.
- Sizes are `BigInt` byte counts (the underlying Rust `usize`).
- The list is empty if nothing has been downloaded yet.
- Throws if the cache directory cannot be read.
---
# Embeddings & RAG
When you want your LLM to search through documents, understand semantic similarity, or build retrieval-augmented generation (RAG) systems, you'll need embeddings and cross-encoders.
## Understanding Embeddings
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Texts with similar meanings have similar vectors, even if they use different words.
For example, "Schedule a meeting for next Tuesday" and "Book an appointment next week" would have very similar embeddings, despite using different words.
## The Encoder
The `Encoder` object converts text into embedding vectors. You'll need a specialized embedding model (different from chat models).
We recommend you first try [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).
```dart
import 'dart:typed_data';
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final embedding = await encoder.encode(text: "What is the weather like?");
print("Vector with ${embedding.length} dimensions");
```
The resulting embedding is a `Float32List` (typically 384 or 768 dimensions depending on the model).
### Comparing Embeddings
To measure how similar two pieces of text are, compare their embeddings using cosine similarity:
```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final query = await encoder.encode(text: "How do I reset my password?");
final doc1 = await encoder.encode(text: "You can reset your password in the account settings");
final doc2 = await encoder.encode(text: "The password requirements include 8 characters minimum");
final similarity1 = nobodywho.cosineSimilarity(
a: query.toList(),
b: doc1.toList()
);
final similarity2 = nobodywho.cosineSimilarity(
a: query.toList(),
b: doc2.toList()
);
print("Document 1 similarity: ${similarity1.toStringAsFixed(3)}"); // Higher score
print("Document 2 similarity: ${similarity2.toStringAsFixed(3)}"); // Lower score
```
Cosine similarity returns a value between -1 and 1, where 1 means identical meaning and -1 means opposite meaning.
### Practical Example: Finding Relevant Documents
```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
// Your knowledge base
final documents = [
"Python supports multiple programming paradigms including object-oriented and functional",
"JavaScript is primarily used for web development and runs in browsers",
"SQL is a domain-specific language for managing relational databases",
"Git is a version control system for tracking changes in source code"
];
// Pre-compute document embeddings
final docEmbeddings = [];
for (final doc in documents) {
docEmbeddings.add(await encoder.encode(text: doc));
}
// Search query
final query = "What language should I use for database queries?";
final queryEmbedding = await encoder.encode(text: query);
// Find the most relevant document
double maxSimilarity = -1;
int bestIdx = 0;
for (int i = 0; i < docEmbeddings.length; i++) {
final similarity = nobodywho.cosineSimilarity(
a: queryEmbedding.toList(),
b: docEmbeddings[i].toList()
);
if (similarity > maxSimilarity) {
maxSimilarity = similarity;
bestIdx = i;
}
}
print("Most relevant: ${documents[bestIdx]}");
print("Similarity score: ${maxSimilarity.toStringAsFixed(3)}");
```
## The CrossEncoder for Better Ranking
While embeddings work well for initial filtering, cross-encoders provide more accurate relevance scoring. They directly compare a query against documents to determine how well the document answers the query.
The key difference is that embeddings compare vector similarity, while cross-encoders understand the relationship between query and document, at a potentially larger computation cost.
### Why CrossEncoder Matters
Consider this example:
```
Query: "What are the office hours for customer support?"
Documents: [
"Customer asked: What are the office hours for customer support?",
"Support team responds: Our customer support is available Monday-Friday 9am-5pm EST",
"Note: Weekend support is not available at this time"
]
```
Using embeddings alone, the first document scores highest (most similar to the query) even though it provides no useful information. A cross-encoder correctly identifies that the second document actually answers the question.
### Using CrossEncoder
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
// Download a reranking model like bge-reranker-v2-m3-Q8_0.gguf
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
final query = "How do I install Python packages?";
final documents = [
"Someone previously asked about Python packages",
"Use pip install package-name to install Python packages",
"Python packages are not included in the standard library"
];
// Get relevance scores for each document
final scores = await crossencoder.rank(query: query, documents: documents);
print(scores); // [0.23, 0.89, 0.45] - second doc scores highest
```
### Automatic Sorting
For convenience, use `rankAndSort` to get documents sorted by relevance:
```dart continuation
// Returns list of (document, score) tuples, sorted by score
final rankedDocs = await crossencoder.rankAndSort(query: query, documents: documents);
for (final (doc, score) in rankedDocs) {
print("[${score.toStringAsFixed(3)}] $doc");
}
```
This returns documents ordered from most to least relevant.
## Building a RAG System
Retrieval-Augmented Generation (RAG) combines document search with LLM generation. The LLM uses retrieved documents to ground its responses in your knowledge base.
Here's a complete example building a customer service assistant with access to company policies:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
Future main() async {
// Initialize the cross-encoder for document ranking
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
// Your knowledge base
final knowledge = [
"Our company offers a 30-day return policy for all products",
"Free shipping is available on orders over \$50",
"Customer support is available via email and phone",
"We accept credit cards, PayPal, and bank transfers",
"Order tracking is available through your account dashboard"
];
// Create a tool that searches the knowledge base
final searchKnowledgeTool = nobodywho.Tool(
function: ({required String query}) async {
// Rank all documents by relevance to the query
final ranked = await crossencoder.rankAndSort(query: query, documents: knowledge);
// Return top 3 most relevant documents
final topDocs = ranked.take(3).map((e) => e.$1).toList();
return topDocs.join("\n");
},
name: "search_knowledge",
description: "Search the knowledge base for relevant information"
);
// Create a chat with access to the knowledge base
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
systemPrompt: "You are a customer service assistant. Use the search_knowledge tool to find relevant information from our policies before answering customer questions.",
templateVariables: {"enable_thinking": false},
tools: [searchKnowledgeTool]
);
// The chat will automatically search the knowledge base when needed
final response = await chat.ask("What is your return policy?").completed();
print(response);
}
```
The LLM will call the `search_knowledge` tool, receive the most relevant documents, and use them to generate an accurate answer.
## Async Operations
In Flutter/Dart, all operations are asynchronous by default. There are no separate `EncoderAsync` or `CrossEncoderAsync` classes - the regular `Encoder` and `CrossEncoder` classes use async/await patterns:
```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
Future main() async {
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
// Generate embeddings asynchronously
final embedding = await encoder.encode(text: "What is the weather?");
// Rank documents asynchronously
final query = "What is our refund policy?";
final docs = [
"Refunds processed within 5-7 business days",
"No refunds on sale items",
"Contact support to initiate refund"
];
final ranked = await crossencoder.rankAndSort(query: query, documents: docs);
for (final (doc, score) in ranked) {
print("[${score.toStringAsFixed(3)}] $doc");
}
}
```
## Recommended Models
### For Embeddings
- [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf) - Good balance of speed and quality (~25MB)
- Supports English text with 384-dimensional embeddings
### For Cross-Encoding (Reranking)
- [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - Multilingual support with excellent accuracy
## Best Practices
**Precompute embeddings**: If you have a fixed knowledge base, generate embeddings once and reuse them. Don't re-encode the same documents repeatedly.
**Use embeddings for filtering**: When working with large document collections (1000+ documents), use embeddings to narrow down to the top 50-100 candidates, then use a cross-encoder to rerank them.
**Limit cross-encoder inputs**: Cross-encoders are more expensive than embeddings. Don't pass thousands of documents to `rank()` - filter first with embeddings.
**Choose appropriate context size**: The `nCtx` parameter (default 4096) should match your model's recommended context size. Check the model documentation.
```dart
// For longer documents, increase context size
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
```
## Complete RAG Example
Here's a full example showing a two-stage retrieval system:
```dart
import 'dart:typed_data';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
Future main() async {
// Initialize models
final encoder = await nobodywho.Encoder.fromPath(modelPath: './embedding-model.gguf');
final crossencoder = await nobodywho.CrossEncoder.fromPath(modelPath: './reranker-model.gguf');
// Large knowledge base
final knowledgeBase = [
"Python 3.11 introduced performance improvements through faster CPython",
"The Django framework is used for building web applications",
"NumPy provides support for large multi-dimensional arrays",
"Pandas is the standard library for data manipulation and analysis",
// ... 100+ more documents
];
// Precompute embeddings for all documents
final docEmbeddings = [];
for (final doc in knowledgeBase) {
docEmbeddings.add(await encoder.encode(text: doc));
}
Future search({required String query}) async {
// Stage 1: Fast filtering with embeddings
final queryEmbedding = await encoder.encode(text: query);
final similarities = <(String, double)>[];
for (int i = 0; i < knowledgeBase.length; i++) {
final similarity = nobodywho.cosineSimilarity(
a: queryEmbedding.toList(),
b: docEmbeddings[i].toList()
);
similarities.add((knowledgeBase[i], similarity));
}
// Get top 20 candidates
similarities.sort((a, b) => b.$2.compareTo(a.$2));
final candidateDocs = similarities.take(20).map((e) => e.$1).toList();
// Stage 2: Precise ranking with cross-encoder
final ranked = await crossencoder.rankAndSort(query: query, documents: candidateDocs);
// Return top 3 most relevant
final topResults = ranked.take(3).map((e) => e.$1).toList();
return topResults.join("\n---\n");
}
final searchTool = nobodywho.Tool(
function: search,
name: "search",
description: "Search the knowledge base for information relevant to the query"
);
// Create RAG-enabled chat
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
systemPrompt: "You are a technical documentation assistant. Always use the search tool to find relevant information before answering programming questions.",
templateVariables: {"enable_thinking": false},
tools: [searchTool]
);
// The chat automatically searches and uses retrieved documents
final response = await chat.ask("What Python libraries are best for data analysis?").completed();
print(response);
}
```
This two-stage approach combines the speed of embeddings with the accuracy of cross-encoders, making it efficient even for large knowledge bases.
---
# Getting started
## How do I get started?
First, install `nobodywho`.
```bash
flutter pub add nobodywho
```
Next you need to import NobodyWho and we highly suggets you do this using the namespace `nobodywho` like so:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
```
since we have generic names such as `Model` and `Chat` in our package.
After you have imported the package it is very important that the next step is done correctly. As we dynamically link the rust binaries you must make
the following function call exactly once in your application!
```dart
await nobodywho.NobodyWho.init();
```
A call to any of the functions in NobodyWho will result in an error before `.init()` has been called.
However a second call to `.init()` will also result in an error, so you should be mindful about when you make this call.
We suggest you make it as early and as close to the root of your app as possible, as even though it is async it is a very fast operation.
With that setup done we can move on to the exiting stuff! We will in the rest of the docs that
you have imported NobodyWho using namespacing and that `.init()` has been called.
Now you are ready to pick a model. NobodyWho can download GGUF models directly from Hugging Face — just pass a `huggingface:` path. See [model selection](/docs/model-selection) for recommendations.
Then create a `Chat` object and call `.ask`!
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: 'huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf',
);
final msg = await chat.ask('Is water wet?').completed();
print(msg); // Yes, indeed, water is wet!
```
This is a super simple example, but we believe that examples which do simple things, should be simple!
To get a full overview of the functionality provided by NobodyWho, simply keep reading. You can also have a look at our [flutter starter app repository](https://github.com/nobodywho-ooo/flutter-starter-example).
## Minimum recommended specs
- iOS: iPhone 11 or newer with at least 4 GB of RAM. We tested a Qwen3 0.6B (332 MB) on an iPhone X (iOS 16) and while it ran, performance was too slow to be practical.
- Android: Snapdragon 855 / Adreno 640 / 6 GB RAM or better. The same Qwen3 0.6B model performed notably better on a OnePlus 7 Pro (Android 12) than on the iPhone X tested above.
## Feedback & Contributions
We welcome your feedback and ideas!
- Bug Reports & Improvements: If you encounter a bug or have suggestions, please open an issue on our [Issues](https://github.com/nobodywho-ooo/nobodywho/issues) page.
- Feature Requests & Question: For new feature requests or general questions, join the discussion on our [Discussions](https://github.com/nobodywho-ooo/nobodywho/discussions) page.
---
# Sampling
The model does not produce tokens but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify,
to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).
## Sampler presets
To get a quick start, NobodyWho offers a couple of well-known presets, which you can quickly utilize.
For example, if you want to increase or decrease the "creativity" of your model, select our `temperature` preset:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
sampler: nobodywho.SamplerPresets.temperature(temperature: 0.2)
);
```
Setting `temperature` to `0.2`, will then affect the sampler when choosing the next token, making the distribution less flat and therefore the model will favour more probable tokens.
To see the whole list of presets, check out the `SamplerPresets` class:
```dart
class SamplerPresets {
static SamplerConfig defaultSampler();
static SamplerConfig dry();
static SamplerConfig greedy();
static SamplerConfig json();
static SamplerConfig temperature({required double temperature});
static SamplerConfig topK({required int topK});
static SamplerConfig topP({required double topP});
// Constrain output to a specific format:
static SamplerConfig constrainWithJsonSchema({required String schema});
static SamplerConfig constrainWithRegex({required String pattern});
static SamplerConfig constrainWithGrammar({required String grammar});
}
```
## Structured output
One of the most useful features is constraining the model to produce structured output —
this gives you a hard guarantee that the output matches a specific format, rather than
relying on the model to get it right on its own.
### Regular expressions
For simpler patterns, you can constrain the output with a regex:
```dart
// Force the model to answer with exactly "yes" or "no"
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
sampler: nobodywho.SamplerPresets.constrainWithRegex(pattern: r'yes|no'),
);
final answer = await chat.ask("Is the sky blue?").completed();
```
### JSON schema
In some use-cases it might be useful to let the LLM generate JSON output.
This could be done either in the simple way, just enforcing any JSON by the preset:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
sampler: nobodywho.SamplerPresets.json(),
);
```
Or utilizing JSON schemas to really force the LLM to give you the specific object shapes
that you want:
```dart
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
sampler: nobodywho.SamplerPresets.constrainWithJsonSchema(schema: {
'type': 'object',
'properties': {
'name': {'type': 'string', 'maxLength': 50},
'age': {'type': 'integer'},
},
'required': ['name', 'age'],
'additionalProperties': false,
}),
);
final response = await chat.ask("Give me a person as JSON with name and age fields.").completed();
final person = jsonDecode(response); // always valid JSON matching the schema
```
### Custom grammars (advanced)
For cases where JSON schema and regex are not expressive enough, you can supply a custom grammar.
`constrainWithGrammar` accepts both **Lark** syntax and **GBNF** (llama.cpp format) —
NobodyWho automatically converts GBNF to Lark before passing it to the inference engine.
**Lark syntax** (recommended):
```dart
final sampler = nobodywho.SamplerPresets.constrainWithGrammar(grammar: """
start: record (NEWLINE record)* NEWLINE?
record: field ("," field)*
field: /[^,"\\n\\r]+/
NEWLINE: /\\r?\\n/
""");
```
**GBNF syntax** (also accepted):
```dart
final sampler = nobodywho.SamplerPresets.constrainWithGrammar(grammar: """
file ::= record (newline record)* newline?
record ::= field ("," field)*
field ::= /[^,"\\n\\r]+/
newline ::= "\\r\\n" | "\\n"
""");
```
See the [Lark documentation](https://lark-parser.readthedocs.io/en/latest/grammar.html) and the
[GBNF specification](https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md) for the
full grammar syntax.
:::info
The older `SamplerPresets.grammar()` method is deprecated. Use
`SamplerPresets.constrainWithGrammar()` instead — it accepts both Lark and GBNF strings.
:::
## Defining your own samplers
Sampler presets abstract away some control, that you might want - for example, if you
want to chain samplers, change more "advanced" parameters, etc. For that use case,
we provide `SamplerBuilder` class:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final chat = await nobodywho.Chat.fromPath(
modelPath: "./model.gguf",
sampler: nobodywho.SamplerBuilder()
.temperature(temperature: 0.8)
.topK(topK: 5)
.dist()
);
```
With `SamplerBuilder` you can chain multiple steps together and then select how do you
want to sample from the distribution. Keep in mind, that `SamplerBuilder` provides two
types of methods: ones which modify the distribution (returning again the instance of
`SamplerBuilder`) and ones which sample from the distribution (returning `SamplerConfig`).
So in order to have the sampler working properly and not giving you type errors, be careful
to always end the chain with one of the sampling steps (e.g. `dist()`, `greedy()`, `mirostatV2()`, etc.).
For reproducible output, set the RNG seed with `.seed(seed: value)` anywhere in the chain.
It is consumed by every random sampler in the chain — `dist`, `mirostatV1`, `mirostatV2`,
and the `xtc` shift step. `greedy` ignores it. If unset, a default seed is used.
```dart
final sampler = nobodywho.SamplerBuilder()
.temperature(temperature: 0.8)
.topK(topK: 5)
.seed(seed: 42)
.dist();
```
You can also change the sampler configuration on an existing chat instance:
```dart
final chat = await nobodywho.Chat.fromPath(modelPath: "./model.gguf");
final sampler = nobodywho.SamplerBuilder()
.temperature(temperature: 0.8)
.topK(topK: 5)
.dist();
await chat.setSamplerConfig(sampler);
```
---
# Tool Calling
To give your LLM the ability to interact with the outside world, you will need tool calling.
:::info
Note that **not every model** supports tool calling. If the model does not have
such an option, it might not call your tools.
For reliable tool calling, we recommend trying the [Qwen](https://huggingface.co/Qwen/models) family of models.
:::
## Declaring a tool
A tool can be created from any Dart function that returns a `String` or `Future`.
To perform the conversion, you simply need to use the `describeTool()` function. To get
a good sense of how such a tool can look like, consider this geometry example:
```dart
import 'dart:math' as math;
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final circleAreaTool = nobodywho.Tool(
name: "circle_area",
description: "Calculates the area of a circle given its radius",
function: ({ required double radius }) {
final area = math.pi * radius * radius;
return "Circle with radius $radius has area ${area.toStringAsFixed(2)}";
}
);
```
As you can see, every `Tool()` call needs a function, a name, and a description
of what such tool does. To let your LLM use it, simply add it when creating `Chat`:
```dart continuation
final chat = nobodywho.Chat.fromPath(
modelPath: './model.gguf',
tools: [circleAreaTool]
);
```
NobodyWho then figures out the right tool calling format, inspects the names and types of the parameters,
and configures the sampler.
Naturally, more tools can be defined and the model can chain the calls for them:
```dart
import 'dart:io';
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final getCurrentDirTool = nobodywho.Tool(
name: "get_current_dir",
description: "Gets path of the current directory",
function: () => Directory.current.path
);
final listFilesTool = nobodywho.Tool(
name: "list_files",
description: "Lists files in the given directory.",
function: ({required String path}) {
final dir = Directory(path);
final files = dir.listSync()
.where((entity) => entity is File)
.map((file) => file.path.split('/').last)
.toList();
return "Files: ${files.join(', ')}";
},
parameterDescriptions : {"path" : "The path to directory you want list. Must be a valid path." }
);
final getFileSizeTool = nobodywho.Tool(
name: "get_file_size",
description: "Gets the size of a file in bytes.",
function: ({required String filepath}) async {
final file = File(filepath);
final size = await file.length();
return "File size: $size bytes";
},
parameterDescriptions : {"filepath" : "The path to file you wish to know the size of. Must be a valid path." }
);
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
tools: [getCurrentDirTool, listFilesTool, getFileSizeTool],
templateVariables: {"enable_thinking": false}
);
final response = await chat.ask('What is the biggest file in my current directory?').completed();
print(response); // The largest file in your current directory is `model.gguf`.
```
## Pre-packaged tools
We ship NobodyWho with two packaged-in tools, which are general enough for multiple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.
The usage is straightforward. Use the `Tool.python()` and `Tool.bash()` factory constructors:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final chat = await nobodywho.Chat.fromPath(
modelPath: './model.gguf',
tools: [nobodywho.Tool.python(), nobodywho.Tool.bash()],
);
```
Lastly, keep in mind that for most use-cases it is reasonable to constrain the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `Tool.python()` provides `maxDuration`, `maxMemoryBytes` and `maxRecursionDepth`
and `Tool.bash()` provides `maxCommands`.
## Tool calling and the context
As with everything made to improve response quality, using tool calls fills up the context faster than simply chatting with an LLM. So be aware that you might need to use a larger context size than expected when using tools.
---
# Vision & Hearing
Easily provide image and audio information to your LLM.
## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:
1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens
To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.
With the downloaded GGUFs, you can simply add the projection model when loading the model:
```dart
import 'package:nobodywho/nobodywho.dart' as nobodywho;
final model = await nobodywho.Model.load(
modelPath: "./multimodal-model.gguf",
projectionModelPath: "./mmproj.gguf",
);
final chat = nobodywho.Chat(
model: model,
systemPrompt: "You are a helpful assistant, that can hear and see stuff!",
);
```
:::info
The language model and projection model have to **fit** together, as they are trained together!
Unfortunately you can't just take projection model and a LLM that you like and expect them
to work together.
:::
## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through `askWithPrompt`, which accepts a `Prompt` containing a list of `PromptPart` values.
```dart continuation
final response = await chat.askWithPrompt(nobodywho.Prompt([
nobodywho.TextPart("Tell me what you see in the image and what you hear in the audio."),
nobodywho.ImagePart("./dog.png"),
nobodywho.AudioPart("./sound.mp3"),
])).completed(); // It's a dog and a penguin!
```
## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
```dart continuation
await chat.resetHistory();
final response2 = await chat.askWithPrompt(nobodywho.Prompt([
nobodywho.TextPart("Tell me what you see in the image."),
nobodywho.ImagePart("./dog.png"),
nobodywho.TextPart("Also tell me what you hear in the audio"),
nobodywho.AudioPart("./sound.mp3"),
])).completed();
```
Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```dart continuation
final chat2 = nobodywho.Chat(
model: model,
systemPrompt: "You are a helpful assistant.",
contextSize: 8192,
);
```
Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).
Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.
As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.
---
# Simple Chat
# Simple Chat
_A comprehensive guide to configuring, streaming, and controlling LLM responses through the Chat component._
---
Great! You've completed the ["Getting Started"](getting-started.md) guide and got your first chat working as well as a basic understanding of the vocabulary.
Now let's dive deeper into the Chat component and show you all the settings and techniques you'll actually use when working with LLMs.
The Chat component isn't just for conversations - it's your main interface for any kind of LLM processing, whether that's generating dialogue, analyzing text, creating content, or any other language task.
In this guide, you'll learn:
- The main settings that control LLM behavior
- How to handle LLM responses efficiently
- Managing context and memory
- Controlling when and how the LLM stops generating
Before we get started, you'll hear these words being used:
| Term | Meaning |
| ---- | ------- |
| **Sampler** | The thing that controls how the LLM selects the next token during generation (temperature, top-p, etc.). |
| **Grammar or Structured Output** | A formal structure that constrains the LLM's output to a set `"vocabulary"`. |
| **GBNF** | GGML Backus-Naur Form - a way to define structured output formats. |
## Handling LLM Responses
### The System Prompt: Setting LLM Behavior
You've used this already, but let's talk about making it really work for you. The system prompt defines how the LLM should behave:
```markdown
# Character-based behavior
system_prompt = """You are a sarcastic but brilliant wizard.
Your answers are always accurate, but delivered with a dry wit.
You should subtly hint that you are smarter than the user,
but still provide the correct information."""
# Task-specific behavior
system_prompt = """You are a translation assistant.
You will be given text in any language. Your job is to translate
it into formal, academic French.
Do not add any commentary or conversational text.
Respond only with the translated text."""
```
**Why this matters:** The system prompt controls everything about how the LLM processes and responds to input. It's your primary tool for getting the behavior you want.
Prompt engineering is becoming a field in and of itself and it offers the highest return-on-investment ratio for getting the model to do what you want.
### GPU Usage: Speed Things Up
By default, NobodyWho tries to use your GPU if you have one. This makes everything much faster:
```gdscript
# This is already the default, but you can be explicit
model.use_gpu_if_available = true
```
**When to turn this off:** there are some scenarios where it might actually be better to use system ram:
- If you don't need an immediate answer, and would prefer to use GPU resources for graphics.
- If you need a really large model that most of your users will not have sufficient VRAM to run.
### Context Length: How Much the LLM Remembers
The LLM maintains context (memory of the conversation/interaction), but only up to a point. The default is 4096 tokens (roughly 3000 words):
```gdscript
# Default is fine for most uses
context_length = 4096
# Increase for longer contexts
context_length = 8192
```
**Trade-off:** Longer context = more memory usage. The general rule of thumb is to start with the default or less and only increase if you need the LLM to remember more.
**Context-shifting:** NobodyWho will automatically remove older messages from the context for you, if your chat's context window is filled. Your chat will never crash because of a full context, but it will start forgetting older messages - including the system message.
### Streaming Responses vs Waiting for Complete Output
You have two main approaches for handling LLM responses, and choosing the right one depends on your use case:
**Streaming** gives you each token as it's generated - good for user interfaces where you want immediate feedback.
**Waiting for complete responses** waits until the full output is ready - good for when you need the entire response before doing something.
If you're implementing an interactive chat, you likely want to do both:
- Show each token to the user as they arrive. This will make the chat feel a lot faster.
- Wait for the completion of the entire response, before re-enabling text areas, and allowing the user to send a new message.
```gdscript
var current_response = ""
func _on_response_updated(token: String):
current_response += token
# Good for: UI updates, real-time feedback
ui_label.text = current_response
func _on_response_finished(response: String):
# Good for: Final processing, logging, triggering next actions
print(response)
response = response.replace("", player.name)
trigger_next_game_event()
```
**When to use streaming:**
- Interactive dialogue where users expect immediate feedback
- Long responses where you want to show progress
**When to wait for complete responses:**
- When you need to make decisions based on the full LLM output
- Content generation where partial results are useless (like JSON or structured output answers).
You most likely end up using both; having the response_updated to stream to your UI and then triggering the next step in your program when you get the full response.
## Managing Context and Memory
Sometimes you need to reset the LLM's memory or manage what it remembers.
### Starting Fresh
```gdscript
# Clear all context, it will still have all the settings that you
# have set up before (including the system prompt)
reset_context()
```
This is useful when:
- Starting a new task that's unrelated to previous ones, where the previous history is irrelevant
- The LLM gets confused as it has context shifted too much
### Advanced Context Management
If you need more control over what the LLM remembers:
```gdscript
# See what's in the context
var messages = await get_chat_history()
for message in messages:
print(message.role, ": ", message.content)
# Set a custom context (useful for templates or saved states)
var task_context = [
{"role": "user", "content": "Analyze the following data:", "assets": []},
{"role": "assistant", "content": "I'm ready to analyze data. Please provide it.", "assets": []},
{"role": "user", "content": "Here's the data: " + data_to_analyze, "assets": []}
]
await set_chat_history(task_context)
```
### Structured Output & Sampling
You can control how the model picks tokens and constrain its output format. See the [Sampling](sampling.md) guide for sampler presets (temperature, JSON, grammar constraints) and the [Structured Output](structured-output.md) guide for a full GBNF grammar tutorial.
## Performance and Memory Tips
### Start the Worker Early
In a real-time application, you don't want the user's first interaction to trigger a long loading time. Starting the worker early, like during a splash screen or initial setup, pre-loads the model into memory so the first response is fast.
```gdscript
# In your _ready() function, set up everything before the app starts.
func _ready():
# 1. Configure the chat behavior
self.system_prompt = "You are a helpful assistant."
self.model_node = get_node("../SharedModel")
# 2. Start the worker *before* the user can interact.
# This pre-loads the model so the first interaction isn't slow.
start_worker()
# 3. Now other setup can happen
print("Assistant chat is ready.")
```
**Why:** Starting the worker loads the model into memory. It's slow the first time, but then all LLM operations are much faster.
You should definitely think about when to do this to not ruin the UX too much.
### Share Models Between Components
An application might need to use an LLM for several different tasks. Instead of loading the same heavy model multiple times, you can have multiple `Chat` components that all share a single `Model` component. Each `Chat` can have its own system prompt and configuration, directing it to perform a different task.
```gdscript
# An application with multiple LLM-powered behaviors, all sharing one model.
func _ready():
# 1. Get the single, shared model
var shared_model = get_node("../SharedModel")
# 2. Configure a chat component for general conversation
var casual_chat = get_node("CasualChat")
casual_chat.model_node = shared_model
casual_chat.system_prompt = "You are a friendly and helpful assistant. Keep your answers concise."
casual_chat.start_worker()
# 3. Configure another chat component for structured data extraction
var extractor_chat = get_node("ExtractorChat")
extractor_chat.model_node = shared_model
extractor_chat.system_prompt = "Extract the key information from the user's text and provide it in JSON format."
# This one would likely use a grammar to enforce JSON output.
extractor_chat.start_worker()
# Now you can use both for different tasks without loading two models!
casual_chat.ask("Can you tell me about your capabilities?")
extractor_chat.ask("My name is Jane Doe and my email is jane@example.com.")
```
**Memory savings:** Instead of loading multiple models, you load one and share it. Much more efficient!
---
# Downloading models
# Downloading models
_How NobodyWho downloads, caches, and inspects GGUF models in Godot._
---
NobodyWho can either load a model from a path on disk or download it for you on first use, caching it for subsequent runs. This page covers the available model path formats, how to observe a download in progress, how to access gated/private models, and how to inspect what's already in the local cache.
## Supported model path formats
The `model_path` field on `NobodyWhoModel` (and `projection_model_path` for vision models) accepts several forms:
| Form | Example | Notes |
| ---- | ------- | ----- |
| Godot resource path | `res://models/my-model.gguf` | Bundled with your game export |
| User data path | `user://downloaded.gguf` | Written by your game at runtime |
| Absolute filesystem path | `/opt/models/foo.gguf` | Local file |
| HuggingFace reference | `hf:owner/repo/file.gguf` | Downloaded and cached on first use |
| HTTPS URL | `https://example.com/model.gguf` | Downloaded and cached on first use |
The HuggingFace prefix is case-insensitive and the `//` is optional — `hf:`, `hf://`, `huggingface:`, and `huggingface://` all mean the same thing. Remote models are downloaded to the platform cache directory on first load and re-used on subsequent runs. Downloads happen on a background thread — the Godot main loop stays responsive while a multi-GB model is fetched.
## Showing download progress
`NobodyWhoModel` emits a `download_progress(downloaded, total)` signal while a remote model is downloading, throttled to roughly 10 Hz with a guaranteed final emit on completion. Connect it if you'd like to drive a progress bar:
```gdscript
model.download_progress.connect(func(downloaded: int, total: int):
print("%d / %d bytes" % [downloaded, total])
)
```
The signal is not emitted for local files or already-cached downloads.
## Downloading a gated model
Some HuggingFace models are private or gated by a license you need to accept. In both cases you need to be authorized to download the model weights.
You can manually download the GGUF file via your web browser and then point your `NobodyWhoModel` at the local path.
Alternatively, use the `NobodyWhoDownloader` node, which lets you pass an authorization header:
```gdscript
var dl = NobodyWhoDownloader.new()
dl.model_path = "huggingface:NobodyWho/Qwen_Qwen3-0.6B-GGUF/Qwen_Qwen3-0.6B-Q4_K_M.gguf"
dl.headers = {"Authorization": "Bearer your_hf_token"}
dl.download_complete.connect(func(local_path: String):
get_node("../ChatModel").model_path = local_path
)
dl.download_failed.connect(func(error: String):
push_error("Download failed: " + error)
)
dl.start_download()
add_child(dl)
```
You can generate a HuggingFace token in [your account settings](https://huggingface.co/settings/tokens).
## Inspecting the model cache
`NobodyWhoModel.get_cached_models()` is a static function that returns every `.gguf` model in NobodyWho's cache directory, paired with its size in bytes. This is the same cache used by `NobodyWhoDownloader` and by `NobodyWhoModel`'s `huggingface:` paths.
```gdscript
for entry in NobodyWhoModel.get_cached_models():
print("%s: %d bytes" % [entry["path"], entry["size"]])
```
Each entry is a `Dictionary` with two keys:
- `"path"` — absolute path to the cached `.gguf` file
- `"size"` — size in bytes
The array is empty if nothing has been downloaded yet. On error the function returns `null` and logs a Godot error to the console.
---
# Embeddings & RAG
# Embeddings & RAG
_Using embeddings for semantic text comparison and retrieval-augmented generation._
---
## Understanding Text with Embeddings
Cool, you've got the basics of chat working! Now let's explore embeddings, which let you understand what text means rather than just matching exact words.
Embeddings are like a smart way to measure how similar two pieces of text are, even if they use completely different words.
Instead of looking for exact matches, embeddings understand meaning.
For example, "Hand me the red potion" and "Give me the scarlet flask" would be recognized as very similar, even though they share no common words.
Here are the key terms for working with embeddings:
| Term | Meaning |
| ---- | ------- |
| **Embedding Model (GGUF)** | A specialized `*.gguf` file trained to convert text into numerical vectors that represent meaning. |
| **Embedding** | A list of numbers (vector) that represents the meaning of a piece of text. |
| **Cosine Similarity** | A mathematical way to compare how similar two embeddings are, returning a value between 0 (completely different) and 1 (identical meaning). |
| **Semantic Search** | Finding text that means the same thing, even if the words are different. |
| **Vector** | The array of numbers that represents your text's meaning. |
Let's show you how to use embeddings to understand what your players really mean when they type commands.
### Download an Embedding Model
Embedding models are different from chat models. You need a model specifically trained for embeddings.
We normally use [bge-small-en-v1.5-q8_0.gguf](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/resolve/main/bge-small-en-v1.5-q8_0.gguf).
### Practical Example: Quest & Reputation System
A good way to visualize the practicality of embeddings is through an example.
In this example we will guide you through how to make a quest trigger or lowering the user's reputation based on what they say.
We'll build it step by step, but for the impatient; The complete script is copyable in the bottom of the page.
#### Step 1: Set up your basic structure and variables
The first step is to setup our components. We will add some statements for quests and some for hostile behavior - these are not exhaustive lists.
**Do note** that it will take a longer time to embed a lot of sentences (depending on model and hardware of course), so depending on how complex your statements need to be,
you might be better off having a handful and tuning the sensitivity of the trigger instead.
First, create your script that extends `NobodyWhoEncoder` and define your statement categories:
```gdscript
extends NobodyWhoEncoder
var quest_triggers= [
"I know where the dragon rests",
"The druid told me the proper way to meet the dragon",
"I discovered the ritual needed to gain the dragon's audience",
"I know about the sacred grove"
]
var hostile_statements = [
"I want to kill the dragon",
"I'm going to destroy everything",
"I hate this place and everyone in it",
"I will burn down the village",
"Everyone here deserves to die"
]
var helpful_embeddings = []
var hostile_embeddings = []
var player_reputation = 0
```
#### Step 2: Initialize the embedding system
Set up the embedding model and start the worker:
```gdscript
func _ready():
# Create and configure the embedding model
var embedding_model = NobodyWhoModel.new()
embedding_model.model_path = "res://models/bge-small-en-v1.5-q8_0.gguf"
get_parent().add_child(embedding_model)
# Link to the embedding model
self.model_node = embedding_model
self.encoding_finished.connect(_on_encoding_finished)
self.start_worker()
# Pre-generate embeddings for all statement types
precompute_all_embeddings()
```
#### Step 3: Precompute reference embeddings
Generate embeddings for all your reference statements:
```gdscript
func precompute_all_embeddings():
# Generate embeddings for helpful statements
for statement in quest_triggers:
encode(statement)
var embedding = await self.encoding_finished
helpful_embeddings.append(embedding)
# Generate embeddings for hostile statements
for statement in hostile_statements:
encode(statement)
var embedding = await self.encoding_finished
hostile_embeddings.append(embedding)
```
#### Step 4: Add input handling for testing
Add a simple test trigger using the enter key:
```gdscript
func _input(event):
# Handle enter key press to send hardcoded test message
if event is InputEventKey and event.pressed:
if event.keycode == KEY_ENTER:
var test_message = "I know the location of the dragon"
print("Sending test message: ", test_message)
analyze_player_statement(test_message)
```
#### Step 5: Analyze player statements
Compare the player's message against your reference embeddings:
```gdscript
func analyze_player_statement(player_text: String):
# Generate embedding for player input
encode(player_text)
var player_embedding = await self.encoding_finished
# Compare against both categories
var best_helpful_similarity = get_best_similarity(player_embedding, helpful_embeddings)
var best_hostile_similarity = get_best_similarity(player_embedding, hostile_embeddings)
print("Helpful similarity: ", best_helpful_similarity)
print("Hostile similarity: ", best_hostile_similarity)
# Use similarity threshold of 0.8 and compare categories
if best_helpful_similarity > 0.8 and best_helpful_similarity > best_hostile_similarity:
handle_helpful_information(player_text)
elif best_hostile_similarity > 0.8 and best_hostile_similarity > best_helpful_similarity:
handle_hostile_intent(player_text)
else:
print("Unclear intent - no strong match found")
```
#### Step 6: Handle the results
Trigger appropriate game systems based on detected intent:
```gdscript
func handle_helpful_information(text: String):
# Trigger game systems based on detected intent
print("Triggering quest: 'Audience with the Ancient Dragon'!")
func handle_hostile_intent(text: String):
player_reputation -= 15
print("Player expressed hostile intent! Reputation -15 (now: ", player_reputation, ")")
```
---
## Adding Long-Term Memory (RAG)
Great! You've got chat and embeddings working. Now let's add something useful: the ability to look up specific lore, dialogues, questlines etc.
### Why Your Game Needs Smart Document Search
Picture this: Your player is 40 hours into your RPG and asks an npc "Where do I find that crystal for the sword upgrade?"
Your LLM, without reranking, might give a generic answer or worse - make something up - leading to a bad player experience.
There are several ways to combat this, one is to load a lot of information into the context (i.e. the system prompt) but with a limited context, it might 'forget' the important information
or be confused by too much information. Instead we want to add a "long term memory" module to our language model.
To do this in the llm space you are going to use RAG (retrieval augmented generation) we are enriching the knowledge of the LLM by allowing it to search through a database of info we fed it.
There are many ways to do this. In NobodyWho we currently expose two major ways, one is embeddings; converting a sentence to a vector and then find the vectors that are closest to it.
This is powerful as you can save the vectors to a database or a file beforehand and then use the really fast and cheap cosine similarity to compare them. Another more expensive but more accurate way is to use a cross-encoder that figures out the relationship between the question and the document rather that just how similar they are.
This approach is often called reranking, due to how it is used as a step two, for sorting and filtering large knowledge databases accessed by LLMs. We'll call it ranking as we are working with a small enough dataset that we do not need a first pass to filter out irrelevant info.
Take this example:
```
Query: "Where do I find crystals for my sword upgrade?"
Documents: [
"You asked the blacksmith: Where do I find crystals for my sword upgrade?",
"The blacksmith said: Magic crystals are found in the Northern Mountains.",
"You heard in the tavern: Magic crystals are not found in the Southern Desert."
]
```
If we rely just on comparing the query with the embeddings using cosine similarity (as we did with the embeddings), we will get back the document "You asked the blacksmith: Where do I find crystals for my sword upgrade?" as it is the most similar sentence to our query. This gave us no useful information and we have just wasted valuable context.
But with ranking, the cross-encoder model has been trained on knowing that the answer to the question is not the question itself, and thus ranks the document "The blacksmith said: Magic crystals are found in the Northern Mountains." the highest.
Here are the key terms you'll need:
| Term | Meaning |
| ---- | ------- |
| **Document Ranking** | Sorting text documents by how well they match or answer a question. |
| **RAG (Retrieval-Augmented Generation)** | A system that finds relevant documents first, then uses them to generate better LLM responses. |
| **Cross-encoder** | The type of model used for reranking - it reads both the query and document together to score relevance. |
Let's show you how to build smart search systems for your game.
### Download a Reranker Model
Reranking models are different from chat and embedding models. You need one specifically trained for document ranking.
We recommend [bge-reranker-v2-m3-Q8_0.gguf](https://huggingface.co/gpustack/bge-reranker-v2-m3-GGUF/resolve/main/bge-reranker-v2-m3-Q8_0.gguf) - it works well for most games and supports multiple languages.
Note that the current qwen3 reranker does not work, due to how they created the template as it has some missing fields.
### Practical Example: Smart NPC with Knowledge Base
Let's build a tavern keeper NPC that can answer player questions by searching through their personal knowledge. This NPC knows about the local area, quests, and rumors - perfect for creating more immersive and helpful characters.
We'll build it step by step, but for the impatient - the complete script is at the bottom.
#### Step 1: Set up your NPC's knowledge base
First, let's create a knowledge base for our tavern keeper - everything this specific NPC would realistically know:
```gdscript
extends NobodyWhoChat
@onready var reranker = $"../Rerank"
@onready var chat_model = $"../ChatModel"
# The tavern keeper's knowledge - ~50 pieces of local information way more than could fit in a standard 4096 sized context.
var tavern_keeper_knowledge = PackedStringArray([
"The lake contains a special clay that blacksmiths use to forge superior weapons.",
"Ancient oak trees in the sacred grove provide wood that naturally resists dark magic.",
"Silver veins run through the mountain caves, valuable for crafting blessed weapons.",
"Rare moonflowers bloom in the ruins only once per season and have powerful magical properties.",
"The mill pond contains perfect stones for sharpening blades to razor sharpness.",
"Wild honey from forest bees makes potions more potent when used as a base ingredient.",
"A hooded stranger was seen asking questions about the old castle ruins last week.",
"Someone has been leaving fresh flowers at the grave of the village's first mayor.",
"Strange animal tracks were found near the well that don't match any known creature.",
"The church bell rang by itself three nights ago at exactly midnight.",
"Farmers found crop circles in their wheat fields after the last thunderstorm.",
"A merchant claims he saw lights moving through the abandoned mine from the hill road.",
"Children report hearing music coming from the forest when they play near the edge of town.",
"The weather has been unusually warm this winter, and the old-timers are worried.",
"Someone broke into the general store but only stole a map of the local cave systems.",
"A wolf with unusual blue eyes has been spotted watching the town from the tree line.",
"Old Sarah runs the bakery and makes the best apple pies in three kingdoms. Her grandson Tom went missing last week.",
"Blacksmith Gareth is always looking for quality iron ore and magic crystals. He pays double for rare materials.",
"Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.",
"Father Benedict runs the small chapel and knows ancient blessings that can ward off evil spirits.",
"Widow Martha owns the general store and knows every piece of gossip in town within hours.",
"Young apprentice Jake works for the blacksmith but dreams of becoming an adventurer himself.",
"Doctor Thorne treats injuries and illnesses. He keeps rare healing herbs in his back garden.",
"Stable master Owen knows every horse in the region and can track animals through the wilderness.",
"Mayor Thompson inherited his position from his father and struggles with the town's growing problems.",
"The old mine north of town has been abandoned for years. Strange sounds echo from deep inside at night.",
"The forest path to the east is safe during the day, but wolves hunt there after sunset.",
"Crystal Mines to the south produce valuable gems but have become dangerous recently.",
"The ancient stone bridge over Miller's Creek was built by dwarves centuries ago and still stands strong.",
"Darkwood Forest harbors bandits who prey on merchant caravans traveling the main road.",
"The Whispering Caves get their name from the wind that creates eerie sounds through the rock formations.",
"Lake Serenity freezes solid in winter, making it possible to cross on foot to the northern settlements.",
"The old watchtower on Crow's Hill offers a view of the entire valley but hasn't been manned in decades.",
"Sacred Grove is where the druids once practiced their rituals before they disappeared from the region.",
"The ruins of Castle Blackrock still stand on the mountain, though none dare venture there anymore.",
"Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.",
"Tom the baker's grandson disappeared near the Crystal Mines while collecting rare stones.",
"Strange lights have been appearing in the Whispering Caves during moonless nights.",
"Farmers report their livestock going missing near the edge of Darkwood Forest.",
"The old mill wheel stopped working after something large damaged it upstream.",
"Merchants complain about increased bandit activity on the eastern trade route.",
"Several townsfolk have reported seeing ghostly figures near the abandoned mine at midnight.",
"The village well's water tastes strange since the earthquake last month.",
"Wild animals have been acting aggressively and fleeing deeper into the mountains.",
"Ancient runes appeared overnight on the sacred standing stones outside town.",
"The town was founded by refugees fleeing the Great Dragon War three hundred years ago.",
"Legend says a powerful wizard once lived in the castle ruins and cursed the land before vanishing.",
"The crystal mines were discovered when a shepherd boy fell through a sinkhole and found glowing stones.",
"Local folklore claims the Whispering Caves connect to an underground realm of spirits.",
"The stone bridge was payment from dwarf king Thorin for safe passage through human lands.",
"Bards sing of a hidden treasure buried somewhere within the sacred grove by ancient druids.",
"The watchtower was built to watch for dragon attacks during the old wars.",
"Village elders say the standing stones mark the boundary between the mortal world and fairy realm.",
"The lake got its name from a tragic love story between a knight and a water nymph.",
"Old maps show secret tunnels connecting the mine, caves, and castle ruins underground.",
"Red mushrooms grow near the village well and are perfect for brewing healing potions.",
"The finest iron ore comes from the abandoned northern mine, though it's dangerous to retrieve.",
"Magic crystals form naturally in the southern mines but require special tools to extract safely.",
"Medicinal herbs grow wild in the forest but should only be picked during the full moon.",
])
var ranked_docs = []
```
#### Step 2: Configure your components
```gdscript
func _ready():
# Set up the chat for generating helpful responses
self.model_node = chat_model
reranker.connect("ranking_finished", func(result): ranked_docs = result)
reranker.start_worker()
self.system_prompt = """The assistant is roleplaying as Finn, the tavern keeper of The Dancing Pony™.
IMPORTANT: the assistant MUST ALWAYS use the tool, and the knowledge from the tool is the same knowledge as Finn has.
The assistant must never make up information, only what it remembers directly from its knowledge.
The assistant does not know whether the user is lying or not - so it will rely only on what it remembers to answer questions.
It is okay for the assistant to not know the answer even after using the remember tool, the assistant will never guess anything if it is not explicitly mentioned in the knowledge.
The assistant must always speak like a tavern keeper.
"""
# Add the tool to remember stuff
self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
self.connect("response_finished", func(response: String): print("Finn says: ", response))
start_worker()
```
#### Step 3: Set up a simple input system
```gdscript
func _process(delta):
if Input.is_action_just_pressed("enter"):
var test_question = "Where is strider?"
print("Player asks Finn: ", test_question)
ask(test_question)
```
#### Step 4: Use ranked results to generate smart answers
```gdscript
func remember(query: String) -> String:
print("Finn searches his memory for: ", query)
ranked_docs = "\n".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))
print("Finn found: ", ranked_docs)
return ranked_docs
func _ready():
....
# Add the tool to the llm to allow it to search for the most relevant lore.
self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
self.connect("response_finished", func(response: String): print("Finn says: ", response))
start_worker()
```
When we run this we will get the response:
>Finn says: <think>
>Okay, let's see. The user asked about trading in the area.
```
Finn searches his memory for: What is trading like in the area?
.... goes through and ranks all the documents
Finn found:
Merchants complain about increased bandit activity on the eastern trade route.
Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.
Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.
```
> From the tool response, I remember a few key points. First, merchants are complaining about bandits on the eastern trade route. That's a problem for traders. Then there's Merchant Elena who sells spices and silk every second Tuesday. That's a specific thing to mention. Also, Trader Gareth's caravan was attacked in Darkwood Forest. So the answer should include these points.
>
>I need to present this information clearly. Start by mentioning the bandit issue, then note Elena's schedule, and mention Gareth's attack. Make sure it's in a friendly, tavern keeper style. Check if there's anything else, but the tool response only gave these three points. Don't add anything extra. Alright, structure the response with those details.
></think>
and when good ole Finn is done thinking:
>Trading in the area is... if you'll forgive the blunt truth, *dangerous*. Merchants swear the eastern trade route is littered with bandits, and last week a caravan was ambushed in Darkwood Forest. But there are still opportunities! Merchant Elena brings rare spices and silk every second Tuesday—she's a reliable seller. Just be wary of the roads. And if you spot a caravan with a single rider, don't engage. They're probably bandits.
---
### Complete Scripts
Complete Godot Script (Click to expand)
```gdscript
extends NobodyWhoChat
@onready var reranker = $"../Rerank"
@onready var chat_model = $"../ChatModel"
# The tavern keeper's knowledge - ~50 pieces of local information way more than could fit in a standard 4096 sized context.
var tavern_keeper_knowledge = PackedStringArray([
"The lake contains a special clay that blacksmiths use to forge superior weapons.",
"Ancient oak trees in the sacred grove provide wood that naturally resists dark magic.",
"Silver veins run through the mountain caves, valuable for crafting blessed weapons.",
"Rare moonflowers bloom in the ruins only once per season and have powerful magical properties.",
"The mill pond contains perfect stones for sharpening blades to razor sharpness.",
"Wild honey from forest bees makes potions more potent when used as a base ingredient.",
"A hooded stranger was seen asking questions about the old castle ruins last week.",
"Someone has been leaving fresh flowers at the grave of the village's first mayor.",
"Strange animal tracks were found near the well that don't match any known creature.",
"The church bell rang by itself three nights ago at exactly midnight.",
"Farmers found crop circles in their wheat fields after the last thunderstorm.",
"A merchant claims he saw lights moving through the abandoned mine from the hill road.",
"Children report hearing music coming from the forest when they play near the edge of town.",
"The weather has been unusually warm this winter, and the old-timers are worried.",
"Someone broke into the general store but only stole a map of the local cave systems.",
"A wolf with unusual blue eyes has been spotted watching the town from the tree line.",
"Old Sarah runs the bakery and makes the best apple pies in three kingdoms. Her grandson Tom went missing last week.",
"Blacksmith Gareth is always looking for quality iron ore and magic crystals. He pays double for rare materials.",
"Merchant Elena travels between towns selling exotic spices and silk. She arrives every second Tuesday.",
"Father Benedict runs the small chapel and knows ancient blessings that can ward off evil spirits.",
"Widow Martha owns the general store and knows every piece of gossip in town within hours.",
"Young apprentice Jake works for the blacksmith but dreams of becoming an adventurer himself.",
"Doctor Thorne treats injuries and illnesses. He keeps rare healing herbs in his back garden.",
"Stable master Owen knows every horse in the region and can track animals through the wilderness.",
"Mayor Thompson inherited his position from his father and struggles with the town's growing problems.",
"The old mine north of town has been abandoned for years. Strange sounds echo from deep inside at night.",
"The forest path to the east is safe during the day, but wolves hunt there after sunset.",
"Crystal Mines to the south produce valuable gems but have become dangerous recently.",
"The ancient stone bridge over Miller's Creek was built by dwarves centuries ago and still stands strong.",
"Darkwood Forest harbors bandits who prey on merchant caravans traveling the main road.",
"The Whispering Caves get their name from the wind that creates eerie sounds through the rock formations.",
"Lake Serenity freezes solid in winter, making it possible to cross on foot to the northern settlements.",
"The old watchtower on Crow's Hill offers a view of the entire valley but hasn't been manned in decades.",
"Sacred Grove is where the druids once practiced their rituals before they disappeared from the region.",
"The ruins of Castle Blackrock still stand on the mountain, though none dare venture there anymore.",
"Trader Gareth's caravan was attacked by bandits hiding somewhere in Darkwood Forest.",
"Tom the baker's grandson disappeared near the Crystal Mines while collecting rare stones.",
"Strange lights have been appearing in the Whispering Caves during moonless nights.",
"Farmers report their livestock going missing near the edge of Darkwood Forest.",
"The old mill wheel stopped working after something large damaged it upstream.",
"Merchants complain about increased bandit activity on the eastern trade route.",
"Several townsfolk have reported seeing ghostly figures near the abandoned mine at midnight.",
"The village well's water tastes strange since the earthquake last month.",
"Wild animals have been acting aggressively and fleeing deeper into the mountains.",
"Ancient runes appeared overnight on the sacred standing stones outside town.",
"The town was founded by refugees fleeing the Great Dragon War three hundred years ago.",
"Legend says a powerful wizard once lived in the castle ruins and cursed the land before vanishing.",
"The crystal mines were discovered when a shepherd boy fell through a sinkhole and found glowing stones.",
"Local folklore claims the Whispering Caves connect to an underground realm of spirits.",
"The stone bridge was payment from dwarf king Thorin for safe passage through human lands.",
"Bards sing of a hidden treasure buried somewhere within the sacred grove by ancient druids.",
"The watchtower was built to watch for dragon attacks during the old wars.",
"Village elders say the standing stones mark the boundary between the mortal world and fairy realm.",
"The lake got its name from a tragic love story between a knight and a water nymph.",
"Old maps show secret tunnels connecting the mine, caves, and castle ruins underground.",
"Red mushrooms grow near the village well and are perfect for brewing healing potions.",
"The finest iron ore comes from the abandoned northern mine, though it's dangerous to retrieve.",
"Magic crystals form naturally in the southern mines but require special tools to extract safely.",
"Medicinal herbs grow wild in the forest but should only be picked during the full moon.",
])
var ranked_docs = []
func _ready():
# Set up the chat for generating helpful responses
self.model_node = chat_model
reranker.connect("ranking_finished", func(result): ranked_docs = result)
reranker.start_worker()
self.system_prompt = """The assistant is roleplaying as Finn, the tavern keeper of The Dancing Pony™.
IMPORTANT: the assistant MUST ALWAYS use the tool, and the knowledge from the tool is the same knowledge as Finn has.
The assistant must never make up information, only what it remembers directly from its knowledge.
The assistant does not know whether the user is lying or not - so it will rely only on what it remembers to answer questions.
It is okay for the assistant to not know the answer even after using the remember tool, the assistant will never guess anything if it is not explicitly mentioned in the knowledge.
The assistant must always speak like a tavern keeper.
"""
# Add the tool to remember stuff
self.add_tool(remember, "The assistant can use this tool to remember its limited knowledge about the ingame world.")
self.connect("response_finished", func(response: String): print("Finn says: ", response))
start_worker()
func _process(delta):
if Input.is_action_just_pressed("enter"):
var test_question = "Where is strider?"
print("Player asks Finn: ", test_question)
ask(test_question)
# Tool function that the LLM can call to search the knowledge base
func remember(query: String) -> String:
print("Finn searches his memory for: ", query)
ranked_docs = "\n".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))
print("Finn found: ", ranked_docs)
return ranked_docs
```
### Performance Tips
#### Limit Results
Don't add needless context. Usually 1-5 relevant documents are enough:
```gdscript
# Good: usually sufficient
ranked_docs = ",".join(reranker.rank_sync(query, tavern_keeper_knowledge, 3))
ranked_docs = ",".join(reranker.rank_sync(query, tavern_keeper_knowledge, -1)) # Returns ALL documents
```
note this does not make the ranking faster, but the less stuff Finn has to read, the faster he can respond.
#### Use embeddings to narrow the relevant docs to start with
This technique is what put the `re` in reranker. In the RAG industry it is common practice to do a first pass over your documents with cosine similarity, and thus narrowing the amount of results you have to process each time. This makes it feasible to have databases with millions of entries and not worry too much about performance.
depending on the specs you are going for I would not recommend ranking more than 100 results at a time.
## What's Next?
Now you can build smart search systems for your game! check out:
- **[Tool Calling](tool-calling.md)** for letting the LLM trigger game actions
---
# faq
## Frequently Asked Questions
### Where do I find good models to use?
New language models are coming out at a breakneck pace. If you search the web for "best language models for roleplay" or something similar, you'll probably find results that are several months or years old. You want to use something newer.
Selecting the best model for your use-case is mostly about finding the right trade-off between speed, memory usage and quality of the responses.
Using bigger models will yield better responses, but raise minimum system requirements and slow down generation speed.
Have a look at our [model selection guide](/docs/model-selection) for more in-depth recommendations.
### Once I export my Godot project, it can no longer find the model file.
Exports are a bit weird for now: Llama.cpp expects a path to a GGUF file on your filesystem, while Godot really wants to package everything in one big .pck file.
The solution (for now) is to manually copy your chosen GGUF file into the export directory (the folder with your exported game executable).
If you're exporting for Android, you can't reliably pass a `res://` path to the model node. The best workaround is to use `user://` instead.
If your model is sufficiently small, you might get away with copying it from `res://` into `user://`. If using double the storage isn't acceptable, consider downloading it at runtime, or find some other way of distributing your model as a file.
We're looking into solutions for including this file automatically.
### NobodyWho-Godot makes Godot crash on Arch Linux / Manjaro
The Godot build currently in the Arch linux repositories does not work with gdextensions at all.
The solution for Arch users is to install godot from elsewhere. The binary being distributed from the godotengine.org website works great.
Other distribution methods like nix, flatpak, or building from source also seems to work great.
If anyone knows how to report this issue and to whom, feel free to do so. At this point I have met many Arch linux users who have this issue.
### NobodyWho-Godot fails to load on NixOS
If using a Godot engine from nixpkgs, with NobodyWho binaries from the Godot Asset Library. It will most likely fail to look up dynamic dependencies (libgomp, vulkan-loader, etc).
The reason is that the dynamic library .so files from the Godot Asset Library are compiled for generic linux, and expect to find them in FHS directories like /lib, which on NixOS will not contain any dynamic libraries.
There are two good solutions for this:
1. The easy way: run the godot editor using steam-run: `steam-run godot4 --editor`
2. The Nix way: compile NobodyWho using Nix. This repo contains a flake, so it's fairly simple to do (if you have nix with nix-command and flakes enabled): `nix build github:nobodywho-ooo/nobodywho`. Remember to move the dynamic libraries into the right directory afterwards.
---
# Getting Started
# Getting Started
_A minimal, end-to-end example showing how to load a model and perform a single chat interaction._
---
One of the most important components of NobodyWho is the Chat node. It handles all the conversation logic between the user and the LLM.
When you use the chat, you first pick a model and tell it what kind of answers you want.
When you send a message, the chat remembers what you said and sends it off to get an answer.
The model will then start reading and generating a response.
You can choose to wait for the full answer to generate or get the response in a stream.
Here are the key terms you'll see throughout this guide:
| Term | Meaning |
| ---- | ------- |
| **Model (GGUF)** | A `*.gguf` file that holds the weights of a large‑language model. |
| **System prompt** | Text that sets the ground rules for the model. |
| **Token** | The smallest chunk of text the model emits (roughly a word). |
| **Chat** | The node/component that owns the context, sends user input to the worker, and keeps conversation state in sync with the LLM. |
| **Context** | The message history and metadata passed to the model each turn; it lives inside the Chat. |
| **Worker** | NobodyWho's background task for a single conversation — it keeps the model ready and acts as a communication layer between the program and the model. Each Chat has its own worker. |
Let's show you how to use the plugin to get a large language model to answer you.
## Download a GGUF Model
The first step is to get a model.
If you're in a hurry, just download [Qwen3 0.6B Q4_K_M](https://huggingface.co/NobodyWho/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf).
It's super small and fast, and works for well for simple use-cases.
Otherwise, check out our [recommended models](/docs/model-selection) or if you have a non-standard use case, shoot us a question in Discord.
## Load the GGUF model
At this point you should have downloaded the model and put it into your project folder.
Add a `NobodyWhoModel` node to your scene tree.
Set the model path to point to your GGUF model.

`model_path` accepts local paths (`res://`, `user://`, or an absolute filesystem path) as well as `huggingface:` / `hf://` references and `https://` URLs that are downloaded and cached on first use. See [Downloading models](./downloading-models) for the full list of supported formats, download progress signals, gated/private models, and inspecting the local cache.
### Knowing when the worker is ready
`start_worker()` returns immediately. The worker finishes loading in the background (including any download). Connect to the new signals if your game logic needs to wait:
```gdscript
chat.worker_started.connect(func():
print("Ready to chat!")
)
chat.worker_failed.connect(func(err):
push_error("Model load failed: " + err)
)
chat.start_worker()
```
You can also call `ask()` straight away — prompts issued before the worker is ready are queued and dispatched as soon as loading completes. The same applies to `NobodyWhoEncoder.encode()` and `NobodyWhoCrossEncoder.rank()`.
## Create a new Chat
The next step is adding a Chat to our scene.
Add a `NobodyWhoChat` node to your scene tree.
Then add a script to the node:
```gdscript
extends NobodyWhoChat
func _ready():
# configure the node (feel free to do this in the UI)
self.system_prompt = "You are an evil wizard."
self.model_node = get_node("../ChatModel")
# connect signals to signal handlers
self.response_updated.connect(_on_response_updated)
self.response_finished.connect(_on_response_finished)
# Start the worker, this is not required, but recommended to do in
# the beginning of the program to make sure it is ready
# when the user prompts the chat the first time. This will be called
# under the hood when you use `ask()` as well.
self.start_worker()
self.ask("How are you?")
func _on_response_updated(token):
# this will print every time a new token is generated
print(token)
func _on_response_finished(response):
# this will print when the entire response is finished
print(response)
```
## Testing Your Setup
That's it! You now have a working chat system that can talk to a language model. When you run your scene, the chat will automatically send a test message and you should see the model's response appearing in your console.
You should see tokens appearing one by one as the model generates its response, followed by the complete answer. If you see the evil wizard responding with curses (or whatever system prompt you chose), everything is working correctly!
**If nothing happens:**
- Make sure your model file path is correct
- Verify that your Chat node is properly connected to your Model node
- Look for any error messages in the console
- Start your editor through the command line and check the stdout logs.
Now you're ready to build more complex conversations and integrate the chat system into your game!
---
# Installation
# Installation
_How to install NobodyWho and start building._
---
### Via Asset Library
- **Open Godot 4.5** (or any newer 4.x release).
- Switch to the **Asset Library** tab.
- Search for **“NobodyWho”** and select the entry.
- Click **Download**, tick **Ignore asset root**, then choose **Install**.
- Godot puts the plugin in `res://addons/nobodywho`. Open *Create Node* and you should see **`NobodyWhoChat`**. If it’s missing, restart Godot and try again.

### Via GitHub
- Download the latest ZIP from the [GitHub releases](https://github.com/nobodywho-ooo/nobodywho/releases).
- In Godot, open **AssetLib ▸ Import** and pick the ZIP.
- Tick **Ignore asset root** and finish the import.

---
After installation, NobodyWho’s nodes should appear in your editor. If not, retrace your steps above or reach out on Discord or GitHub - we are there to help.
---
# Sampling
# Sampling
_Controlling how the model picks tokens and constraining output format._
---
The model does not produce tokens directly but rather a probability distribution over all possible tokens. We must then choose how to pick the next token from the distribution. This is the job of a **sampler**, which using NobodyWho you can freely modify to achieve better quality outputs or constrain the outputs to some known format (e.g. JSON).
## Sampler Presets
NobodyWho offers several built-in presets you can apply to your `NobodyWhoChat` node:
### JSON Output
Force the model to always produce valid JSON:
```gdscript
chat.set_sampler_preset_json()
chat.system_prompt = "Generate a character with name, weapon, and armor properties."
chat.ask("Create a fantasy character")
# Output will always be valid JSON, e.g.:
# {"name": "Eldara", "weapon": "enchanted bow", "armor": "leather vest"}
```
### Temperature
Control the "creativity" of the model. Lower values make the model more deterministic:
```gdscript
chat.set_sampler_preset_temperature(0.2) # More focused/deterministic
chat.set_sampler_preset_temperature(1.5) # More creative/random
```
### Greedy
Always pick the most probable token:
```gdscript
chat.set_sampler_preset_greedy()
```
## Defining your own samplers
Presets cover the common cases, but when you want to chain multiple shift
steps, set a seed for reproducible output, or use Mirostat, build a sampler
with `NobodyWhoSamplerBuilder`:
```gdscript
var cfg = NobodyWhoSamplerBuilder.new() \
.top_k(40) \
.temperature(0.8) \
.dist()
chat.set_sampler_config(cfg)
```
`NobodyWhoSamplerBuilder` has two kinds of methods: **shift steps** that transform the
probability distribution (returning the builder for further chaining) and
**terminal steps** that finalize the chain into a `NobodyWhoSamplerConfig`. Always end
the chain with one of the terminals: `dist()`, `greedy()`, `mirostat_v1(...)`,
or `mirostat_v2(...)`.
For reproducible output, set the RNG seed anywhere in the chain. The seed is
consumed by every random sampler in the chain — `dist`, `mirostat_v1`,
`mirostat_v2`, and the `xtc` shift step. `greedy` ignores it. If unset, a
default seed is used.
```gdscript
var cfg = NobodyWhoSamplerBuilder.new() \
.top_k(40) \
.temperature(0.8) \
.seed(42) \
.dist()
chat.set_sampler_config(cfg)
```
## Structured Output
One of the most powerful features is constraining the model to produce output in a specific format. This gives you a hard guarantee that the output matches your format, rather than relying on the model to get it right on its own.
### Grammar Constraints
You can constrain the model's output using a GBNF grammar:
```gdscript
var grammar = """
root ::= greeting " " name
greeting ::= "Hello" | "Hi" | "Hey"
name ::= "World" | "Friend" | "There"
"""
chat.set_sampler_preset_constrain_with_grammar(grammar)
```
This makes it **impossible** for the model to generate anything outside your defined format.
For a comprehensive tutorial on writing GBNF grammars, including JSON generation, compact formats, and practical game examples, see the [Structured Output](structured-output.md) guide.
### JSON Schema Constraints
Force the model to produce JSON matching a specific schema:
```gdscript
var schema = JSON.stringify({
"type": "object",
"properties": {
"name": {"type": "string"},
"level": {"type": "integer"},
"class": {"type": "string", "enum": ["Warrior", "Mage", "Rogue"]}
},
"required": ["name", "level", "class"]
})
chat.set_sampler_preset_constrain_with_json_schema(schema)
```
### Regex Constraints
For simpler patterns, constrain the output with a regular expression:
```gdscript
# Force the model to answer with exactly "yes" or "no"
chat.set_sampler_preset_constrain_with_regex("yes|no")
```
## Changing Samplers Mid-Conversation
You can change the sampler at any point during a conversation. The new sampler will take effect on the next `ask()` call:
```gdscript
# Start with free-form chat
chat.ask("Tell me about yourself")
var response = await chat.response_finished
# Switch to structured output for the next question
chat.set_sampler_preset_json()
chat.ask("Now describe your stats as JSON")
var json_response = await chat.response_finished
```
---
# Structured Output
# Structured Output
_Getting reliable, structured responses from your models_
---
Congratulations - you have understood the basics of having a large language models generate text for you.
You are now ready for some more juicy and complex options.
Here are the key terms you should know:
| Term | Meaning |
| ---- | ------- |
| **GBNF** | GGML Backus-Naur Form - a way to define strict rules for output format |
| **Grammar** | The set of rules that define what valid output looks like |
| **Token** | A piece of text (word, punctuation, etc.) that the model generates, generally 1 to 4 characters long |
| **Encoder** | Translates text into tokens that the model can understand |
## My model is so stupid that it can not even write json
Yeah, most models will fail to generate valid json at some point if you just ask it to.
But fret not dear friend, the solution you are looking for is called **STRUCTURED OUTPUT**.
It is pretty much what it claims to be; A system that constrains the model's vocabulary to one that you determine.
This can be useful for a myriad of things, from forcing the LLM to never use modern words, to using the LLM
as the engine for your own procedural generation dungeon room.
This section will take you through creating your own grammar that the model will have to use.
### Why GBNF Beats Prompt Engineering
You've probably tried this before:
```
""" Please respond in JSON format with name, level, and class fields
Only use those fields.
Only use valid json.
All json attributes should have " around them.
Please do not deviate from the instructions.
You will lose 10 points if you use other fields than level, class and name.
Do not write a message just json.
If you do not respond in valid json I will lose my job and my kids will starve.
"""
```
And got back something like:
```
Sure! Here's a character: {"name": "Eldara", "level": 15, "class": Wizard} - hope this helps!
```
Notice the problems? Missing quotes around "Wizard", extra text before and after. Your JSON parser explodes. 💥
GBNF fixes this by making it **impossible** for the model to generate anything except the format you define:
```json
{"name": "Eldara", "level": 15, "class": "Wizard"}
```
Valid. Every. Single. Time.
## Understanding GBNF Grammar Rules
### The Absolute Basics
A GBNF grammar is made up of **rules**. Each rule says "this thing can be made from these parts":
```
rule-name ::= what-it-can-be
```
### Your First Grammar: Hello World
Let's start with the simplest possible grammar:
```
root ::= "Hello World"
```
This says: "The output must be exactly the text 'Hello World'". That's it. The model can't say anything else.
Try this and the model will always output: `Hello World`
### Adding Choices with `|`
What if we want some variety? Use `|` (pipe) to give options:
```
root ::= "Hello World" | "Hi there" | "Greetings"
```
Now the model can choose between these three options, but nothing else.
### Building Blocks with Multiple Rules
Here's where it gets interesting. You can break things into smaller pieces:
```
root ::= greeting " " name
greeting ::= "Hello" | "Hi" | "Hey"
name ::= "World" | "Friend" | "There"
```
This creates outputs like:
- `Hello World`
- `Hi Friend`
- `Hey There`
The model picks one option from `greeting`, adds a space, then picks one option from `name`.
### Character Classes
Instead of listing every letter, use character classes:
```
root ::= letter letter letter
letter ::= [a-z]
```
`[a-z]` means "any lowercase letter from a to z". This generates random 3-letter combinations like `cat`, `how`, `dog`.
so letter letter letter will make a three letter word
Common character classes:
- `[a-z]` - lowercase letters
- `[A-Z]` - uppercase letters
- `[0-9]` - digits
- `[a-zA-Z]` - any letter
- `[a-zA-Z0-9]` - letters and numbers
### Repetitions
This quickly becomes tedious if you want to create either long words or just any word. This is where repetitions come in:
- `*` means "zero or more"
- `+` means "one or more"
- `?` means "optional (zero or one)"
- `{n}` means "exactly n times"
- `{n,}` means "at least n times"
- `{n,m}` means "at least n and at most m times"
```
root ::= letter+
```
This means "one or more lowercase letters" - so you get words like `hello`, `a`, `supercalifragilisticexpialidocious`.
```
root ::= [a-z]+ [0-9]*
```
This means "letters followed by optional numbers" - so you get `hello`, `test123`, `word`.
### Building JSON Step by Step
Now that you have been tricked into learning the basics of regex, we should build a small JSON generator. Start simple:
```
root ::= "{" "}"
```
This only generates: `{}`
Add one field:
```
root ::= "{" "\"name\"" ":" string "}"
string ::= "\"" [a-zA-Z]+ "\""
```
This generates: `{"name":"Bob"}` (where Bob is any sequence of letters)
Add more fields:
```
root ::= "{" "\"name\"" ":" string "," "\"level\"" ":" number "}"
string ::= "\"" [a-zA-Z]+ "\""
number ::= [0-9]+
```
This generates: `{"name":"Alice","level":"25"}`
### Making It Flexible
Use repetition to handle variable numbers of fields:
```
root ::= "{" pair ("," pair)* "}"
pair ::= word ":" word
word ::= "\"" [a-zA-Z]+ "\""
```
The `("," pair)*` means "zero or more additional pairs, each preceded by a comma". This generates:
- `{"name":"Bob"}`
- `{"name":"Alice","job":"Wizard"}`
- `{"name":"Charlie","job":"Knight","weapon":"Sword"}`
### Whitespace: Making It Readable
Add optional whitespace to make output prettier:
```
root ::= "{" ws pair (ws "," ws pair)* ws "}"
pair ::= string ws ":" ws string
string ::= "\"" [a-zA-Z ]+ "\""
ws ::= [ \t\n]*
```
The `ws` rule means "whitespace" - zero or more spaces, tabs, or newlines. Now you get nicely formatted JSON.
### Advanced: Specific Values
Control exactly what values are allowed:
```
root ::= "{" "\"class\"" ":" class-type "}"
class-type ::= "\"Warrior\"" | "\"Mage\"" | "\"Rogue\"" | "\"Cleric\""
```
This only allows those four specific classes - no hallucinated "Tank-operator" in your neolithic era game!
### Nested Structures
Build complex nested data:
```
root ::= "{" "\"character\"" ":" character-object "}"
character-object ::= "{" "\"name\"" ":" string "," "\"stats\"" ":" stats-object "}"
stats-object ::= "{" "\"hp\"" ":" number "," "\"mp\"" ":" number "}"
string ::= "\"" [a-zA-Z ]+ "\""
number ::= [0-9]+
```
This creates nested JSON like:
```json
{"character":{"name":"Gandalf","stats":{"hp":"100","mp":"200"}}}
```
## Performance Optimization: Compact Formats
Now that you understand GBNF with JSON, let's talk optimization. JSON is verbose and every token costs time. For high-performance applications, you can create much more compact formats.
### Why Compact Formats Matter
**JSON Format:**
```json
{"name":"Gandalf","level":15,"class":"Mage","hp":100,"mp":80}
```
*60 characters, ~38 tokens*
**Compact Format:**
```
Gandalf|High|Mage|Low|High
```
*22 characters, ~10 tokens*
**That's ~4 times faster while maintaining the same information!**
### Building Compact Formats
Start with pipe-separated values:
```
root ::= [A-Z][a-z]+ "|" [1-9][0-9]? "|" class-type
class-type ::= "Warrior" | "Mage" | "Rogue" | "Cleric"
```
This generates: `Gandalf|15|Mage` (semantically clear - no ambiguity about what "Mage" means!)
**Why not single letters?** If you used `"W" | "M" | "R" | "C"`, the LLM has no inherent knowledge that "M" means "Mage" rather than "Monk" or "Mercenary". The model generates tokens based on semantic understanding, not arbitrary mappings.
### Different delimiters for different levels
Use different separators for different levels:
```
root ::= character ("|" character)*
character ::= [A-Z][a-z]+ ":" stats ":" equipment
stats ::= stats-range "," stats-range "," stats-range
stats-range ::= "low" | "medium" | "high"
equipment ::= weapon-type "," armor-type
weapon-type ::= "Sword" | "Axe" | "Staff" | "Dagger"
armor-type ::= "Leather" | "Robes" | "Chain" | "Plate"
```
This generates: `Gandalf:high,low,low:Staff,Robes|Aragorn:low,high,medium:Sword,Plate` which in JSON would be:
```json
[
{
"name": "Gandalf",
"stats": {
"hp": "high",
"mp": "low",
"level": "low"
},
"equipment": {
"weapon": "Staff",
"armor": "Robes"
}
},
{
"name": "Aragorn",
"stats": {
"hp": "low",
"mp": "high",
"level": "medium"
},
"equipment": {
"weapon": "Sword",
"armor": "Plate"
}
}
]
```
### Semantic Soundness
One advantage of using JSON is the hints it gives the LLM.
If it sees `"name": "Gandalf"`, instead of just `Gandalf` it might be more inclined to generate a wizard class or give the character a staff.
The same goes for numbers, the llm does not inherently understand what a good number for a high level or mana pool is - but it understands high vs low.
When designing compact formats:
✅ **Good:** `"Warrior" | "Mage" | "Rogue"`
✅ **Good:** `"Sword" | "Staff" | "Dagger"`
✅ **Good:** `"Leather" | "Robes" | "Chain"`
✅ **Good:** `"Low" | "Medium" | "High"`
❌ **Bad:** `"WAR" | "MAG" | "ROG"` - abbreviated and potentially ambiguous
❌ **Bad:** `"W" | "M" | "R"` - arbitrary single letters
❌ **Bad:** `"1" | "2" | "3"` - numeric values
The LLM generates text based on semantic understanding. Use full words that align perfectly with how language models think about concepts.
You should additionally provide the right context and single or few shots prompting to make it more robust.
### Underscores footgun
The GBNF format does not support `_`. According to the [the GBNF format documentation](https://github.com/ggml-org/llama.cpp/tree/master/grammars#json-schemas--gbnf), only lowercase characters and dashes are allowed for naming nonterminals.
## Practical Example: Legendary Weapon Generator
Let's build a weapon generation system that creates legendary weapons for your RPG. We'll start simple and add complexity step by step, showing you how GBNF grammars work in practice.
### Why Use GBNF for Weapon Generation?
Traditional random generators often create nonsensical combinations like "Flaming Sword of Ice", with 8 fire damage and a random generic backstory as well an ice ability. More advanced systems exist but they rely on lookup tables which can become tedious very quickly.
LLMs with GBNF understand semantic coherence - they'll generate "Flamebrand, Ancient Sword of Solar Wrath" instead.
Which has 8 fire damage, and a meaningful backstory based on how you got it
or the lore from your game as well as an ability that is chosen based on the backstory, damage and name.
### Step 1: Dynamic Weapon Name Generator
Let's start with a weapon generator that builds weapon names:
**Grammar:**
```
root ::= weapon-name " (" weapon-type ")"
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
```
```gdscript
extends Node
@onready var model = $Model # Your NobodyWhoModel node
@onready var chat = $Chat # Your NobodyWhoChat node
func _ready():
# Configure the weapon generator
model.model_path = "res://models/your-model.gguf"
chat.model_node = model
chat.system_prompt = "You are a legendary weapon generator for a fantasy RPG."
# Start the worker so it's ready
chat.start_worker()
# Connect to handle responses
chat.response_finished.connect(_on_weapon_generated)
func _input(event):
if event is InputEventKey and event.pressed and event.keycode == KEY_SPACE:
generate_weapon()
func generate_weapon():
chat.set_sampler_preset_constrain_with_grammar(grammar_string)
# Reset context to avoid new weapons to be influenced by already generated ones.
chat.reset_context()
chat.ask("Generate a weapon:")
func _on_weapon_generated(weapon_name: String):
print(weapon_name)
# Here you could add the weapon to inventory, display it in UI, etc.
```
**Output examples:**
- `Flamebrand (Sword)`
- `Shadowfang (Dagger)`
- `Stormcall (Staff)`
- `Darkward (Bow)`
This is more or less just a random number generator, but more GPU expensive...
### Step 2: Adding Weapon Stats
Let's add damage and abilities to make weapons more interesting for gameplay, this is where we deviate from a random weapon generator to a semantic weapon generator:
**Grammar:**
```
root ::= weapon-name " (" weapon-type ") - " damage-level " damage, " ability-name " ability. " backstory
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
backstory ::= [a-zA-Z0-9 ]+ "."
```
Be careful not to add too many symbols in your backstory. If the model can not write a `.` it will increase the chance that it will end the sentence instead of writing paragraph upon paragraph of text.
```gdscript
func generate_weapon():
chat.set_sampler_preset_constrain_with_grammar(grammar_string)
# Reset context to avoid new weapons to be influenced by already generated ones.
chat.reset_context()
chat.ask("Generate a weapon:")
func _on_weapon_generated(weapon_data: String):
print(weapon_data)
```
**Output examples:**
- `Shadowfang (Sword) - Legendary damage, Shadow Step ability. Shadowfang is a legendary sword that was forged by the ancient shadow realm.`
See how the examples will match flame and brand to a sword, will give it the flame strike ability as well as a thematic backstory. It feels like there is intent behind the creation of this weapon.
### Step 3: Enhanced Backstories
Let's expand the backstory system to allow for richer, more detailed weapon lore:
**Grammar:**
```
root ::= weapon-name " (" weapon-type ") - " damage-level " damage, " ability-name " ability. Story: " backstory
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
backstory ::= [a-zA-Z0-9 ]{50,200} "."
```
When doing this we want to also inject some of our lore. We will borrow from Lord of the rings here - replace with your own lore.
```gdscript
func _ready():
# Configure the weapon generator
chat.model_node = model
chat.system_prompt = "Generate a weapon a backstory in the LOTR universe"
# ... rest of the setup
func generate_weapon():
chat.set_sampler_preset_constrain_with_grammar(grammar_string)
# Reset context to avoid new weapons to be influenced by already generated ones.
chat.reset_context()
chat.ask("The party just found a new weapon after travelling through the mines of Moria:")
func _on_weapon_generated(weapon_data: String):
print(weapon_data)
```
**Output examples:**
- `Shadowfang (Sword) - Legendary damage, Shadow Step ability. The sword is made from the dark shards that were once part of the Balrog`
- `Flamebrand (Sword) - High damage, Flame Strike ability. Backstory involves a fallen dwarf lord named Drakon who was corrupted by the Balrogs and used the sword to slay an enemy.`
### Step 4: Compact Format for Performance
For games that generate many weapons or even very complex weapons, you want maximum efficiency. Let's create a compact pipe-separated format:
**Grammar:**
```
root ::= weapon-name "|" weapon-type "|" damage-level "|" ability-name "|" weight "|" throwable "|" damage-type "|" durability "|" rarity "|" enchantment "|" material "|" short-story
weapon-name ::= name-prefix name-suffix
name-prefix ::= "Flame" | "Frost" | "Shadow" | "Storm" | "Light" | "Dark"
name-suffix ::= "brand" | "fang" | "bane" | "call" | "ward" | "rend"
weapon-type ::= "Sword" | "Axe" | "Dagger" | "Staff" | "Bow" | "Hammer"
damage-level ::= "Low" | "Medium" | "High" | "Legendary"
ability-name ::= "Flame Strike" | "Frost Bite" | "Shadow Step" | "Lightning Bolt" | "Healing Aura" | "Poison Cloud"
weight ::= "Heavy" | "Light"
throwable ::= "Throwable" | "Non-throwable"
damage-type ::= "Sharp" | "Pierce" | "Blunt"
durability ::= "Fragile" | "Sturdy" | "Unbreakable"
rarity ::= "Common" | "Rare" | "Epic" | "Legendary"
enchantment ::= "Glowing" | "Humming" | "Pulsing" | "Silent"
material ::= "Steel" | "Mithril" | "Obsidian" | "Crystal"
backstory ::= [a-zA-Z0-9 ]{50,200} "."
```
**Note:** So-called "thinking" or "reasoning" models will strongly prefer to start every generation with a block of text inside `<think>` tags. If your grammar doesn't naturally allow the output to be prefixed with a "thinking" section like this, it will try to squeeze it into free-text sections (e.g. like the backstory section in the example above). If relying a lot on structured generation, you may prefer to use a "non-thinking" model. If you prefer to keep the "thinking" ability, you could begin your grammar with a section like `"<think>" [a-zA-Z0-9 ]{10,1000} "." "</think>` to allow it to get it's reasoning section out of the way.
Furthermore, the current implementation of GBNF has some performance issues with using specifc ranges (eg: word{10,20}) - so it might be smarter to have a non grammarized model generate the short story.
**Output examples:**
- `Flamebrand|Sword|High|Flame Strike|Heavy|Non-throwable|Sharp|Sturdy|Epic|Glowing|Steel|Forged by fire elementals in ancient volcano`
or with thinking models (demonstrating that it will squeeze in the "thinking" section wherever possible:
- `Shadowfang|Axe|Legendary|Shadow Step|Light|Throwable|Sharp|Sturdy|Epic|Silent|Steel|The Shadowfang is a legendary axe that is said to have been forged in the depths of the Shadowspire Mountains by the elusive Night Hunter.`
- `Stormcall|Staff|Legendary|Lightning Bolt|Light|Non-throwable|Blunt|Unbreakable|Legendary|Pulsing|Crystal|The user wants me to generate a short story for the weapon. I will think...`
---
This is quite a powerful system for procedural generation of anything being weapons, levels, questlines or whatever you can think of, and even better
You get to influence the generation meaningfully with the prompt that you send, while keeping the variety offered by the system.
This complete system generates weapons with all the attributes your game systems might need, from combat mechanics (damage type, weight) to visual effects (enchantment, material) and lore (story).
---
# Tool Calling
# Tool Calling
_Triggering actions from within the model._
---
Welcome to the tool calling page!
Now that you have some of the basics understood (if not, please read [Chat](chat.md)),
we can move on to adding one of the truly powerful and fun components to our model; Tool/Function Calling.
Tool calling is a way to give your model actions to perform in your game world.
The model can:
* Check data - "What's my health?"
* Change the world - "Open the north gate."
* Run helper logic - damage rolls, crafting math, random loot.
We'll start with a small and simple tool, add arguments, then increase accuracy using schema and adding constraints.
**Note that not all models support tool calling**
---
## A simple tool
This is an example of how to give the model access to a function we have created that gets the player's current stats (health, mana, gold).
```gdscript
extends NobodyWhoChat
func get_player_stats() -> String:
var player = GameManager.get_local_player()
return JSON.stringify({
"health": player.health,
"mana": player.mana,
"gold": player.gold
})
func _ready():
add_tool(get_player_stats, "Returns the local player's health, mana, and gold.")
```
Ask "How hurt am I?" - the model calls your tool and answers with real numbers.
---
## But I need arguments, you say:
Sure - that is possible, but only primitives are currently implemented in NobodyWho:
Allowed primitive types: `int`, `float`, `bool`, `String`/`string`, `Array`/`string[]`
Models operate with JSON as an abstract layer instead of using a specific language (like Godot) when calling tools.
When NobodyWho receives a function or a delegate it will deconstruct the name and parameters and use them
to construct a JSON schema that we can pass to the model.
In the example below the generated json will look something like this:
```json
{
"type": "object",
"properties": {
"amount": {
"type": "integer",
"description": ""
}
},
"required": ["amount"]
}
```
This is then used to construct a lazy-loadable gbnf grammar, so the models always pass the correct number and set of arguments.
A limitation of this is that we cannot extract the description from a given argument.
Therefore it might be advantageous to write your own schema for maximum precision.
```gdscript
func heal_player(amount: int) -> String:
GameManager.get_local_player().heal(amount)
return "Healed %d HP" % amount
add_tool(heal_player, "Heals the local player by a number of hit-points")
```
*Godot auto-builds the JSON schema from the type hints.*
Therefore you must ensure that all parameters are listed and return type is defined from the method.
---
## Your model is now ready to interact with the world
Have the model open a door.
```gdscript
func open_door(door_id: String) -> String:
DoorManager.open(door_id)
return "Opened door %s" % door_id
add_tool(open_door, "Opens a door in the world by id")
chat.ask("can you open the door")
```
The model will pause any generation until the tool is completed.
---
## Multiple Tools & Resetting
You can add as many tools as like, but you need to reset the context before they will be taken into account.
```gdscript
add_tool(get_player_stats, "Player stats")
add_tool(open_door, "Open a door")
reset_context()
```
---
## But I don't want it to hallucinate random strings
Don't worry, we've got you.
As I mentioned before, we are using the OpenSchema specification, which goes like this:
```jsonschema
{
"type": "object",
"properties": {
"color": {
"type": "string",
"description": "A specific color for the button",
"enum": ["red", "blue", "green"]
}
},
"required": ["color"],
}
```
The type must always be an `object`, the properties are a dictionary of where the key is the parameter name, and the value describes the data for the parameter. Ie. type determines whether it is a string, a list or something else. Description describes how the parameter is used.
If the properties are not a part of the `required` list, the model will see them as optional parameter.
```gdscript
# `press_button_schema` holds the JSON shown above.
func press_button(color: String) -> String:
ButtonManager.press(color)
return "Pressed %s button" % color
add_tool_with_schema(press_button,
press_button_schema,
"Press one of the three coloured buttons (red, blue, green)")
```
Result: the model **cannot** request any color other than *red*, *blue*, or *green*. Use the same pattern for item rarities, quest tiers, etc...
**Heads-up** – NobodyWho turns that schema into a GBNF grammar using the open-source [`richardanaya/gbnf`](https://github.com/richardanaya/gbnf) converter. It currently supports the common bits: primitive types, `enum`, `required`, flat `oneOf`, and simple arrays. Exotic keywords (`minimum`, `pattern`, deeply-nested refs) may be ignored until the library grows.
---
A note on descriptions:
The description helps the model pick the right tool and pass the right arguments. Be explicit. Explain when to use the tool, explain what the tool does.
Bad: **"Door"**
Good: **"Use this function when the assistant is blocked or needs to close a door. This tool opens or closes the door with the given id, if -1 is given, the nearest door will be interacted with."**
---
## Pre-packaged tools
We ship NobodyWho with two packaged-in tools, which are general enough for multiple use-cases - [monty](https://github.com/pydantic/monty) Python interpreter
and [bashkit](https://github.com/everruns/bashkit) Bash interpreter. Both of them should serve similar purpose - to give your small LLM a better chance to answer
questions requiring precise reasoning or some kind of computation, possibly on a big context.
The usage is straightforward. Use `add_python_tool()` and `add_bash_tool()`:
```gdscript
func _ready():
add_python_tool()
add_bash_tool()
```
Lastly, keep in mind that for most use-cases it is reasonable to constrain the tools with some limits regarding memory and computation time,
so that you don't end up executing infinite loop code. To solve this, `add_python_tool()` provides `max_duration_secs`, `max_memory_bytes` and `max_recursion_depth`
and `add_bash_tool()` provides `max_commands`.
---
# Vision & Hearing
# Vision & Hearing
_Enabling models to ingest images and audio._
---
A picture is worth a thousand words (or at least a thousand tokens).
With NobodyWho, you can easily provide image information to your LLM.
## Choosing a model
Not all models have built-in image and audio capabilities. Generally, you will
need two parts for making this work:
1. Multimodal LLM, so the LLM can consume image-tokens or/and audio-tokens
2. Projection model, which converts images to image-tokens or/and audio to audio-tokens
To find such a model, refer to the [HuggingFace Image-Text-to-Text](https://huggingface.co/models?pipeline_tag=image-text-to-text&library=gguf&sort=likes) section
and [Audio-Text-to-Text](https://huggingface.co/models?pipeline_tag=audio-text-to-text&sort=trending). Some models like Gemma 4 even manage both!
Usually, the projection model then includes `mmproj` in its name.
If you are unsure which ones to pick, or just want a reasonable default, you can try [Gemma 4](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf?download=true) with its [BF16 projection model](https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/mmproj-BF16.gguf?download=true),
which can do both image and audio.
With the downloaded GGUFs, you can set the projection model on your `NobodyWhoModel` node.
In the editor, set the `projection_model_path` property to point to your projection model file.
Alternatively, you can set it in GDScript:
```gdscript
$ChatModel.projection_model_path = "res://mmproj.gguf"
```
> **Note:** The language model and projection model have to **fit** together, as they are trained together!
> Unfortunately you can't just take projection model and a LLM that you like and expect them
> to work together.
## Composing a prompt object
With the model configured, all that is left is to compose the prompt and send it to the model.
That is done through the `NobodyWhoPrompt` object.
```gdscript
extends NobodyWhoChat
func _ready():
self.model_node = get_node("../ChatModel")
self.system_prompt = "You are a helpful assistant, that can hear and see stuff!"
var prompt = NobodyWhoPrompt.new()
prompt.add_text("Tell me what you see in the image and what you hear in the audio.")
prompt.add_image("res://dog.png")
prompt.add_audio("res://sound.mp3")
ask(prompt)
var response = await response_finished # It's a dog and a penguin!
```
## Tips for multimodality
As with textual prompts, the format in which you supply the multimodal prompt can matter in certain
scenarios. If the model performs poorly, try to mess around with the order of supplying the text
and the multimodal files, or the descriptions you supply. For example, the following prompt may perform better than the previously presented one.
```gdscript
var prompt = NobodyWhoPrompt.new()
prompt.add_text("Tell me what you see in the image.")
prompt.add_image("res://dog.png")
prompt.add_text("Also tell me what you hear in the audio.")
prompt.add_audio("res://sound.mp3")
```
Also, there is still a lot of variance between how the models internally process the images.
This, for example, causes differences in how quickly the model consumes context - for some models like Gemma 3, the number of tokens per image is constant; for others like Qwen 3, they scale with the size of the image. In that case, you can increase the context size if the resources allow:
```gdscript
self.context_length = 8192
```
Or, for example, preprocess your images with some kind of compression (sometimes even changing the image type helps).
Moreover, audio ingestion seems to be also reliant a lot on the data type of the projection model file - for gemma 4,
ingesting audio works the best on BF16, while other types reportedly struggle. We thus recommend sticking at least trying out different
projection model files, if the one you picked does not work.
As always with more niche models you can find bugs. If you stumble upon some of them, please be sure to [report them](https://github.com/nobodywho-ooo/nobodywho/issues), so we can fix the functionality.
---