| Title: | Interface for Large Language Models via 'llama.cpp' |
|---|---|
| Description: | Provides 'R' bindings to 'llama.cpp' for running Large Language Models ('LLMs') locally with optional 'Vulkan' GPU acceleration via 'ggmlR'. Supports model loading, text generation, 'tokenization', token-to-piece conversion, 'embeddings' (single and batch), encoder-decoder inference, low-level batch management, chat templates, 'LoRA' adapters, explicit backend/device selection, multi-GPU split, and 'NUMA' optimization. Includes a high-level 'ragnar'-compatible embedding provider ('embed_llamar'). Built on top of 'ggmlR' for efficient tensor operations. |
| Authors: | Yuri Baramykov [aut, cre] (ORCID: <https://orcid.org/0009-0000-7627-4217>), Georgi Gerganov [cph] (Author of the 'llama.cpp' library included in src/) |
| Maintainer: | Yuri Baramykov <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.4 |
| Built: | 2026-06-01 21:04:07 UTC |
| Source: | https://github.com/zabis13/llamar |
Returns an ellmer Chat object backed by a local GGUF model,
so the whole ellmer / ragnar toolchain (turns, tools, streaming,
structured output, ragnar_register_tool_retrieve(), …) works
against local inference. Transport is the OpenAI-compatible HTTP API
from llama_serve_openai; this function is a thin
chat_vllm wrapper over it. (We use the vLLM provider
because it speaks /v1/chat/completions — the de-facto standard our
server implements — whereas ellmer's chat_openai/
chat_openai_compatible target OpenAI's newer /v1/responses.)
chat_llamar( model_path = NULL, base_url = NULL, port = 11434L, n_ctx = 4096L, n_gpu_layers = -1L, model_id = NULL, system_prompt = NULL, timeout = 180, ... )chat_llamar( model_path = NULL, base_url = NULL, port = 11434L, n_ctx = 4096L, n_gpu_layers = -1L, model_id = NULL, system_prompt = NULL, timeout = 180, ... )
model_path |
Path to a GGUF model file. Spawns a server (mode A).
Mutually exclusive with |
base_url |
Base URL of a running OpenAI-compatible server, e.g.
|
port |
Port for the spawned server (mode A only). Default
|
n_ctx, n_gpu_layers
|
Passed to |
model_id |
Model identifier reported to ellmer. Defaults to the
model file's base name in mode A; |
system_prompt |
Optional system prompt for the chat. |
timeout |
Seconds to wait for a spawned server to accept
connections before erroring (mode A only). Default |
... |
Passed on to |
Two modes, picked by which argument you pass (DBI-style — like
DBI::dbConnect() accepting either connection parameters or a
ready connection):
base_urlConnect to a server you already started (e.g.
llama_serve_openai() in another process, or a worker pool).
No process is spawned.
model_pathSpin up llama_serve_openai() in a
background R process (via callr), wait for it to come up, and
return a Chat pointed at it. The server process's lifetime is
tied to the returned object: when it is garbage-collected (or R
exits), the process is killed. Stop it eagerly with
chat_llamar_stop.
Exactly one of base_url or model_path must be supplied.
An ellmer Chat object. In mode A it additionally
carries the background process handle (see chat_llamar_stop).
The server is single-sequence (one request at a time); see
llama_serve_openai. For parallel sessions, run a pool of
servers on different ports and create one chat_llamar(base_url=)
per worker.
Tool calling and structured output are mediated by the OpenAI protocol,
so they work only as far as the server implements them. The current
server does not emit tool_calls yet (see TODO), so ellmer tools
registered on the returned chat will not be invoked by the model.
[llama_serve_openai], [chat_llamar_stop]
## Not run: # Mode A: spawn a server for this model and chat with it. chat <- chat_llamar(model_path = "model.gguf") chat$chat("Why is the sky blue?") chat_llamar_stop(chat) # or let GC do it # Mode B: connect to a server you already run. llama_serve_openai("model.gguf", port = 11434L) # in another process chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1") chat$chat("Hello!") ## End(Not run)## Not run: # Mode A: spawn a server for this model and chat with it. chat <- chat_llamar(model_path = "model.gguf") chat$chat("Why is the sky blue?") chat_llamar_stop(chat) # or let GC do it # Mode B: connect to a server you already run. llama_serve_openai("model.gguf", port = 11434L) # in another process chat <- chat_llamar(base_url = "http://127.0.0.1:11434/v1") chat$chat("Hello!") ## End(Not run)
Kills the background llama_serve_openai process that
chat_llamar started in mode A. A no-op for chats created
in mode B (base_url=), which own no process. Safe to call more
than once.
chat_llamar_stop(chat)chat_llamar_stop(chat)
chat |
A |
Invisibly TRUE if a process was killed, FALSE
otherwise.
[chat_llamar]
Computes embeddings using a local GGUF model. When called without x,
returns a function suitable for passing to ragnar_store_create(embed = ...).
embed_llamar( x, model, n_gpu_layers = 0L, n_ctx = 512L, n_threads = parallel::detectCores(), embedding = FALSE, normalize = TRUE )embed_llamar( x, model, n_gpu_layers = 0L, n_ctx = 512L, n_threads = parallel::detectCores(), embedding = FALSE, normalize = TRUE )
x |
Character vector of texts to embed, a data.frame with a |
model |
Either a path to a |
n_gpu_layers |
Number of layers to offload to GPU (0 = CPU only, -1 = all).
Ignored when |
n_ctx |
Context window size for the embedding context. Defaults to 512,
typical for embedding models. Ignored when |
n_threads |
Number of CPU threads. Ignored when |
embedding |
Logical; if |
normalize |
Logical; if |
If x is missing or NULL: a function function(x) that
returns a list of numeric vectors (one per input string), suitable for ragnar.
If x is a character vector: a numeric matrix with nrow = length(x)
and ncol = n_embd.
If x is a data.frame: the same data.frame with an added embedding
column (list of numeric vectors).
## Not run: # --- Partial application for ragnar --- store <- ragnar_store_create( "my_store", embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1) ) # --- Direct use with path --- mat <- embed_llamar(c("hello", "world"), model = "embedding-model.gguf") # --- Direct use with pre-loaded model --- mdl <- llama_load_model("embedding-model.gguf", n_gpu_layers = -1) mat <- embed_llamar(c("hello", "world"), model = mdl) ## End(Not run)## Not run: # --- Partial application for ragnar --- store <- ragnar_store_create( "my_store", embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1) ) # --- Direct use with path --- mat <- embed_llamar(c("hello", "world"), model = "embedding-model.gguf") # --- Direct use with pre-loaded model --- mdl <- llama_load_model("embedding-model.gguf", n_gpu_layers = -1) mat <- embed_llamar(c("hello", "world"), model = mdl) ## End(Not run)
Returns a data.frame of all compute devices (CPU, GPU, etc.) detected
by the ggml backend. Use device names from this list in the devices
parameter of llama_load_model.
llama_backend_devices()llama_backend_devices()
A data.frame with columns name, description, and
type (one of "cpu", "gpu", "igpu", "accel").
# List available compute devices and pick GPU names for llama_load_model() devs <- llama_backend_devices() print(devs) gpu_names <- devs$name[devs$type == "GPU"]# List available compute devices and pick GPU names for llama_load_model() devs <- llama_backend_devices() print(devs) gpu_names <- devs$name[devs$type == "GPU"]
llama_batch_init()
Free a llama batch allocated with llama_batch_init()
llama_batch_free(batch)llama_batch_free(batch)
batch |
An external pointer returned by |
NULL invisibly.
## Not run: batch <- llama_batch_init(512L) llama_batch_free(batch) ## End(Not run)## Not run: batch <- llama_batch_init(512L) llama_batch_free(batch) ## End(Not run)
Allocates a llama_batch that can hold up to n_tokens tokens.
Use llama_batch_free() to release the memory when done.
llama_batch_init(n_tokens, embd = 0L, n_seq_max = 1L)llama_batch_init(n_tokens, embd = 0L, n_seq_max = 1L)
n_tokens |
Maximum number of tokens in the batch. |
embd |
Embedding size; 0 means token-ID mode (normal inference). |
n_seq_max |
Maximum number of sequences per token. |
An external pointer to the allocated batch.
## Not run: batch <- llama_batch_init(512L) llama_batch_free(batch) ## End(Not run)## Not run: batch <- llama_batch_init(512L) llama_batch_free(batch) ## End(Not run)
Formats a conversation using the specified chat template. This is essential for instruct/chat models to work correctly.
llama_chat_apply_template( messages, template = NULL, add_generation_prompt = TRUE )llama_chat_apply_template( messages, template = NULL, add_generation_prompt = TRUE )
messages |
List of messages, each with 'role' and 'content' elements. Roles are typically "system", "user", "assistant". |
template |
Template string (from [llama_chat_template]) or NULL to use default |
add_generation_prompt |
Whether to add the assistant prompt prefix at the end |
A character scalar containing the formatted prompt string, ready
to be passed to llama_generate.
## Not run: model <- llama_load_model("llama-3.2-instruct.gguf") tmpl <- llama_chat_template(model) messages <- list( list(role = "system", content = "You are a helpful assistant."), list(role = "user", content = "What is R?") ) prompt <- llama_chat_apply_template(messages, template = tmpl) cat(prompt) ctx <- llama_new_context(model) response <- llama_generate(ctx, prompt) ## End(Not run)## Not run: model <- llama_load_model("llama-3.2-instruct.gguf") tmpl <- llama_chat_template(model) messages <- list( list(role = "system", content = "You are a helpful assistant."), list(role = "user", content = "What is R?") ) prompt <- llama_chat_apply_template(messages, template = tmpl) cat(prompt) ctx <- llama_new_context(model) response <- llama_generate(ctx, prompt) ## End(Not run)
Returns a character vector of all chat template names supported by llama.cpp.
llama_chat_builtin_templates()llama_chat_builtin_templates()
A character vector of built-in template names.
# See which chat template formats are supported out of the box templates <- llama_chat_builtin_templates() head(templates)# See which chat template formats are supported out of the box templates <- llama_chat_builtin_templates() head(templates)
Returns the chat template string embedded in the model file, if any. Common templates include ChatML, Llama, Mistral, etc.
llama_chat_template(model, name = NULL)llama_chat_template(model, name = NULL)
model |
Model handle returned by [llama_load_model] |
name |
Optional template name (NULL for default) |
A character scalar with the chat template string, or NULL if
the model does not contain a built-in template.
## Not run: model <- llama_load_model("llama-3.2-instruct.gguf") tmpl <- llama_chat_template(model) cat(tmpl) ## End(Not run)## Not run: model <- llama_load_model("llama-3.2-instruct.gguf") tmpl <- llama_chat_template(model) cat(tmpl) ## End(Not run)
Detokenize token IDs back to text
llama_detokenize(ctx, tokens)llama_detokenize(ctx, tokens)
ctx |
Context handle returned by [llama_new_context] |
tokens |
Integer vector of token IDs (as returned by [llama_tokenize]) |
A character scalar containing the decoded text.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # Round-trip: text -> tokens -> text original <- "Hello, world!" tokens <- llama_tokenize(ctx, original, add_special = FALSE) restored <- llama_detokenize(ctx, tokens) identical(original, restored) # TRUE ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # Round-trip: text -> tokens -> text original <- "Hello, world!" tokens <- llama_tokenize(ctx, original, add_special = FALSE) restored <- llama_detokenize(ctx, tokens) identical(original, restored) # TRUE ## End(Not run)
Computes embeddings for a character vector of texts in a single decode pass
using per-sequence pooling. This is more efficient than calling
llama_embeddings in a loop when embedding many texts.
llama_embed_batch(ctx, texts)llama_embed_batch(ctx, texts)
ctx |
Context handle returned by [llama_new_context] |
texts |
Character vector of texts to embed |
Requires a model that supports pooled embeddings (e.g. embedding models like nomic-embed, bge, etc.). The context must have enough capacity for the total number of tokens across all texts. Causal attention is automatically disabled during computation.
A numeric matrix with nrow = length(texts) and
ncol = n_embd.
## Not run: model <- llama_load_model("embedding-model.gguf") ctx <- llama_new_context(model, n_ctx = 2048L) llama_set_causal_attn(ctx, FALSE) mat <- llama_embed_batch(ctx, c("hello world", "foo bar", "test")) # mat is a 3 x n_embd matrix ## End(Not run)## Not run: model <- llama_load_model("embedding-model.gguf") ctx <- llama_new_context(model, n_ctx = 2048L) llama_set_causal_attn(ctx, FALSE) mat <- llama_embed_batch(ctx, c("hello world", "foo bar", "test")) # mat is a 3 x n_embd matrix ## End(Not run)
Runs the model in embeddings mode and returns the hidden-state vector of the last token. Note: meaningful only for models that support embeddings.
llama_embeddings(ctx, text)llama_embeddings(ctx, text)
ctx |
Context handle returned by [llama_new_context] |
text |
Character string to embed |
A numeric vector of length n_embd (the model's embedding
dimension) containing the hidden-state representation of the input text.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) emb1 <- llama_embeddings(ctx, "Hello world") emb2 <- llama_embeddings(ctx, "Hi there") # Cosine similarity similarity <- sum(emb1 * emb2) / (sqrt(sum(emb1^2)) * sqrt(sum(emb2^2))) cat("Similarity:", similarity, "\n") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) emb1 <- llama_embeddings(ctx, "Hello world") emb2 <- llama_embeddings(ctx, "Hi there") # Cosine similarity similarity <- sum(emb1 * emb2) / (sqrt(sum(emb1^2)) * sqrt(sum(emb2^2))) cat("Similarity:", similarity, "\n") ## End(Not run)
Runs the encoder pass for encoder-decoder architectures (e.g. T5, BART). The encoder output is stored internally and used by subsequent decoder calls.
llama_encode(ctx, tokens)llama_encode(ctx, tokens)
ctx |
A context pointer (llama_context). |
tokens |
Integer vector of token IDs to encode. |
Integer return code (0 = success, negative = error).
## Not run: model <- llama_load_model("t5-model.gguf") ctx <- llama_new_context(model) tokens <- llama_tokenize(ctx, "Hello world") llama_encode(ctx, tokens) ## End(Not run)## Not run: model <- llama_load_model("t5-model.gguf") ctx <- llama_new_context(model) tokens <- llama_tokenize(ctx, "Hello world") llama_encode(ctx, tokens) ## End(Not run)
Free an inference context
llama_free_context(ctx)llama_free_context(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects. Releases the memory associated with the inference context.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # ... use context ... llama_free_context(ctx) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # ... use context ... llama_free_context(ctx) ## End(Not run)
Free a loaded model
llama_free_model(model)llama_free_model(model)
model |
Model handle returned by [llama_load_model] |
No return value, called for side effects. Releases the memory associated with the model.
## Not run: model <- llama_load_model("model.gguf") # ... use model ... llama_free_model(model) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") # ... use model ... llama_free_model(model) ## End(Not run)
Sets up sampling and prefills the prompt, returning an opaque state handle that is pulled one chunk at a time with [llama_gen_next]. This is the streaming counterpart to [llama_generate]: same sampler chain and the same output for a given seed, but text arrives incrementally so it can be pushed into an SSE stream as it is produced.
llama_gen_begin( ctx, prompt, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, mirostat = 0L, mirostat_tau = 5, mirostat_eta = 0.1, grammar = NULL )llama_gen_begin( ctx, prompt, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, mirostat = 0L, mirostat_tau = 5, mirostat_eta = 0.1, grammar = NULL )
ctx |
Context handle returned by [llama_new_context] |
prompt |
Character string prompt |
max_new_tokens |
Maximum number of tokens to generate |
temp |
Sampling temperature. 0 = greedy decoding. |
top_k |
Top-K filtering (0 = disabled) |
top_p |
Top-P (nucleus) filtering (1.0 = disabled) |
seed |
Random seed for sampling |
min_p |
Min-P filtering threshold (0.0 = disabled) |
typical_p |
Locally typical sampling threshold (1.0 = disabled) |
repeat_penalty |
Repetition penalty (1.0 = disabled) |
repeat_last_n |
Number of last tokens to penalize (0 = disabled, -1 = context size) |
frequency_penalty |
Frequency penalty (0.0 = disabled) |
presence_penalty |
Presence penalty (0.0 = disabled) |
mirostat |
Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
mirostat_tau |
Mirostat target entropy (tau parameter) |
mirostat_eta |
Mirostat learning rate (eta parameter) |
grammar |
GBNF grammar string for constrained generation (NULL = disabled) |
Typical loop:
st <- llama_gen_begin(ctx, prompt)
repeat {
chunk <- llama_gen_next(st)
if (is.null(chunk)) break
cat(chunk)
}
cat(llama_gen_end(st)) # flush any held-back trailing bytes
Only one streaming generation may be active per context at a time: each
call to llama_gen_begin clears the context KV cache.
An external pointer holding the generation state. Pass it to [llama_gen_next] and [llama_gen_end]. The underlying sampler is freed automatically by the garbage collector.
[llama_gen_next], [llama_gen_end], [llama_generate]
Marks the generation done and returns any bytes still held in the internal
UTF-8 carry buffer (the tail of an unfinished character, if generation
stopped mid-character). Concatenating every [llama_gen_next] chunk followed
by the llama_gen_end result reproduces the full [llama_generate]
output for the same seed and parameters. Safe to call more than once.
llama_gen_end(state)llama_gen_end(state)
state |
Generation state handle from [llama_gen_begin]. |
A length-1 UTF-8 character vector with any remaining buffered text
(often "").
[llama_gen_begin], [llama_gen_next]
Advances a generation started with [llama_gen_begin] by one token and
returns the next chunk of decoded text. A possibly-incomplete trailing
UTF-8 character is held back until enough bytes arrive, so every returned
chunk is valid UTF-8 (the chunk may be "" when the only new byte is
part of an unfinished character).
llama_gen_next(state)llama_gen_next(state)
state |
Generation state handle from [llama_gen_begin]. |
A length-1 UTF-8 character vector with the next chunk, or
NULL when generation has finished (end-of-generation token reached
or max_new_tokens exhausted). After NULL, call
[llama_gen_end] to flush any remaining bytes.
[llama_gen_begin], [llama_gen_end]
Tokenizes the prompt, runs the full autoregressive decode loop with sampling, and returns the generated text (excluding the original prompt).
llama_generate( ctx, prompt, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, mirostat = 0L, mirostat_tau = 5, mirostat_eta = 0.1, grammar = NULL, with_timings = FALSE )llama_generate( ctx, prompt, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, mirostat = 0L, mirostat_tau = 5, mirostat_eta = 0.1, grammar = NULL, with_timings = FALSE )
ctx |
Context handle returned by [llama_new_context] |
prompt |
Character string prompt |
max_new_tokens |
Maximum number of tokens to generate |
temp |
Sampling temperature. 0 = greedy decoding. |
top_k |
Top-K filtering (0 = disabled) |
top_p |
Top-P (nucleus) filtering (1.0 = disabled) |
seed |
Random seed for sampling |
min_p |
Min-P filtering threshold (0.0 = disabled) |
typical_p |
Locally typical sampling threshold (1.0 = disabled) |
repeat_penalty |
Repetition penalty (1.0 = disabled) |
repeat_last_n |
Number of last tokens to penalize (0 = disabled, -1 = context size) |
frequency_penalty |
Frequency penalty (0.0 = disabled) |
presence_penalty |
Presence penalty (0.0 = disabled) |
mirostat |
Mirostat sampling mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0) |
mirostat_tau |
Mirostat target entropy (tau parameter) |
mirostat_eta |
Mirostat learning rate (eta parameter) |
grammar |
GBNF grammar string for constrained generation (NULL = disabled) |
with_timings |
If TRUE, attach a named numeric vector of per-stage timings (in ms) as attribute "timings" of the returned text. Stages: tokenize, build_sampler, kv_clear, prefill_dispatch, prefill_sync, gpu_sync (cumulative across decode-loop iterations), sample (cumulative), decode_dispatch (cumulative), detokenize, plus n_iterations and t_total_ms. Adds llama_synchronize calls inside the loop, so it is intended for profiling and may slightly slow generation. |
A character scalar containing the generated text (excluding the original prompt).
## Not run: model <- llama_load_model("model.gguf", n_gpu_layers = -1L) ctx <- llama_new_context(model, n_ctx = 2048L) # Basic generation result <- llama_generate(ctx, "Once upon a time") cat(result) # Greedy decoding (deterministic) result <- llama_generate(ctx, "The answer is", temp = 0) # More creative output result <- llama_generate(ctx, "Write a poem about R:", max_new_tokens = 100L, temp = 1.0, top_p = 0.95) # With repetition penalty result <- llama_generate(ctx, "List items:", repeat_penalty = 1.1, repeat_last_n = 64L) # JSON output with grammar result <- llama_generate(ctx, "Output JSON:", grammar = 'root ::= "{" "}" ') ## End(Not run)## Not run: model <- llama_load_model("model.gguf", n_gpu_layers = -1L) ctx <- llama_new_context(model, n_ctx = 2048L) # Basic generation result <- llama_generate(ctx, "Once upon a time") cat(result) # Greedy decoding (deterministic) result <- llama_generate(ctx, "The answer is", temp = 0) # More creative output result <- llama_generate(ctx, "Write a poem about R:", max_new_tokens = 100L, temp = 1.0, top_p = 0.95) # With repetition penalty result <- llama_generate(ctx, "List items:", repeat_penalty = 1.1, repeat_last_n = 64L) # JSON output with grammar result <- llama_generate(ctx, "Output JSON:", grammar = 'root ::= "{" "}" ') ## End(Not run)
Runs continuous batching: all prompts share the same decode loop, so each
iteration dispatches one matmul over all still-running sequences. This
converts decode from memory-bound vector ops into compute-bound matrix ops
on the GPU and lifts throughput compared to calling llama_generate
in a loop.
llama_generate_batch( ctx, prompts, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, grammar = NULL )llama_generate_batch( ctx, prompts, max_new_tokens = 256L, temp = 0.8, top_k = 50L, top_p = 0.9, seed = 42L, min_p = 0, typical_p = 1, repeat_penalty = 1, repeat_last_n = 64L, frequency_penalty = 0, presence_penalty = 0, grammar = NULL )
ctx |
Context handle returned by [llama_new_context], created with
sufficient |
prompts |
Character vector of prompts, one per parallel sequence. |
max_new_tokens, temp, top_k, top_p, seed, min_p, typical_p, repeat_penalty, repeat_last_n, frequency_penalty, presence_penalty, grammar
|
Sampling parameters; see |
The context must be created with n_seq_max >= length(prompts) and
n_ctx large enough to hold every prompt plus its generated tokens
simultaneously. As a rule of thumb:
n_ctx >= sum(prompt_lengths) + length(prompts) * max_new_tokens.
Each sequence gets its own sampler chain seeded with seed + seq_index,
so identical prompts still produce diverse outputs at temp > 0
(useful for self-consistency sampling). Sampler hyperparameters are shared
across sequences in this version.
Stop conditions per sequence: end-of-generation token (model-defined) or
max_new_tokens reached. Mirostat and with_timings are not
supported here yet — use llama_generate for those.
A list of length length(prompts), in the same order as the
input. Each element is a list with fields:
text: character scalar with the generated text
n_tokens: integer count of tokens generated
finished_reason: "eos" or "max_tokens"
## Not run: model <- llama_load_model("model.gguf", n_gpu_layers = -1L) # 4 parallel sequences, up to 256 new tokens each ctx <- llama_new_context(model, n_ctx = 4096L, n_seq_max = 4L, flash_attn = "on") # Batch classification prompts <- c("Classify: 'great movie' as positive/negative.", "Classify: 'awful service' as positive/negative.", "Classify: 'just okay' as positive/negative.", "Classify: 'loved every minute' as positive/negative.") out <- llama_generate_batch(ctx, prompts, max_new_tokens = 16L, temp = 0) vapply(out, `[[`, character(1), "text") # Self-consistency sampling: same prompt repeated samples <- llama_generate_batch(ctx, rep("2 + 2 =", 4L), max_new_tokens = 8L, temp = 0.7) ## End(Not run)## Not run: model <- llama_load_model("model.gguf", n_gpu_layers = -1L) # 4 parallel sequences, up to 256 new tokens each ctx <- llama_new_context(model, n_ctx = 4096L, n_seq_max = 4L, flash_attn = "on") # Batch classification prompts <- c("Classify: 'great movie' as positive/negative.", "Classify: 'awful service' as positive/negative.", "Classify: 'just okay' as positive/negative.", "Classify: 'loved every minute' as positive/negative.") out <- llama_generate_batch(ctx, prompts, max_new_tokens = 16L, temp = 0) vapply(out, `[[`, character(1), "text") # Self-consistency sampling: same prompt repeated samples <- llama_generate_batch(ctx, rep("2 + 2 =", 4L), max_new_tokens = 8L, temp = 0.7) ## End(Not run)
Returns a matrix of shape n_outputs × n_embd containing the raw
embedding vectors for all tokens whose logits flag was set in the batch.
Only works when pooling_type == "none" (generative models or embedding
contexts without pooling). For pooled embeddings use [llama_get_embeddings_seq].
llama_get_embeddings(ctx, n_outputs)llama_get_embeddings(ctx, n_outputs)
ctx |
Context handle returned by [llama_new_context] |
n_outputs |
Number of outputs requested in the last decode call
(i.e. how many tokens had |
A numeric matrix with n_outputs rows and n_embd columns.
Returns the embedding vector for a specific token position after a decode call with embeddings enabled. Negative indices count from the end (-1 = last token).
llama_get_embeddings_ith(ctx, i)llama_get_embeddings_ith(ctx, i)
ctx |
Context handle returned by [llama_new_context] |
i |
Integer index of the token (0-based, or negative for reverse indexing) |
A numeric vector of length n_embd.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_generate(ctx, "Hello world", max_new_tokens = 1L) # Get the embedding of the last decoded token emb <- llama_get_embeddings_ith(ctx, -1L) cat("Embedding dim:", length(emb), "\n") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_generate(ctx, "Hello world", max_new_tokens = 1L) # Get the embedding of the last decoded token emb <- llama_get_embeddings_ith(ctx, -1L) cat("Embedding dim:", length(emb), "\n") ## End(Not run)
Returns the pooled embedding vector for a given sequence ID after a batch decode. Only works when the model supports pooling (embedding models).
llama_get_embeddings_seq(ctx, seq_id)llama_get_embeddings_seq(ctx, seq_id)
ctx |
Context handle returned by [llama_new_context] with
|
seq_id |
Integer sequence ID (0-based) |
A numeric vector of length n_embd.
## Not run: # Get pooled embedding for sequence 0 (requires embedding context) model <- llama_load_model("nomic-embed.gguf") ctx <- llama_new_context(model, embedding = TRUE) mat <- llama_embed_batch(ctx, "Hello world") emb <- llama_get_embeddings_seq(ctx, 0L) cat("Pooled embedding dim:", length(emb), "\n") ## End(Not run)## Not run: # Get pooled embedding for sequence 0 (requires embedding context) model <- llama_load_model("nomic-embed.gguf") ctx <- llama_new_context(model, embedding = TRUE) mat <- llama_embed_batch(ctx, "Hello world") emb <- llama_get_embeddings_seq(ctx, 0L) cat("Pooled embedding dim:", length(emb), "\n") ## End(Not run)
Returns the raw logit vector (unnormalized log-probabilities) from the last token position after a decode operation.
llama_get_logits(ctx)llama_get_logits(ctx)
ctx |
Context handle returned by [llama_new_context] |
A numeric vector of length n_vocab containing the logits.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) result <- llama_generate(ctx, "The capital of France is", max_new_tokens = 1L) logits <- llama_get_logits(ctx) # Find top token top_id <- which.max(logits) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) result <- llama_generate(ctx, "The capital of France is", max_new_tokens = 1L) logits <- llama_get_logits(ctx) # Find top token top_id <- which.max(logits) ## End(Not run)
Returns the logit vector for token at index i in the last decoded batch.
Use i = -1 to get the logits for the last token.
llama_get_logits_ith(ctx, i)llama_get_logits_ith(ctx, i)
ctx |
Context handle returned by [llama_new_context] |
i |
Integer index into the last batch (0-based). Use |
A numeric vector of length n_vocab.
Returns the model handle that was used to create this context. The returned object is the same R external pointer that was passed to [llama_new_context] — no new allocation occurs.
llama_get_model(ctx)llama_get_model(ctx)
ctx |
Context handle returned by [llama_new_context] |
A model handle (external pointer), equivalent to the original handle returned by [llama_load_model].
Get current verbosity level
llama_get_verbosity()llama_get_verbosity()
An integer scalar indicating the current verbosity level (0 = silent, 1 = errors only, 2 = normal, 3 = verbose).
# Save current level, suppress output, then restore old <- llama_get_verbosity() llama_set_verbosity(0) # ... noisy operations ... llama_set_verbosity(old)# Save current level, suppress output, then restore old <- llama_get_verbosity() llama_set_verbosity(0) # ... noisy operations ... llama_set_verbosity(old)
Removes cached model files. Can clear the entire cache or only files from a specific repository.
llama_hf_cache_clear(repo_id = NULL, confirm = TRUE, cache_dir = NULL)llama_hf_cache_clear(repo_id = NULL, confirm = TRUE, cache_dir = NULL)
repo_id |
Character or |
confirm |
Logical. If |
cache_dir |
Character or |
Invisible NULL. Called for its side effect of deleting
cached files.
llama_hf_cache_clear(confirm = FALSE)llama_hf_cache_clear(confirm = FALSE)
Returns the path to the directory where models downloaded from Hugging Face are cached. The directory is created if it does not exist.
llama_hf_cache_dir()llama_hf_cache_dir()
A character string containing the absolute path to the cache
directory. The path follows the R user directory convention via
R_user_dir.
llama_hf_cache_dir()llama_hf_cache_dir()
Lists all cached model files with their sizes and download metadata.
llama_hf_cache_info(cache_dir = NULL)llama_hf_cache_info(cache_dir = NULL)
cache_dir |
Character or |
A data frame with columns:
Character. The Hugging Face repository identifier.
Character. The model file name.
Numeric. File size in bytes.
Character. Human-readable file size.
Character. Absolute path to the cached file.
Character. Timestamp of when the file was downloaded.
Returns an empty data frame with the same columns if the cache is empty.
llama_hf_cache_info()llama_hf_cache_info()
Downloads a GGUF model file from a Hugging Face repository. Files are cached locally so subsequent calls return the cached path without re-downloading.
llama_hf_download( repo_id, filename = NULL, pattern = NULL, tag = NULL, token = NULL, cache_dir = NULL, revision = "main", force = FALSE )llama_hf_download( repo_id, filename = NULL, pattern = NULL, tag = NULL, token = NULL, cache_dir = NULL, revision = "main", force = FALSE )
repo_id |
Character. Hugging Face repository in |
filename |
Character or |
pattern |
Character or |
tag |
Character or |
token |
Character or |
cache_dir |
Character or |
revision |
Character. Git revision (branch/tag/commit). Defaults to
|
force |
Logical. If |
Exactly one of filename, pattern, or tag must be
specified to identify which file to download.
A character string containing the absolute path to the downloaded (or cached) GGUF model file.
## Not run: path <- llama_hf_download("TheBloke/Llama-2-7B-GGUF", pattern = "*q2_k*") print(path) ## End(Not run)## Not run: path <- llama_hf_download("TheBloke/Llama-2-7B-GGUF", pattern = "*q2_k*") print(path) ## End(Not run)
Queries the Hugging Face API for GGUF model files in the specified repository. Returns a data frame with file names, sizes, and detected quantization levels.
llama_hf_list(repo_id, token = NULL, pattern = NULL)llama_hf_list(repo_id, token = NULL, pattern = NULL)
repo_id |
Character. Hugging Face repository in |
token |
Character or |
pattern |
Character or |
A data frame with columns:
Character. The file name within the repository.
Numeric. File size in bytes.
Character. Human-readable file size.
Character. Detected quantization level (e.g. "Q4_K_M")
or NA if not detected.
files <- llama_hf_list("TheBloke/Llama-2-7B-GGUF") print(files)files <- llama_hf_list("TheBloke/Llama-2-7B-GGUF") print(files)
Load a GGUF model file
llama_load_model( path, n_gpu_layers = -1L, devices = NULL, split_mode = "layer", use_mmap = TRUE, use_mlock = FALSE )llama_load_model( path, n_gpu_layers = -1L, devices = NULL, split_mode = "layer", use_mmap = TRUE, use_mlock = FALSE )
path |
Path to the .gguf model file |
n_gpu_layers |
Number of layers to offload to GPU
( |
devices |
Character vector of device names or types to use for offloading.
|
split_mode |
Multi-GPU split strategy: |
use_mmap |
Logical; map model file into memory (default |
use_mlock |
Logical; force the OS to keep model pages resident
(default |
An external pointer (class externalptr) wrapping the loaded
model. This handle is required by llama_new_context,
llama_model_info, and other model-level functions.
Freed automatically by the garbage collector or manually via
llama_free_model.
## Not run: # Default: full GPU offload (falls back to CPU if no GPU) model <- llama_load_model("model.gguf") # Force CPU-only model <- llama_load_model("model.gguf", n_gpu_layers = 0L) # Explicit CPU-only backend model <- llama_load_model("model.gguf", devices = "cpu") # Specific GPU device (see llama_backend_devices()) model <- llama_load_model("model.gguf", n_gpu_layers = -1L, devices = "Vulkan0") # Multi-GPU: use two devices model <- llama_load_model("model.gguf", n_gpu_layers = -1L, devices = c("Vulkan0", "Vulkan1")) ## End(Not run)## Not run: # Default: full GPU offload (falls back to CPU if no GPU) model <- llama_load_model("model.gguf") # Force CPU-only model <- llama_load_model("model.gguf", n_gpu_layers = 0L) # Explicit CPU-only backend model <- llama_load_model("model.gguf", devices = "cpu") # Specific GPU device (see llama_backend_devices()) model <- llama_load_model("model.gguf", n_gpu_layers = -1L, devices = "Vulkan0") # Multi-GPU: use two devices model <- llama_load_model("model.gguf", n_gpu_layers = -1L, devices = c("Vulkan0", "Vulkan1")) ## End(Not run)
Convenience function that downloads a GGUF model from Hugging Face (if not
already cached) and loads it via llama_load_model.
llama_load_model_hf(repo_id, ..., n_gpu_layers = 0L)llama_load_model_hf(repo_id, ..., n_gpu_layers = 0L)
repo_id |
Character. Hugging Face repository in |
... |
Additional arguments passed to |
n_gpu_layers |
Integer. Number of layers to offload to GPU.
Use |
An external pointer to the loaded model, as returned by
llama_load_model.
## Not run: model <- llama_load_model_hf("TheBloke/Llama-2-7B-GGUF", pattern = "*q2_k*") ## End(Not run)## Not run: model <- llama_load_model_hf("TheBloke/Llama-2-7B-GGUF", pattern = "*q2_k*") ## End(Not run)
Activates a loaded LoRA adapter for the given context. Multiple LoRA adapters can be applied simultaneously.
llama_lora_apply(ctx, lora, scale = 1)llama_lora_apply(ctx, lora, scale = 1)
ctx |
Context handle returned by [llama_new_context] |
lora |
LoRA adapter handle from [llama_lora_load] |
scale |
Scaling factor for the adapter (1.0 = full effect, 0.5 = half effect) |
No return value, called for side effects. Activates the LoRA adapter for the given context.
## Not run: model <- llama_load_model("base-model.gguf") lora <- llama_lora_load(model, "adapter.gguf") ctx <- llama_new_context(model) # Apply with full strength llama_lora_apply(ctx, lora, scale = 1.0) # Or apply with reduced effect llama_lora_apply(ctx, lora, scale = 0.5) ## End(Not run)## Not run: model <- llama_load_model("base-model.gguf") lora <- llama_lora_load(model, "adapter.gguf") ctx <- llama_new_context(model) # Apply with full strength llama_lora_apply(ctx, lora, scale = 1.0) # Or apply with reduced effect llama_lora_apply(ctx, lora, scale = 0.5) ## End(Not run)
Deactivates all LoRA adapters from the context, returning to base model behavior.
llama_lora_clear(ctx)llama_lora_clear(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects. Removes all active LoRA adapters from the context.
## Not run: # Apply multiple LoRAs llama_lora_apply(ctx, lora1) llama_lora_apply(ctx, lora2) # Remove all at once llama_lora_clear(ctx) ## End(Not run)## Not run: # Apply multiple LoRAs llama_lora_apply(ctx, lora1) llama_lora_apply(ctx, lora2) # Remove all at once llama_lora_clear(ctx) ## End(Not run)
Loads a LoRA (Low-Rank Adaptation) adapter file that can be applied to modify the model's behavior without changing the base weights.
llama_lora_load(model, path)llama_lora_load(model, path)
model |
Model handle returned by [llama_load_model] |
path |
Path to the LoRA adapter file (.gguf or .bin) |
An external pointer (class externalptr) wrapping the loaded
LoRA (Low-Rank Adaptation) adapter. Pass this handle to
llama_lora_apply to activate the adapter.
## Not run: model <- llama_load_model("base-model.gguf") lora <- llama_lora_load(model, "fine-tuned-adapter.gguf") ctx <- llama_new_context(model) llama_lora_apply(ctx, lora, scale = 1.0) # Now generation uses the LoRA-modified model result <- llama_generate(ctx, "Hello") ## End(Not run)## Not run: model <- llama_load_model("base-model.gguf") lora <- llama_lora_load(model, "fine-tuned-adapter.gguf") ctx <- llama_new_context(model) llama_lora_apply(ctx, lora, scale = 1.0) # Now generation uses the LoRA-modified model result <- llama_generate(ctx, "Hello") ## End(Not run)
Deactivates a specific LoRA adapter from the context.
llama_lora_remove(ctx, lora)llama_lora_remove(ctx, lora)
ctx |
Context handle returned by [llama_new_context] |
lora |
LoRA adapter handle to remove |
An integer scalar: 0 on success, -1 if the adapter was not applied to this context.
## Not run: # Remove a specific adapter while keeping others active llama_lora_remove(ctx, lora) result <- llama_generate(ctx, "Without adapter: ", max_new_tokens = 20L) ## End(Not run)## Not run: # Remove a specific adapter while keeping others active llama_lora_remove(ctx, lora) result <- llama_generate(ctx, "Without adapter: ", max_new_tokens = 20L) ## End(Not run)
Get maximum number of devices
llama_max_devices()llama_max_devices()
An integer scalar: the maximum number of compute devices available.
# Query the maximum number of devices supported by the backend n <- llama_max_devices() cat("Max devices:", n, "\n")# Query the maximum number of devices supported by the backend n <- llama_max_devices() cat("Max devices:", n, "\n")
Prints a debug summary of how model weights are distributed across compute devices (CPU, GPU layers). Useful for diagnosing memory allocation with partial GPU offload.
llama_memory_breakdown_print(ctx)llama_memory_breakdown_print(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects.
Check if the KV cache supports shifting
llama_memory_can_shift(ctx)llama_memory_can_shift(ctx)
ctx |
Context handle returned by [llama_new_context] |
A logical scalar: TRUE if the memory supports position shifting.
## Not run: if (llama_memory_can_shift(ctx)) { message("Context shifting is supported") } ## End(Not run)## Not run: if (llama_memory_can_shift(ctx)) { message("Context shifting is supported") } ## End(Not run)
Removes all tokens from the KV cache. Call this before starting a new generation from scratch.
llama_memory_clear(ctx)llama_memory_clear(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects.
## Not run: # Clear the KV cache to start a fresh conversation llama_memory_clear(ctx) result <- llama_generate(ctx, "New topic: ", max_new_tokens = 50L) ## End(Not run)## Not run: # Clear the KV cache to start a fresh conversation llama_memory_clear(ctx) result <- llama_generate(ctx, "New topic: ", max_new_tokens = 50L) ## End(Not run)
Adds a position delta to all tokens in the given sequence within [p0, p1). This is useful for implementing context shifting (sliding window).
llama_memory_seq_add(ctx, seq_id, p0, p1, delta)llama_memory_seq_add(ctx, seq_id, p0, p1, delta)
ctx |
Context handle returned by [llama_new_context] |
seq_id |
Sequence ID |
p0 |
Start position (inclusive) |
p1 |
End position (exclusive) |
delta |
Position shift amount (can be negative) |
No return value, called for side effects.
## Not run: # Shift positions left by 100 for context window management llama_memory_seq_add(ctx, seq_id = 0L, p0 = 100L, p1 = -1L, delta = -100L) ## End(Not run)## Not run: # Shift positions left by 100 for context window management llama_memory_seq_add(ctx, seq_id = 0L, p0 = 100L, p1 = -1L, delta = -100L) ## End(Not run)
Copies cached tokens from one sequence to another in the position range [p0, p1).
llama_memory_seq_cp(ctx, seq_id_src, seq_id_dst, p0 = -1L, p1 = -1L)llama_memory_seq_cp(ctx, seq_id_src, seq_id_dst, p0 = -1L, p1 = -1L)
ctx |
Context handle returned by [llama_new_context] |
seq_id_src |
Source sequence ID |
seq_id_dst |
Destination sequence ID |
p0 |
Start position (inclusive, -1 for beginning) |
p1 |
End position (exclusive, -1 for end) |
No return value, called for side effects.
## Not run: # Copy sequence 0 to sequence 1 llama_memory_seq_cp(ctx, seq_id_src = 0L, seq_id_dst = 1L, p0 = -1L, p1 = -1L) ## End(Not run)## Not run: # Copy sequence 0 to sequence 1 llama_memory_seq_cp(ctx, seq_id_src = 0L, seq_id_dst = 1L, p0 = -1L, p1 = -1L) ## End(Not run)
Divides all token positions in the range [p0, p1) for the given
sequence by d. Use p0 = -1 and p1 = -1 for the full range.
Useful for implementing sliding-window context compression.
llama_memory_seq_div(ctx, seq_id, p0, p1, d)llama_memory_seq_div(ctx, seq_id, p0, p1, d)
ctx |
Context handle returned by [llama_new_context] |
seq_id |
Sequence ID |
p0 |
Start position (inclusive). Use -1 for beginning. |
p1 |
End position (exclusive). Use -1 for end. |
d |
Divisor (positive integer) |
No return value, called for side effects.
Removes all sequences except the specified one from the KV cache.
llama_memory_seq_keep(ctx, seq_id)llama_memory_seq_keep(ctx, seq_id)
ctx |
Context handle returned by [llama_new_context] |
seq_id |
Sequence ID to keep |
No return value, called for side effects.
## Not run: llama_memory_seq_keep(ctx, seq_id = 0L) ## End(Not run)## Not run: llama_memory_seq_keep(ctx, seq_id = 0L) ## End(Not run)
Returns the minimum and maximum token positions for a given sequence in the KV cache.
llama_memory_seq_pos_range(ctx, seq_id)llama_memory_seq_pos_range(ctx, seq_id)
ctx |
Context handle returned by [llama_new_context] |
seq_id |
Sequence ID |
A named integer vector with elements min and max.
## Not run: range <- llama_memory_seq_pos_range(ctx, seq_id = 0L) cat("Positions:", range["min"], "to", range["max"], "\n") ## End(Not run)## Not run: range <- llama_memory_seq_pos_range(ctx, seq_id = 0L) cat("Positions:", range["min"], "to", range["max"], "\n") ## End(Not run)
Removes cached tokens for the given sequence in the position range [p0, p1). Use p0 = -1 and p1 = -1 to remove all tokens for the sequence.
llama_memory_seq_rm(ctx, seq_id, p0 = -1L, p1 = -1L)llama_memory_seq_rm(ctx, seq_id, p0 = -1L, p1 = -1L)
ctx |
Context handle returned by [llama_new_context] |
seq_id |
Sequence ID (integer) |
p0 |
Start position (inclusive, -1 for beginning) |
p1 |
End position (exclusive, -1 for end) |
A logical scalar: TRUE if tokens were successfully removed.
## Not run: # Remove all tokens from sequence 0 llama_memory_seq_rm(ctx, seq_id = 0L, p0 = -1L, p1 = -1L) ## End(Not run)## Not run: # Remove all tokens from sequence 0 llama_memory_seq_rm(ctx, seq_id = 0L, p0 = -1L, p1 = -1L) ## End(Not run)
Get model metadata
llama_model_info(model)llama_model_info(model)
model |
Model handle returned by [llama_load_model] |
A named list with fields: - 'n_ctx_train': context size the model was trained with - 'n_embd': embedding dimension - 'n_vocab': vocabulary size - 'n_layer': number of layers - 'n_head': number of attention heads - 'n_head_kv': number of key-value attention heads (GQA) - 'desc': human-readable model description string - 'size': model size in bytes - 'n_params': number of parameters - 'has_encoder': whether the model has an encoder - 'has_decoder': whether the model has a decoder - 'is_recurrent': whether the model is recurrent (e.g. Mamba)
## Not run: model <- llama_load_model("model.gguf") info <- llama_model_info(model) cat("Model:", info$desc, "\n") cat("Layers:", info$n_layer, "\n") cat("Context:", info$n_ctx_train, "\n") cat("Size:", info$size / 1e9, "GB\n") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") info <- llama_model_info(model) cat("Model:", info$desc, "\n") cat("Layers:", info$n_layer, "\n") cat("Context:", info$n_ctx_train, "\n") cat("Size:", info$size / 1e9, "GB\n") ## End(Not run)
Returns all key-value metadata pairs stored in the GGUF model file.
llama_model_meta(model)llama_model_meta(model)
model |
Model handle returned by [llama_load_model] |
A named character vector where names are metadata keys and values are the corresponding metadata values.
## Not run: model <- llama_load_model("model.gguf") meta <- llama_model_meta(model) print(meta) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") meta <- llama_model_meta(model) print(meta) ## End(Not run)
Get a single model metadata value by key
llama_model_meta_val(model, key)llama_model_meta_val(model, key)
model |
Model handle returned by [llama_load_model] |
key |
Character string metadata key (e.g. "general.name", "general.architecture") |
A character scalar with the metadata value, or NULL if the key
does not exist.
## Not run: model <- llama_load_model("model.gguf") llama_model_meta_val(model, "general.name") llama_model_meta_val(model, "general.architecture") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") llama_model_meta_val(model, "general.name") llama_model_meta_val(model, "general.architecture") ## End(Not run)
Get logical batch size
llama_n_batch(ctx)llama_n_batch(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: the logical batch size (max tokens per 'llama_decode' call).
Get context window size
llama_n_ctx(ctx)llama_n_ctx(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: the context window size (number of tokens).
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model, n_ctx = 4096L) llama_n_ctx(ctx) # 4096 ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model, n_ctx = 4096L) llama_n_ctx(ctx) # 4096 ## End(Not run)
Get per-sequence context window size
llama_n_ctx_seq(ctx)llama_n_ctx_seq(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: maximum context size per sequence.
Get maximum number of sequences
llama_n_seq_max(ctx)llama_n_seq_max(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: maximum number of concurrent sequences.
Get number of threads for single-token generation
llama_n_threads(ctx)llama_n_threads(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: current thread count for generation.
Get number of threads for batch processing
llama_n_threads_batch(ctx)llama_n_threads_batch(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: current thread count for prompt encoding.
Get physical micro-batch size
llama_n_ubatch(ctx)llama_n_ubatch(ctx)
ctx |
Context handle returned by [llama_new_context] |
An integer scalar: the physical micro-batch size.
Create an inference context
llama_new_context( model, n_ctx = 2048L, n_threads = NULL, n_threads_batch = NULL, n_batch = 2048L, n_ubatch = 512L, n_seq_max = 1L, flash_attn = "auto", embedding = FALSE )llama_new_context( model, n_ctx = 2048L, n_threads = NULL, n_threads_batch = NULL, n_batch = 2048L, n_ubatch = 512L, n_seq_max = 1L, flash_attn = "auto", embedding = FALSE )
model |
Model handle returned by [llama_load_model] |
n_ctx |
Context window size (number of tokens). 0 means use the model's trained value. |
n_threads |
Number of CPU threads for single-token decode. |
n_threads_batch |
Number of CPU threads for batch (prompt) processing.
|
n_batch |
Logical maximum batch size submitted to a single decode call
(tokens). Default |
n_ubatch |
Physical micro-batch size used inside decode. Larger values
improve prefill throughput on GPU at the cost of memory. Default |
n_seq_max |
Maximum number of parallel sequences the context can hold
simultaneously (KV cache is partitioned across them). Default |
flash_attn |
One of |
embedding |
Logical; if |
An external pointer (class externalptr) wrapping the inference
context. This handle is required by generation, tokenization, and embedding
functions. Freed automatically by the garbage collector or manually via
llama_free_context.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model, n_ctx = 4096L, n_threads = 8L) # ... use context for generation ... llama_free_context(ctx) llama_free_model(model) # Tune for GPU prefill throughput ctx <- llama_new_context(model, n_ctx = 4096L, n_ubatch = 2048L, flash_attn = "on") # Embedding mode emb_ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE) mat <- llama_embed_batch(emb_ctx, c("hello", "world")) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model, n_ctx = 4096L, n_threads = 8L) # ... use context for generation ... llama_free_context(ctx) llama_free_model(model) # Tune for GPU prefill throughput ctx <- llama_new_context(model, n_ctx = 4096L, n_ubatch = 2048L, flash_attn = "on") # Embedding mode emb_ctx <- llama_new_context(model, n_ctx = 512L, embedding = TRUE) mat <- llama_embed_batch(emb_ctx, c("hello", "world")) ## End(Not run)
Call once for better performance on NUMA systems.
llama_numa_init(strategy = "disabled")llama_numa_init(strategy = "disabled")
strategy |
NUMA strategy: |
No return value, called for side effects.
## Not run: # On multi-socket servers, distribute memory across NUMA nodes # for better memory bandwidth during inference llama_numa_init("distribute") # Call before loading any models — affects all subsequent allocations model <- llama_load_model("model.gguf", n_gpu_layers = 0L) ## End(Not run)## Not run: # On multi-socket servers, distribute memory across NUMA nodes # for better memory bandwidth during inference llama_numa_init("distribute") # Call before loading any models — affects all subsequent allocations model <- llama_load_model("model.gguf", n_gpu_layers = 0L) ## End(Not run)
Returns timing and count statistics for the current context, including prompt processing time, token generation time, and counts.
llama_perf(ctx)llama_perf(ctx)
ctx |
Context handle returned by [llama_new_context] |
A named list with fields: - 't_load_ms': model load time in milliseconds - 't_p_eval_ms': prompt processing time in milliseconds - 't_eval_ms': token generation time in milliseconds - 'n_p_eval': number of prompt tokens processed - 'n_eval': number of tokens generated - 'n_reused': number of reused compute graphs
## Not run: result <- llama_generate(ctx, "Hello world") perf <- llama_perf(ctx) cat("Prompt speed:", perf$n_p_eval / (perf$t_p_eval_ms / 1000), "tok/s\n") cat("Generation speed:", perf$n_eval / (perf$t_eval_ms / 1000), "tok/s\n") ## End(Not run)## Not run: result <- llama_generate(ctx, "Hello world") perf <- llama_perf(ctx) cat("Prompt speed:", perf$n_p_eval / (perf$t_p_eval_ms / 1000), "tok/s\n") cat("Generation speed:", perf$n_eval / (perf$t_eval_ms / 1000), "tok/s\n") ## End(Not run)
Prints a formatted summary of timing and throughput statistics for the context (load time, prompt processing speed, generation speed). Output goes to the R console via the llama.cpp logging callback.
llama_perf_print(ctx)llama_perf_print(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects.
Resets the timing and token count statistics for the context.
llama_perf_reset(ctx)llama_perf_reset(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects.
## Not run: # Reset counters before benchmarking a specific generation llama_perf_reset(ctx) result <- llama_generate(ctx, "Benchmark prompt", max_new_tokens = 100L) perf <- llama_perf(ctx) cat("Generation:", perf$n_eval / (perf$t_eval_ms / 1000), "tok/s\n") ## End(Not run)## Not run: # Reset counters before benchmarking a specific generation llama_perf_reset(ctx) result <- llama_generate(ctx, "Benchmark prompt", max_new_tokens = 100L) perf <- llama_perf(ctx) cat("Generation:", perf$n_eval / (perf$t_eval_ms / 1000), "tok/s\n") ## End(Not run)
Get pooling type
llama_pooling_type(ctx)llama_pooling_type(ctx)
ctx |
Context handle returned by [llama_new_context] |
A character string: one of '"none"', '"mean"', '"cls"', '"last"', '"rank"', '"unspecified"'.
Loads a GGUF model once and exposes it over an OpenAI-compatible HTTP
API so any OpenAI client (OpenCode, ellmer, the 'openai' Python SDK, …)
can talk to it. Implements 'GET /v1/models' and
'POST /v1/chat/completions' (both blocking and 'stream = true'). The
HTTP/SSE layer is provided by drogonR; generation runs through
llamaR's streaming API (llama_gen_begin /
llama_gen_next / llama_gen_end).
llama_serve_openai( model_path, port = 11434L, n_ctx = 4096L, n_gpu_layers = -1L, model_id = NULL, host = "127.0.0.1", template = NULL, max_tokens = 512L, ... )llama_serve_openai( model_path, port = 11434L, n_ctx = 4096L, n_gpu_layers = -1L, model_id = NULL, host = "127.0.0.1", template = NULL, max_tokens = 512L, ... )
model_path |
Path to a GGUF model file. |
port |
Port to listen on. Default |
n_ctx |
Context size for the loaded model. |
n_gpu_layers |
Layers to offload to GPU ( |
model_id |
Identifier reported in |
host |
Address to bind. Default |
template |
Chat template string, or |
max_tokens |
Default |
... |
Reserved for future options. |
The server is single-sequence: requests are handled one at a time on the main R thread (each streamed token is one event-loop pump). This is meant for a single local user/agent, not concurrent load.
drogonR is an optional dependency (Suggests); install it
with install.packages("drogonR") (or from its repository) before
calling this function.
Invisibly NULL. Blocks serving until drogonR::dr_stop()
is called (typically from another process or an interrupt).
[llama_gen_begin], [llama_generate]
## Not run: llama_serve_openai("model.gguf", port = 11434L) # In another shell, point any OpenAI client at # http://127.0.0.1:11434/v1 # e.g. GET /v1/models and POST /v1/chat/completions ## End(Not run)## Not run: llama_serve_openai("model.gguf", port = 11434L) # In another shell, point any OpenAI client at # http://127.0.0.1:11434/v1 # e.g. GET /v1/models and POST /v1/chat/completions ## End(Not run)
Registers an R function that is called periodically during generation. If the function returns 'TRUE', the current decode operation is aborted. Pass 'NULL' to remove the callback.
llama_set_abort_callback(ctx, fn)llama_set_abort_callback(ctx, fn)
ctx |
Context handle returned by [llama_new_context] |
fn |
A zero-argument R function returning a logical scalar, or 'NULL' to clear. |
Note: only one callback is active globally — setting a new one replaces the previous one across all contexts.
No return value, called for side effects.
## Not run: # Abort after 2 seconds deadline <- Sys.time() + 2 llama_set_abort_callback(ctx, function() Sys.time() > deadline) result <- llama_generate(ctx, "Tell me a long story", max_new_tokens = 500L) llama_set_abort_callback(ctx, NULL) ## End(Not run)## Not run: # Abort after 2 seconds deadline <- Sys.time() + 2 llama_set_abort_callback(ctx, function() Sys.time() > deadline) result <- llama_generate(ctx, "Tell me a long story", max_new_tokens = 500L) llama_set_abort_callback(ctx, NULL) ## End(Not run)
When disabled, the model uses full (bidirectional) attention. This is useful for embedding models.
llama_set_causal_attn(ctx, causal)llama_set_causal_attn(ctx, causal)
ctx |
Context handle returned by [llama_new_context] |
causal |
Logical; |
No return value, called for side effects.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_set_causal_attn(ctx, FALSE) # for embeddings ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_set_causal_attn(ctx, FALSE) # for embeddings ## End(Not run)
Set the number of threads for a context
llama_set_threads(ctx, n_threads, n_threads_batch = n_threads)llama_set_threads(ctx, n_threads, n_threads_batch = n_threads)
ctx |
Context handle returned by [llama_new_context] |
n_threads |
Number of threads for single-token generation |
n_threads_batch |
Number of threads for batch processing (prompt encoding).
Defaults to the same value as |
No return value, called for side effects.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_set_threads(ctx, n_threads = 8L) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) llama_set_threads(ctx, n_threads = 8L) ## End(Not run)
Controls how much diagnostic output is printed during model loading and inference.
llama_set_verbosity(level)llama_set_verbosity(level)
level |
Integer verbosity level: - 0: Silent (no output) - 1: Errors only (default) - 2: Normal (warnings and info) - 3: Verbose (all debug messages) |
No return value, called for side effects. Sets the global verbosity level used by the underlying 'llama.cpp' library.
# Suppress all output llama_set_verbosity(0) # Show only errors llama_set_verbosity(1) # Verbose output for debugging llama_set_verbosity(3)# Suppress all output llama_set_verbosity(0) # Show only errors llama_set_verbosity(1) # Verbose output for debugging llama_set_verbosity(3)
When 'warmup = TRUE', the context runs in warmup mode which pre-caches model weights in GPU memory without producing meaningful outputs. Call with 'warmup = FALSE' to return to normal inference mode.
llama_set_warmup(ctx, warmup)llama_set_warmup(ctx, warmup)
ctx |
Context handle returned by [llama_new_context] |
warmup |
Logical; 'TRUE' to enable warmup mode, 'FALSE' to disable. |
No return value, called for side effects.
Returns the number of bytes required to serialize the current context state (KV cache + sampling state). Use before allocating a buffer for raw state I/O.
llama_state_get_size(ctx)llama_state_get_size(ctx)
ctx |
Context handle returned by [llama_new_context] |
A numeric scalar (size in bytes).
Restores a previously saved context state (including KV cache).
llama_state_load(ctx, path)llama_state_load(ctx, path)
ctx |
Context handle returned by [llama_new_context] |
path |
File path to load state from |
A logical scalar: TRUE on success (errors on failure).
## Not run: llama_state_load(ctx, "state.bin") # Continue generation from saved state result <- llama_generate(ctx, "") ## End(Not run)## Not run: llama_state_load(ctx, "state.bin") # Continue generation from saved state result <- llama_generate(ctx, "") ## End(Not run)
Saves the full context state (including KV cache) to a binary file. This allows resuming generation later from the exact same state.
llama_state_save(ctx, path)llama_state_save(ctx, path)
ctx |
Context handle returned by [llama_new_context] |
path |
File path to save state to |
A logical scalar: TRUE on success (errors on failure).
## Not run: llama_state_save(ctx, "state.bin") ## End(Not run)## Not run: llama_state_save(ctx, "state.bin") ## End(Not run)
Returns 'TRUE' if at least one GPU backend (e.g. Vulkan) was detected at runtime. Use the result to decide whether to pass 'n_gpu_layers != 0' to [llama_load_model].
llama_supports_gpu()llama_supports_gpu()
A logical scalar: TRUE if at least one GPU backend
(e.g. Vulkan) is available, FALSE otherwise.
if (llama_supports_gpu()) { message("GPU available, will use Vulkan backend") } else { message("GPU not available, using CPU only") }if (llama_supports_gpu()) { message("GPU available, will use Vulkan backend") } else { message("GPU not available, using CPU only") }
Check whether memory locking is supported
llama_supports_mlock()llama_supports_mlock()
A logical scalar: TRUE if mlock is supported.
# Check if memory locking is available (prevents swapping model to disk) if (llama_supports_mlock()) { message("mlock available — model weights can be pinned in RAM") }# Check if memory locking is available (prevents swapping model to disk) if (llama_supports_mlock()) { message("mlock available — model weights can be pinned in RAM") }
Check whether memory-mapped file I/O is supported
llama_supports_mmap()llama_supports_mmap()
A logical scalar: TRUE if mmap is supported.
# Check memory-mapping support before loading large models if (llama_supports_mmap()) { message("mmap available — large models will load faster") }# Check memory-mapping support before loading large models if (llama_supports_mmap()) { message("mmap available — large models will load faster") }
Check whether RPC backend is available
llama_supports_rpc()llama_supports_rpc()
A logical scalar: 'TRUE' if the RPC backend is compiled in.
Blocks until all pending GPU/async operations for this context are complete. Normally not needed — 'llama_decode' and 'llama_generate' are synchronous — but useful when using low-level batch APIs in async mode.
llama_synchronize(ctx)llama_synchronize(ctx)
ctx |
Context handle returned by [llama_new_context] |
No return value, called for side effects.
Returns a string with information about the system capabilities detected by llama.cpp (SIMD support, etc.).
llama_system_info()llama_system_info()
A character scalar with system capability information.
cat(llama_system_info(), "\n")cat(llama_system_info(), "\n")
Get current time in microseconds
llama_time_us()llama_time_us()
A numeric scalar with the current time in microseconds.
# Measure elapsed time for an operation t0 <- llama_time_us() Sys.sleep(0.01) elapsed_ms <- (llama_time_us() - t0) / 1000 cat("Elapsed:", round(elapsed_ms, 1), "ms\n")# Measure elapsed time for an operation t0 <- llama_time_us() Sys.sleep(0.01) elapsed_ms <- (llama_time_us() - t0) / 1000 cat("Elapsed:", round(elapsed_ms, 1), "ms\n")
Convert a single token ID to its text piece
llama_token_to_piece(ctx, token, special = FALSE)llama_token_to_piece(ctx, token, special = FALSE)
ctx |
A context pointer (llama_context). |
token |
Integer token ID. |
special |
Logical. If TRUE, render special tokens (e.g. |
A character string — the text piece for the token.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # Inspect individual tokens from tokenizer output tokens <- llama_tokenize(ctx, "Hello world") pieces <- vapply(tokens, function(t) llama_token_to_piece(ctx, t), "") cat(paste(pieces, collapse = "|"), "\n") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) # Inspect individual tokens from tokenizer output tokens <- llama_tokenize(ctx, "Hello world") pieces <- vapply(tokens, function(t) llama_token_to_piece(ctx, t), "") cat(paste(pieces, collapse = "|"), "\n") ## End(Not run)
Tokenize text into token IDs
llama_tokenize(ctx, text, add_special = TRUE, parse_special = FALSE)llama_tokenize(ctx, text, add_special = TRUE, parse_special = FALSE)
ctx |
Context handle returned by [llama_new_context] |
text |
Character string to tokenize |
add_special |
Whether to add special tokens (BOS/EOS) as configured by the model |
parse_special |
Whether to parse control/special tokens (e.g. Mistral's
|
An integer vector of token IDs as used by the model's vocabulary.
## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) tokens <- llama_tokenize(ctx, "Hello, world!") print(tokens) # [1] 1 15043 29892 3186 29991 # Without special tokens tokens <- llama_tokenize(ctx, "Hello", add_special = FALSE) # Parse a templated prompt's role markers as control tokens prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi"))) tokens <- llama_tokenize(ctx, prompt, parse_special = TRUE) ## End(Not run)## Not run: model <- llama_load_model("model.gguf") ctx <- llama_new_context(model) tokens <- llama_tokenize(ctx, "Hello, world!") print(tokens) # [1] 1 15043 29892 3186 29991 # Without special tokens tokens <- llama_tokenize(ctx, "Hello", add_special = FALSE) # Parse a templated prompt's role markers as control tokens prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi"))) tokens <- llama_tokenize(ctx, prompt, parse_special = TRUE) ## End(Not run)
Returns the log-probability score stored in the vocabulary (used by SPM/UGM tokenizers).
llama_vocab_get_score(model, token)llama_vocab_get_score(model, token)
model |
Model handle returned by [llama_load_model] |
token |
Integer token ID (0-based) |
A numeric scalar.
Returns the raw text string stored in the vocabulary for a given token ID. Unlike [llama_token_to_piece], this does not apply any special rendering — it returns exactly what is stored in the GGUF vocabulary table.
llama_vocab_get_text(model, token)llama_vocab_get_text(model, token)
model |
Model handle returned by [llama_load_model] |
token |
Integer token ID (0-based) |
A character string, or 'NULL' if the token has no text entry.
Returns the token IDs for special tokens (BOS, EOS, etc.) and fill-in-middle (FIM) tokens used by the model's vocabulary. A value of -1 indicates the token is not defined.
llama_vocab_info(model)llama_vocab_info(model)
model |
Model handle returned by [llama_load_model] |
A named integer vector with token IDs for: bos, eos,
eot, sep, nl, pad, fim_pre,
fim_suf, fim_mid, fim_rep, fim_sep.
A value of -1 means the token is not defined by the model.
## Not run: model <- llama_load_model("model.gguf") vocab <- llama_vocab_info(model) cat("BOS token:", vocab["bos"], "\n") cat("EOS token:", vocab["eos"], "\n") ## End(Not run)## Not run: model <- llama_load_model("model.gguf") vocab <- llama_vocab_info(model) cat("BOS token:", vocab["bos"], "\n") cat("EOS token:", vocab["eos"], "\n") ## End(Not run)
Check if a token is a control token
llama_vocab_is_control(model, token)llama_vocab_is_control(model, token)
model |
Model handle returned by [llama_load_model] |
token |
Integer token ID (0-based) |
A logical scalar.
Returns 'TRUE' for EOS, EOT, and other tokens that signal end of output. Useful for implementing custom generation loops.
llama_vocab_is_eog(model, token)llama_vocab_is_eog(model, token)
model |
Model handle returned by [llama_load_model] |
token |
Integer token ID (0-based) |
A logical scalar.
Get vocabulary type
llama_vocab_type(model)llama_vocab_type(model)
model |
Model handle returned by [llama_load_model] |
A character string: one of '"spm"' (LLaMA/SentencePiece BPE), '"bpe"' (GPT-2 BPE), '"wpm"' (BERT WordPiece), '"ugm"' (T5 Unigram), '"rwkv"', '"plamo2"', or '"none"'.