llama_gen_begin() / llama_gen_next() / llama_gen_end() — token-by-token generation matching llama_generate() output, with valid-UTF-8 chunks.llama_serve_openai() — serve a local GGUF model over an OpenAI-compatible HTTP API (/v1/models, /v1/chat/completions, streaming and blocking) via the optional drogonR package.chat_llamar() — returns an ellmer::Chat backed by a local model, connecting to a running server (base_url=) or spawning one (model_path=); chat_llamar_stop() stops a spawned server.llama_n_batch()-sized chunks (was GGML_ASSERT(n_tokens_all <= cparams.n_batch)).llama_n_ctx_seq() — per-sequence context window size.llama_n_batch() — logical batch size (max tokens per llama_decode call).llama_n_ubatch() — physical micro-batch size.llama_n_seq_max() — maximum number of concurrent sequences.llama_n_threads() / llama_n_threads_batch() — read back thread counts set via llama_set_threads().llama_pooling_type() — pooling type of the context as a string ("none", "mean", "cls", "last", "rank").fflush macro from r_llama_compat.h
that broke std::fflush in <fstream> (Apple clang / libc++).llama_get_logits_ith() — logit vector for a specific token position in the last decoded batch. Supports negative indexing (-1 = last token).embed_llamar() — high-level embedding provider compatible with
ragnar_store_create(embed = ...). Supports partial application (lazy model
loading), direct call returning a matrix, and data.frame input. L2
normalization on by default.llama_embed_batch() — embed multiple texts in one call. Uses true pooled
batch decode (llama_get_embeddings_seq) for embedding models, with automatic
fallback to sequential last-token decode for generative models.llama_get_embeddings_ith() — get embedding vector for the i-th token
(supports negative indexing).llama_get_embeddings_seq() — get pooled embedding for a sequence ID.llama_new_context() gains embedding parameter. When TRUE, sets
cparams.embeddings = true and disables causal attention at creation time.
llama_embed_batch() uses this flag to choose the optimal code path.llama_load_model() gains devices parameter for explicit backend selection.
Accepts device names from llama_backend_devices(), type keywords ("cpu",
"gpu"), or numeric indices. Multiple devices enable multi-GPU split.llama_backend_devices() — list all available compute devices (CPU, GPU,
iGPU, accelerator) as a data.frame.llama_numa_init() — NUMA optimization with strategies: disabled, distribute,
isolate, numactl, mirror.llama_time_us() — current time in microseconds.llama_token_to_piece() — convert a single token ID to its text piece.llama_encode() — run the encoder pass for encoder-decoder models (e.g. T5, BART).llama_batch_init() / llama_batch_free() — low-level batch allocation and release
with automatic GC finalizer.extern "C" block wrapping #include <R.h> in r_llama_compat.h
(C++ templates cannot appear inside extern "C" linkage).Rinternals.h #define length(x) and
std::codecvt::length() in r_llama_interface.cpp:
C++ standard headers are now included before R headers, followed by
#undef length.llama_token_to_piece, llama_batch_init,
llama_batch_free, and llama_encode, including GPU context variants.llama_hf_list() — list GGUF files in a Hugging Face repository.llama_hf_download() — download a GGUF model with local caching.
Supports exact filename, glob pattern, or Ollama-style tag selection.llama_load_model_hf() — download and load a model in one step.llama_hf_cache_dir() — get the cache directory path.llama_hf_cache_info() — inspect cached models.llama_hf_cache_clear() — clear the model cache.jsonlite and utils to Imports.configure.win and Makevars.win.in.ggmlR is built with GPU support.exit() / _Exit() overrides to r_llama_compat.h to prevent
process termination (redirects to Rf_error()).ggmlR >= 0.5.4.ggmlR).ggmlR.\value tags to all exported functions describing
return class, structure, and meaning.\dontrun{} with \donttest{} in all examples.cph) for bundled
'llama.cpp' code.NEWS.md in the package tarball (removed from .Rbuildignore).cran-comments.md..Rbuildignore.Full LLM inference cycle is now available from R:
llama_load_model() / llama_free_model() — load and free GGUF modelsllama_new_context() / llama_free_context() — context managementllama_tokenize() / llama_detokenize() — tokenization and detokenizationllama_generate() — text generation with temperature, top_k, top_p, greedy supportllama_embeddings() — embedding extractionllama_model_info() — model metadataModel and context are wrapped as ExternalPtr with automatic GC finalizers. The context holds a reference to the model ExternalPtr, preventing premature collection.
llama_generate() runs the full pipeline in a single C++ call: prompt
tokenization → encode → autoregressive decode loop with a sampler chain →
detokenization of generated tokens.
19 assertions across 7 test blocks, all passing.
libggml.a from ggmlR packageggml_build_forward_select replaced with simplified branch selection