Changes in version 0.2.4 (2026-05-27) Streaming generation - llama_gen_begin() / llama_gen_next() / llama_gen_end() — token-by-token generation matching llama_generate() output, with valid-UTF-8 chunks. OpenAI-compatible server - llama_serve_openai() — serve a local GGUF model over an OpenAI-compatible HTTP API (/v1/models, /v1/chat/completions, streaming and blocking) via the optional drogonR package. ellmer integration - chat_llamar() — returns an ellmer::Chat backed by a local model, connecting to a running server (base_url=) or spawning one (model_path=); chat_llamar_stop() stops a spawned server. Bug fixes - Long prompts no longer abort: prefill is now split into llama_n_batch()-sized chunks (was GGML_ASSERT(n_tokens_all <= cparams.n_batch)). Changes in version 0.2.3 (2026-04-06) Context getters - llama_n_ctx_seq() — per-sequence context window size. - llama_n_batch() — logical batch size (max tokens per llama_decode call). - llama_n_ubatch() — physical micro-batch size. - llama_n_seq_max() — maximum number of concurrent sequences. - llama_n_threads() / llama_n_threads_batch() — read back thread counts set via llama_set_threads(). - llama_pooling_type() — pooling type of the context as a string ("none", "mean", "cls", "last", "rank"). Bug fixes - Fixed macOS compilation error: removed fflush macro from r_llama_compat.h that broke std::fflush in (Apple clang / libc++). Logits - llama_get_logits_ith() — logit vector for a specific token position in the last decoded batch. Supports negative indexing (-1 = last token). Changes in version 0.2.2 (2026-03-05) ragnar integration - embed_llamar() — high-level embedding provider compatible with ragnar_store_create(embed = ...). Supports partial application (lazy model loading), direct call returning a matrix, and data.frame input. L2 normalization on by default. Batch embeddings - llama_embed_batch() — embed multiple texts in one call. Uses true pooled batch decode (llama_get_embeddings_seq) for embedding models, with automatic fallback to sequential last-token decode for generative models. - llama_get_embeddings_ith() — get embedding vector for the i-th token (supports negative indexing). - llama_get_embeddings_seq() — get pooled embedding for a sequence ID. Context embedding mode - llama_new_context() gains embedding parameter. When TRUE, sets cparams.embeddings = true and disables causal attention at creation time. llama_embed_batch() uses this flag to choose the optimal code path. Backend & device selection - llama_load_model() gains devices parameter for explicit backend selection. Accepts device names from llama_backend_devices(), type keywords ("cpu", "gpu"), or numeric indices. Multiple devices enable multi-GPU split. - llama_backend_devices() — list all available compute devices (CPU, GPU, iGPU, accelerator) as a data.frame. Hardware & system - llama_numa_init() — NUMA optimization with strategies: disabled, distribute, isolate, numactl, mirror. - llama_time_us() — current time in microseconds. Tests - 40+ new test blocks covering all new functions. - Total: 143 passing, 4 expected skips. Changes in version 0.2.1 New functions - llama_token_to_piece() — convert a single token ID to its text piece. - llama_encode() — run the encoder pass for encoder-decoder models (e.g. T5, BART). - llama_batch_init() / llama_batch_free() — low-level batch allocation and release with automatic GC finalizer. Bug fixes - Fixed compilation failure on macOS with Apple clang 17 / Xcode 16.4: removed extern "C" block wrapping #include in r_llama_compat.h (C++ templates cannot appear inside extern "C" linkage). - Fixed macro conflict between Rinternals.h #define length(x) and std::codecvt::length() in r_llama_interface.cpp: C++ standard headers are now included before R headers, followed by #undef length. Tests - Added 9 new test blocks covering llama_token_to_piece, llama_batch_init, llama_batch_free, and llama_encode, including GPU context variants. - Total: 103 passing, 4 expected skips. Changes in version 0.2.0 Hugging Face integration New functions - llama_hf_list() — list GGUF files in a Hugging Face repository. - llama_hf_download() — download a GGUF model with local caching. Supports exact filename, glob pattern, or Ollama-style tag selection. - llama_load_model_hf() — download and load a model in one step. - llama_hf_cache_dir() — get the cache directory path. - llama_hf_cache_info() — inspect cached models. - llama_hf_cache_clear() — clear the model cache. Dependencies - Added jsonlite and utils to Imports. Changes in version 0.1.3 GPU and build system improvements Vulkan GPU support on Windows - Added Vulkan linking support to configure.win and Makevars.win.in. - Windows builds now link with Vulkan when ggmlR is built with GPU support. CRAN compliance - Added exit() / _Exit() overrides to r_llama_compat.h to prevent process termination (redirects to Rf_error()). Dependencies - Requires ggmlR >= 0.5.4. - Bumped minimum R version to 4.1.0 (matches ggmlR). DESCRIPTION - Updated description to mention Vulkan GPU support via ggmlR. Changes in version 0.1.2 CRAN compliance fixes Documentation - Expanded all acronyms in DESCRIPTION (LLMs, GPU). - Added detailed \value tags to all exported functions describing return class, structure, and meaning. - Replaced \dontrun{} with \donttest{} in all examples. DESCRIPTION - Added Georgi Gerganov as copyright holder (cph) for bundled 'llama.cpp' code. Packaging - Included NEWS.md in the package tarball (removed from .Rbuildignore). - Created cran-comments.md. - Cleaned up duplicate entries in .Rbuildignore. Changes in version 0.1.1 R interface — first working release Full LLM inference cycle is now available from R: - llama_load_model() / llama_free_model() — load and free GGUF models - llama_new_context() / llama_free_context() — context management - llama_tokenize() / llama_detokenize() — tokenization and detokenization - llama_generate() — text generation with temperature, top_k, top_p, greedy support - llama_embeddings() — embedding extraction - llama_model_info() — model metadata Memory management Model and context are wrapped as ExternalPtr with automatic GC finalizers. The context holds a reference to the model ExternalPtr, preventing premature collection. Generation internals llama_generate() runs the full pipeline in a single C++ call: prompt tokenization → encode → autoregressive decode loop with a sampler chain → detokenization of generated tokens. Tests 19 assertions across 7 test blocks, all passing. Changes in version 0.1.0 Initial Release - Basic package structure with llama.cpp integration - Links against libggml.a from ggmlR package - Includes all llama.cpp model implementations (~100 architectures) - Vulkan GPU support (optional) Dependencies - Requires ggmlR >= 0.5.1 for static library export Known Limitations - ggml_build_forward_select replaced with simplified branch selection