Package: llamaR 0.2.5

llamaR: Interface for Large Language Models via 'llama.cpp'

Provides R bindings to 'llama.cpp' for running large language models locally, with optional GPU acceleration via 'ggmlR'. Supports text generation, embeddings, chat-based workflows, tool calling, and multimodal (vision) inference. Includes 'OpenAI'- and 'Anthropic'-compatible HTTP servers for serving local models, along with device selection and multi-GPU support.

Authors:Yuri Baramykov [aut, cre], Georgi Gerganov [cph]

llamaR_0.2.5.tar.gz
llamaR_0.2.5.zip(r-4.7-x86_64)llamaR_0.2.5.zip(r-4.6-x86_64)llamaR_0.2.5.zip(r-4.5-x86_64)
llamaR_0.2.5.tgz(r-4.6-x86_64)llamaR_0.2.5.tgz(r-4.6-arm64)llamaR_0.2.5.tgz(r-4.5-x86_64)llamaR_0.2.5.tgz(r-4.5-arm64)
llamaR_0.2.5.tar.gz(r-4.7-arm64)llamaR_0.2.5.tar.gz(r-4.7-x86_64)llamaR_0.2.5.tar.gz(r-4.6-arm64)llamaR_0.2.5.tar.gz(r-4.6-x86_64)
manual.pdf |manual.html✨
DESCRIPTION |NEWS
card.svg |card.png
llamaR/json (API)

# Install 'llamaR' in R:

install.packages('llamaR', repos = c('https://zabis13.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/zabis13/llamar/issues

Uses libs:

c++– GNU Standard C++ Library v3
openmp– GCC OpenMP (GOMP) support library

On CRAN:

cpp openmp

6.93 score 6 stars 72 scripts 364 downloads 99 exports 4 dependencies

Last updated from:7af1093184. Checks:12 OK, 1 FAIL. Indexed: yes.

Target	Result	Time
linux-devel-arm64	OK	450
linux-devel-x86_64	OK	561
source / vignettes	OK	842
linux-release-arm64	OK	460
linux-release-x86_64	OK	423
macos-release-arm64	OK	321
macos-release-x86_64	OK	625
macos-oldrel-arm64	OK	322
macos-oldrel-x86_64	OK	664
windows-devel	OK	645
windows-release	OK	679
windows-oldrel	OK	699
wasm-release	FAIL	355

Exports:chat_llamar chat_llamar_stop embed_llamar llama_backend_devices llama_batch_free llama_batch_init llama_chat_apply_template llama_chat_build llama_chat_builtin_templates llama_chat_parse llama_chat_template llama_detokenize llama_embed_batch llama_embeddings llama_encode llama_free_context llama_free_model llama_gen_begin llama_gen_begin_at llama_gen_end llama_gen_next llama_generate llama_generate_batch llama_get_embeddings llama_get_embeddings_ith llama_get_embeddings_seq llama_get_logits llama_get_logits_ith llama_get_model llama_get_verbosity llama_hf_cache_clear llama_hf_cache_dir llama_hf_cache_info llama_hf_download llama_hf_list llama_image_eval llama_image_load llama_load_model llama_load_model_hf llama_lora_apply llama_lora_clear llama_lora_load llama_lora_remove llama_max_devices llama_memory_breakdown_print llama_memory_can_shift llama_memory_clear llama_memory_seq_add llama_memory_seq_cp llama_memory_seq_div llama_memory_seq_keep llama_memory_seq_pos_range llama_memory_seq_rm llama_model_info llama_model_meta llama_model_meta_val llama_mtmd_load llama_mtmd_marker llama_mtmd_set_verbosity llama_mtmd_support_audio llama_mtmd_support_vision llama_n_batch llama_n_ctx llama_n_ctx_seq llama_n_seq_max llama_n_threads llama_n_threads_batch llama_n_ubatch llama_new_context llama_numa_init llama_perf llama_perf_print llama_perf_reset llama_pooling_type llama_serve_anthropic llama_serve_openai llama_set_abort_callback llama_set_causal_attn llama_set_threads llama_set_verbosity llama_set_warmup llama_state_get_size llama_state_load llama_state_save llama_supports_gpu llama_supports_mlock llama_supports_mmap llama_supports_rpc llama_synchronize llama_system_info llama_time_us llama_token_to_piece llama_tokenize llama_vocab_get_score llama_vocab_get_text llama_vocab_info llama_vocab_is_control llama_vocab_is_eog llama_vocab_type

Dependencies:generics ggmlR jsonlite R6

Multi-GPU: Splits, Replicas and Benchmarking

Rendered frommulti-gpu.Rmdusingknitr::rmarkdown

Last update: 2026-07-12
Started: 2026-07-12

Chat and Agents

Rendered fromchat-and-agents.Rmdusingknitr::rmarkdown

Last update: 2026-06-27
Started: 2026-05-27

Getting Started with llamaR

Rendered fromgetting-started.Rmdusingknitr::rmarkdown

Last update: 2026-05-27
Started: 2026-05-27

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
Chat with a local model through an ellmer::Chat object	chat_llamar
Stop the server spawned by chat_llamar()	chat_llamar_stop
Embedding provider for ragnar / standalone use	embed_llamar
List available backend devices	llama_backend_devices
Free a llama batch allocated with 'llama_batch_init()'	llama_batch_free
Initialise a llama batch	llama_batch_init
Apply chat template to messages	llama_chat_apply_template
Build a tool-aware chat prompt and its parsing grammar	llama_chat_build
List built-in chat templates	llama_chat_builtin_templates
Parse raw model output into content and tool calls	llama_chat_parse
Get model's built-in chat template	llama_chat_template
Detokenize token IDs back to text	llama_detokenize
Batch embeddings for multiple texts	llama_embed_batch
Extract embeddings for a text	llama_embeddings
Encode tokens using the encoder (encoder-decoder models only)	llama_encode
Free an inference context	llama_free_context
Free a loaded model	llama_free_model
Begin a streaming (token-by-token) generation	llama_gen_begin
Begin streaming generation from an already-prefilled context	llama_gen_begin_at
Finish a streaming generation	llama_gen_end
Pull the next chunk of a streaming generation	llama_gen_next
Generate text from a prompt	llama_generate
Generate completions for multiple prompts in parallel	llama_generate_batch
Get all output token embeddings as a matrix	llama_get_embeddings
Get embeddings for the i-th token in the batch	llama_get_embeddings_ith
Get pooled embeddings for a sequence	llama_get_embeddings_seq
Get logits from the last decode step	llama_get_logits
Get logits for a specific token position	llama_get_logits_ith
Get the model associated with a context	llama_get_model
Get current verbosity level	llama_get_verbosity
Clear the model cache	llama_hf_cache_clear
Get the cache directory for downloaded models	llama_hf_cache_dir
Show information about the model cache	llama_hf_cache_info
Download a GGUF model from Hugging Face	llama_hf_download
List GGUF files in a Hugging Face repository	llama_hf_list
Evaluate an image + prompt into a llama context	llama_image_eval
Load an image file into an mtmd bitmap	llama_image_load
Load a GGUF model file	llama_load_model
Load a model directly from Hugging Face	llama_load_model_hf
Apply a LoRA adapter to context	llama_lora_apply
Remove all LoRA adapters from context	llama_lora_clear
Load a LoRA adapter	llama_lora_load
Remove a LoRA adapter from context	llama_lora_remove
Get maximum number of devices	llama_max_devices
Print memory breakdown by device	llama_memory_breakdown_print
Check if the KV cache supports shifting	llama_memory_can_shift
Clear the KV cache	llama_memory_clear
Shift token positions in a sequence	llama_memory_seq_add
Copy a sequence in the KV cache	llama_memory_seq_cp
Integer-divide token positions in a sequence	llama_memory_seq_div
Keep only one sequence in the KV cache	llama_memory_seq_keep
Get position range for a sequence	llama_memory_seq_pos_range
Remove tokens from a sequence in the KV cache	llama_memory_seq_rm
Get model metadata	llama_model_info
Get all model metadata as a named character vector	llama_model_meta
Get a single model metadata value by key	llama_model_meta_val
Load a multimodal projector (mmproj)	llama_mtmd_load
Media marker string for multimodal prompts	llama_mtmd_marker
Set verbosity of the multimodal subsystem	llama_mtmd_set_verbosity
Does this multimodal context support audio?	llama_mtmd_support_audio
Does this multimodal context support vision (images)?	llama_mtmd_support_vision
Get logical batch size	llama_n_batch
Get context window size	llama_n_ctx
Get per-sequence context window size	llama_n_ctx_seq
Get maximum number of sequences	llama_n_seq_max
Get number of threads for single-token generation	llama_n_threads
Get number of threads for batch processing	llama_n_threads_batch
Get physical micro-batch size	llama_n_ubatch
Create an inference context	llama_new_context
Initialize NUMA optimization	llama_numa_init
Get performance statistics	llama_perf
Print performance statistics to the console	llama_perf_print
Reset performance counters	llama_perf_reset
Get pooling type	llama_pooling_type
Serve an Anthropic Messages API-compatible endpoint for a local model	llama_serve_anthropic
Serve an OpenAI-compatible HTTP API for a local model	llama_serve_openai
Set or clear the abort callback	llama_set_abort_callback
Set causal attention mode	llama_set_causal_attn
Set the number of threads for a context	llama_set_threads
Set logging verbosity level	llama_set_verbosity
Set warmup mode	llama_set_warmup
Get the size of the serialized context state in bytes	llama_state_get_size
Load context state from file	llama_state_load
Save context state to file	llama_state_save
Check whether GPU offloading is available	llama_supports_gpu
Check whether memory locking is supported	llama_supports_mlock
Check whether memory-mapped file I/O is supported	llama_supports_mmap
Check whether RPC backend is available	llama_supports_rpc
Synchronize asynchronous computation	llama_synchronize
Get system information string	llama_system_info
Get current time in microseconds	llama_time_us
Convert a single token ID to its text piece	llama_token_to_piece
Tokenize text into token IDs	llama_tokenize
Get the score of a token	llama_vocab_get_score
Get the text representation of a token	llama_vocab_get_text
Get vocabulary special token IDs	llama_vocab_info
Check if a token is a control token	llama_vocab_is_control
Check if a token is an end-of-generation token	llama_vocab_is_eog
Get vocabulary type	llama_vocab_type