GGML_BACKEND_DEVICE_TYPE_META device type (tensor-parallel meta backend).use_ref flag, Hadamard FWHT op, rewritten Flash Attention (split-KV + tiled).quantize_q1_0(), quantize_nvfp4(), dequantize_row_q1_0(), dequantize_row_nvfp4().copyin overhead in single-backend decode (Ministral-3B, targeting ~3×).R6 moved from Suggests to Imports — package now loads correctly when R6 is not pre-installed.LearnerClassifGGML and LearnerRegrGGML R6 class definitions are now deferred until mlr3/R6/paradox are available, preventing namespace load failure in environments without these optional packages.ggml_conv_2d (IM2COL+GEMM) with ggml_conv_2d_direct (GGML_OP_CONV_2D) in onnx_ggml.c — SuperResolution GPU time 344 ms → 5 ms (~70×).wg512 pipeline threshold lowered from >1024 to >=512 — improves attention softmax at seq_len 512–1024.benchmark_ops.R (36-op CPU/GPU micro-benchmark), profile_onnx_superres_gpu.R (GPU profiler for SuperResolution).USE_SUBGROUP_NO_SHMEM path added to mul_mmq.comp — on wavefront-64 devices (RDNA4, subgroup_size=64) the block_a weight tile is loaded directly into registers via subgroupShuffle / subgroupBroadcast, eliminating the shared-memory round-trip in block_a_to_shmem → block_a_to_registers. Measured on RX 9070: Flux 768×768 sampling 22.38s → 20.80s (~7% end-to-end; sampling is not pure matmul so the gain on isolated Q4_K GEMM is higher).subgroup_no_shmem — ggml_vulkan_device_caps() now returns this flag (logical), indicating whether the shuffle mmq path is active.GL_EXT_shader_subgroup_extended_types_float16 added to mul_mmq.comp under #ifdef USE_SUBGROUP_NO_SHMEM && FLOAT16 — required for subgroupShuffle on float16_t components of f16vec2.ggml_vulkan_device_caps() extended — wavefronts_per_simd and arch fields added; all 14 fields now documented.pipeline_dequant_mul_mat_mat_q8_1_no_shmem — registered in device struct; selected at dispatch when subgroup_size == 64 and src0 is Q4_K / Q5_K / Q6_K; falls back to standard mmq pipeline gracefully when not compiled.GGML_TYPE_Q2_K, Q3_K, Q4_K, Q5_K, Q6_K exported — these constants were defined in tensors.R but missing from NAMESPACE; roxygen2::roxygenise() now includes them.inst/examples/vulkan_caps.R extended — new section shows USE_SUBGROUP_NO_SHMEM: ACTIVE/INACTIVE with explanation of conditions.tests/testthat/test-vulkan.R adds smoke tests for Q4_K / Q5_K / Q6_K quantized matmul via Vulkan (no NaN/Inf, correct shape); test-vulkan-caps.R asserts integer_dot_product=TRUE on RDNA4.get_device_architecture() now identifies RDNA4 by wavefrontsPerSimd == 16 (distinct from RDNA3's 8 and RDNA1's 20). Previously GFX1201 fell through to AMD_RDNA3 due to identical subgroup size range (min=32, max=64).VK_AMD_shader_core_properties queried at device init — wavefronts_per_simd is now stored in vk_device_struct and read once during ggml_vk_get_device(), not just inside get_device_architecture().SHADERGEN_DEFINES propagated to C++ compiler — configure now appends SHADERGEN_DEFINES (which includes -DGGML_VULKAN_COOPMAT_GLSLC_SUPPORT) to VULKAN_CPPFLAGS. Previously these defines were only passed to vulkan-shaders-gen, so all #if defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT) blocks in ggml-vulkan.cpp were dead code at runtime.ggml_backend_vk_get_device_caps() extended — now returns subgroup_min_size, subgroup_max_size, wavefronts_per_simd, and arch (string) in addition to the original 5 fields. R function ggml_vulkan_device_caps() exposes all 9 fields.coopmat_support=YES, coopmat1_fa_support=YES — KHR cooperative matrix GEMM and flash-attention paths now active.GGML_OP_FLASH_ATTN_EXT now accepts K/V tensors in Q4_K format on Vulkan. Previously Q4_K fell back to CPU; now it runs fully on GPU via both the scalar and cooperative-matrix (KHR) paths.dequantize4_q4k() added to flash_attn_base.glsl — decodes 4 consecutive Q4_K elements from a block_q4_K_packed16 block: reconstructs the 6-bit scale and min for the sub-block, reads two consecutive uint16 from qs[], and extracts four nibbles. Works for both K and V bindings.flash_attn.comp (FA_SCALAR) and flash_attn_cm1.comp (FA_COOPMAT1) now compiled with DATA_A_Q4_K / BLOCK_SIZE=QUANT_K_Q4_K=256. Four SPIR-V variants generated: f32acc and f16acc for each path.vulkan-shaders-gen.cpp — q4_k added to the FA scalar and coopmat1 generation conditions.ggml-vulkan.cpp — CREATE_FA(GGML_TYPE_Q4_K, ...) added for FA_SCALAR and FA_COOPMAT1; GGML_TYPE_Q4_K added to the supported-types switch in ggml_backend_vk_device_supports_op.HSK) is a multiple of 256 (e.g. DeepSeek-V2/V3 MLA). For HSK=128 (Llama, Mistral) the shader is functionally correct but pads the inner loop to 256."ggml" engine for parsnip::mlp() — registers a "ggml" engine for both classification and regression modes. After library(ggmlR) (with parsnip installed), use:
mlp(hidden_units = 64, epochs = 100) |>
set_engine("ggml", batch_size = 32, backend = "auto") |>
set_mode("classification")
Engine arguments: batch_size, backend, verbose, validation_split, optimizer, callbacks. All mlp() parameters (hidden_units, epochs, dropout, activation, learn_rate) are mapped through.backend = "gpu" in parsnip — "gpu" is now correctly translated to "vulkan" inside ggmlr_parsnip_fit_classif() and ggmlr_parsnip_fit_regr(). Previously the string was passed through and caused an unknown backend error.learn_rate callback — the learn_rate argument from mlp() is applied via an internal on_epoch_begin callback that sets the optimizer learning rate at the start of epoch 1. Works for both "adam" and "sgd" optimizers.Suggests: parsnip, tibble, rlang, dials.inst/examples/tidymodels_integration.R — CPU vs GPU comparison for iris classification and mtcars regression using the parsnip engine.LearnerClassifGGML / LearnerRegrGGML always defined — R6 class definitions are now unconditional (no longer wrapped in if (requireNamespace("mlr3"))). This ensures the classes are always present in the ggmlR namespace, so ggmlR:::.register_mlr3() can be called reliably from vignettes and tests regardless of package load order..onLoad() no longer uses mlr3misc::register_namespace_callback() (which had a bug in v0.21.0 causing R CMD check warning namespace can be unloaded cleanly). Registration now uses isNamespaceLoaded() + setHook() directly, covering both "mlr3 already loaded" and "mlr3 loads after ggmlR" scenarios.mlr3misc removed from Suggests — no longer needed.inst/examples/mlr3_integration.R — CPU vs GPU comparison for iris classification and mtcars regression, plus 3-fold CV.marshal_model.* / unmarshal_model.* S3 methods no longer appear in NAMESPACE as S3method(mlr3::marshal_model, ...) — this caused Error: namespace 'marshal_model' not found on package load. Methods are now registered exclusively via registerS3method() in .onLoad().test-parsnip.R — new tests: learn_rate applied without error; backend="gpu" accepted and converted to "vulkan" (skipped when Vulkan unavailable).test-mlr3-learner.R — explicit ggmlR:::.register_mlr3() call at top of file for reliable registration in R CMD check test process.Rcpp::asis vignette engine. No rendering on CRAN runners.rmarkdown from Suggests (no longer needed).ggml_graph_print() output captured in test-graph-utils.R; C-level broadcast warnings captured in ONNX broadcast and resize-broadcast tests.gguf_load(path) — opens a GGUF file (v2/v3) and reads all metadata and tensor descriptors. Returns an S3 object of class "gguf".gguf_metadata(x) — returns all key-value metadata pairs as a named list (architecture, tokenizer config, quantization info, etc.).gguf_tensor_names(x) — lists all tensor names in the file.gguf_tensor_info(x, name) — returns shape, type, and size in bytes for a single tensor.gguf_tensor_data(x, name) — dequantizes (if needed) and returns tensor weights as an R numeric array with correct dimensions.gguf_free(x) — explicitly frees GGUF context (also called by GC).print.gguf() method shows file version, tensor count, and metadata count.VK_KHR_push_descriptor): unchanged — when the extension is available and maxPushDescriptors >= 12, descriptor sets are pushed directly into the command buffer via pushDescriptorSetKHR(), eliminating descriptor pool overhead. Falls back to the traditional descriptor pool path on hardware without the extension.fit() now accepts a callbacks parameter for sequential models (passed through to ggml_fit_sequential()).test-gguf.R, test-graph-utils.R, test-inplace-ops.R, test-keras-api.R, test-misc-ops.R, test-model-ops.R, test-print-methods.R, test-tensor-utils.R, test-threading.R, test-autograd-missing.R, test-nn-functional-missing.R, test-quants-missing.R.src/ and inst/include/ headers: configure and configure.win now automatically sync all public headers from src/ to inst/include/ at install time. Previously, changes to GGML_MAX_DIMS (4→5) and other structs in src/ggml.h were not propagated to the exported headers, causing segfaults in downstream packages (e.g. sd2R).tests/testthat/test-headers-sync.R to verify that inst/include/ headers remain in sync with src/ headers and that GGML_MAX_DIMS is consistent.ggml_view_5d() — new API function for creating 5D views with explicit strides, extending the existing 1D–4D view family. Uses the existing ggml_view_impl() internally.ggml_repeat_5d() — new API function for tiling tensors up to 5D. CPU kernels (ggml_compute_forward_repeat_f32, ggml_compute_forward_repeat_f16) updated with a 5th loop dimension. Vulkan dispatch collapses dim3×dim4 into push constants transparently (no shader changes needed — push constants remain at 128 bytes).onnx_ggml.c (~20 sites):
ne[GGML_MAX_DIMS] arrays, switch with case 5: new_tensor_5d.onnx_broadcast_align): all reshape/new_tensor calls use dimension-aware helpers.onnx_reshape_nd().ggml_repeat_5d().tmap_put_nd() and slice_fill arrays updated to GGML_MAX_DIMS.onnx_reshape_nd(), onnx_new_tensor_nd(), ne_product() — eliminate switch/case duplication.ggml_permute API limitation).ConstantOfShape read the value TensorProto attribute as float regardless of data_type. When data_type=7 (INT64), the 8-byte int64 was reinterpreted as a 4-byte float, producing garbage values (~1.4e-45 instead of 1). This broke attention mask generation (fill=0 instead of 1) and position ID generation (NonZero on zeros = empty).ConstantOfShape now checks data_type and correctly handles INT64, INT32, DOUBLE, and FLOAT value attributes.ggml_get_rows which only supports 2D data. For axis=0 on rank>2 (e.g. CaiT QKV split on [48,576,6,3]), the tensor is now reshaped to 2D, gathered, and reshaped back.GGML_OP_SCATTER_ELEMENTS added to the ggml engine with both CPU kernel and Vulkan compute shader.scatter_elements.comp): two variants compiled at install time — scatter_elements_none (overwrite) and scatter_elements_add (atomicAdd via GL_EXT_shader_atomic_float). Data is copied to output via vkCmdCopyBuffer with a pipeline barrier before the scatter dispatch.ScatterElements op with axis=0 and reduction="none"/"add" attributes. Indices cast to I32, updates/data cast to F32 automatically.ggml_map_custom3 op. The CPU kernel computes 2D relative position bias directly: bias[b,hq,wq,hk,wk] = dot(x, W_h) + dot(x_transposed, W_w).detect_pos_embed_blocks() identifies contiguous node ranges with /pos_embed/ in output names, extracts W_h/W_w initializer shapes to determine H, W, C, validates F32 data type.onnx_ggml_run(), input data is copied into pinned memory before ggml_backend_tensor_set() — the Vulkan driver detects the pinned source pointer and performs direct DMA transfer to VRAM, bypassing the internal staging copy.ggml_backend_vk_host_buffer_type() returns NULL or buffer is too small, the standard staging path is used transparently.onnx_device_info(): added NULL guards for ctx->graph and n_nodes == 0 edge cases that caused segfault when called on models before first inference run.ggml_predict() with stochastic dropout: nn_build_graph() now receives training = FALSE during inference, so stochastic Bernoulli dropout is disabled at predict time. Previously, stochastic = TRUE dropout layers applied random masks during inference, degrading accuracy.ggml_fit() return value: the return value of ggml_fit() must be assigned back to model to obtain trained weights (model <- ggml_fit(...)). This is now clarified in all examples and documentation. Using history <- ggml_fit(...) without reassigning model leaves the model with untrained weights.ggml_evaluate() return value: now includes n_samples in addition to loss and accuracy. Metrics are computed on all samples without truncation (via ggml_predict() internally).inst/examples/titanic_classification.R — new end-to-end binary classification example on the Titanic dataset. Demonstrates feature engineering (Title, FamilySize, IsAlone), stratified train/val split, one-hot encoding, dropout regularization, and manual validation metrics (accuracy, precision, recall, F1, confusion matrix). Achieves ~82% val accuracy.weight_buf and never re-transferred between runs. Previous architecture reloaded all weights before every onnx_run() call — eliminated entirely.ctx_weight / ctx contexts: weight tensors live in a permanent GPU buffer that the scheduler never aliases; compute tensors are managed by ggml_backend_sched independently.onnx_device_info() — scheduler diagnostic: number of splits, GPU/CPU op counts, CPU-only op list.inst/examples/benchmark_onnx.R): proper VRAM cleanup between models via rm() + gc().onnx_load(path, device, input_shapes) — load an ONNX model file, build a ggml computation graph, and allocate tensors on Vulkan GPU or CPU. Weights are loaded via memory-mapped file (zero-copy where possible).onnx_run(model, inputs) — run inference on a loaded ONNX model with named input data.onnx_inputs(model) — list expected input tensor names and shapes.onnx_summary(model) — return model metadata (IR version, opset, producer, ops used).print.onnx_model() — formatted summary of a loaded ONNX model.input_shapes parameter for models with dynamic dimensions: specify fixed shapes at load time (e.g. input_shapes = list(image = c(1L, 3L, 224L, 224L))).auto_pad attribute (SAME_UPPER, SAME_LOWER) supported for Conv and pooling ops.input_shapes (Conv, Reshape, Transpose)input_shapes (1180 nodes)input_shapes (482 nodes: MatMul, LayerNorm, GELU, Softmax)inst/lib/libggml.a, breaking static linking from dependent packages (e.g. llamaR).dp_train(make_model, data, loss_fn, forward_fn, target_fn, n_gpu, n_iter, lr, max_norm, verbose) — data-parallel training across multiple replicas. Weights are broadcast from replica 0 before the first step; gradients are averaged across replicas each iteration; weights are re-broadcast after each optimizer update. Returns list(params, loss_history, model).ag_mul and ag_sub now support CPU broadcast: [d×s] * [1×s] and [d×s] * [d×1] shapes work correctly with proper gradient reduction.ag_softmax_cross_entropy_loss accepts integer target vectors (0-based class indices) and converts them to one-hot automatically.ggml_sum_rows f16 on Vulkan: F16→F16 dispatch now supported natively (no CPU fallback).ag_tensor() / ag_param() — environment-backed tensors with reference semantics; in-place optimizer updates visible to all references.with_grad_tape({ ... }) — enables the global gradient tape for the enclosed forward pass.backward(loss) — reverse-mode automatic differentiation; returns a gradient environment keyed by tensor id.ag_matmul, ag_add (with bias broadcast), ag_sub, ag_mul, ag_scale.ag_relu, ag_sigmoid, ag_tanh, ag_softmax.ag_sum, ag_mean, ag_log, ag_exp, ag_pow, ag_clamp.ag_reshape, ag_transpose.ag_mse_loss, ag_cross_entropy_loss, ag_softmax_cross_entropy_loss (numerically-stable fused).optimizer_sgd() — SGD with optional momentum.optimizer_adam() — Adam with bias-corrected moment estimates.ag_linear() — Glorot-initialised dense layer (closure-based, returns $forward, $params()).ag_gradcheck() — central finite-difference gradient checker (like torch.autograd.gradcheck).ag_sequential(...) — ordered layer container; collects all parameters for the optimizer.ag_dropout(rate) — inverted dropout; identity in eval mode.ag_batch_norm(num_features) — batch normalisation with running statistics and learnable γ/β.ag_embedding(vocab_size, dim) — token lookup with scatter-add backward.ag_train(model) / ag_eval(model) — switch all sub-layers between train and eval mode.ag_dataloader(x, y, batch_size, shuffle, col_major) — mini-batch iterator with shuffle and $epoch() helper.lr_scheduler_step(optimizer, step_size, gamma) — step-decay learning rate.lr_scheduler_cosine(optimizer, T_max, lr_min, restart) — cosine-annealing (with optional SGDR warm restarts).clip_grad_norm(params, grads, max_norm) — clips all gradients by global L2 norm in-place.ggml_layer_lstm() — LSTM recurrent layer (unrolled BPTT).ggml_layer_gru() — GRU recurrent layer (unrolled BPTT).ggml_layer_global_max_pooling_2d() — reduces [H,W,C] to [C] via max pooling.ggml_layer_global_average_pooling_2d() — reduces [H,W,C] to [C] via average pooling.ggml_save_model() — saves full model (architecture + weights) to RDS file.ggml_load_model() — restores a model saved with ggml_save_model().ggml_dense(), ggml_conv_2d(), ggml_conv_1d(), ggml_batch_norm(), ggml_embedding(), ggml_lstm(), ggml_gru() — layer object constructors returning a reusable ggml_layer object.ggml_apply(tensor, layer) — applies a ggml_layer object to a tensor node; shared weights by object identity.ggml_layer_dropout() — dropout with deterministic or stochastic (per-epoch Bernoulli mask) mode.ggml_layer_embedding() — token embedding lookup for integer inputs.ggml_input() gains dtype argument ("float32" or "int32").ggml_model() and ggml_predict().ggml_input() — declare a symbolic input tensor node (Functional API).ggml_model() — assemble a ggml_functional_model from input/output nodes.ggml_layer_add() — element-wise addition of tensor nodes (residual connections).ggml_layer_concatenate() — concatenate tensor nodes along an axis.ggml_layer_*() functions now accept a ggml_tensor_node as first argument (Functional API mode).ggml_compile(), ggml_fit(), ggml_evaluate(), ggml_predict() are now S3 generics with methods for ggml_functional_model.ggml_fit_opt() — low-level optimizer loop with callbacks and learning-rate control.ggml_callback_early_stopping() — stops training when a metric stagnates.ggml_schedule_step_decay() — step learning-rate decay.ggml_schedule_cosine_decay() — cosine learning-rate annealing.ggml_schedule_reduce_on_plateau() — reduces LR when metric stops improving.ggml_opt_init_for_fit(), ggml_opt_set_lr(), ggml_opt_get_lr() — learning-rate control without recreating the optimizer context.configure.win.ggml_layer_conv_1d() — 1D convolution layer.ggml_layer_batch_norm() — batch normalization layer.ggml_predict_classes() — argmax wrapper returning 1-based class indices.summary.ggml_sequential_model() — detailed model summary with parameter counts.ggml_fit() now returns model$history (class ggml_history) with print and plot methods.ggml_model_sequential(), ggml_layer_dense(), ggml_layer_conv_2d(), ggml_layer_max_pooling_2d(), ggml_layer_flatten(), ggml_compile(), ggml_fit(), ggml_evaluate(), ggml_predict(), ggml_save_weights(), ggml_load_weights().ggml_timestep_embedding() — sinusoidal timestep embeddings.ggml_set_f32_nd(), ggml_get_f32_nd(), ggml_set_i32_nd(), ggml_get_i32_nd().ggml_tensor_nb(), ggml_tensor_num(), ggml_tensor_copy(), ggml_tensor_set_f32_scalar(), ggml_get_first_tensor(), ggml_get_next_tensor().libggml.a exported for linking by dependent packages.gguf.cpp added for GGUF file format support.inst/include/ for LinkingTo.ggml_opt_init(), ggml_opt_free(), ggml_opt_fit(), ggml_opt_epoch(), ggml_opt_eval().ggml_opt_dataset_init(), ggml_opt_dataset_data(), ggml_opt_dataset_labels(), ggml_opt_dataset_shuffle().ggml_opt_result_init(), ggml_opt_result_loss(), ggml_opt_result_accuracy(), ggml_opt_result_pred().