Changes in version 0.7.9                        

  - Single-cell GPU integration (Seurat): new adapter layer that runs
    GPU-accelerated operations directly on Seurat objects, with no hard
    dependency on Seurat (it stays in Suggests).
      - RunGGML() — Seurat-style, pipe-friendly entry point (object in,
        object out); mirrors RunPCA(). First operation is "embed" (PCA):
        the gene-by-gene covariance multiply runs on the Vulkan GPU, the
        eigendecomposition on the CPU.
      - Layered architecture: ggml_extract() (extraction, handles Seurat
        v4 GetAssayData vs v5 LayerData, sparse dgCMatrix → dense),
        ggml_run() (dispatch, auto GPU/CPU with transparent CPU
        fallback), ggml_inject() (writes the result back as a Seurat
        reduction via CreateDimReducObject).
      - Contract objects ggml_task() / ggml_result() and an
        introspectable ggml_ops_registry() for capability checks before
        dispatch.
      - Seurat, SeuratObject, Matrix added to Suggests; package remains
        R CMD check-clean without them.

                 Changes in version 0.7.8 (2026-06-04)                  

  - Re-enabled the GGML_BACKEND_DEVICE_TYPE_META device type
    (tensor-parallel meta backend).

                 Changes in version 0.7.7 (2026-05-25)                  

  - ggml-0.11.0 migration complete (vendored library upgraded
    from 0.9.5); all ggmlR features and optimizations preserved (5D
    indexing, Q4_K Flash Attention, RDNA4 subgroup-shuffle MMQ,
    Vulkan 1.4 push constants).
  - CPU backend updated to 0.11.0: use_ref flag, Hadamard FWHT op,
    rewritten Flash Attention (split-KV + tiled).
  - Vulkan shaders synchronized with 0.11.0 across all stages.
  - New quantization types Q1_0 and NVFP4 — full GPU support.
  - New exports: quantize_q1_0(), quantize_nvfp4(),
    dequantize_row_q1_0(), dequantize_row_nvfp4().
  - In progress: GPU inference speedup — fixing scheduler copyin
    overhead in single-backend decode (Ministral-3B, targeting ~3×).

                 Changes in version 0.7.6 (2026-04-22)                  

  - Fix: R6 moved from Suggests to Imports — package now loads correctly
    when R6 is not pre-installed.
  - Fix: LearnerClassifGGML and LearnerRegrGGML R6 class definitions are
    now deferred until mlr3/R6/paradox are available, preventing
    namespace load failure in environments without these optional
    packages.

                 Changes in version 0.7.5 (2026-04-20)                  

  - Vulkan 1.4 Support: integrated push constants raised to 256 bytes,
    targeting 5D tensor operations — enables larger parameter blocks in
    compute shaders without staging buffers.
  - Architecture Update: refactored core file structure for improved
    project organization.

                        Changes in version 0.7.4                        

  - ONNX Conv: replaced ggml_conv_2d (IM2COL+GEMM) with
    ggml_conv_2d_direct (GGML_OP_CONV_2D) in onnx_ggml.c —
    SuperResolution GPU time 344 ms → 5 ms (~70×).
  - Vulkan softmax: wg512 pipeline threshold lowered from >1024 to >=512
    — improves attention softmax at seq_len 512–1024.
  - New examples: benchmark_ops.R (36-op CPU/GPU micro-benchmark),
    profile_onnx_superres_gpu.R (GPU profiler for SuperResolution).

                        Changes in version 0.7.3                        

Vulkan: subgroup-shuffle mmq for Q4_K / Q5_K / Q6_K (wavefront-64 devices)

  - USE_SUBGROUP_NO_SHMEM path added to mul_mmq.comp — on wavefront-64
    devices (RDNA4, subgroup_size=64) the block_a weight tile is loaded
    directly into registers via subgroupShuffle / subgroupBroadcast,
    eliminating the shared-memory round-trip in block_a_to_shmem →
    block_a_to_registers. Measured on RX 9070: Flux 768×768
    sampling 22.38s → 20.80s (~7% end-to-end; sampling is not pure
    matmul so the gain on isolated Q4_K GEMM is higher).
  - New device capability field subgroup_no_shmem —
    ggml_vulkan_device_caps() now returns this flag (logical),
    indicating whether the shuffle mmq path is active.
  - GL_EXT_shader_subgroup_extended_types_float16 added to mul_mmq.comp
    under #ifdef USE_SUBGROUP_NO_SHMEM && FLOAT16 — required for
    subgroupShuffle on float16_t components of f16vec2.
  - ggml_vulkan_device_caps() extended — wavefronts_per_simd and arch
    fields added; all 14 fields now documented.
  - New pipeline pipeline_dequant_mul_mat_mat_q8_1_no_shmem — registered
    in device struct; selected at dispatch when subgroup_size == 64 and
    src0 is Q4_K / Q5_K / Q6_K; falls back to standard mmq pipeline
    gracefully when not compiled.
  - GGML_TYPE_Q2_K, Q3_K, Q4_K, Q5_K, Q6_K exported — these constants
    were defined in tensors.R but missing from NAMESPACE;
    roxygen2::roxygenise() now includes them.
  - inst/examples/vulkan_caps.R extended — new section shows
    USE_SUBGROUP_NO_SHMEM: ACTIVE/INACTIVE with explanation of
    conditions.
  - Tests — tests/testthat/test-vulkan.R adds smoke tests for Q4_K /
    Q5_K / Q6_K quantized matmul via Vulkan (no NaN/Inf, correct shape);
    test-vulkan-caps.R asserts integer_dot_product=TRUE on RDNA4.

                 Changes in version 0.7.2 (2026-04-15)                  

Vulkan: RDNA4 (RX 9000) cooperative matrix support

  - AMD RDNA4 (GFX12xx) detected correctly — get_device_architecture()
    now identifies RDNA4 by wavefrontsPerSimd == 16 (distinct from
    RDNA3's 8 and RDNA1's 20). Previously GFX1201 fell through to
    AMD_RDNA3 due to identical subgroup size range (min=32, max=64).
  - VK_AMD_shader_core_properties queried at device init —
    wavefronts_per_simd is now stored in vk_device_struct and read once
    during ggml_vk_get_device(), not just inside
    get_device_architecture().
  - SHADERGEN_DEFINES propagated to C++ compiler — configure now appends
    SHADERGEN_DEFINES (which includes
    -DGGML_VULKAN_COOPMAT_GLSLC_SUPPORT) to VULKAN_CPPFLAGS. Previously
    these defines were only passed to vulkan-shaders-gen, so all #if
    defined(GGML_VULKAN_COOPMAT_GLSLC_SUPPORT) blocks in ggml-vulkan.cpp
    were dead code at runtime.
  - ggml_backend_vk_get_device_caps() extended — now returns
    subgroup_min_size, subgroup_max_size, wavefronts_per_simd, and arch
    (string) in addition to the original 5 fields. R function
    ggml_vulkan_device_caps() exposes all 9 fields.
  - Result on RX 9070 (RADV GFX1201): coopmat_support=YES,
    coopmat1_fa_support=YES — KHR cooperative matrix GEMM and
    flash-attention paths now active.

Vulkan: Q4_K flash attention (FA_SCALAR + FA_COOPMAT1)

  - Q4_K in flash attention — GGML_OP_FLASH_ATTN_EXT now accepts K/V
    tensors in Q4_K format on Vulkan. Previously Q4_K fell back to CPU;
    now it runs fully on GPU via both the scalar and cooperative-matrix
    (KHR) paths.
  - dequantize4_q4k() added to flash_attn_base.glsl — decodes 4
    consecutive Q4_K elements from a block_q4_K_packed16 block:
    reconstructs the 6-bit scale and min for the sub-block, reads two
    consecutive uint16 from qs[], and extracts four nibbles. Works for
    both K and V bindings.
  - flash_attn.comp (FA_SCALAR) and flash_attn_cm1.comp (FA_COOPMAT1)
    now compiled with DATA_A_Q4_K / BLOCK_SIZE=QUANT_K_Q4_K=256. Four
    SPIR-V variants generated: f32acc and f16acc for each path.
  - vulkan-shaders-gen.cpp — q4_k added to the FA scalar and coopmat1
    generation conditions.
  - ggml-vulkan.cpp — CREATE_FA(GGML_TYPE_Q4_K, ...) added for FA_SCALAR
    and FA_COOPMAT1; GGML_TYPE_Q4_K added to the supported-types switch
    in ggml_backend_vk_device_supports_op.
  - Note: most efficient when head dimension (HSK) is a multiple of 256
    (e.g. DeepSeek-V2/V3 MLA). For HSK=128 (Llama, Mistral) the shader
    is functionally correct but pads the inner loop to 256.

                        Changes in version 0.7.1                        

tidymodels / parsnip integration

  - "ggml" engine for parsnip::mlp() — registers a "ggml" engine for
    both classification and regression modes. After library(ggmlR) (with
    parsnip installed), use:
    mlp(hidden_units = 64, epochs = 100) |>
      set_engine("ggml", batch_size = 32, backend = "auto") |>
      set_mode("classification")
    Engine arguments: batch_size, backend, verbose, validation_split,
    optimizer, callbacks. All mlp() parameters (hidden_units, epochs,
    dropout, activation, learn_rate) are mapped through.
  - backend = "gpu" in parsnip — "gpu" is now correctly translated to
    "vulkan" inside ggmlr_parsnip_fit_classif() and
    ggmlr_parsnip_fit_regr(). Previously the string was passed through
    and caused an unknown backend error.
  - learn_rate callback — the learn_rate argument from mlp() is applied
    via an internal on_epoch_begin callback that sets the optimizer
    learning rate at the start of epoch 1. Works for both "adam" and
    "sgd" optimizers.
  - New Suggests: parsnip, tibble, rlang, dials.
  - New example: inst/examples/tidymodels_integration.R — CPU vs GPU
    comparison for iris classification and mtcars regression using the
    parsnip engine.

mlr3 integration

  - LearnerClassifGGML / LearnerRegrGGML always defined — R6 class
    definitions are now unconditional (no longer wrapped in if
    (requireNamespace("mlr3"))). This ensures the classes are always
    present in the ggmlR namespace, so ggmlR:::.register_mlr3() can be
    called reliably from vignettes and tests regardless of package load
    order.
  - Registration robustness — .onLoad() no longer uses
    mlr3misc::register_namespace_callback() (which had a bug in v0.21.0
    causing R CMD check warning namespace can be unloaded cleanly).
    Registration now uses isNamespaceLoaded() + setHook() directly,
    covering both "mlr3 already loaded" and "mlr3 loads after ggmlR"
    scenarios.
  - mlr3misc removed from Suggests — no longer needed.
  - New example: inst/examples/mlr3_integration.R — CPU vs GPU
    comparison for iris classification and mtcars regression,
    plus 3-fold CV.

Bug fixes

  - marshal_model.* / unmarshal_model.* S3 methods no longer appear in
    NAMESPACE as S3method(mlr3::marshal_model, ...) — this caused Error:
    namespace 'marshal_model' not found on package load. Methods are now
    registered exclusively via registerS3method() in .onLoad().

Tests

  - test-parsnip.R — new tests: learn_rate applied without error;
    backend="gpu" accepted and converted to "vulkan" (skipped when
    Vulkan unavailable).
  - test-mlr3-learner.R — explicit ggmlR:::.register_mlr3() call at top
    of file for reliable registration in R CMD check test process.

                 Changes in version 0.7.0 (2026-04-06)                  

Vignettes: prebuilt HTML via Rcpp::asis

  - Seven vignettes (Autograd Engine, Data Parallel Training, Embedding
    ggmlR, GPU Vulkan Backend, Keras-like API, ONNX Import,
    Quantization) are now shipped as prebuilt HTML using the Rcpp::asis
    vignette engine. No rendering on CRAN runners.
  - Removed rmarkdown from Suggests (no longer needed).

Test suite

  - Suppressed spurious stdout/stderr output from tests:
    ggml_graph_print() output captured in test-graph-utils.R; C-level
    broadcast warnings captured in ONNX broadcast and resize-broadcast
    tests.

                        Changes in version 0.6.9                        

GGUF file reader

  - gguf_load(path) — opens a GGUF file (v2/v3) and reads all metadata
    and tensor descriptors. Returns an S3 object of class "gguf".
  - gguf_metadata(x) — returns all key-value metadata pairs as a named
    list (architecture, tokenizer config, quantization info, etc.).
  - gguf_tensor_names(x) — lists all tensor names in the file.
  - gguf_tensor_info(x, name) — returns shape, type, and size in bytes
    for a single tensor.
  - gguf_tensor_data(x, name) — dequantizes (if needed) and returns
    tensor weights as an R numeric array with correct dimensions.
  - gguf_free(x) — explicitly frees GGUF context (also called by GC).
  - Supports all ggml quantization types (F32, F16, Q4_0, Q8_0,
    K-quants, etc.) with automatic dequantization to F32.
  - print.gguf() method shows file version, tensor count, and metadata
    count.

Vulkan backend: revert to Vulkan 1.2 + Push Descriptors

  - Vulkan API version capped at 1.2 (was 1.3). Requesting a Vulkan 1.3
    instance implicitly enables Synchronization2 (core in 1.3), which
    causes significant performance degradation on RADV (Mesa) drivers —
    particularly on newer AMD hardware (RX 9070 and similar). Capping
    at 1.2 avoids the implicit promotion while retaining all
    functionality.
  - Push Descriptors (VK_KHR_push_descriptor): unchanged — when the
    extension is available and maxPushDescriptors >= 12, descriptor sets
    are pushed directly into the command buffer via
    pushDescriptorSetKHR(), eliminating descriptor pool overhead. Falls
    back to the traditional descriptor pool path on hardware without the
    extension.

Keras-compatible API

  - fit() now accepts a callbacks parameter for sequential models
    (passed through to ggml_fit_sequential()).

Test suite

  - New test files: test-gguf.R, test-graph-utils.R, test-inplace-ops.R,
    test-keras-api.R, test-misc-ops.R, test-model-ops.R,
    test-print-methods.R, test-tensor-utils.R, test-threading.R,
    test-autograd-missing.R, test-nn-functional-missing.R,
    test-quants-missing.R.

                        Changes in version 0.6.8                        

Bug fixes

  - Fixed ABI mismatch between src/ and inst/include/ headers: configure
    and configure.win now automatically sync all public headers from
    src/ to inst/include/ at install time. Previously, changes to
    GGML_MAX_DIMS (4→5) and other structs in src/ggml.h were not
    propagated to the exported headers, causing segfaults in downstream
    packages (e.g. sd2R).
  - Added tests/testthat/test-headers-sync.R to verify that
    inst/include/ headers remain in sync with src/ headers and that
    GGML_MAX_DIMS is consistent.

                 Changes in version 0.6.7 (2026-03-29)                  

ggml engine: native 5D tensor support

  - ggml_view_5d() — new API function for creating 5D views with
    explicit strides, extending the existing 1D–4D view family. Uses the
    existing ggml_view_impl() internally.
  - ggml_repeat_5d() — new API function for tiling tensors up to 5D. CPU
    kernels (ggml_compute_forward_repeat_f32,
    ggml_compute_forward_repeat_f16) updated with a 5th loop dimension.
    Vulkan dispatch collapses dim3×dim4 into push constants
    transparently (no shader changes needed — push constants remain
    at 128 bytes).
  - ONNX tensor pipeline upgraded from hardcoded 4D to 5D throughout
    onnx_ggml.c (~20 sites):
      - Initializers, inputs, Constant, ConstantOfShape:
        ne[GGML_MAX_DIMS] arrays, switch with case 5: new_tensor_5d.
      - Broadcast (onnx_broadcast_align): all reshape/new_tensor calls
        use dimension-aware helpers.
      - Softmax: reshape-back via generic onnx_reshape_nd().
      - Reshape op: collapse threshold raised from >4D to >5D.
      - Slice: 5D view/offset support, generic stride-based cval
        propagation and deferred fill.
      - Split: 5D view support.
      - Expand: 5D broadcast with rank promotion.
      - Tile: uses ggml_repeat_5d().
      - Gather axis=0: generic reshape-back for any rank.
      - tmap_put_nd() and slice_fill arrays updated to GGML_MAX_DIMS.
  - New internal helpers: onnx_reshape_nd(), onnx_new_tensor_nd(),
    ne_product() — eliminate switch/case duplication.
  - Resize/Interpolate remains 4D (spatial op, 5D not relevant).
    Transpose/Permute remains 4D (ggml_permute API limitation).

ONNX: ConstantOfShape INT64/INT32/DOUBLE value fix

  - roberta-9 model now loads and runs (was producing NaN in softmax).
    Root cause: ConstantOfShape read the value TensorProto attribute as
    float regardless of data_type. When data_type=7 (INT64), the 8-byte
    int64 was reinterpreted as a 4-byte float, producing garbage values
    (~1.4e-45 instead of 1). This broke attention mask generation
    (fill=0 instead of 1) and position ID generation (NonZero on zeros =
    empty).
  - Fix: ConstantOfShape now checks data_type and correctly handles
    INT64, INT32, DOUBLE, and FLOAT value attributes.

ONNX: Gather axis=0 on rank>2 tensors

  - Gather on 4D tensors no longer asserts. Previous code always used
    ggml_get_rows which only supports 2D data. For axis=0 on rank>2
    (e.g. CaiT QKV split on [48,576,6,3]), the tensor is now reshaped
    to 2D, gathered, and reshaped back.

ONNX: ScatterElements op (GPU + CPU)

  - New GGML_OP_SCATTER_ELEMENTS added to the ggml engine with both CPU
    kernel and Vulkan compute shader.
  - Vulkan shader (scatter_elements.comp): two variants compiled at
    install time — scatter_elements_none (overwrite) and
    scatter_elements_add (atomicAdd via GL_EXT_shader_atomic_float).
    Data is copied to output via vkCmdCopyBuffer with a pipeline barrier
    before the scatter dispatch.
  - CPU kernel: single-threaded scatter with memcpy (overwrite) or
    element-wise addition (reduce=add).
  - ONNX mapper: ScatterElements op with axis=0 and
    reduction="none"/"add" attributes. Indices cast to I32, updates/data
    cast to F32 automatically.
  - This unblocks sageconv (GNN message passing with scatter-add).

Model count

  - 12/15 ONNX Model Zoo models now pass (was 11/15). New: roberta-9.
  - Remaining failures: sageconv (ScatterElements shape mismatch needs
    further work), cait_xs24_384 (reshape size mismatch),
    MaskRCNN-12-int8 (spatial broadcast mismatch), xcit_tiny (broadcast
    dim mismatch).

                        Changes in version 0.6.6                        

ONNX: BoTNet RelPosBias2D fused custom op

  - botnet26t_256 model now loads and runs (was failing on 5D Transpose
    in pos_embed subgraph). Three pos_embed subgraphs (~60-80 ONNX nodes
    each) are detected via pre-pass scanner and replaced with a single
    fused ggml_map_custom3 op. The CPU kernel computes 2D relative
    position bias directly: bias[b,hq,wq,hk,wk] = dot(x, W_h) +
    dot(x_transposed, W_w).
  - Pre-pass scanner: detect_pos_embed_blocks() identifies contiguous
    node ranges with /pos_embed/ in output names, extracts W_h/W_w
    initializer shapes to determine H, W, C, validates F32 data type.
  - Model count: 13/15 ONNX Model Zoo models now pass (was 12/15).

ONNX: pinned staging buffer for GPU input transfer

  - When Vulkan GPU is available, a host-visible pinned memory buffer is
    allocated at model load time for ONNX input data. In
    onnx_ggml_run(), input data is copied into pinned memory before
    ggml_backend_tensor_set() — the Vulkan driver detects the pinned
    source pointer and performs direct DMA transfer to VRAM, bypassing
    the internal staging copy.
  - Fallback: if ggml_backend_vk_host_buffer_type() returns NULL or
    buffer is too small, the standard staging path is used
    transparently.

Bug fixes

  - onnx_device_info(): added NULL guards for ctx->graph and n_nodes
    == 0 edge cases that caused segfault when called on models before
    first inference run.

                        Changes in version 0.6.5                        

Bug fixes

  - ggml_predict() with stochastic dropout: nn_build_graph() now
    receives training = FALSE during inference, so stochastic Bernoulli
    dropout is disabled at predict time. Previously, stochastic = TRUE
    dropout layers applied random masks during inference, degrading
    accuracy.
  - ggml_fit() return value: the return value of ggml_fit() must be
    assigned back to model to obtain trained weights (model <-
    ggml_fit(...)). This is now clarified in all examples and
    documentation. Using history <- ggml_fit(...) without reassigning
    model leaves the model with untrained weights.
  - ggml_evaluate() return value: now includes n_samples in addition to
    loss and accuracy. Metrics are computed on all samples without
    truncation (via ggml_predict() internally).

Examples

  - inst/examples/titanic_classification.R — new end-to-end binary
    classification example on the Titanic dataset. Demonstrates feature
    engineering (Title, FamilySize, IsAlone), stratified train/val
    split, one-hot encoding, dropout regularization, and manual
    validation metrics (accuracy, precision, recall, F1, confusion
    matrix). Achieves ~82% val accuracy.

ONNX inference: dedicated weight buffer architecture

  - Zero-overhead repeated inference: weights are loaded to GPU (or CPU)
    once via a dedicated weight_buf and never re-transferred between
    runs. Previous architecture reloaded all weights before every
    onnx_run() call — eliminated entirely.
  - Separate ctx_weight / ctx contexts: weight tensors live in a
    permanent GPU buffer that the scheduler never aliases; compute
    tensors are managed by ggml_backend_sched independently.
  - GPU speedups from eliminated weight reload (vs 0.6.3):
      - SuperResolution: 354 ms → 7 ms (48x)
      - BERT: 100 ms → 15 ms (7x)
      - Inception V3 Op18: 106 ms → 14 ms (7x)
      - Inception V3: 24 ms → 14 ms (1.7x)
      - EmotionFerPlus: 4.7 ms → 1.7 ms (2.8x)
      - BAT-ResNeXt: 14 ms → 9 ms (1.6x)
  - onnx_device_info() — scheduler diagnostic: number of splits, GPU/CPU
    op counts, CPU-only op list.
  - GPT-NeoX model now loads and runs successfully (was failing on shape
    propagation).
  - Benchmark script (inst/examples/benchmark_onnx.R): proper VRAM
    cleanup between models via rm() + gc().

                 Changes in version 0.6.3 (2026-03-18)                  

ONNX model import

  - onnx_load(path, device, input_shapes) — load an ONNX model file,
    build a ggml computation graph, and allocate tensors on Vulkan GPU
    or CPU. Weights are loaded via memory-mapped file (zero-copy where
    possible).
  - onnx_run(model, inputs) — run inference on a loaded ONNX model with
    named input data.
  - onnx_inputs(model) — list expected input tensor names and shapes.
  - onnx_summary(model) — return model metadata (IR version, opset,
    producer, ops used).
  - print.onnx_model() — formatted summary of a loaded ONNX model.
  - Built-in zero-dependency protobuf parser: no external libraries or
    Python required.
  - input_shapes parameter for models with dynamic dimensions: specify
    fixed shapes at load time (e.g. input_shapes = list(image =
    c(1L, 3L, 224L, 224L))).
  - 40+ supported ONNX ops: Add, Sub, Mul, Div, MatMul, Gemm, Conv
    (1D/2D), ConvTranspose (1D/2D), Relu, Sigmoid, Tanh, GELU, SiLU,
    LeakyRelu, Elu, Softmax, MaxPool, AveragePool, GlobalAveragePool,
    BatchNormalization, LayerNormalization, GroupNormalization,
    RMSNormalization, Reshape, Transpose, Concat, Flatten, Squeeze,
    Unsqueeze, Gather, Pad, Clip, Cast, Constant, ConstantOfShape,
    Shape, Expand, Slice, Split, Where, Erf, Pow, Sqrt, Exp, Log, Abs,
    Neg, Floor, Ceil, ReduceMean, ReduceSum, Resize/Upsample, Identity,
    Dropout.
  - auto_pad attribute (SAME_UPPER, SAME_LOWER) supported for Conv and
    pooling ops.
  - Numpy-style broadcast for binary ops (Add/Sub/Mul/Div): handles
    mismatched ranks and dimensions, with left-align, right-align, and
    greedy dim-matching strategies.
  - Scalar Constant tensors (0-dimensional TensorProto) correctly
    handled.

Tested real-world ONNX models (13/15 from ONNX Model Zoo)

  - mnist-8 — OK (12 nodes)
  - squeezenet1.0-8 — OK (66 nodes: Conv, Relu, MaxPool, Concat,
    Dropout, GlobalAveragePool, Softmax)
  - adv_inception_v3 Opset 17/18 — OK (215 nodes)
  - super-resolution-10 — OK with input_shapes (Conv, Reshape,
    Transpose)
  - bert Opset 17 — OK (533 nodes: MatMul, Add, LayerNorm, GELU/Erf,
    Softmax, Shape, Gather, Cast, Where, ConstantOfShape)
  - emotion-ferplus-8 — OK (52 nodes: Conv, Relu, MaxPool, Reshape,
    Gemm, Constant)
  - sageconv Opset 16 — OK (24 nodes: MatMul, Add, Mul, Sigmoid,
    ReduceSum)
  - roberta-sequence-classification-9 — OK with input_shapes (1180
    nodes)
  - bat_resnext26ts Opset 18 — OK (570 nodes: Conv, BatchNorm, SiLU,
    Concat, Expand, Split)
  - gptneox Opset 18 — OK with input_shapes (482 nodes: MatMul,
    LayerNorm, GELU, Softmax)
  - xcit_tiny — OK (436 nodes: MatMul, LayerNorm, Softmax, Concat,
    Transpose)
  - MaskRCNN-12-int8 — OK (937 nodes: QLinearConv, DequantizeLinear,
    Resize, Concat, Reshape)
  - botnet26t_256 (Opset 16) — OK (RelPosBias2D fused custom op, 3
    pos_embed blocks replaced)
  - Remaining failures: cait_xs24_384 (batched matmul 3D+).

                        Changes in version 0.6.2                        

  - Fixed Windows cleanup script that removed inst/lib/libggml.a,
    breaking static linking from dependent packages (e.g. llamaR).

                 Changes in version 0.6.1 (2026-02-22)                  

  - dp_train(make_model, data, loss_fn, forward_fn, target_fn, n_gpu,
    n_iter, lr, max_norm, verbose) — data-parallel training across
    multiple replicas. Weights are broadcast from replica 0 before the
    first step; gradients are averaged across replicas each iteration;
    weights are re-broadcast after each optimizer update. Returns
    list(params, loss_history, model).
  - ag_mul and ag_sub now support CPU broadcast: [d×s] * [1×s] and [d×s]
    * [d×1] shapes work correctly with proper gradient reduction.
  - ag_softmax_cross_entropy_loss accepts integer target vectors
    (0-based class indices) and converts them to one-hot automatically.
  - ggml_sum_rows f16 on Vulkan: F16→F16 dispatch now supported natively
    (no CPU fallback).

                        Changes in version 0.6.0                        

Dynamic autograd engine (PyTorch-style training)

  - ag_tensor() / ag_param() — environment-backed tensors with reference
    semantics; in-place optimizer updates visible to all references.
  - with_grad_tape({ ... }) — enables the global gradient tape for the
    enclosed forward pass.
  - backward(loss) — reverse-mode automatic differentiation; returns a
    gradient environment keyed by tensor id.
  - Differentiable ops: ag_matmul, ag_add (with bias broadcast), ag_sub,
    ag_mul, ag_scale.
  - Activations: ag_relu, ag_sigmoid, ag_tanh, ag_softmax.
  - Reduction / math ops: ag_sum, ag_mean, ag_log, ag_exp, ag_pow,
    ag_clamp.
  - Shape ops: ag_reshape, ag_transpose.
  - Loss functions: ag_mse_loss, ag_cross_entropy_loss,
    ag_softmax_cross_entropy_loss (numerically-stable fused).
  - optimizer_sgd() — SGD with optional momentum.
  - optimizer_adam() — Adam with bias-corrected moment estimates.
  - ag_linear() — Glorot-initialised dense layer (closure-based, returns
    $forward, $params()).
  - ag_gradcheck() — central finite-difference gradient checker (like
    torch.autograd.gradcheck).

Layer objects (environment-based, train/eval modes)

  - ag_sequential(...) — ordered layer container; collects all
    parameters for the optimizer.
  - ag_dropout(rate) — inverted dropout; identity in eval mode.
  - ag_batch_norm(num_features) — batch normalisation with running
    statistics and learnable γ/β.
  - ag_embedding(vocab_size, dim) — token lookup with scatter-add
    backward.
  - ag_train(model) / ag_eval(model) — switch all sub-layers between
    train and eval mode.

Training utilities

  - ag_dataloader(x, y, batch_size, shuffle, col_major) — mini-batch
    iterator with shuffle and $epoch() helper.
  - lr_scheduler_step(optimizer, step_size, gamma) — step-decay learning
    rate.
  - lr_scheduler_cosine(optimizer, T_max, lr_min, restart) —
    cosine-annealing (with optional SGDR warm restarts).
  - clip_grad_norm(params, grads, max_norm) — clips all gradients by
    global L2 norm in-place.

                        Changes in version 0.5.9                        

  - ggml_layer_lstm() — LSTM recurrent layer (unrolled BPTT).
  - ggml_layer_gru() — GRU recurrent layer (unrolled BPTT).
  - ggml_layer_global_max_pooling_2d() — reduces [H,W,C] to [C] via max
    pooling.
  - ggml_layer_global_average_pooling_2d() — reduces [H,W,C] to [C] via
    average pooling.
  - ggml_save_model() — saves full model (architecture + weights) to RDS
    file.
  - ggml_load_model() — restores a model saved with ggml_save_model().
  - ggml_dense(), ggml_conv_2d(), ggml_conv_1d(), ggml_batch_norm(),
    ggml_embedding(), ggml_lstm(), ggml_gru() — layer object
    constructors returning a reusable ggml_layer object.
  - ggml_apply(tensor, layer) — applies a ggml_layer object to a tensor
    node; shared weights by object identity.

                        Changes in version 0.5.7                        

  - ggml_layer_dropout() — dropout with deterministic or stochastic
    (per-epoch Bernoulli mask) mode.
  - ggml_layer_embedding() — token embedding lookup for integer inputs.
  - ggml_input() gains dtype argument ("float32" or "int32").
  - Multi-output support in ggml_model() and ggml_predict().

                        Changes in version 0.5.6                        

  - ggml_input() — declare a symbolic input tensor node (Functional
    API).
  - ggml_model() — assemble a ggml_functional_model from input/output
    nodes.
  - ggml_layer_add() — element-wise addition of tensor nodes (residual
    connections).
  - ggml_layer_concatenate() — concatenate tensor nodes along an axis.
  - All ggml_layer_*() functions now accept a ggml_tensor_node as first
    argument (Functional API mode).
  - ggml_compile(), ggml_fit(), ggml_evaluate(), ggml_predict() are now
    S3 generics with methods for ggml_functional_model.

                        Changes in version 0.5.5                        

  - ggml_fit_opt() — low-level optimizer loop with callbacks and
    learning-rate control.
  - ggml_callback_early_stopping() — stops training when a metric
    stagnates.
  - ggml_schedule_step_decay() — step learning-rate decay.
  - ggml_schedule_cosine_decay() — cosine learning-rate annealing.
  - ggml_schedule_reduce_on_plateau() — reduces LR when metric stops
    improving.
  - ggml_opt_init_for_fit(), ggml_opt_set_lr(), ggml_opt_get_lr() —
    learning-rate control without recreating the optimizer context.

                        Changes in version 0.5.4                        

  - Vulkan GPU backend support on Windows via configure.win.
  - Vulkan auto-detected at build time on Linux and Windows.

                        Changes in version 0.5.3                        

  - ggml_layer_conv_1d() — 1D convolution layer.
  - ggml_layer_batch_norm() — batch normalization layer.
  - ggml_predict_classes() — argmax wrapper returning 1-based class
    indices.
  - summary.ggml_sequential_model() — detailed model summary with
    parameter counts.
  - ggml_fit() now returns model$history (class ggml_history) with print
    and plot methods.
  - Sequential API: ggml_model_sequential(), ggml_layer_dense(),
    ggml_layer_conv_2d(), ggml_layer_max_pooling_2d(),
    ggml_layer_flatten(), ggml_compile(), ggml_fit(), ggml_evaluate(),
    ggml_predict(), ggml_save_weights(), ggml_load_weights().
  - Vulkan GPU backend covering 90%+ of ML operations.

                        Changes in version 0.5.2                        

  - ggml_timestep_embedding() — sinusoidal timestep embeddings.
  - N-D tensor access: ggml_set_f32_nd(), ggml_get_f32_nd(),
    ggml_set_i32_nd(), ggml_get_i32_nd().
  - Tensor utilities: ggml_tensor_nb(), ggml_tensor_num(),
    ggml_tensor_copy(), ggml_tensor_set_f32_scalar(),
    ggml_get_first_tensor(), ggml_get_next_tensor().

                 Changes in version 0.5.1 (2026-02-09)                  

  - Static library libggml.a exported for linking by dependent packages.
  - gguf.cpp added for GGUF file format support.
  - Headers exported via inst/include/ for LinkingTo.

                        Changes in version 0.5.0                        

  - Full optimization/training API: ggml_opt_init(), ggml_opt_free(),
    ggml_opt_fit(), ggml_opt_epoch(), ggml_opt_eval().
  - Dataset management: ggml_opt_dataset_init(),
    ggml_opt_dataset_data(), ggml_opt_dataset_labels(),
    ggml_opt_dataset_shuffle().
  - Training results: ggml_opt_result_init(), ggml_opt_result_loss(),
    ggml_opt_result_accuracy(), ggml_opt_result_pred().
  - Extended backend API: device management, registry, async operations,
    graph planning, buffer management (~50 new functions).
  - Loss functions: MSE, cross-entropy. Optimizers: AdamW, SGD.

                        Changes in version 0.4.0                        

  - Multi-GPU backend scheduler API.
  - Vulkan GPU backend support.

                        Changes in version 0.2.0                        

  - Initial release: R bindings for GGML tensor library.
  - Core tensor operations, neural network ops, activation functions,
    quantization (Q4_0, Q4_1, Q8_0), OpenMP parallelization, computation
    graph API.