Changes in version 0.2.1                        

  - FLUX.2 (Klein 4B) support via model_type = "flux2", with
    auto-detection from tensors/filename.
  - New llm_path argument in sd_ctx() for LLM text encoders (Qwen3 for
    FLUX.2 Klein, Mistral-Small for full FLUX.2).
  - Inpainting: new mask argument in sd_img2img() regenerates only the
    masked region. Accepts a PNG path, a numeric matrix, or an SD image
    (white = generate). Works on plain SD/SDXL/FLUX 1/2 weights via the
    denoise mask. New helper sd_load_mask().
  - Shiny GUI now shares sd_generate()'s auto-routing (CFG, VAE tiling,
    highres-fix), fixing FLUX.2 VAE-decode crashes.
  - New meta_backend argument in sd_ctx(): runs the diffusion model
    through the ggml meta backend for multi-GPU tensor split (a single
    model sharded across all GPUs). Requires ggmlR >= 0.7.8; falls back
    to the normal single-backend path otherwise. The Shiny GUI enables
    it automatically for FLUX.2.

                        Changes in version 0.2.0                        

Performance: VAE Decode

  - vae_conv_direct now defaults to TRUE — VAE decoder uses
    GGML_OP_CONV_2D (direct convolution via conv2d_mm.comp) instead of
    the legacy IM2COL + MUL_MAT path.
  - On RX 9070 (RDNA4) with Vulkan coopmat (KHR): VAE decode 12.6s
    → 0.5s for 768×768.
  - All convolutions now run through the coopmat cm1 path (~16-17
    TFLOPS/s) when coopmat_support is available; scalar FMA fallback
    used otherwise.

                 Changes in version 0.1.9 (2026-04-09)                  

Shiny GUI

  - New sd_app() launches an interactive Shiny application for image
    generation.
      - Auto-detection of model architecture (Flux, SD3, SDXL, SD1/2)
        from filenames in the models folder — no manual configuration
        needed.
      - Non-blocking async generation via C++ std::thread: the UI
        remains responsive during image generation, with a live progress
        bar and ETA display.
      - Automatic role assignment for multi-file models (diffusion, VAE,
        CLIP-L, T5-XXL).
      - Prevents loading incompatible model combinations (e.g. SD1.5 +
        Flux).

Async C++ Generation API

  - New internal functions for non-blocking generation from R:
      - sd_generate_async() — launches generation in a background C++
        thread.
      - sd_generate_poll() — checks completion status (atomic flags).
      - sd_generate_result() — retrieves results after completion.
  - Progress callback writes JSON to a temp file (step, steps, pct,
    elapsed, eta_sec), read by Shiny via later::later() polling.
  - R API calls (Rprintf, R_CheckUserInterrupt) are suppressed in the
    worker thread to prevent stack corruption.

Build System

  - tools/patch_sd_sources.sh rewritten: all sed calls replaced with
    perl -pi -e for cross-platform compatibility (macOS BSD sed + Linux
    GNU sed).

                        Changes in version 0.1.8                        

Bug Fixes

  - Fixed undefined symbol: ggml_backend_vk_get_device_count load error
    on CRAN Fedora (clang and gcc). Root cause: ggmlR's shared library
    (ggmlR.so) was built with Vulkan, but the static library (libggml.a)
    shipped without Vulkan objects. The old configure relied on
    ggml_vulkan_status() which queries ggmlR.so — it reported
    "AVAILABLE", causing sd2R to compile with -DSD_USE_VULKAN against a
    libggml.a that lacked the symbols. Now configure checks nm libggml.a
    for a defined (T) symbol directly, ignoring the runtime ggmlR check
    entirely.

                 Changes in version 0.1.7 (2026-03-30)                  

Multi-GPU Model Parallelism

  - New device_layout parameter in sd_ctx(): distribute sub-models
    across multiple Vulkan GPUs without separate processes.
      - "mono" — all on one GPU (default, backward-compatible).
      - "split_encoders" — CLIP/T5 on GPU 1, diffusion + VAE on GPU 0.
      - "split_vae" — CLIP/T5 + VAE on GPU 1, diffusion on GPU 0.
      - "encoders_cpu" — text encoders on CPU, diffusion + VAE on GPU.
  - Low-level diffusion_gpu, clip_gpu, vae_gpu integer arguments for
    manual device assignment (override presets).

Profiling

  - New profiling API for per-stage timing of image generation:
      - sd_profile_start() / sd_profile_stop() — control event capture.
      - sd_profile_get() — raw event data frame.
      - sd_profile_summary() — formatted summary with durations and
        percentages.
  - Stages tracked: text_encode (with text_encode_clip and
    text_encode_t5 sub-stages), sampling, vae_decode, vae_encode, model
    loading.
  - Pretty-printed output via print.sd_profile().

                        Changes in version 0.1.6                        

Pipeline Graph API

  - New sd_pipeline() / sd_node() — sequential graph-based pipeline.
    Node types: "txt2img", "img2img", "upscale", "save".
  - sd_run_pipeline(pipeline, ctx) — execute pipeline with a single
    context.
  - sd_save_pipeline() / sd_load_pipeline() — JSON serialization.

                        Changes in version 0.1.5                        

Flux Support

  - Flux model family (flux1-dev, etc.) fully supported: text-to-image,
    image-to-image, highres fix, tiled sampling, multi-GPU.
  - Separate model paths: diffusion_model_path, vae_path, clip_l_path,
    t5xxl_path in sd_ctx().
  - cfg_scale auto-defaults to 1.0 for Flux (guidance-distilled models).

img2img Improvements

  - sd_generate() now defaults width/height to init image dimensions
    when not specified explicitly.

                        Changes in version 0.1.4                        

Build System

  - configure.win rewritten to use template approach (Makevars.win.in →
    Makevars.win), matching ggmlR pattern.

                        Changes in version 0.1.3                        

Unified sd_generate() Entry Point

  - New sd_generate() — single function for all generation modes.
    Automatically selects the optimal strategy (direct, tiled sampling,
    or highres fix) based on output resolution and available VRAM.
  - vram_gb parameter in sd_ctx(): set once, auto-routing handles the
    rest.

Multi-GPU

  - New sd_generate_multi_gpu() — parallel generation across multiple
    Vulkan GPUs via callr, one process per GPU, with progress reporting.

Performance

  - Batch compute optimization for tiled sampling: pre-allocated compute
    context buffer eliminates ~110 MB malloc/free per UNet call.

                        Changes in version 0.1.2                        

Highres Fix

  - New sd_highres_fix() — classic two-pass highres pipeline: txt2img at
    native resolution → upscale → tiled img2img refinement.
  - hr_strength parameter (default 0.4) controls refinement intensity.

Tiled img2img

  - New sd_img2img_tiled() — img2img with MultiDiffusion tiled sampling
    for large images.

                        Changes in version 0.1.1                        

VAE Tiling

  - New vae_mode parameter: "normal", "tiled", "auto" (default).
    Auto-tiles when image area exceeds threshold.
  - vae_tile_rel_x / vae_tile_rel_y for adaptive tile sizing.

High-Resolution Pipeline

  - New sd_txt2img_highres() — patch-based generation for 2K, 4K+
    images.
  - model_type parameter in sd_ctx(): "sd1", "sd2", "sdxl", "flux",
    "sd3".

Tiled Sampling (MultiDiffusion)

  - New sd_txt2img_tiled() — tiled diffusion sampling at any resolution.
    VRAM bounded by tile size, not output resolution.

                        Changes in version 0.1.0                        

Core

  - Text-to-image generation via stable-diffusion.cpp (C++ backend).
  - Support for SD 1.x, SD 2.x, SDXL model versions.
  - SafeTensors and GGUF model format loading.
  - Vulkan GPU backend via ggmlR.
  - Samplers: Euler, Euler A, Heun, DPM2, DPM++ (2M), LCM, DDIM, TCD.
  - Schedulers: Discrete, Karras, Exponential, Simple, SGM Uniform, AYS,
    LCM.

R API

  - sd_ctx() — create model context.
  - sd_generate() — unified entry point.
  - sd_txt2img(), sd_img2img() — low-level generation.
  - sd_save_image(), sd_system_info().