Breaking changes in backend (transparent to R
users): - Migrated from llama_kv_self_* API to
llama_memory_* API - Supports heterogeneous model
architectures: - Standard Transformers (LLaMA, Qwen, Mistral, etc.) -
Mamba/RWKV (State Space Models) - Hybrid models (Jamba, LFM2) - Sliding
Window Attention (Qwen2-MLA)
Key improvements: - Better memory management and automatic defragmentation - Enhanced support for parallel inference with shared prefixes - Improved reproducibility of generation results - More efficient batch processing
llama_batch_get_one()llama_batch_init() +
common_batch_add() + llama_batch_free()generate() call starts from clean staten_threads_batch parameter for batch processingNo changes to R-level API - All existing R code continues to work without modification:
library(localLLM)
backend_init()
model <- model_load("model.gguf")
ctx <- context_create(model, n_ctx = 512)
result <- generate(ctx, "Hello", max_tokens = 10)
# All existing code works exactly the samebackend/llama.cpp/build_localllm.shUpdated files: -
custom_files/localllm_capi.cpp (10 locations modified) -
Memory API migration (8 locations) - Batch API modernization (2
locations) - Error handling improvements - Thread configuration
updates
Unchanged: -
custom_files/localllm_capi.h (C API interface) - All R
layer code (R/*.R) - Proxy layer
(src/proxy.cpp) - Test suite
(tests/testthat/*.R) - Documentation
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM() # Will download the new b7825 backendremove.packages("localLLM")
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM(force = TRUE) # Force reinstall backendNew technical documentation: - UPGRADE_COMPLETE.md -
Complete upgrade report - CRITICAL_CHANGES_REQUIRED.md -
Detailed change checklist -
MIGRATION_ANALYSIS_b5421_to_b7785.md - Full migration
analysis - Architecture deep-dive in planning documents
Potential optimizations for future releases: - Flash Attention support for improved performance - Unified Buffer optimization for multi-sequence inference - SWA (Sliding Window Attention) for ultra-long contexts (128K+)
For more information about llama.cpp, see: - llama.cpp releases - llama.cpp documentation
Previous release notes (if any) would go here…