localLLM 1.2.0

Major Changes

Backend Upgrade: llama.cpp b5421 → b7825

Core Architecture Migration: KV Cache → Unified Memory API

Breaking changes in backend (transparent to R users): - Migrated from llama_kv_self_* API to llama_memory_* API - Supports heterogeneous model architectures: - Standard Transformers (LLaMA, Qwen, Mistral, etc.) - Mamba/RWKV (State Space Models) - Hybrid models (Jamba, LFM2) - Sliding Window Attention (Qwen2-MLA)

Key improvements: - Better memory management and automatic defragmentation - Enhanced support for parallel inference with shared prefixes - Improved reproducibility of generation results - More efficient batch processing

Batch API Modernization

Improvements

Memory Management

Error Handling

Performance

API Compatibility

No changes to R-level API - All existing R code continues to work without modification:

library(localLLM)

backend_init()
model <- model_load("model.gguf")
ctx <- context_create(model, n_ctx = 512)
result <- generate(ctx, "Hello", max_tokens = 10)
# All existing code works exactly the same

Backend Library Changes

Compilation

File Modifications

Updated files: - custom_files/localllm_capi.cpp (10 locations modified) - Memory API migration (8 locations) - Batch API modernization (2 locations) - Error handling improvements - Thread configuration updates

Unchanged: - custom_files/localllm_capi.h (C API interface) - All R layer code (R/*.R) - Proxy layer (src/proxy.cpp) - Test suite (tests/testthat/*.R) - Documentation

Testing

Installation Notes

First-time Installation

install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM()  # Will download the new b7825 backend

Upgrading from 1.1.0

remove.packages("localLLM")
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM(force = TRUE)  # Force reinstall backend

Documentation

New technical documentation: - UPGRADE_COMPLETE.md - Complete upgrade report - CRITICAL_CHANGES_REQUIRED.md - Detailed change checklist - MIGRATION_ANALYSIS_b5421_to_b7785.md - Full migration analysis - Architecture deep-dive in planning documents

Known Issues

Future Enhancements

Potential optimizations for future releases: - Flash Attention support for improved performance - Unified Buffer optimization for multi-sequence inference - SWA (Sliding Window Attention) for ultra-long contexts (128K+)

Contributors


For more information about llama.cpp, see: - llama.cpp releases - llama.cpp documentation

localLLM 1.1.0

Previous release notes (if any) would go here…

mirror server hosted at Truenetwork, Russian Federation.