diffcp is forSuppose you have a convex optimization problem whose data — the constraint matrix, the offset, the objective coefficients — depend on some upstream quantity (a parameter, a model weight, a hyperparameter, the output of a previous computation). You solve the problem and get an optimum. How does that optimum change when the data changes?
diffcp answers that question. It treats the optimal
solution as an implicit function of the problem data, and
returns the derivative of that function — both in the forward direction
(perturb the data, get the predicted change in the optimum) and as its
adjoint (start from a downstream loss on the optimum and backpropagate
to a gradient on the data).
Three concrete things you can do with it:
Problem$backward() and Problem$derivative()
API, which internally calls diffcp.The implementation is a port of the Python diffcp
package (Agrawal, Barratt, Boyd, Busseti, Moursi, 2019); the numerical
core (cone projections, Jacobians, the M operator, LSQR) is
direct C++ from the upstream repository, called from R via
RcppEigen.
diffcp differentiates the primal–dual conic
problem
\[ \begin{aligned} \text{minimize} &\quad c^\top x \\ \text{subject to} &\quad A x + s = b, \quad s \in K, \end{aligned} \]
where \(K\) is a Cartesian product of the standard cones (zero, the non-negative orthant, second-order, positive semidefinite, exponential, and the exponential dual). The dual variable attached to the cone constraint is \(y\); together \((x, y, s)\) is the primal–dual triple.
Differentiating “the optimum” turns out to mean something precise:
This is the construction in Busseti, Moursi, and Boyd (2019);
diffcp packages it into two callables.
Calling solve_and_derivative(A, b, c, cone_dict) returns
a list with five entries:
x, y, s — the primal–dual
optimum.D(dA, db, dc) — the forward
derivative. Given a perturbation of the data, returns
(dx, dy, ds): the predicted change in the optimum, to first
order.DT(dx, dy, ds) — the adjoint of
D. Given a perturbation of the optimum (or, more usefully,
a gradient of a downstream loss with respect to the optimum), returns
(dA, db, dc): the gradient of that loss with respect to the
data.Both are functions, not matrices. The dense mode materializes the
Jacobian as an Eigen matrix and factors it once; the LSQR mode applies
the Jacobian and its transpose matrix-free, never forming the dense
object. The trade-off is per-call cost (dense is faster
once M has been factored) versus memory and asymptotic
scaling (lsqr is the only viable choice for very large
problems).
The smallest illustrative case: pick the cheapest non-negative mixture of three goods that sums to one,
\[ \min_{x \in \mathbb{R}^3} \; c^\top x \quad \text{subject to} \quad \mathbf{1}^\top x = 1,\; x \ge 0. \]
In SCS standard form \(Ax + s = b\), \(s \in K = \{0\} \times \mathbb{R}^3_+\) — one row for the equality (zero cone, \(s = 0\)) followed by three rows for the non-negativity constraints (the slack absorbs the inequality):
A <- sparseMatrix(i = c(1, 2, 1, 3, 1, 4),
j = c(1, 1, 2, 2, 3, 3),
x = c(1, -1, 1, -1, 1, -1),
dims = c(4, 3))
b <- c(1, 0, 0, 0)
c <- c(1, 2, 3)
cone_dict <- list(z = 1L, l = 3L)
res <- solve_and_derivative(A, b, c, cone_dict, mode = "lsqr",
tol_gap_abs = 1e-10,
tol_gap_rel = 1e-10,
tol_feas = 1e-10)
res$x
#> [1] 1.000000e+00 1.562233e-11 5.329854e-12The optimum is \(x \approx (1, 0, 0)\) — all weight on the cheapest coordinate. Now the derivative. Suppose we slightly increase the cost of the first coordinate, holding everything else fixed:
dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
dims = c(4, 3))
db <- numeric(4)
dc <- c(0.001, 0, 0)
d <- res$D(dA, db, dc)
d$dx
#> [1] -2.274114e-16 -6.059404e-16 4.396755e-15D reports that dx is essentially zero — the
optimum is robust to a small bump in \(c_1\), because the constraint \(\mathbf{1}^\top
x = 1\) pins coordinate 1 against \(x
\ge 0\). (You’d see a non-trivial dy if you printed
the dual perturbation.) The forward derivative agrees with
finite-differencing the perturbed solve to within solver tolerance:
res_pert <- solve_only(A, b, c + dc, cone_dict,
tol_gap_abs = 1e-10,
tol_gap_rel = 1e-10,
tol_feas = 1e-10)
max(abs(d$dx - (res_pert$x - res$x)))
#> [1] 1.453855e-14The agreement is at the level of the Clarabel solve tolerance, not the LSQR tolerance — the per-call derivative is the exact implicit-function-theorem answer at this point.
A geometric example: project the simplex centroid onto a Lorentz cone,
\[ \min_{t,\, x} \; t \quad \text{subject to} \quad \|x\|_2 \le t,\; \mathbf{1}^\top x = 1. \]
The variables are \((t, x_1, x_2, x_3)\). The block \((t, x)\) lives in a single 4-dimensional second-order cone; the equality goes in a zero cone of size 1. Since \(t\) is being minimized against a norm constraint, the optimum lies on the cone boundary \(\|x\|_2 = t\), which is exactly the kind of active constraint that makes derivatives non-trivial.
A <- sparseMatrix(
i = c(2, 1, 3, 1, 4, 1, 5),
j = c(1, 2, 2, 3, 3, 4, 4),
x = c(-1, 1, -1, 1, -1, 1, -1),
dims = c(5, 4))
b <- c(1, 0, 0, 0, 0)
c <- c(1, 0, 0, 0)
cone_dict <- list(z = 1L, q = 4L)
res <- solve_and_derivative(A, b, c, cone_dict, mode = "dense",
tol_gap_abs = 1e-10,
tol_gap_rel = 1e-10,
tol_feas = 1e-10)
res$x
#> [1] 0.5773503 0.3333333 0.3333333 0.3333333The optimum is \(x = (1/3, 1/3,
1/3)\), \(t = \sqrt{3}/3\), and
the SOC constraint \(\|x\|_2 = t\)
holds with equality. The cone projection’s Jacobian at a boundary point
is the technical heart of diffcp — it’s a smoothed
projector that interpolates between the identity (interior of the cone)
and zero (interior of the polar cone). The package handles all the
boundary geometry automatically; from the R side you just call
D and DT.
The minimum-eigenvalue SDP: find a unit-trace PSD matrix that minimizes \(\langle C, X \rangle\) for given \(C\). With \(C = \mathrm{diag}(1, 0, 0, 0, -1)\), the optimum is rank-one and puts all weight on the eigenvector of \(C\) with most-negative eigenvalue.
PSD support requires a vectorization convention. diffcp
uses the SCS convention: lower-triangular column-major, with
off-diagonals scaled by \(\sqrt{2}\).
The package exports vec_symm() and
unvec_symm() for round-trips between symmetric matrices and
their packed vector form. The trace-coefficient row in the equality
constraint is sparse — only the diagonal entries contribute, and we
build it explicitly:
DIM <- 5L
n <- DIM * (DIM + 1L) %/% 2L # 15
trace_row <- numeric(n)
pos <- 1L
for (col in seq_len(DIM)) {
trace_row[pos] <- 1.0
pos <- pos + (DIM - col + 1L) # next diagonal in column-major lower-tri
}
A <- rbind(
Matrix(matrix(trace_row, nrow = 1L), sparse = TRUE),
-Diagonal(n))
A <- as(A, "CsparseMatrix")
b <- c(1.0, numeric(n))
C <- matrix(0, DIM, DIM)
C[1, 1] <- 1; C[DIM, DIM] <- -1
c_vec <- vec_symm(C)
cone_dict <- list(z = 1L, s = DIM)
res <- solve_only(A, b, c_vec, cone_dict,
tol_gap_abs = 1e-10,
tol_gap_rel = 1e-10,
tol_feas = 1e-10)
X <- unvec_symm(res$x, DIM)
sum(diag(X))
#> [1] 1
range(eigen(X, symmetric = TRUE, only.values = TRUE)$values)
#> [1] -2.080584e-12 1.000000e+00X has trace 1 (to solver tolerance) and is PSD (smallest
eigenvalue \(\ge -10^{-9}\)). Behind
the scenes, when the solve goes through Clarabel — which packs PSD
entries in upper-triangular column-major order, opposite to SCS —
diffcp permutes the rows of \(A\) and the entries of \(b\) on the way in, and inverse-permutes the
dual \(y\) and slack \(s\) on the way out, so the user-facing
convention stays SCS throughout.
Here is the canonical use case: shadow prices. Take
a small LP — a profit-maximizing production plan with a labor and a
materials budget — and ask, given the optimum, by how much would the
profit change per extra unit of labor? That number is the dual
variable on the labor constraint. diffcp lets you compute
the same sensitivity for any perturbation of the data,
including ones that have no closed-form dual interpretation.
We solve
\[ \max_{x_1, x_2 \ge 0}\; 3 x_1 + 5 x_2 \quad \text{s.t.}\quad 2 x_1 + x_2 \le 100,\; x_1 + 3 x_2 \le 90, \]
and ask: how does the optimum respond to a small change in the labor-budget right-hand side, \(b_1 = 100\)?
We rewrite the maximization as a minimization of \(-c^\top x\), and encode the two budget inequalities and the two non-negativity inequalities as four rows of a non-negative orthant slack constraint:
A_lp <- sparseMatrix(i = c(1, 2, 3, 1, 2, 4),
j = c(1, 1, 1, 2, 2, 2),
x = c(2, 1, -1, 1, 3, -1),
dims = c(4, 2))
b_lp <- c(100, 90, 0, 0)
c_lp <- c(-3, -5)
cones_lp <- list(l = 4L)
res <- solve_and_derivative(A_lp, b_lp, c_lp, cones_lp, mode = "dense",
tol_gap_abs = 1e-10, tol_gap_rel = 1e-10,
tol_feas = 1e-10)
res$x # optimal production plan
#> [1] 42 16
-sum(c_lp * res$x) # max profit
#> [1] 206The optimum makes \(x_1 = 42\),
\(x_2 = 16\), profit \(= 206\). Now use D to ask how
the optimal \(x\) would change if we
got one extra unit of labor (\(\Delta b_1 =
1\)):
dA <- sparseMatrix(i = integer(0), j = integer(0), x = numeric(0),
dims = c(4, 2))
db <- c(1, 0, 0, 0)
dc <- numeric(2)
d <- res$D(dA, db, dc)
d$dx # change in optimal production
#> [1] 0.6 -0.2
-sum(c_lp * d$dx) # change in profit
#> [1] 0.8d$dx says that an extra unit of labor increases \(x_1\) by 0.6 and decreases \(x_2\) by 0.2, and the corresponding profit
change is \(0.8\) per unit of labor:
the labor-budget shadow price, recovered from D rather than
from the dual variable \(y\).
The advantage of going through D instead of the dual: it
generalizes. We can ask the same question about non-trivial
perturbations of \(A\) (a productivity
gain on a particular input), c (a price change on a
particular product), or any combination. The dual variable answers only
the canonical RHS-perturbation question; D answers all of
them through the same single call.
DT answers the opposite question. Suppose we
have a downstream loss \(L(x^\ast)\) —
say, a smooth function of the optimum with known gradient \(\partial L / \partial x^\ast\). The adjoint
computes the gradient of \(L\) with
respect to every data entry \((A, b,
c)\) in one shot:
## Suppose dL/dx* is given (e.g. from autodiff in a larger model).
dL_dx <- c(1.0, -0.5)
dL_dy <- numeric(length(res$y))
dL_ds <- numeric(length(res$s))
da <- res$DT(dL_dx, dL_dy, dL_ds)
da$db # gradient of L w.r.t. each entry of b
#> [1] 7.000000e-01 -4.000000e-01 -6.162066e-12 -2.615830e-12
da$dc # gradient of L w.r.t. each entry of c
#> [1] 3.130882e-10 1.741558e-10In a training loop you’d feed da$db and
da$dc straight into a parameter update.
Both D and DT access the same underlying
Jacobian, but their cost profile differs:
D(dA, db, dc) is right-multiplication
by the Jacobian. Cost is one solve of the same linear system per call.
Use D when you have a small number of perturbation
directions (e.g., parameter sweeps) and want to forecast the effect
on the optimum.DT(dx, dy, ds) is right-multiplication
by the Jacobian’s transpose. Same per-call cost, but the answer is a
gradient with respect to all data entries simultaneously. Use
DT when you have an upstream gradient on the optimum
(typically from an autodiff system) and need to propagate it through the
conic optimization.mode = "dense" factors M once on the first
call and amortizes the cost across subsequent D /
DT invocations on the same problem, which is the right
choice when you’ll evaluate either operator many times.
dense versus lsqrsolve_and_derivative accepts two modes for the inner
linear-algebra step:
mode = "lsqr" (default): matrix-free
conjugate-gradient- style iterative solve against the M
operator. Memory is \(O(\text{nnz}(A))\); iteration count grows
with the conditioning of \(M\).
Appropriate for large sparse problems.mode = "dense": build \(M\) as a dense matrix, factor with Eigen’s
LDLT, and reuse the factorization across D /
DT calls. Cost is \(O(N^3)\) once, \(O(N^2)\) per call, where \(N = m + n + 1\). Appropriate for small
problems and for workflows that re-apply D /
DT many times.Both modes produce the same answer to within solver tolerance. The
dense path matches Python’s \((M^\top
M).\mathrm{ldlt}().\mathrm{solve}(M^\top \cdot
\mathrm{rhs})\) bit-for-bit; the LSQR path agrees with the dense
path to roughly \(10^{-6}\) at the
default atol = btol = 1e-8.
vec_symm() packs
the lower triangle in column-major order with off-diagonals scaled by
\(\sqrt{2}\) (the SCS convention).
unvec_symm() is the inverse. This scaling makes the inner
product \(\langle \mathrm{vec}(A),
\mathrm{vec}(B) \rangle\) equal \(\mathrm{tr}(AB)\) for symmetric \(A, B\), which is what the cone-projection
Jacobian expects.diffcp permutes rows of \(A\) and entries of \(b\) to Clarabel order on the way in and
inverse-permutes the dual \(y\) and
slack \(s\) on the way out. The
user-facing convention is SCS throughout.ep cone
consumes three rows of \(A\) (one cone
per count); the dual ed cone is handled
separately or recovered via Moreau’s decomposition, \(\Pi_{K^\ast}(x) = x + \Pi_K(-x)\).solve_only(P = P) accepts a positive-semidefinite
P for the forward solve. The derivative path through
solve_and_derivative does not support
P directly; CVXR’s reduction chain canonicalizes QPs into
auxiliary-variable conic problems before reaching the solver, so the
loss is invisible at the CVXR layer.R users typically don’t call diffcp directly. CVXR (≥
1.8.2) exposes a higher-level API:
library(CVXR)
p <- Parameter(); x <- Variable(); value(p) <- 3
prob <- Problem(Minimize((x - 2 * p)^2), list(x >= 0))
psolve(prob, requires_grad = TRUE)
backward(prob)
gradient(p) # 2 (since x* = 2p)
delta(p) <- 1e-3
derivative(prob)
delta(x) # 0.002The requires_grad = TRUE flag routes the solve through
the DIFFCP solver wrapper, which calls
diffcp::solve_and_derivative under the hood. CVXR then
handles the chain rule through any problem reductions (DGP
log-transform, Complex2Real splitting, parameter canonicalization) and
reports gradients keyed by Parameter rather than by raw
data entries.
When you want the lower-level data-perturbation interface — for
research, for fixtures that bypass CVXR canonicalization, or for
problems already in conic standard form — diffcp is the
right entry point.
If you use diffcp in academic work, please cite
both the R package and the original paper. See
citation("diffcp") for the canonical BibTeX entries.
R package
Narasimhan, B.; Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; Moursi, W. diffcp: Differentiating Through Cone Programs. R package, 2026. https://github.com/bnaras/diffcp
Original paper
Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; Moursi, W. “Differentiating through a cone program.” Journal of Applied and Numerical Optimization, 1(2), 107–115, 2019.