When you know the derivatives of your optimization problem, it is frequently more efficient to use them directly instead of relying on automatic differentiation. For that purpose, we have created ManualNLPModels
. The package is very crude, due to demand being low, but let us know if you need more functionalities.
For instance, in the logistic regression problem, we have a model \(h_{\beta}(x) = \sigma(\hat{x}^T \beta) = \sigma(\beta_0 + x^T\beta_{1:p})\), where \(\hat{x} = \begin{bmatrix} 1 \\ x \end{bmatrix}\). The value of \(\beta\) is found by finding the minimum of the negavitve of the log-likelihood function.
\[\ell(\beta) = -\frac{1}{n} \sum_{i=1}^n y_i \ln \big(h_{\beta}(x_i)\big) + (1 - y_i) \ln\big(1 - h_{\beta}(x_i)\big).\]We'll input the gradient of this function manually. It is given by
\[\nabla \ell(\beta) = \frac{-1}{n} \sum_{i=1}^n \big(y_i - h_{\beta}(x_i)\big) \hat{x}_i = \frac{1}{n} \begin{bmatrix} e^T \\ X^T \end{bmatrix} (h_{\beta}(X) - y),\]where \(e\) is the vector with all components equal to 1.
using ManualNLPModels
using LinearAlgebra, Random
Random.seed!(0)
sigmoid(t) = 1 / (1 + exp(-t))
h(β, X) = sigmoid.(β[1] .+ X * β[2:end])
n, p = 500, 50
X = randn(n, p)
β = randn(p + 1)
y = round.(h(β, X) .+ randn(n) * 0.1)
function myfun(β, X, y)
@views hβ = sigmoid.(β[1] .+ X * β[2:end])
out = sum(
yᵢ * log(ŷᵢ + 1e-8) + (1 - yᵢ) * log(1 - ŷᵢ + 1e-8)
for (yᵢ, ŷᵢ) in zip(y, hβ)
)
return -out / n + 0.5e-4 * norm(β)^2
end
function mygrad(out, β, X, y)
n = length(y)
@views δ = (sigmoid.(β[1] .+ X * β[2:end]) - y) / n
out[1] = sum(δ) + 1e-4β[1]
@views out[2:end] .= X' * δ + 1e-4 * β[2:end]
return out
end
nlp = NLPModel(
zeros(p + 1),
β -> myfun(β, X, y),
grad=(out, β) -> mygrad(out, β, X, y),
)
ManualNLPModels.NLPModel{Float64, Vector{Float64}}
Problem name: Generic
All variables: ████████████████████ 51 All constraints: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
free: ████████████████████ 51 free: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 lower: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 upper: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 low/upp: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 fixed: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 infeas: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nnzh: (100.00% sparsity) 0 linear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nonlinear: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
nnzj: (------% sparsity)
Counters:
obj: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 grad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 cons: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
cons_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 cons_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jcon: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
jgrad: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jac: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jac_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
jac_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
jprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jtprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jtprod_lin: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
jtprod_nln: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 hess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 hprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
jhess: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0 jhprod: ⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅⋅ 0
Notice that the grad
function must modify the first argument so you don't waste memory creating arrays.
Only the obj
, grad
and grad!
functions will be defined for this model, so you need to choose your solver carefully. We'll use lbfgs
from JSOSolvers.jl
.
using JSOSolvers
output = lbfgs(nlp)
βsol = output.solution
ŷ = round.(h(βsol, X))
sum(ŷ .== y) / n
1.0
We can compare against other approaches.
using BenchmarkTools
using Logging
@benchmark begin
nlp = NLPModel(
zeros(p + 1),
β -> myfun(β, X, y),
grad=(out, β) -> mygrad(out, β, X, y),
)
output = with_logger(NullLogger()) do
lbfgs(nlp)
end
end
BenchmarkTools.Trial: 1999 samples with 1 evaluation.
Range (min … max): 2.291 ms … 5.625 ms ┊ GC (min … max): 0.00% … 45.53%
Time (median): 2.351 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.498 ms ± 514.270 μs ┊ GC (mean ± σ): 3.50% ± 8.91%
██▅▆▂ ▁▁
█████▇▄▄▅▄▃▄▄▁███▇▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▃▆▅▆▅▆▆▅▄▄▆▆ █
2.29 ms Histogram: log(frequency) by time 5.16 ms <
Memory estimate: 1.73 MiB, allocs estimate: 2655.
using ADNLPModels
@benchmark begin
adnlp = ADNLPModel(β -> myfun(β, X, y), zeros(p + 1))
output = with_logger(NullLogger()) do
lbfgs(adnlp)
end
end
BenchmarkTools.Trial: 17 samples with 1 evaluation.
Range (min … max): 299.460 ms … 307.731 ms ┊ GC (min … max): 0.00% … 0.60%
Time (median): 301.345 ms ┊ GC (median): 0.60%
Time (mean ± σ): 301.179 ms ± 1.899 ms ┊ GC (mean ± σ): 0.39% ± 0.30%
█ ██
█▁▁▆▆▁▁▁▁▁▁▁▁██▆▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
299 ms Histogram: frequency by time 308 ms <
Memory estimate: 30.52 MiB, allocs estimate: 3786.
using JuMP
using NLPModelsJuMP
@benchmark begin
model = Model()
@variable(model, modelβ[1:p+1])
@NLexpression(model,
xᵀβ[i=1:n],
modelβ[1] + sum(modelβ[j + 1] * X[i,j] for j = 1:p)
)
@NLexpression(
model,
hβ[i=1:n],
1 / (1 + exp(-xᵀβ[i]))
)
@NLobjective(model, Min,
-sum(y[i] * log(hβ[i] + 1e-8) + (1 - y[i] * log(hβ[i] + 1e-8)) for i = 1:n) / n + 0.5e-4 * sum(modelβ[i]^2 for i = 1:p+1)
)
jumpnlp = MathOptNLPModel(model)
output = with_logger(NullLogger()) do
lbfgs(jumpnlp)
end
end
BenchmarkTools.Trial: 33 samples with 1 evaluation.
Range (min … max): 144.085 ms … 158.153 ms ┊ GC (min … max): 4.32% … 8.17%
Time (median): 151.985 ms ┊ GC (median): 5.21%
Time (mean ± σ): 152.054 ms ± 4.129 ms ┊ GC (mean ± σ): 5.85% ± 1.51%
▃ ▃ ▃█ ▃ ▃ ▃ █ ▃ ▃
█▁▁▇▁▁▁▁▁▇▁▁▁▇▁▁▁▁▁▁▁█▁██▇▁▁█▁▁▇▁▁▇▁▇█▇▁▁▁▁▁▁▁▁▇▁█▇▁▇▁█▁▁▁█▁█ ▁
144 ms Histogram: frequency by time 158 ms <
Memory estimate: 98.01 MiB, allocs estimate: 436900.
Or just the grad calls:
using NLPModels
@benchmark grad(nlp, β)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.100 μs … 3.577 ms ┊ GC (min … max): 0.00% … 91.27%
Time (median): 14.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 15.244 μs ± 47.394 μs ┊ GC (mean ± σ): 4.14% ± 1.34%
▃▆██▇▅▅▂▁
▂▃▅███████████▇▅▅▄▄▄▃▃▃▄▄▄▄▅▄▄▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▄
13.1 μs Histogram: frequency by time 19.3 μs <
Memory estimate: 18.19 KiB, allocs estimate: 8.
adnlp = ADNLPModel(β -> myfun(β, X, y), zeros(p + 1))
@benchmark grad(adnlp, β)
BenchmarkTools.Trial: 1084 samples with 1 evaluation.
Range (min … max): 4.565 ms … 6.848 ms ┊ GC (min … max): 0.00% … 29.84%
Time (median): 4.580 ms ┊ GC (median): 0.00%
Time (mean ± σ): 4.611 ms ± 181.170 μs ┊ GC (mean ± σ): 0.29% ± 2.42%
██▄
▃█████▅▃▂▂▂▂▂▂▂▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▃▃▂▂ ▃
4.57 ms Histogram: frequency by time 4.79 ms <
Memory estimate: 471.91 KiB, allocs estimate: 42.
model = Model()
@variable(model, modelβ[1:p+1])
@NLexpression(model,
xᵀβ[i=1:n],
modelβ[1] + sum(modelβ[j + 1] * X[i,j] for j = 1:p)
)
@NLexpression(
model,
hβ[i=1:n],
1 / (1 + exp(-xᵀβ[i]))
)
@NLobjective(model, Min,
-sum(y[i] * log(hβ[i] + 1e-8) + (1 - y[i] * log(hβ[i] + 1e-8)) for i = 1:n) / n + 0.5e-4 * sum(modelβ[i]^2 for i = 1:p+1)
)
jumpnlp = MathOptNLPModel(model)
@benchmark grad(jumpnlp, β)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 205.299 μs … 434.699 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 208.300 μs ┊ GC (median): 0.00%
Time (mean ± σ): 209.617 μs ± 4.853 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃█▅▆▅ ▁
▂▃▄████████▆▇▅▅▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▃▂▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
205 μs Histogram: frequency by time 227 μs <
Memory estimate: 496 bytes, allocs estimate: 1.
Take these benchmarks with a grain of salt. They are being run on a github actions server with global variables. If you want to make an informed option, you should consider performing your own benchmarks.