Generalized CP (GCP) Tensor Decomposition
This document outlines usage and examples for the generalized CP (GCP) tensor decomposition implmented in gcp_opt. GCP allows alternate objective functions besides sum of squared errors, which is the standard for CP. The code support both dense and sparse input tensors, but the sparse input tensors require randomized optimization methods. For some examples, see also GCP-OPT and Amino Acids Dataset.
GCP is described in greater detail in the manuscripts:
- D. Hong, T. G. Kolda, J. A. Duersch, Generalized Canonical Polyadic Tensor Decomposition, SIAM Review, 62:133-163, 2020, https://doi.org/10.1137/18M1203626
- T. G. Kolda, D. Hong, Stochastic Gradients for Large-Scale Tensor Decomposition. SIAM J. Mathematics of Data Science, 2:1066-1095, 2020, https://doi.org/10.1137/19m1266265
Contents
- Basic Usage
- Manually specifying the loss function
- Choice of Optimzation Method
- Specifying Missing or Incomplete Data Using the Mask Option
- Other Options
- Specifying L-BFGS-B Parameters
- Specifying SGD and ADAM Parameters
- Example on Gaussian distributed
- Create an example Rayleigh tensor model and data instance.
- Boolean tensor.
- Create and test a Poisson count tensor.
- Loss function for Poisson negative log likelihood with identity link.
Basic Usage
The idea of GCP is to use alternative objective functions. As such, the most important thing to specify is the objective function.
The command M = gcp_opt(X,R,'type',type) computes an estimate of the best rank-|R| generalized CP (GCP) decomposition of the tensor X for the specified generalized loss function specified by type. The input X can be a tensor or sparse tensor. The result M is a Kruskal tensor. Some options for the objective function are:
- 'binary' - Bernoulli distribution for binary data
- 'count' - Poisson distribution for count data (see also cp_apr)
- 'normal' - Gaussian distribution (see also cp_als and cp_opt)
- 'huber (0.25)' - Similar to Gaussian but robust to outliers
- 'rayleigh' - Rayleigh distribution for nonnegative data
- 'gamma' - Gamma distribution for nonnegative data
See Function Types for GCP for a complete list of options.
Manually specifying the loss function
Rather than specifying a type, the user has the option to explicitly specify the objection function, gradient, and lower bounds using the following options:
- 'func' - Objective function handle, e.g., @(x,m) (m-x).^2
- 'grad' - Gradient function handle, e.g., @(x,m) 2.*(m-x)
- 'lower' - Lower bound, e.g., 0 or -Inf
Note that the function must be able to work on vectors of x and m values.
Choice of Optimzation Method
The default optimization method is L-BFGS-B (bound-constrained limited-memory BFGS). To use this, install the third-party software:
The L-BFGS-B software can only be used for dense tensors. The other choice is to use a stochastic optimization method, either stochastic gradient descent (SGD) or ADAM. This can be used for dense or sparse tensors.
The command M = gcp_opt(X,R,...,'opt',opt) specifies the optimization method where opt is one of the following strings:
- 'lbfgsb' - Bound-constrained limited-memory BFGS
- 'sgd' - Stochastic gradient descent (SGD)
- 'adam' - Momentum-based SGD method
Each methods has parameters, which are described below.
Specifying Missing or Incomplete Data Using the Mask Option
If some entries of the tensor are unknown, the method can mask off that data during the fitting process. To do so, specify a mask tensor W that is the same size as the data tensor X. The mask tensor should be 1 if the entry in X is known and 0 otherwise. The call is M = gcp_opt(X,R,'type',type','mask',W).
Other Options
A few common options are as follows:
- 'maxiters' - Maximum number of outer iterations {1000}
- 'init' - Initialization for factor matrices {|'rand'|}
- 'printitn' - Print every n iterations; 0 for no printing {1}
- 'state' - Random state, to re-create the same outcome {[]}
Specifying L-BFGS-B Parameters
In addition to the options above, there are two options used to modify the L-BFGS-B behavior.
- 'factr' - Tolerance on the change on the objective value. Defaults to 1e7, which is multiplied by machine epsilon.
- 'pgtol' - Projected gradient tolerance, defaults to 1e-4 times total tensor size.
It can sometimes be useful to increase or decrease pgtol depending on the objective function and size of the tensor.
Specifying SGD and ADAM Parameters
There are a number of parameters that can be adjusted for SGD and ADAM.
Stochastic Gradient. There are three different sampling methods for computing the stochastic gradient:
- Uniform - Entries are selected uniformly at random. Default for dense tensors.
- Stratified - Zeros and nonzeros are sampled separately, which is recommended for sparse tensors. Default for sparse tensors.
- Semi-Stratified - Modification to stratified sampling that avoids rejection sampling for better efficiency at the cost of potentially higher variance.
The options corresponding to these are as follows.
- 'sampler' - Type of sampling to use for stochastic gradient. Defaults to 'uniform' for dense and 'stratified' for sparse. The third options is 'semi-stratified'.
- 'gsamp' - Number of samples for stochastic gradient. This should generally be O(sum(sz)*r). For the stratified or semi-stratified sampler, this can be two numbers. The first is the number of nonzero samples and the second is the number of zero samples. If only one number is specified, then this is used as the number for both nonzeros and zeros, and the total number of samples is 2x what is specified.
Estimating the Function. We also use sampling to estimate the function value.
- 'fsampler' - This can be 'uniform' (default for dense) or 'stratified' (default for sparse) or a custom function handle. The custom function handleis primarily useful in reusing the same sampled elements across different tests. For instance, we might create such a sampler by calling the hidden sampling function and saving its outputs:
[xsubs, xvals, wghts] = tt_sample_uniform(X, 10000); fsampler = @() deal(xsubs, xvals, wghts);'
- 'fsamp' - Number of samples to estimate function. This should generally be somewhat large since we want this sample to generate a reliable estimate of the true function value.
The 'stratified' sampler has an extra option: * 'oversample' - Factor to oversample when implicitly sampling zeros in the sparse case. Defaults to 1.1. Only adjust for very small tensors.
There are some other options that are needed for SGD, the learning rate and a decrease schedule. Our schedule is very simple - we decrease the rate if there is no improvement in the approximate function value after an epoch. After a specified number of decreases ('maxfails'), we quit.
- 'rate' - Initial learning rate. Defaults to 1e-3.
- 'decay' - How much to decrease the learning rate once progress stagnates, i.e., no decrease in objective function between epochs. Default to 0.1.
- 'maxfails' - How many times to decrease the learning rate. Can be zero. Defaults to 1.
- 'epciters' - Iterations per epoch. Defaults to 1000.
- 'festtol' - Quit if the function estimate goes below this level. Defaults to -Inf.
There are some options that are specific to ADAM and generally needn't change:
- 'beta1' - Default to 0.9
- 'beta2' - Defaults to 0.999
- 'epsilon' - Defaults to 1e-8
Example on Gaussian distributed
We set up the example with known low-rank structure. Here nc is the rank and sz is the size.
clear rng('default') nc = 2; sz = [50 60 70]; info = create_problem('Size',sz,'Num_Factors',nc); X = info.Data; M_true = info.Soln; whos
Name Size Bytes Class Attributes M_true 50x60x70 3544 ktensor X 50x60x70 1680360 tensor info 1x1 1684240 struct nc 1x1 8 double sz 1x3 24 double
Run GCP-OPT
tic, [M1,M0,out] = gcp_opt(X,nc,'type','normal','printitn',10); toc fprintf('Final fit: %e (for comparison to f in CP-ALS)\n',1 - norm(X-full(M1))/norm(X)); fprintf('Score: %f\n',score(M1,M_true));
GCP-OPT-LBFGSB (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) GCP rank: 2 Generalized function Type: normal Objective function: @(x,m)(m-x).^2 Gradient function: @(x,m)2.*(m-x) Lower bound of factor matrices: -Inf Optimization method: lbfgsb Max iterations: 1000 Projected gradient tolerance: 21 Begin Main loop Iter 10, f(x) = 6.435693e+04, ||grad||_infty = 2.48e+03 Iter 20, f(x) = 9.151275e+02, ||grad||_infty = 1.88e+01 End Main Loop Final objective: 9.1513e+02 Setup time: 0.09 seconds Main loop time: 0.32 seconds Outer iterations: 20 Total iterations: 50 L-BFGS-B Exit message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL. Elapsed time is 0.510884 seconds. Final fit: 9.004887e-01 (for comparison to f in CP-ALS) Score: 0.999247
Compare to CP-ALS, which should usually be faster
tic, M2 = cp_als(X,nc,'init',tocell(M0),'printitn',1); toc fprintf('Objective function: %e (for comparison to f(x) in GCP-OPT)\n', norm(X-full(M2))^2/prod(size(X))); fprintf('Score: %f\n',score(M2,M_true));
CP_ALS: Iter 1: f = 3.345448e-01 f-delta = 3.3e-01 Iter 2: f = 5.734164e-01 f-delta = 2.4e-01 Iter 3: f = 6.242275e-01 f-delta = 5.1e-02 Iter 4: f = 7.508386e-01 f-delta = 1.3e-01 Iter 5: f = 8.939489e-01 f-delta = 1.4e-01 Iter 6: f = 9.005545e-01 f-delta = 6.6e-03 Iter 7: f = 9.005707e-01 f-delta = 1.6e-05 Final f = 9.005707e-01 Elapsed time is 0.180171 seconds. Objective function: 4.350572e-03 (for comparison to f(x) in GCP-OPT) Score: 0.999932
Now let's try is with the ADAM functionality
tic, [M3,~,out] = gcp_opt(X,nc,'type','normal','opt','adam','init',M0,'printitn',1); toc fprintf('Final fit: %e (for comparison to f in CP-ALS)\n',1 - norm(X-full(M1))/norm(X)); fprintf('Score: %f\n',score(M3,M_true));
GCP-OPT-ADAM (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) GCP rank: 2 Generalized function Type: normal Objective function: @(x,m)(m-x).^2 Gradient function: @(x,m)2.*(m-x) Lower bound of factor matrices: -Inf Optimization method: adam Max iterations (epochs): 1000 Iterations per epoch: 1000 Learning rate / decay / maxfails: 0.001 0.1 1 Function Sampler: uniform with 210000 samples Gradient Sampler: uniform with 2100 samples Begin Main loop Initial f-est: 1.867360e+05 Epoch 1: f-est = 9.595092e+04, step = 0.001 Epoch 2: f-est = 9.371435e+04, step = 0.001 Epoch 3: f-est = 8.795032e+04, step = 0.001 Epoch 4: f-est = 3.244018e+04, step = 0.001 Epoch 5: f-est = 1.251749e+03, step = 0.001 Epoch 6: f-est = 9.170513e+02, step = 0.001 Epoch 7: f-est = 9.149259e+02, step = 0.001 Epoch 8: f-est = 9.162387e+02, step = 0.001, nfails = 1 (resetting to solution from last epoch) Epoch 9: f-est = 9.130911e+02, step = 0.0001 Epoch 10: f-est = 9.130145e+02, step = 0.0001 Epoch 11: f-est = 9.131056e+02, step = 0.0001, nfails = 2 (resetting to solution from last epoch) End Main Loop Final f-est: 9.1301e+02 Setup time: 0.05 seconds Main loop time: 12.19 seconds Total iterations: 11000 Elapsed time is 12.250667 seconds. Final fit: 9.004887e-01 (for comparison to f in CP-ALS) Score: 0.999870
Create an example Rayleigh tensor model and data instance.
Consider a tensor that is Rayleigh-distribued. This means its entries are all nonnegative. First, we generate such a tensor with low-rank structure.
clear rng(65) nc = 3; sz = [50 60 70]; nd = length(sz); % Create factor matrices that correspond to smooth sinusidal factors U=cell(1,nd); for k=1:nd V = 1.1 + cos(bsxfun(@times, 2*pi/sz(k)*(0:sz(k)-1)', 1:nc)); U{k} = V(:,randperm(nc)); end M_true = normalize(ktensor(U)); X = tenfun(@raylrnd, full(M_true));
Visualize the true solution
viz(M_true, 'Figure', 1)
ktensor/viz: Normalizing factors and sorting components according to the 2-norm. ans = struct with fields: height: 0.2933 width: [3×1 double] GlobalAxis: [1×1 Axes] FactorAxes: [3×3 Axes] ModeTitleHandles: [3×1 Text] CompTitleHandles: [3×1 Text] PlotHandles: {3×3 cell}
Run GCP-OPT
tic, [M1,~,out] = gcp_opt(X,nc,'type','rayleigh','printitn',10); toc fprintf('Score: %f\n',score(M1,M_true));
GCP-OPT-LBFGSB (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) GCP rank: 3 Generalized function Type: rayleigh Objective function: @(x,m)2*log(m+1e-10)+(pi/4)*(x./(m+1e-10)).^2 Gradient function: @(x,m)2./(m+1e-10)-(pi/2)*x.^2./(m+1e-10).^3 Lower bound of factor matrices: 0 Optimization method: lbfgsb Max iterations: 1000 Projected gradient tolerance: 21 Begin Main loop Iter 10, f(x) = 9.142571e+05, ||grad||_infty = 1.41e+03 Positive dir derivative in projection Using the backtracking step Iter 20, f(x) = 8.450604e+05, ||grad||_infty = 1.89e+03 Iter 30, f(x) = 7.770233e+05, ||grad||_infty = 1.41e+03 Iter 40, f(x) = 7.632798e+05, ||grad||_infty = 1.80e+03 Iter 50, f(x) = 7.580042e+05, ||grad||_infty = 1.10e+03 Iter 60, f(x) = 7.573270e+05, ||grad||_infty = 2.52e+02 Iter 70, f(x) = 7.572930e+05, ||grad||_infty = 7.99e+01 End Main Loop Final objective: 7.5729e+05 Setup time: 0.02 seconds Main loop time: 1.41 seconds Outer iterations: 70 Total iterations: 165 L-BFGS-B Exit message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL. Elapsed time is 1.429504 seconds. Score: 0.795733
Visualize the solution from GCP-OPT
viz(M1, 'Figure', 2)
ktensor/viz: Normalizing factors and sorting components according to the 2-norm. ans = struct with fields: height: 0.2933 width: [3×1 double] GlobalAxis: [1×1 Axes] FactorAxes: [3×3 Axes] ModeTitleHandles: [3×1 Text] CompTitleHandles: [3×1 Text] PlotHandles: {3×3 cell}
Now let's try is with the scarce functionality - this leaves out all but 10% of the data!
tic, [M2,~,out] = gcp_opt(X,nc,'type','rayleigh','opt','adam'); toc fprintf('Final fit: %e (for comparison to f in CP-ALS)\n',1 - norm(X-full(M1))/norm(X)); fprintf('Score: %f\n',score(M2,M_true));
GCP-OPT-ADAM (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) GCP rank: 3 Generalized function Type: rayleigh Objective function: @(x,m)2*log(m+1e-10)+(pi/4)*(x./(m+1e-10)).^2 Gradient function: @(x,m)2./(m+1e-10)-(pi/2)*x.^2./(m+1e-10).^3 Lower bound of factor matrices: 0 Optimization method: adam Max iterations (epochs): 1000 Iterations per epoch: 1000 Learning rate / decay / maxfails: 0.001 0.1 1 Function Sampler: uniform with 210000 samples Gradient Sampler: uniform with 2100 samples Begin Main loop Initial f-est: 2.715221e+06 Epoch 1: f-est = 1.017441e+06, step = 0.001 Epoch 2: f-est = 9.204216e+05, step = 0.001 Epoch 3: f-est = 8.791755e+05, step = 0.001 Epoch 4: f-est = 8.496629e+05, step = 0.001 Epoch 5: f-est = 8.276276e+05, step = 0.001 Epoch 6: f-est = 8.053227e+05, step = 0.001 Epoch 7: f-est = 7.839439e+05, step = 0.001 Epoch 8: f-est = 7.710536e+05, step = 0.001 Epoch 9: f-est = 7.653168e+05, step = 0.001 Epoch 10: f-est = 7.619699e+05, step = 0.001 Epoch 11: f-est = 7.600227e+05, step = 0.001 Epoch 12: f-est = 7.590060e+05, step = 0.001 Epoch 13: f-est = 7.585602e+05, step = 0.001 Epoch 14: f-est = 7.583133e+05, step = 0.001 Epoch 15: f-est = 7.582559e+05, step = 0.001 Epoch 16: f-est = 7.582295e+05, step = 0.001 Epoch 17: f-est = 7.582587e+05, step = 0.001, nfails = 1 (resetting to solution from last epoch) Epoch 18: f-est = 7.581745e+05, step = 0.0001 Epoch 19: f-est = 7.581687e+05, step = 0.0001 Epoch 20: f-est = 7.581605e+05, step = 0.0001 Epoch 21: f-est = 7.581473e+05, step = 0.0001 Epoch 22: f-est = 7.581537e+05, step = 0.0001, nfails = 2 (resetting to solution from last epoch) End Main Loop Final f-est: 7.5815e+05 Setup time: 0.06 seconds Main loop time: 25.78 seconds Total iterations: 22000 Elapsed time is 25.836094 seconds. Final fit: 5.380785e-01 (for comparison to f in CP-ALS) Score: 0.797088
Visualize the solution with scarce
viz(M2, 'Figure', 3)
ktensor/viz: Normalizing factors and sorting components according to the 2-norm. ans = struct with fields: height: 0.2933 width: [3×1 double] GlobalAxis: [1×1 Axes] FactorAxes: [3×3 Axes] ModeTitleHandles: [3×1 Text] CompTitleHandles: [3×1 Text] PlotHandles: {3×3 cell}
Boolean tensor.
The model will predict the odds of observing a 1. Recall that the odds related to the probability as follows. If is the probability adn is the odds, then . Higher odds indicates a higher probability of observing a one.
clear rng(7639) nc = 3; % Number of components sz = [50 60 70]; % Tensor size nd = length(sz); % Number of dimensions
We assume that the underlying model tensor has factor matrices with only a few "large" entries in each column. The small entries should correspond to a low but nonzero entry of observing a 1, while the largest entries, if multiplied together, should correspond to a very high likelihood of observing a 1.
probrange = [0.01 0.99]; % Absolute min and max of probabilities oddsrange = probrange ./ (1 - probrange); smallval = nthroot(min(oddsrange)/nc,nd); largeval = nthroot(max(oddsrange)/nc,nd); A = cell(nd,1); for k = 1:nd A{k} = smallval * ones(sz(k), nc); nbig = 5; for j = 1:nc p = randperm(sz(k)); A{k}(p(1:nbig),j) = largeval; end end M_true = ktensor(A);
Convert K-tensor to an observed tensor Get the model values, which correspond to odds of observing a 1
Mfull = full(M_true); % Convert odds to probabilities Mprobs = Mfull ./ (1 + Mfull); % Flip a coin for each entry, with the probability of observing a one % dictated by Mprobs Xfull = 1.0*(tensor(@rand, sz) < Mprobs); % Convert to sparse tensor, real-valued 0/1 tensor since it was constructed % to be sparse X = sptensor(Xfull); fprintf('Proportion of nonzeros in X is %.2f%%\n', 100*nnz(X) / prod(sz));
Proportion of nonzeros in X is 8.42%
Just for fun, let's visualize the distribution of the probabilities in the model tensor.
histogram(Mprobs(:))
Call GCP_OPT on the full tensor
[M1,~,out] = gcp_opt(Xfull, nc, 'type', 'binary','printitn',25); fprintf('Final score: %f\n', score(M1,M_true));
GCP-OPT-LBFGSB (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) GCP rank: 3 Generalized function Type: binary Objective function: @(x,m)log(m+1)-x.*log(m+1e-10) Gradient function: @(x,m)1./(m+1)-x./(m+1e-10) Lower bound of factor matrices: 0 Optimization method: lbfgsb Max iterations: 1000 Projected gradient tolerance: 21 Begin Main loop Iter 25, f(x) = 4.498422e+04, ||grad||_infty = 7.66e+01 Iter 50, f(x) = 4.323442e+04, ||grad||_infty = 1.27e+02 Iter 62, f(x) = 4.309850e+04, ||grad||_infty = 1.95e+01 End Main Loop Final objective: 4.3098e+04 Setup time: 0.05 seconds Main loop time: 0.71 seconds Outer iterations: 62 Total iterations: 140 L-BFGS-B Exit message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL. Final score: 0.739944
GCP-OPT as sparse tensor
[M2,~,out] = gcp_opt(X, nc, 'type', 'binary'); fprintf('Final score: %f\n', score(M2,M_true));
GCP-OPT-ADAM (Generalized CP Tensor Decomposition) Tensor size: 50 x 60 x 70 (210000 total entries) Sparse tensor: 17690 (8.4%) Nonzeros and 192310 (91.58%) Zeros GCP rank: 3 Generalized function Type: binary Objective function: @(x,m)log(m+1)-x.*log(m+1e-10) Gradient function: @(x,m)1./(m+1)-x./(m+1e-10) Lower bound of factor matrices: 0 Optimization method: adam Max iterations (epochs): 1000 Iterations per epoch: 1000 Learning rate / decay / maxfails: 0.001 0.1 1 Function Sampler: stratified with 17690 nonzero and 17690 zero samples Gradient Sampler: stratified with 1000 nonzero and 1000 zero samples Begin Main loop Initial f-est: 7.428218e+04 Epoch 1: f-est = 4.717953e+04, step = 0.001 Epoch 2: f-est = 4.654388e+04, step = 0.001 Epoch 3: f-est = 4.622779e+04, step = 0.001 Epoch 4: f-est = 4.556995e+04, step = 0.001 Epoch 5: f-est = 4.513884e+04, step = 0.001 Epoch 6: f-est = 4.497412e+04, step = 0.001 Epoch 7: f-est = 4.478373e+04, step = 0.001 Epoch 8: f-est = 4.417746e+04, step = 0.001 Epoch 9: f-est = 4.361799e+04, step = 0.001 Epoch 10: f-est = 4.347086e+04, step = 0.001 Epoch 11: f-est = 4.341842e+04, step = 0.001 Epoch 12: f-est = 4.338054e+04, step = 0.001 Epoch 13: f-est = 4.341225e+04, step = 0.001, nfails = 1 (resetting to solution from last epoch) Epoch 14: f-est = 4.336923e+04, step = 0.0001 Epoch 15: f-est = 4.336528e+04, step = 0.0001 Epoch 16: f-est = 4.335960e+04, step = 0.0001 Epoch 17: f-est = 4.336279e+04, step = 0.0001, nfails = 2 (resetting to solution from last epoch) End Main Loop Final f-est: 4.3360e+04 Setup time: 0.11 seconds Main loop time: 21.92 seconds Total iterations: 17000 Final score: 0.836891
Create and test a Poisson count tensor.
nc = 3; sz = [80 90 100]; nd = length(sz); paramRange = [0.5 60]; factorRange = paramRange.^(1/nd); minFactorRatio = 95/100; lambdaDamping = 0.8; rng(21); info = create_problem('Size', sz, ... 'Num_Factors', nc, ... 'Factor_Generator', @(m,n)factorRange(1)+(rand(m,n)>minFactorRatio)*(factorRange(2)-factorRange(1)), ... 'Lambda_Generator', @(m,n)ones(m,1)*(lambdaDamping.^(0:n-1)'), ... 'Sparse_Generation', 0.2); M_true = normalize(arrange(info.Soln)); X = info.Data; viz(M_true, 'Figure',3);
ktensor/viz: Normalizing factors and sorting components according to the 2-norm.
Loss function for Poisson negative log likelihood with identity link.
% Call GCP_OPT on sparse tensor [M1,M0,out] = gcp_opt(X, nc, 'type', 'count','printitn',25); fprintf('Final score: %f\n', score(M1,M_true));
GCP-OPT-ADAM (Generalized CP Tensor Decomposition) Tensor size: 80 x 90 x 100 (720000 total entries) Sparse tensor: 123856 (17%) Nonzeros and 596144 (82.80%) Zeros GCP rank: 3 Generalized function Type: count Objective function: @(x,m)m-x.*log(m+1e-10) Gradient function: @(x,m)1-x./(m+1e-10) Lower bound of factor matrices: 0 Optimization method: adam Max iterations (epochs): 1000 Iterations per epoch: 1000 Learning rate / decay / maxfails: 0.001 0.1 1 Function Sampler: stratified with 100000 nonzero and 100000 zero samples Gradient Sampler: stratified with 1000 nonzero and 1000 zero samples Begin Main loop Initial f-est: 4.721309e+05 Epoch 14: f-est = 3.448798e+05, step = 0.001, nfails = 1 (resetting to solution from last epoch) Epoch 18: f-est = 3.447516e+05, step = 0.0001, nfails = 2 (resetting to solution from last epoch) End Main Loop Final f-est: 3.4475e+05 Setup time: 0.05 seconds Main loop time: 25.09 seconds Total iterations: 18000 Final score: 0.954415