sgdml package¶
Subpackages¶
Submodules¶
sgdml.predict module¶
This module contains all routines for evaluating GDML and sGDML models.

class
sgdml.predict.
GDMLPredict
(model, batch_size=None, num_workers=1, max_processes=None)[source]¶ 
_predict_bulk
(R)[source]¶ Predict energy and forces for multiple geometries.
Parameters: R ( numpy.ndarray
) – A 2D array of size M x 3N containing of the Cartesian coordinates of each atom of M molecules.Returns: numpy.ndarray
– Energies stored in an 1D array of size M.numpy.ndarray
– Forces stored in an 2D arry of size M x 3N.

predict
(r)[source]¶ Predict energy and forces for multiple geometries.
Note
The order of the atoms in r is not arbitrary and must be the same as used for training the model.
Parameters: R ( numpy.ndarray
) – A 2D array of size M x 3N containing of the Cartesian coordinates of each atom of M molecules.Returns: numpy.ndarray
– Energies stored in an 1D array of size M.numpy.ndarray
– Forces stored in an 2D arry of size M x 3N.

set_batch_size
(batch_size=None)[source]¶ Set chunk size for each process.
The chunk size determines how much of a processes workload will be passed to Python’s underlying lowlevel routines at once. This parameter is highly hardware dependent. A chunk is a subset of the training set of the model.
Note
This parameter can be optimally determined using set_opt_num_workers_and_batch_size_fast.
Parameters: batch_size (int) – Chunk size (maximum value is set if None).

set_num_workers
(num_workers=None)[source]¶ Set number of processes to use during prediction.
This number should not exceed the number of available CPU cores.
Note
This parameter can be optimally determined using set_opt_num_workers_and_batch_size_fast.
Parameters: num_workers (int) – Number of processes (maximum value is set if None).

set_opt_num_workers_and_batch_size_fast
(n_bulk=1, n_reps=3)[source]¶ Determine the optimal number of processes and chunk size to use when evaluating the loaded model.
This routine runs a benchmark in which the prediction routine in repeatedly called n_repstimes with varying parameter configurations, while the runtime is measured for each one. The optimal parameters are then automatically set.
Note
Depending on the parameter n_reps, this routine takes some seconds to complete, which is why it only makes sense to call it before running a large number of predictions.
Parameters:  n_bulk (int) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).
 n_reps (int) – Number of repetitions (bigger value: more accurate, but also slower).
Returns: Force and energy prediciton speed in geometries per second.
Return type: int


sgdml.predict.
_predict_wkr
(wkr_start_stop, chunk_size, r_desc)[source]¶ Compute part of a prediction.
The workload will be processed in b_size chunks.
Parameters:  wkr_start_stop (tuple of int) – Indices of first and last (exclusive) sum element.
 r_desc (numpy.ndarray) – 1D array containing the descriptor for the query geometry.
Returns: Partial prediction of all force components and energy (appended to array as last element).
Return type: numpy.ndarray
Return a ctypes array allocated from shared memory with data from a NumPy array of type float.
Parameters: arr_np ( numpy.ndarray
) – NumPy array.Returns: Return type: array of ctype
sgdml.train module¶
This module contains all routines for training GDML and sGDML models.

class
sgdml.train.
GDMLTrain
(max_processes=None)[source]¶ 
_assemble_kernel_mat
(R_desc, R_d_desc, n_perms, tril_perms_lin, sig, use_E_cstr=False, progr_callback=None)[source]¶ Compute force field kernel matrix.
The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrixvalued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).
Parameters: R_desc (
numpy.ndarray
) – Array containing the descriptor for each training point.R_d_desc (
numpy.ndarray
) – Array containing the gradient of the descriptor for each training point.n_perms (int) – Number of individual permutations encoded in tril_perms_lin.
tril_perms_lin (
numpy.ndarray
) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.sig (int) – Hyperparameter :math:`sigma`(kernel length scale).
use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.
progress_callback (callable, optional) – Kernel assembly progress function that takes three arguments:
 current : int
Current progress (number of completed entries).
 total : int
Task size (total number of entries to create).
 duration_s : float or None, optional
Once complete, this parameter contains the time it took to assemble the kernel (seconds).
Returns: Force field kernel matrix.
Return type: numpy.ndarray

_recov_int_const
(model, task)[source]¶ Estimate the integration constant for a force field model.
The offset between the energies predicted for the original training data and the true energy labels is computed in the least square sense. Furthermore, common issues with the userprovided datasets are self diagnosed here.
Parameters:  model (
dict
) – Data structure of custom typemodel
.  task (
dict
) – Data structure of custom typetask
.
Returns: Estimate for the integration constant.
Return type: float
Raises: ValueError
– If the sign of the force labels in the dataset from which the model emerged is switched (e.g. gradients instead of forces).ValueError
– If inconsistent/corrupted energy labels are detected in the provided dataset.ValueError
– If different scales in energy vs. force labels are detected in the provided dataset.
 model (

create_task
(train_dataset, n_train, valid_dataset, n_valid, sig, lam=1e15, use_sym=True, use_E=True, use_E_cstr=False, use_cprsn=False)[source]¶ Create a data structure of custom type task.
These data structures serve as recipes for model creation, summarizing the configuration of one particular training run. Training and test points are sampled from the provided dataset, without replacement. If the same dataset if given for training and testing, the subsets are drawn without overlap.
Each task also contains a choice for the hyperparameters of the training process and the MD5 fingerprints of the used datasets.
Parameters: train_dataset (
dict
) – Data structure of custom typedataset
containing train dataset.n_train (int) – Number of training points to sample.
valid_dataset (
dict
) – Data structure of custom typedataset
containing validation dataset.n_valid (int) – Number of validation points to sample.
sig (int) – Hyperparameter (kernel length scale).
lam (float, optional) – Hyperparameter lambda (regularization strength).
use_sym (bool, optional) – True: include symmetries (sGDML), False: GDML.
use_E (bool, optional) – True: reconstruct force field with corresponding potential energy surface, False: ignore energy during training, even if energy labels are available
in the dataset. The trained model will still be able to predict energies up to an unknown integration constant. Note, that the energy predictions accuracy will be untested.
use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML.
use_cprsn (bool, optional) – True: compress kernel matrix along symmetric degrees of freedom, False: train using full kernel matrix
Returns: Data structure of custom type
task
.Return type: dict
Raises: ValueError
– If a reconstruction of the potential energy surface is requested, but the energy labels are missing in the dataset.

draw_strat_sample
(T, n, excl_idxs=None)[source]¶ Draw sample from dataset that preserves its original distribution.
The distribution is estimated from a histogram were the bin size is determined using the FreedmanDiaconis rule. This rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. A reduced histogram is then constructed by sampling uniformly in each bin. It is intended to populate all bins with at least one sample in the reduced histogram, even for small training sizes.
Parameters:  T (
numpy.ndarray
) – Dataset to sample from.  n (int) – Number of examples.
 excl_idxs (
numpy.ndarray
, optional) – Array of indices to exclude from sample.
Returns: Array of indices that form the sample.
Return type: numpy.ndarray
 T (

train
(task, cprsn_callback=None, ker_progr_callback=None, solve_callback=None)[source]¶ Train a model based on a training task.
Parameters: task (
dict
) – Data structure of custom typetask
.cprsn_callback (callable, optional) –
 Symmetry compression status.
 n_atoms : int
Total number of atoms.
 n_atoms_kept : float or None, optional
Number of atoms kept after compression.
ker_progr_callback (callable, optional) – Kernel assembly progress function that takes three arguments:
 current : int
Current progress (number of completed entries).
 total : int
Task size (total number of entries to create).
 duration_s : float or None, optional
Once complete, this parameter contains the time it took to assemble the kernel (seconds).
solve_callback (callable, optional) –
 Linear system solver status.
 done : bool
False when solver starts, True when it finishes.
 duration_s : float or None, optional
Once done, this parameter contains the runtime of the solver (seconds).
Returns: Data structure of custom type
model
.Return type: dict


sgdml.train.
_assemble_kernel_mat_wkr
(j, n_perms, tril_perms_lin, sig, use_E_cstr=False)[source]¶ Compute one row and column of the force field kernel matrix.
The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrixvalued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).
Parameters:  j (int) – Index of training point.
 n_perms (int) – Number of individual permutations encoded in tril_perms_lin.
 tril_perms_lin (
numpy.ndarray
) – 1D array (int) containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.  sig (int) – Hyperparameter \(\sigma\).
Returns: Number of kernel matrix blocks created, divided by 2 (symmetric blocks are always created at together).
Return type: int
Return a ctypes array allocated from shared memory with data from a NumPy array.
Parameters:  arr_np (
numpy.ndarray
) – NumPy array.  typecode_or_type (char or
ctype
) – Either a ctypes type or a one character typecode of the kind used by the Python array module.
Returns: Return type: array of
ctype
 arr_np (