sgdml package

Submodules

sgdml.predict module

This module contains all routines for evaluating GDML and sGDML models.

class sgdml.predict.GDMLPredict(model, batch_size=None, num_workers=1, max_processes=None)[source]
_predict_bulk(R)[source]

Predict energy and forces for multiple geometries.

Parameters:R (numpy.ndarray) – A 2D array of size M x 3N containing of the Cartesian coordinates of each atom of M molecules.
Returns:
  • numpy.ndarray – Energies stored in an 1D array of size M.
  • numpy.ndarray – Forces stored in an 2D arry of size M x 3N.
predict(r)[source]

Predict energy and forces for multiple geometries.

Note

The order of the atoms in r is not arbitrary and must be the same as used for training the model.

Parameters:R (numpy.ndarray) – A 2D array of size M x 3N containing of the Cartesian coordinates of each atom of M molecules.
Returns:
  • numpy.ndarray – Energies stored in an 1D array of size M.
  • numpy.ndarray – Forces stored in an 2D arry of size M x 3N.
set_batch_size(batch_size=None)[source]

Set chunk size for each process.

The chunk size determines how much of a processes workload will be passed to Python’s underlying low-level routines at once. This parameter is highly hardware dependent. A chunk is a subset of the training set of the model.

Note

This parameter can be optimally determined using set_opt_num_workers_and_batch_size_fast.

Parameters:batch_size (int) – Chunk size (maximum value is set if None).
set_num_workers(num_workers=None)[source]

Set number of processes to use during prediction.

This number should not exceed the number of available CPU cores.

Note

This parameter can be optimally determined using set_opt_num_workers_and_batch_size_fast.

Parameters:num_workers (int) – Number of processes (maximum value is set if None).
set_opt_num_workers_and_batch_size_fast(n_bulk=1, n_reps=3)[source]

Determine the optimal number of processes and chunk size to use when evaluating the loaded model.

This routine runs a benchmark in which the prediction routine in repeatedly called n_reps-times with varying parameter configurations, while the runtime is measured for each one. The optimal parameters are then automatically set.

Note

Depending on the parameter n_reps, this routine takes some seconds to complete, which is why it only makes sense to call it before running a large number of predictions.

Parameters:
  • n_bulk (int) – Number of geometries that will be passed to the predict function in each call (performance will be optimized for that exact use case).
  • n_reps (int) – Number of repetitions (bigger value: more accurate, but also slower).
Returns:

Force and energy prediciton speed in geometries per second.

Return type:

int

sgdml.predict._predict_wkr(wkr_start_stop, chunk_size, r_desc)[source]

Compute part of a prediction.

The workload will be processed in b_size chunks.

Parameters:
  • wkr_start_stop (tuple of int) – Indices of first and last (exclusive) sum element.
  • r_desc (numpy.ndarray) – 1D array containing the descriptor for the query geometry.
Returns:

Partial prediction of all force components and energy (appended to array as last element).

Return type:

numpy.ndarray

sgdml.predict.share_array(arr_np)[source]

Return a ctypes array allocated from shared memory with data from a NumPy array of type float.

Parameters:arr_np (numpy.ndarray) – NumPy array.
Returns:
Return type:array of ctype

sgdml.train module

This module contains all routines for training GDML and sGDML models.

class sgdml.train.GDMLTrain(max_processes=None)[source]
_assemble_kernel_mat(R_desc, R_d_desc, n_perms, tril_perms_lin, sig, use_E_cstr=False, progr_callback=None)[source]

Compute force field kernel matrix.

The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrix-valued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).

Parameters:
  • R_desc (numpy.ndarray) – Array containing the descriptor for each training point.

  • R_d_desc (numpy.ndarray) – Array containing the gradient of the descriptor for each training point.

  • n_perms (int) – Number of individual permutations encoded in tril_perms_lin.

  • tril_perms_lin (numpy.ndarray) – 1D array containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.

  • sig (int) – Hyper-parameter :math:`sigma`(kernel length scale).

  • use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML kernel.

  • progress_callback (callable, optional) – Kernel assembly progress function that takes three arguments:

    current : int

    Current progress (number of completed entries).

    total : int

    Task size (total number of entries to create).

    duration_s : float or None, optional

    Once complete, this parameter contains the time it took to assemble the kernel (seconds).

Returns:

Force field kernel matrix.

Return type:

numpy.ndarray

_recov_int_const(model, task)[source]

Estimate the integration constant for a force field model.

The offset between the energies predicted for the original training data and the true energy labels is computed in the least square sense. Furthermore, common issues with the user-provided datasets are self diagnosed here.

Parameters:
  • model (dict) – Data structure of custom type model.
  • task (dict) – Data structure of custom type task.
Returns:

Estimate for the integration constant.

Return type:

float

Raises:
  • ValueError – If the sign of the force labels in the dataset from which the model emerged is switched (e.g. gradients instead of forces).
  • ValueError – If inconsistent/corrupted energy labels are detected in the provided dataset.
  • ValueError – If different scales in energy vs. force labels are detected in the provided dataset.
create_task(train_dataset, n_train, valid_dataset, n_valid, sig, lam=1e-15, use_sym=True, use_E=True, use_E_cstr=False, use_cprsn=False)[source]

Create a data structure of custom type task.

These data structures serve as recipes for model creation, summarizing the configuration of one particular training run. Training and test points are sampled from the provided dataset, without replacement. If the same dataset if given for training and testing, the subsets are drawn without overlap.

Each task also contains a choice for the hyper-parameters of the training process and the MD5 fingerprints of the used datasets.

Parameters:
  • train_dataset (dict) – Data structure of custom type dataset containing train dataset.

  • n_train (int) – Number of training points to sample.

  • valid_dataset (dict) – Data structure of custom type dataset containing validation dataset.

  • n_valid (int) – Number of validation points to sample.

  • sig (int) – Hyper-parameter (kernel length scale).

  • lam (float, optional) – Hyper-parameter lambda (regularization strength).

  • use_sym (bool, optional) – True: include symmetries (sGDML), False: GDML.

  • use_E (bool, optional) – True: reconstruct force field with corresponding potential energy surface, False: ignore energy during training, even if energy labels are available

    in the dataset. The trained model will still be able to predict energies up to an unknown integration constant. Note, that the energy predictions accuracy will be untested.

  • use_E_cstr (bool, optional) – True: include energy constraints in the kernel, False: default (s)GDML.

  • use_cprsn (bool, optional) – True: compress kernel matrix along symmetric degrees of freedom, False: train using full kernel matrix

Returns:

Data structure of custom type task.

Return type:

dict

Raises:

ValueError – If a reconstruction of the potential energy surface is requested, but the energy labels are missing in the dataset.

draw_strat_sample(T, n, excl_idxs=None)[source]

Draw sample from dataset that preserves its original distribution.

The distribution is estimated from a histogram were the bin size is determined using the Freedman-Diaconis rule. This rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. A reduced histogram is then constructed by sampling uniformly in each bin. It is intended to populate all bins with at least one sample in the reduced histogram, even for small training sizes.

Parameters:
  • T (numpy.ndarray) – Dataset to sample from.
  • n (int) – Number of examples.
  • excl_idxs (numpy.ndarray, optional) – Array of indices to exclude from sample.
Returns:

Array of indices that form the sample.

Return type:

numpy.ndarray

train(task, cprsn_callback=None, ker_progr_callback=None, solve_callback=None)[source]

Train a model based on a training task.

Parameters:
  • task (dict) – Data structure of custom type task.

  • cprsn_callback (callable, optional) –

    Symmetry compression status.
    n_atoms : int

    Total number of atoms.

    n_atoms_kept : float or None, optional

    Number of atoms kept after compression.

  • ker_progr_callback (callable, optional) – Kernel assembly progress function that takes three arguments:

    current : int

    Current progress (number of completed entries).

    total : int

    Task size (total number of entries to create).

    duration_s : float or None, optional

    Once complete, this parameter contains the time it took to assemble the kernel (seconds).

  • solve_callback (callable, optional) –

    Linear system solver status.
    done : bool

    False when solver starts, True when it finishes.

    duration_s : float or None, optional

    Once done, this parameter contains the runtime of the solver (seconds).

Returns:

Data structure of custom type model.

Return type:

dict

sgdml.train._assemble_kernel_mat_wkr(j, n_perms, tril_perms_lin, sig, use_E_cstr=False)[source]

Compute one row and column of the force field kernel matrix.

The Hessian of the Matern kernel is used with n = 2 (twice differentiable). Each row and column consists of matrix-valued blocks, which encode the interaction of one training point with all others. The result is stored in shared memory (a global variable).

Parameters:
  • j (int) – Index of training point.
  • n_perms (int) – Number of individual permutations encoded in tril_perms_lin.
  • tril_perms_lin (numpy.ndarray) – 1D array (int) containing all recovered permutations expanded as one large permutation to be applied to a tiled copy of the object to be permuted.
  • sig (int) – Hyper-parameter \(\sigma\).
Returns:

Number of kernel matrix blocks created, divided by 2 (symmetric blocks are always created at together).

Return type:

int

sgdml.train._share_array(arr_np, typecode_or_type)[source]

Return a ctypes array allocated from shared memory with data from a NumPy array.

Parameters:
  • arr_np (numpy.ndarray) – NumPy array.
  • typecode_or_type (char or ctype) – Either a ctypes type or a one character typecode of the kind used by the Python array module.
Returns:

Return type:

array of ctype

Module contents