Force Field Reconstruction

To reconstruct a sGDML force field, we need a dataset file, which can be generated from various file formats (Data Preperation). For a quick start, we can also just continue with one of the included benchmark datasets.

The easiest way to create a model is with the automated model creation assistant:

$ sgdml all ethanol.npz <n_train> <n_validate> <n_test>

The parameters <n_train>, <n_validate> and <n_test> specify the sample sizes for the training, validation and test datasets, respectively. All of them are taken from the provided bulk dataset ethanol.npz, without overlap.

The size of the training set is the most important parameter: a larger value will yield a more accurate model, but at the cost of increased training time and memory requirement. Increasing the size of the validation and test dataset carries no such penalty. In fact, it is desirable to use large test datasets to get a reliable estimate for the generalization performance of the trained model.

There is not much more to it: the command line call above will perform all steps necessary to reconstruct a sGDML force field. Once the program finishes, it will output a fully trained and tested model file.


Model training is a memory intensive task that is best executed on a powerful computer.


If the reconstruction process is terminated prematurely, it can be simply reissued to resume.

Manual reconstruction

In the example above, we have used the sgdml command all to automate force field reconstruction. However, each training step can also be called individually as described here.

Python API

The same functionality is also exposed via the Python API, which is particularly useful when developing new models based on the existing sGDML implementation.

Here is how to train one individual model (without cross-validation or testing as with the assistant used above) for a particular choice of hyper-parameters sig = 10 and lam = 1e-15:

import sys
import numpy as np
from sgdml.train import GDMLTrain

dataset = np.load('d_ethanol.npz')
n_train = 200

gdml_train = GDMLTrain()
task = gdml_train.create_task(dataset, n_train,\
        valid_dataset=dataset, n_valid=1000,\
        sig=10, lam=1e-15)

        model = gdml_train.train(task)
except Exception, err:
        np.savez_compressed('m_ethanol.npz', **model)