# QM7 Dataset

## Description

This dataset is a subset of GDB-13 (a database of nearly 1 billion stable and synthetically accessible organic molecules) composed of all molecules of up to 23 atoms (including 7 heavy atoms C, N, O, and S), totalling 7165 molecules. We provide the Coulomb matrix representation of these molecules and their atomization energies computed similarly to the FHI-AIMS implementation of the Perdew-Burke-Ernzerhof hybrid functional (PBE0). This dataset features a large variety of molecular structures such as double and triple bonds, cycles, carboxy, cyanide, amide, alcohol and epoxy. The Coulomb matrix is defined as

where \(Z_i\) is the nuclear charge of atom \(i\) and \(R_i\) is its position. The Coulomb matrix has built-in invariance to translation and rotation of the molecule. The atomization energies are given in kcal/mol and are ranging from -800 to -2000 kcal/mol).

The dataset is composed of three multidimensional arrays **X** (7165 x 23 x 23), **T** (7165) and **P** (5 x 1433) representing the inputs (Coulomb matrices), the labels (atomization energies) and the splits for cross-validation, respectively. The dataset also contain two additional multidimensional arrays **Z** (7165) and **R** (7165 x 3) representing the atomic charge and the cartesian coordinate of each atom in the molecules.

## Download

- Matlab version:
`data/qm7.mat`

(17.9 MB)

## Benchmark results

Average cross-validation error using the five splits of the dataset and using mean absolute error (MAE) are reported below.

Rupp et al. PRL, 2012 | Kernel ridge regression with Gaussian Kernel on Coulomb matrix sorted eigenspectrum | 9.9 kcal/mol Montavon et al. NIPS, 2012 | Multilayer perceptron with binarized random Coulomb matrices | 3.5 kcal/mol

## Code

`code/nn-qm7.tar.gz`

: Simple multilayer perceptron trained on the QM7 dataset with error backpropagation and yielding errors in the range of 3-4 kcal/mol. Train the network by running`$ python nntrain.py [split]`

where`[split]`

is a number between`0`

and`4`

. The training takes place in background (warning, training can take up to two days depending on the machine). To test current performance, open another terminal and run`$ python nntest.py [split]`

where`[split]`

has the same value as for training. The command returns the training and test error at current time in terms of MAE and RMSE.

## How to cite

When using this dataset, please make sure to cite the following two papers:

- L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009. [bibtex]
- M. Rupp, A. Tkatchenko, K.-R. Müller, O. A. von Lilienfeld: Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning, Physical Review Letters, 108(5):058301, 2012.. [bibtex]

## Related papers

- A. K. Rappé, C. J. Casewit, K. S. Colwell, W. A. Goddard III, W. M. Skid, UFF, a Full Periodic Table Force Field for Molecular Mechanics and Molecular Dynamics Simulations, J. Am. Chem. Soc., 114:10024, 1992. [bibtex]
- R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C. Steinbeck, J. K. Wegner and E. Willighagen, The Blue Obelisk - Interoperability in Chemical Informatics, J. Chem. Inf. Model., 46:991, 2006. [bibtex]
- G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, O. A. von Lilienfeld, K.-R. Müller, Learning Invariant Representations of Molecules for Atomization Energy Prediction, Advances in Neural Information Processing Systems (NIPS), 2012. [bibtex]

# QM7b Dataset

## Description

This dataset is an extension of the QM7 dataset for multitask learning where 13 additional properties (e.g. polarizability, HOMO and LUMO eigenvalues, excitation energies) have to be predicted at different levels of theory (ZINDO, SCS, PBE0, GW). Additional molecules comprising chlorine atoms are also included, totalling 7211 molecules.

The dataset is composed of two multidimensional arrays `X (7211 x 23 x 23)`

and `T (7211 x 14)`

representing the inputs (Coulomb matrices) and the labels (molecular properties) and one array `names`

of size 14 listing the names of the different properties.

## Download

- Matlab version:
`data/qm7b.mat`

(16.1 MB)

## Benchmark results

We report test error with 5000 training samples drawn randomly from the dataset and remaining 2211 test samples. For conciseness, we report only MAE values for Polarizability-PBE0, HOMO-GW and IP-ZINDO as a vector of numbers (measured in A^{3} and eV).

Montavon et al., New J. Phys. **15** 095003, 2013 | Multitask MLP with binarized random Coulomb matrices and binarized outputs | 0.11, 0.16, 0.17

## How to cite

When using this dataset, please make sure to cite the following two papers:

- L. C. Blum, J.-L. Reymond, 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13, J. Am. Chem. Soc., 131:8732, 2009. [bibtex]
- G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Machine Learning of Molecular Electronic Properties in Chemical Compound Space, New J. Phys.
**15**095003, 2013, [bibtex], [pdf].

## Related papers

- A. K. Rappé, C. J. Casewit, K. S. Colwell, W. A. Goddard III, W. M. Skid, UFF, a Full Periodic Table Force Field for Molecular Mechanics and Molecular Dynamics Simulations, J. Am. Chem. Soc., 114:10024, 1992. [bibtex]
- R. Guha, M. T. Howard, G. R. Hutchison, P. Murray-Rust, H. Rzepa, C. Steinbeck, J. K. Wegner and E. Willighagen, The Blue Obelisk - Interoperability in Chemical Informatics, J. Chem. Inf. Model., 46:991, 2006. [bibtex]

# QM9 Dataset

## Abstract

Computational de novo design of new drugs and materials requires rigorous and unbiased exploration of chemical compound space. However, large uncharted territories persist due to its size scaling combinatorially with molecular size. We report computed geometric, energetic, electronic, and thermodynamic properties for 134k stable small organic molecules made up of CHONF. These molecules correspond to the subset of all 133,885 species with up to nine heavy atoms (CONF) out of the GDB-17 chemical universe of 166 billion organic molecules. We report geometries minimal in energy, corresponding harmonic frequencies, dipole moments, polarizabilities, along with energies, enthalpies, and free energies of atomization. All properties were calculated at the B3LYP/6-31G(2df,p) level of quantum chemistry. Furthermore, for the predominant stoichiometry, C_{7}H_{10}O_{2}, there are 6,095 constitutional isomers among the 134k molecules. We report energies, enthalpies, and free energies of atomization at the more accurate G4MP2 level of theory for all of them. As such, this data set provides quantum chemical properties for a relevant, consistent, and comprehensive chemical space of small organic molecules. This database may serve the benchmarking of existing methods, development of new methods, such as hybrid quantum mechanics/machine learning, and systematic identification of structure-property relationships.

## Download

Available via figshare.

## How to cite

When using this dataset, please make sure to cite the following two papers:

- L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model.
**52**, 2864–2875, 2012. - R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014. [bibtex]

# QM8 Dataset

## Abstract

Due to its favorable computational efficiency, time-dependent (TD) density functional theory(DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning models trained on deviations of reference second-order approximate coupled-cluster (CC2) singles and doubles spectra from TDDFT counterparts, or even from DFT gap. We applied this approach to low-lying singlet-singlet vertical electronic spectra of over 20 000 synthetically feasible small organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a function of training set size. For a training set of 10 000 molecules, CC2 excitation energies can be reproduced to within ±0.1 eV for the remaining molecules. Analysis of our spectral database via chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence collected, we discuss open challenges associated with data-driven modeling of high-lying spectra and transition intensities.

## Download

Available via EPAPS (FTP).

## How to cite

When using this dataset, please make sure to cite the following two papers:

- L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model.
**52**, 2864–2875, 2012. - R. Ramakrishnan, M. Hartmann, E. Tapavicza, O. A. von Lilienfeld, Electronic Spectra from TDDFT and Machine Learning in Chemical Space, J. Chem. Phys. 143 084111, 2015.

# MD Trajectories of small molecules

## Description

The molecular dynamics (MD) datasets in this package range in size from 150k to nearly 1M conformational geometries. All trajectories are calculated at a temperature of 500 K and a resolution of 0.5 fs. The molecules have different sizes and the molecular PESs exhibit different levels of complexity. The energy range across all data points within a set spans from 20 to 48 kcal/mol. Force components range from 266 to 570 kcal/mol/A. The total energy and force labels for each dataset were computed using the PBE+vdW-TS electronic structure method. All geometries are in Angstrom, energies and forces are given in kcal/mol and kcal/mol/A respectively.

The data is provided in xyz format with one file per conformation. The energy and force labels for each geometry are included in the comment line. This way the files remain valid xyz files. Positions are given in Angstroms, energies are given in kcal/mol.

## Benchmarks

### Energies

Mean absolute errors (in meV) for energy prediction based on training sets of size N with different methods for each dataset.

Benzene | Uracil | Napthalene | Aspirin | Saliylic acid | Malonaldehyde | Ethanol | Toluene | |
---|---|---|---|---|---|---|---|---|

DTNN (N=50k) |
1.7 | n/a |
n/a |
n/a |
21.7 | 8.2 | n/a |
7.8 |

GDML (N=1k) |
3.0 | 4.8 | 5.2 | 11.7 | 5.2 | 6.9 | 6.5 | 5.2 |

GDML (N=50k) |
3.2 | 4.6 | 5.1 | 5.6 | 4.8 | 3.3 | 2.3 | 4.1 |

sGDML (N=1k) |
4.3 | 4.8 | 5.2 | 8.2 | 5.2 | 4.3 | 3.0 | 4.3 |

### Forces

Mean absolute errors (in meV) for each force component based on training sets of size N for each dataset.

Benzene | Uracil | Napthalene | Aspirin | Saliylic acid | Malonaldehyde | Ethanol | Toluene | |
---|---|---|---|---|---|---|---|---|

GDML (N=1k) |
10.0 | 10.4 | 10.0 | 42.9 | 12.1 | 34.7 | 34.3 | 18.6 |

GDML (N=50k) |
10.2 | 1.2 | 1.3 | 1.0 | 1.5 | 3.3 | 4.0 | 2.1 |

sGDML (N=1k) |
2.6 | 10.4 | 4.8 | 29.5 | 12.1 | 17.8 | 14.3 | 6.1 |

## Download

- Available here: www.quantum-machine.org/gdml

## How to cite

When using this dataset, please make sure to cite the following papers:

- S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, K.-R. Müller Machine Learning of Accurate Energy-Conserving Molecular Force Fields, Sci. Adv.
**3(5)**, e1603015, 2017. - K. T. Schütt, F. Arbabzadah, S. Chmiela, K.-R. Müller, A. Tkatchenko Quantum-Chemical Insights from Deep Tensor Neural Networks, Nat. Commun.
**8**, 13890, 2017. -
S. Chmiela, H. E. Sauceda, K.-R. Müller, A. Tkatchenko Towards Exact Molecular Dynamics Simulations with Machine-Learned Force Fields, Nat. Commun.
**9**, 3887, 2018.

# MD Trajectories of C7O2H10

## Description

This data set consists of molecular dynamics trajectories of 113 randomly selected C7O2H10 isomers calculated at a temperature of 500 K and resolution of 1fs using density functional theory with the PBE exchange-correlation potential.

C7O2H10 is the largest set of isomer of QM9. Identifiers used in this data set agree with those used in the QM9 isomer subset.

Each trajectory is stored in a xyz-file named `id.xyz`

with corresponding energies in `id.energy.dat`

. Additionally, consistent energy calculations of all isomers in equilibrium (according to QM9) are provided in `c7o2h10_equilibrium.dat`

.

## Benchmark

Mean abs. errors for energy prediction using the equilibrium energies as well as 50% of each trajectory as reference calculations for training (in eV).

Deep Tensor Neural Network | 0.07

## Download

- Available here:
`data/c7o2h10_md.tar.gz`

(146.8 MB)

# Datasets including densities

## Description

These datasets contain not only molecular geometries and energies but also valence densities.

For each dataset, the energies are given in energies.txt (in kcal/mol, one line per molecular geometry). The densities are given in densities.txt (in Fourier basis coefficients, one line per molecular geometry). The structures are given in structures.xyz (with positions in Bohr).

For details on how these datasets were generated please refer to the publication.

## How to cite

When using any of these datasets, please make sure to cite the following paper:

- F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, K.-R. Müller, By-passing the Kohn-Sham equations with machine learning.

## Download

- H2
`h2.tar.gz`

(1.1 GB) - Water
`water.tar.gz`

(2.1 GB) - Benzene 300K
`benzene_300K.tar.gz`

(2.5 GB) - Benzene 300K + 350K
`benzene_300K-350K.tar.gz`

(2.5 GB) - Benzene 300K + 400K
`benzene_300K-400K.tar.gz`

(2.5 GB) - Benzene 300K test
`benzene_300K-test.tar.gz`

(250 MB) - Ethane 300K
`ethane_300K.tar.gz`

(2.5 GB) - Ethane 300K + 350K
`ethane_300K-350K.tar.gz`

(2.5 GB) - Ethane 300K + 400K
`ethane_300K-400K.tar.gz`

(2.5 GB) - Ethane 300K test
`ethane_300K-test.tar.gz`

(252 MB) - Malonaldehyde 300K
`malonaldehyde_300K.tar.gz`

(2.5 GB) - Malonaldehyde 300K test
`malonaldehyde_300K-test.tar.gz`

(501 MB)

# ISO17 - MD Trajectories of C7O2H10 with total energies and atomic forces

## Description

The molecules were randomly drawn from the largest set of isomers in the QM9 dataset [1] which consists of molecules with a fixed composition of atoms (C7O2H10) arranged in different chemically valid structures. It is an extension of the ismoer MD data used in [2].

The database was generated from molecular dynamics simulations using the Fritz-Haber Institute ab initio simulation package (FHI-aims)[3]. The simulations were carried out using the standard quantum chemistry computational method density functional theory (DFT) in the generalized gradient approximation (GGA) with the Perdew-Burke-Ernzerhof (PBE) functional[4] and the Tkatchenko-Scheffler (TS) van der Waals correction method [5].

The database consist of 129 molecules each containing 5,000 conformational geometries, energies and forces with a resolution of 1 femtosecond in the molecular dynamics trajectories.

## Format

The data is stored in ASE sqlite format with the total energy in eV under the key `total energy`

and the atomic_forces under the key `atomic_forces`

in eV/Ang.

The following Python snippet iterates over the first 10 entries of the dataset located at `path_to_db`

:

```
from ase.db import connect
with connect(path_to_db) as conn:
for row in conn.select(limit=10):
print(row.toatoms())
print(row['total_energy'])
print(row.data['atomic_forces'])
```

## Partitions

The data is partitioned as used in the SchNet paper [6]:

- reference.db - 80% of steps of 80% of MD trajectories
- reference_eq.db - equilibrium conformations of those molecules
- test_within.db - remaining 20% unseen steps of reference trajectories
- test_other.db - remaining 20% unseen MD trajectories
- test_eq.db - equilibrium conformations of test trajectories

In the paper, we split the reference data (reference.db) into 400k training examples and 4k validation examples. The indices are given in the files train_ids.txt and validation_idx.txt, respectively.

## Benchmarks

Model | Energy (within) [eV] | Force (within) [eV/A] | Energy (other) [eV] | Force (other) [eV/A] |
---|---|---|---|---|

SchNet [6] | 0.016 | 0.043 | 0.104 | 0.095 |

## Download

- Available here:
`data/iso17.tar.gz`

(799.7 MB)

## How to cite

When using this dataset, please make sure to cite the following papers:

- K.T. Schütt, P.-J. Kindermans, H.E. Sauceda, S. Chmiela, A. Tkatchenko, K.-R. Müller. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing System. 2017.
- K.T. Schütt, F. Arbabzadah, S. Chmiela, K.R. Müller, A. Tkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature Communications,
**8**, 13890. 2017. - R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures
and properties of 134 kilo molecules. Scientific Data,
**1**, 2014.

## References

[1] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld. Quantum chemistry structures

and properties of 134 kilo molecules. Scientific Data, **1**, 2014.

[2] Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R., & Tkatchenko, A. (2017). Quantum-chemical insights from deep tensor neural networks. Nature Communications, **8**, 13890.

[3] Blum, V.; Gehrke, R.; Hanke, F.; Havu, P.; Havu, V.; Ren, X.; Reuter, K.; Scheffler, M.
Ab Initio Molecular Simulations with Numeric Atom-Centered Orbitals. Comput. Phys.
Commun. 2009, **180** (11), 2175–2196.

[4] Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient Approximation Made
Simple. Phys. Rev. Lett. 1996, **77** (18), 3865–3868.

[5] Tkatchenko, A.; Scheffler, M. Accurate Molecular Van Der Waals Interactions from
Ground-State Electron Density and Free-Atom Reference Data. Phys. Rev. Lett. 2009, **102** (7), 73005.

[6] Schütt, K. T., Kindermans, P. J., Sauceda, H. E., Chmiela, S., Tkatchenko, A., & Müller, K. R. SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in Neural Information Processing System (accepted). 2017.

# SN2 reactions - Chemical reactions of methyl halides with halide anions

## Description

This dataset probes chemical reactions of methyl halides with halide anions, i.e. X- + CH3Y -> CH3X + Y-, and contains structures for all possible combinations of X,Y = F, Cl, Br, I. The dataset also includes various structures for several smaller molecules that can be formed in fragmentation reactions, such as CH3X, HX, CHX or CH2X- as well as geometries for H2, CH2, CH3+ and XY interhalogen compounds. In total, the dataset provides reference energies, forces, and dipole moments for 452709 structures calculated at the DSD-BLYP-D3(BJ)/def2-TZVP level of theory [1-4] using the ORCA 4.0.1 code [5,6].

## References

[1] Grimme, S.; Antony, J.; Ehrlich, S. and Krieg, H. J. Chem. Phys. **132**, 154104 (2010).

[2] Kozuch, S.; Gruzman, D. and Martin, J. M. J. Phys. Chem. C **114**, 20801-20808 (2010).

[3] Grimme, S.; Ehrlich, S. and Goerigk, L. J. Comput. Chem. **32**, 1456-1465 (2011).

[4] Weigend, F. and Ahlrichs, R. Phys. Chem. Chem. Phys. **7**, 3297-3305 (2005).

[5] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. **2**, 73-78 (2012).

[6] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. **8**, e1327 (2018).

## How to cite?

When using this dataset, please cite the following paper: Unke, O. T. and Meuwly, M. "PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments and Partial Charges" arxiv:1902.08408 (2019).

and the digital object identifier (DOI): Unke, O.T. and Meuwly, M. (2019). SN2 reactions dataset. Zenodo. http://doi.org/10.5281/zenodo.2605341.

## Format

The dataset is stored as python dictionary in a compressed numpy binary file (.npz). The dictionary contains seven numpy arrays:

R (num_data, max_atoms, 3): Cartesian coordinates of nuclei (in Angstrom [A])

Q (num_data,): Total charge (in elementary charges [e])

D (num_data, 3): Dipole moment vector with respect to the origin (in elementary charges times Angstrom [eA])

E (num_data,): Potential energy with respect to free atoms (in electronvolt [eV])

F (num_data, max_atoms, 3): Forces acting on the nuclei (in electronvolt per Angstrom [eV/A])

Z (num_data, max_atoms): Nuclear charges/atomic numbers of nuclei

N (num_data,): Number of atoms in each structure (structures consisting of less than max_atoms entries are zero-padded)

Please note that the potential energy is given with respect to free atoms (i.e. total atomization). The following constants were subtracted from the original values for each occurence of the corresponding elements:

H : -13.579407869766147 eV

C : -1028.9362774711024 eV

F : -2715.578463075019 eV

Cl: -12518.663203367176 eV

Br: -70031.09203874589 eV

I : -8096.587166328217 eV

In order to recover the original values, simply add the constants back.

To read the dataset, load the dictionary with python:

`data = np.load("sn2_reactions.npz")`

and access individual entries with the appropriate dictionary key, e.g. "Z" for the nuclear charges:

`nuclear_charges = data["Z"]`

See also "read_data.py" for a more comprehensive example.

## Download

The dataset is available here: https://zenodo.org/record/2605341

# Solvated protein fragments

The solvated protein fragments dataset probes many-body intermolecular interactions between "protein fragments" and water molecules, which are important for the description of many biologically relevant condensed phase systems. It contains structures for all possible "amons" [1] (hydrogen-saturated covalently bonded fragments) of up to eight heavy atoms (C, N, O, S) that can be derived from chemical graphs of proteins containing the 20 natural amino acids connected via peptide bonds or disulfide bridges. For amino acids that can occur in different charge states due to (de-)protonation (i.e. carboxylic acids that can be negatively charged or amines that can be positively charged), all possible structures with up to a total charge of +-2e are included. In total, the dataset provides reference energies, forces, and dipole moments for 2731180 structures calculated at the revPBE-D3(BJ)/def2-TZVP level of theory [2-5] using the ORCA 4.0.1 code [6,7].

For more details, see https://arxiv.org/abs/1902.08408.

[1] Huang, B. and von Lilienfeld, O. A. arXiv:1707.04146 (2017).

[2] Grimme, S.; Antony, J.; Ehrlich, S. and Krieg, H. J. Chem. Phys. 132, 154104 (2010).

[3] Grimme, S.; Ehrlich, S. and Goerigk, L. J. Comput. Chem. 32, 1456-1465 (2011).

[4] Weigend, F. and Ahlrichs, R. Phys. Chem. Chem. Phys. 7, 3297-3305 (2005).

[5] Zhang, Y. and Yang, W. Phys. Rev. Lett. 80, 890 (1998).

[6] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2, 73-78 (2012).

[7] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. 8, e1327 (2018).

When using this dataset, please cite the following paper: Unke, O. T. and Meuwly, M. "PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments and Partial Charges" arxiv:1902.08408 (2019).

and the digital object identifier (DOI): Unke, O.T. and Meuwly, M. (2019). Solvated protein fragments dataset. Zenodo. http://doi.org/10.5281/zenodo.2605372.

======= DATA FORMAT =======The dataset is stored as python dictionary in a compressed numpy binary file (.npz). The dictionary contains seven numpy arrays:

R (num_data, max_atoms, 3): Cartesian coordinates of nuclei (in Angstrom [A])

Q (num_data,): Total charge (in elementary charges [e])

D (num_data, 3): Dipole moment vector with respect to the origin (in elementary charges times Angstrom [eA])

E (num_data,): Potential energy with respect to free atoms (in electronvolt [eV])

F (num_data, max_atoms, 3): Forces acting on the nuclei (in electronvolt per Angstrom [eV/A])

Z (num_data, max_atoms): Nuclear charges/atomic numbers of nuclei

N (num_data,): Number of atoms in each structure (structures consisting of less than max_atoms entries are zero-padded)

Please note that the potential energy is given with respect to free atoms (i.e. total atomization). The following constants were subtracted from the original values for each occurence of the corresponding elements:

H: -13.717939590030356 eV

C: -1029.831662730747 eV

N: -1485.40806126101 eV

O: -2042.7920344362644 eV

S: -10831.264715514206 eV

In order to recover the original values, simply add the constants back.

To read the dataset, load the dictionary with python:

`data = np.load("solvated_protein_fragments.npz")`

and access individual entries with the appropriate dictionary key, e.g. "Z" for the nuclear charges:

`nuclear_charges = data["Z"]`

See also "read_data.py" for a more comprehensive example.

## Download

The dataset is available here: https://zenodo.org/record/2605372

# Molecules generated with G-SchNet

## Description

Molecules generated with the G-SchNet architecture trainined on 50k randomly selected structures from QM9. Both databases only contain generated molecules which cannot be found in QM9. They are relaxed using the B3LYP functional with the 6-31G(2d,p) basis set, as implemented in the ORCA quantum chemistry package. We exclude molecules where the relaxation did not converge or where the bonding pattern changed during relaxation. We report the electronic energy, the dipole moment, the energies of the HOMO and LUMO orbitals, as well as the HOMO-LUMO gap.

Generated: 9074 unique molecules that pass our valency check and do not match any QM9 molecule.

Gap: 3619 unique molecules with a HOMO-LUMO gap of 5 eV or smaller that pass our valency check and do not match any QM9 molecule. Before generation, we fine-tuned the pre-trained G-SchNet model with ~3k molecules from QM9 that have a HOMO-LUMO gap of 4.5 eV or smaller. For details on the generative model and how these datasets were generated please refer to the publication.

## Format

The datasets are stored in a binarized ASE sqlite database. A utility class for loading the datasets can be found in SchNetPack (schnetpack.data.AtomsData). Information on the names of the properties, units and the reference energies of the free atoms, as well as further details on the settings used for the reference computations is provided in the database metadata. The following example code can be used to load the gap and energy data of the molecule with the index idx from a database:

`from schnetpack.data import AtomsData`

`database = AtomsData(path_to_db, required_properties=[‘gap’, ‘energy’])`

`atoms, properties = database.get_properties(idx)`

atoms is an ASE atoms object containing the positions and atom types, while properties is a dictionary containing all requested properties.

## Notes on the reference method

In order to be consistent with the original QM9 computations, the Gaussian parametrization of B3LYP was used. Due to the way d-Orbitals are handled in the Gaussian and the ORCA code, differences in the total electronic energies can still arise between our datasets and the original QM9 data. It is therefore strongly recommended to subtract the energies of the free atoms before comparison. These are provided in the metadata of the datasets.

## Download

## How to cite

N. Gebauer, M. Gastegger, and K. Schütt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 7564–7576. Curran Associates, Inc., 2019.

# MQMspin database

This data set contains a sample of approx. 5k singlet state and approx. 8k triplet state carbene structures drawn from a systematically constructed large chemical space of approx. 300k carbenes. The initial carbene structures are obtained from a randomly selected sample of QM9 closed shell molecules by a double hydrogen abstraction at a saturated carbon. Starting from the QM9 geometry, the carbene triplet state structure is obtained by a restricted open shell B3LYP/def2-TZVP geometry optimization. It is verified that no bond breaking occurs during the geometry optimization. A state-averaged (SA) CASSCF(2e,2o)/cc-pVDZ-F12 single point calculation on the optimized structure is used to verify that the active space orbitals are well-localized on the carbene center and the singlet-triplet vertical spin gap is computed at MRCISD+Q-F12/cc-pVDZ-F12 level of theory using the SA-CASSCF(2e,2o) wave function as reference. The singlet state structures are optimized at CASSCF(2e,2o)/cc-pVDZ-F12 level of theory starting from the triplet state optimized carbene structure. Verification of the carbene character and computation of the vertical spin gap for the singlet state structures is done in the same way as for the triplet state structure.

## Usage of the "QMspin database Part 1"

Details on the construction of the "QMspin data base Part 1" and its scientific discussion is published here: "Large yet bounded: Spin gap ranges in carbenes" ax Schwilk, Diana N. Tahchieva, O. Anatole von Lilienfeld https://arxiv.org/abs/2004.10600. We would greatly appreciate if this publication is acknowledged when making usage of the QMspin database. Once published in a scientific peer-reviewed journal, we would prefer if the journal article citation is used instead.

## Nomenclature of the QMspin structures:

The structures have the same primary index as their QM9 database parent molecule. See "Quantum chemistry structures and properties of 134 kilo molecules" Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, O. Anatole von Lilienfeld, Sci. Data 1, 140022 (2014) for details on the QM9 data base. A secondary index enumerates the carbenes derived from a certain parent QM9 molecule. Note that there might be "gaps" in this index as not all possible carbenes for a given QM9 molecule have been retained in the data base. A third index indicates the optimized spin state ("t" or "s"). There are 7958 triplet state structures and 5022 singlet state structures. Among all these, for 3775 molecules both spin state structures are available, for which in consequence adiabatic spin gaps are available. For 269 singlet state structures and 6255 triplet state structures harmonic frequencies have been computed, all of which are real-valued, i. e. describe a local minimum on the PES. The triplet state structures are optimized with open shell restricted B3LYP/def2-TZVP using the QM9 molecule as initial geometries. The singlet state structures are optimized with CASSCF(2e,2o)/vdz-f12 using the triplet state structures as initial geometries. The single point spin gap energies are computed with MRCISD+Q-F12/vdz-f12 using state-averaged CASSCF(2e,2o) as a reference wave function. For all structures it has been verified that the two non-bonding active space orbitals are well localized on the carbene center. For details see "Large yet bounded: Spin gap ranges in carbenes" Max Schwilk, Diana N. Tahchieva, O. Anatole von Lilienfeld https://arxiv.org/abs/2004.10600.

## Structure of the QMspin database:

The database is provided as gzipped tarball. It contains the following folders and files:

`qmspin_data_overview.csv`

:

`MOL`

: Molecule index`CVS`

: CASSCF vertical spin gap of the singlet state structure in kcal/mol`MVS`

: MRCI vertical spin gap of the singlet state structure in kcal/mol`CVT`

: CASSCF vertical spin gap of the triplet state structure in kcal/mol`MVT`

: MRCI vertical spin gap of the triplet state structure in kcal/mol`CA`

: CASSCF adiabatic spin gap in kcal/mol`MA`

: MRCI adiabatic spin gap in kcal/mol`FS`

: "yes"/"no" if singlet state structure harmonic frequencies have been computed, the wave number in cm^{-1} of the lowest frequency is given if it is below 50 (the latter implies "yes")`FT`

: "yes"/"no" if triplet state structure harmonic frequencies have been computed, the wave number in cm^{-1} of the lowest frequency is given if it is below 50 (the latter implies "yes")

`outputs_singlet_sp`

:

Raw outputs of the singlet state structure single point spin gap calculations. A few of the outputs contain also the previous CASSCF geometry optimization.

`outputs_triplet_sp`

:

Raw outputs of the triplet state structure single point spin gap calculations.

`outputs_singlet_freq`

:

Raw outputs of the singlet state frequency calculations.

`outputs_triplet_freq`

:

Raw outputs of the triplet state frequency calculations.

`geometries_singlet`

:

xyz files of the singlet state optimized structures with "," as delimiter. The comment line contains the computed absolute energies in Hartree of the structure at CASSCF and MRCI level of theory along with indications on the used basis sets.

`geometries_triplet`

:

Analog to description `geometries_singlet`

.

`frequencies_singlet`

:

xyz-style files of the singlet state optimized structures that contain the cartesian coordinates in the first three columns as well as additional columns with the vibrational modes in [Angstrom] (i. e. not weighted by the square root of the atomic mass) given in the same cartesian coordinate basis. Hence there are three columns per normal mode. Column delimiter is ",". Frequencies and intensities (as computed from the nuclear transition dipole moment of the fundamental vibrational excition v=0 to v=1) of the normal modes are given with units specified in the file ordered by increasing wave number. Modes with wave numbers below 50 cm^{-1} are not given. Further details on the frequency calculations, including dipole moments and first order derivatives of the dipole moments are given in the corresponding output files in the folder `outputs_singlet_freq`

.

`frequencies_triplet`

:

Analog to description "frequencies_singlet".

## Download

QMspin_Part1_wo_outputs.tar.gz

## How to cite

M. Schwilk, D. N. Tahchieva, O. A. von Lilienfeld Large yet bounded: Spin gap ranges in carbenes, arXiv:2004.10600, 2020.

# Molecular Hamiltonians and overlap matrices

## Description

Datasets containing structures, energies, forces, Hamiltonians (Fock or Kohn-Sham matrices) and overlap matrices for water, ethanol, malondialdehyde and uracil used for training the SchNOrb model. The datasets use Hartree as units of energy and Angstrom as units of length. The overlap matrices are dimensionless. All calculations with the exception of the schnorb_hamiltonian_ethanol_hf.tgz dataset were carried out at the PBE/def2-SVP level of theory [1,2] using the ORCA quantom chemistry package (4.0.1.2).[3] The latter was computed at the HF/def2-SVP level. For details on how the individual datasets were generated please refer to the publication.

## Format

The datasets are stored in a binarized ASE sqlite database provided in a gzip compressed tar archive. A utility class for loading the datasets can be found in SchNetPack (schnetpack.data.AtomsData). Information on the names of the properties, units and the reference energies of the free atoms, as well as further details on the settings used for the reference computations is provided in the database metadata.

The following example code can be used to load the Hamiltonian and overlap matrix data of the molecule with the index idx from a database:

`from schnetpack.data import AtomsData`

`database = AtomsData(path_to_db, required_properties=[‘hamiltonian’, ‘overlap’])`

`atoms, properties = database.get_properties(idx)`

atoms is an ASE atoms object containing the positions and atom types, while properties is a dictionary containing all requested properties.

## Download

Water: schnorb_hamiltonian_water.tgz

Ethanol (HF): schnorb_hamiltonian_ethanol_hf.tgz

Ethanol (DFT): schnorb_hamiltonian_ethanol_dft.tgz

Malondialdehyde: schnorb_hamiltonian_malondialdehyde.tgz

Uracil: schnorb_hamiltonian_uracil.tgz

## How to cite

K. T. Schütt, M. Gastegger, A. Tkatchenko, K. R. Müller, R. J. Maurer Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions, Nature commun., 10(1), 1-10. 2019.

## References

[1] Perdew, J. P.; Burke, K.; Ernzerhof, M. Phys. Rev. Lett. 77 (18), 3865–3868 (1996).

[2] Weigend, F.; Ahlrichs, R. Phys. Chem. Chem. Phys. 7, 3297-3305 (2005).

[3] Neese, F. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2, 73-78 (2012).

# tmQM Dataset

## Description

The transition metal quantum mechanics dataset (tmQM) contains the geometries and properties of a chemical compound space comprising 86,665 mononuclear complexes extracted from the Cambridge Structural Database. tmQM includes Werner, bioinorganic and organometallic complexes based on a large variety of organic ligands and 30 transition metals (the 3d, 4d and 5d from groups 3 to 12). All complexes are closed-shell, and with a formal charge in the range {+1, 0, -1}e. The tmQM dataset provides the Cartesian coordinates of all metal complexes optimized at the DFTB(GFN2-xTB) level, and their molecular size, stoichiometry, and metal node degree. The quantum properties were computed at the DFT(TPSSh-D3BJ/def2-SVP) level, and include the electronic and dispersion energies, HOMO and LUMO orbital energies, HOMO-LUMO gap, dipole moment, and natural charge of the metal center; DFTB(GFN2-xTB) polarizabilities are also provided. Pairwise representations showed the low correlation between these properties, providing nearly continuous maps with unusual regions of the chemical space; e.g. complexes combining low-energy HOMO orbitals with electron-rich metal centers. The tmQM dataset can be exploited in the data-driven discovery of new metal complexes, including predictive models based on machine learning. These models can have a strong impact on the fields in which transition metal chemistry plays a key role; e.g. catalysis, organic synthesis, and materials science.

## How to cite?

D. Balcells, B. B. Skjelstad (2020): The tmQM Dataset - Quantum Geometries and Properties of 86k Transition Metal Complexes. ChemRxiv. Preprint.

## Format

`tmQM_X.xyz`

contains the Cartesian coordinates of all metal complexes optimized at the DFTB(GFN2-xTB) level in the xyz format. The molecular size, CSD code, charge, spin, stoichiometry and metal node degree are also included.

`tmQM_y.csv`

contains the quantum properties, including the electronic and dispersion energies, dipole moment, natural charge of the metal center, HOMO-LUMO gap, HOMO energy, LUMO energy and polarizability, all computed at the DFT(TPSSh/def2-SVP) level of theory, except the polarizability, which is computed at the DFTB(GFN2-xTB) level. All properties are given in atomic unites, except the dipole moment (D).

## Download

# Densities dataset

## Description

These datasets contain molecular geometries, coupled-cluster energies, DFT energies, and valence electron densities based on the PBE functional. For molecules with geometries and energies taken from the MD17 dataset, only the valence electron densities and corresponding DFT energies are reported.

For each dataset, structures are given in with positions in Bohr and the energies are given in kcal/mol (one line per molecular geometry). The densities are given in Fourier basis coefficients (one line per molecular geometry).

For details on how these datasets were generated please refer to the publication.

## How to cite?

M. Bogojeski, L. Vogt-Maranto, M. E. Tuckerman, K.-R. Müller, K. Burke. Quantum chemical accuracy from density functional approximations via machine learning.