
# DESCRIPTION OF NOTEBOOKS

`Spectra_To_Model_Data.ipynb` will read in the XY files and turn them into train / test / validation datasets in a .npy format which can be used as input into models.

`Train_Run_Models.ipynb` will load in the .npy files and train Random Forest models for every system. By default, it will generate the same figures which were featured in the paper.




# DESCRIPTION OF DIRECTORIES

`Model data` contains .npy files corresponding to the spectra (x labels) featurized as either pointwise or polynomial spectra according to various labels (coordination number, mean nearest neighbor distance, and bader charge).

`Spectra_to_Model_Data.ipynb` will convert spectral data to model data.

`Train_Run_Models` trains and runs the models and generates the result figures used in the publication.

`spectral_data` contains in the `M_XY.json` files (where M is a metal), dictionaries containing the processed spectra which have a large number of keys, some of which are holdovers from things that we tried that were not ultimately used in the paper.

The keys are:

SPECTRA KEYS:
- `E`, `mu`, `E0`, `k`, `mu0`, `chi`, are all outputs from FEFF.
- `mu_norm` is the absorption normalized to 1. 

LABEL KEYS:
- `guessed_oxy` is the guessed oxidation number of the absorbing atom based on pymatgen's guess oxidation number functionality.

- `coord_vector` is a vector of coordination similarity computed using 
pymatgen and the CrystalNNFingerprint featurizer in matminer. This method 
was shown by Chen and Zheng to work very well as a coordination label in 
https://doi.org/10.1016/j.patter.2020.100013 . (We computed this after the 
preceding paper was published as a preprint for some internal testing, but 
did not use it within the manuscript we submitted.)
- `coordination` describes the coordination number.
- `one_hot_coord` is a length-three vector corresponding to four, five, or six fold coordination.
- `nn_indexes` are the indexes within the structure that are nearest neighbors.
- `nn_species` are the species of atoms coordinating the absorbing atom.
- `nn_dists` are the distances of the nearest neighbor atoms.
- `avg_nn_dists` is the mean of nn_dists.
- `nn_min-max` is the difference between the furthst and closest of the nearest neighbor atoms. (This was not used as a label in the paper).
- `mp_baders` are bader charge values from the Materials Project.
- `oqmd_baders` are the same, from the Open Quantum Materials Database.
- `valid_bader` is if the bader charges present are to be used or not.
- `bader` is a single bader charge value associated with the absorbing atom.
- `metadata` is metadata about the spectrum. "feff" origin means it was computed by TRI. Otherwise, it cites that the spectrum was from the materials project with MP-id;  'scrape' means we used the Materials Project API.

DATA AUGMENTATION KEYS:
- `mu_stretch` and `mu_squeeze` are spectra which were contracted or expanded by a factor of 5% for data augmentation purposes.
- `mu_norm_stretch` and `mu_norm_squeeze` are the same as above but for a normalized mu.
- `mu_dilate` and `dilate_factor` are randomly stretched/squeezed by a factor indicated in dilate_factor.
- `mu_p1`, `mu_p2`, `mu_m1`, and m`u_m2` are the spectra shifted by +-1/2 eV (p for plus. m for minus)
- `mu_rand_shift` and were randomly shifted by a factor indicated by rand_shift.

