AI Training Data
What kind of data is used in AI training in drug discovery?
1
The success of AI in drug discovery hinges on the quality and diversity of its training data. AI models learn from a vast array of information, which can be broadly categorized into molecular data (the chemistry), biological data (the biology), and other key modalities that bridge the gap between the two.
How this data is fed into an AI model is critical. Common representations include:
Graph Neural Networks (GNNs): Represents atoms as nodes and bonds as edges, excelling at structure-activity relationship prediction.
Transformers: Learns chemical and biological "syntax" by processing SMILES strings or protein sequences.
Diffusion Models: Generates novel 3D molecular structures by reversing a noise-adding process.
Multimodal Models: Integrates diverse data types (e.g., 2D structures, 3D conformations, and text) to build robust molecular representations.
All this data is static: Static structures, SMILES or molecular 2D and 3D graphs, amino acid sequences.
What are the challenges with AI Training Data?
2
Despite the availability of data, significant hurdles remain:
Poor Quality & Inconsistency: Many public datasets use varied protocols and lack detailed metadata, hindering reproducibility.
Data Scarcity & Noise: High-quality experimental data is limited; HTS data can be noisy. The pharmaceutical industry has even been called data-rich but insight-poor.
Distribution Shift: Models trained on bias-laden public data often fail on novel targets or chemotypes.
Negative Data Gap: Most public data lacks "true negatives," making it hard for models to learn what doesn't work.
How can Atomic Participation Factors make a difference?
3
Atomic Participation Factors are intrinsic properties of a molecule. An example of a sorted Atomic Participation Factor Spectrum for a small molecule is illustrated below. The circled atoms are those that drive the molecule’s dynamics as well as biological function.
These spectra are the result of processing of Molecular Dynamics Simulations. This means they condense significan information about the motion of each single atom, beyond a static 2D or 3D representations. Incorporating AP Spectra into AI training sets allows to inject a new type of information with significant dynamic-flavored content.
AP Spectra not only add new, information-rich data into AI training. In fact, they can potentially help accelerate the training process itself.
How can Atomic Participation Spectra Catalogs be obtained?
4
Atomic Participation Spectra are computed by OPTIMUS™, our flagship software product. BioDynLab can provide AP Catalogs for any set of small molecules, as well as Amino Acid Participation Spectra for peptides and proteins.