Deep learning-based molecular generation has extensive applications in many fields, particularly drug discovery. However, majority of current deep generative models (DGMs) are ligand-based and do not consider chemical knowledge in molecular generation process, often resulting in a relatively low success rate. We herein propose a structure-based molecular generative framework with chemical knowledge explicitly considered (named PocketFlow), which generates novel ligand molecules inside protein binding pockets. In various computational evaluations, PocketFlow showed a state-of-the-art performance with generated molecules being 100% chemically valid and highly drug-like. Ablation experiments prove a critical role of chemical knowledge in ensuring the validity and drug-likeness of the generated molecules. We applied PocketFlow to two new target proteins that are related to epigenetic regulation, HAT1 and YTHDC1, and successfully obtained wet-lab validated bioactive compounds. The binding modes of the active compounds with target proteins are close to those predicted by molecular docking, and further confirmed by the X-ray crystal structure. All the results suggest that PocketFlow is a useful deep generative model, capable of generating innovative bioactive molecules from scratch given a protein binding pocket.
conda env create -f environment.yml
The molecule can be generated by running the following command, where the pocket pdb file and the model parameter file are required, and the rest of the parameters are optional
python main_generate.py -pkt test_samples/test_pocket10/1bvr_C_rec_pocket10-surf.pdb --ckpt ckpt/ZINC-pretrained-255000.pt -n 100 -d cuda:0 --root_path gen_results --name 1bvr -at 1.0 -bt 1.0 --max_atom_num 35 -ft 0.5 -cm True --with_print True
All parameters of generation:
usage: main_generate.py [-h] [-pkt POCKET] [--ckpt CKPT] [-n NUM_GEN] [--name NAME] [-d DEVICE] [-at ATOM_TEMPERATURE] [-bt BOND_TEMPERATURE] [--max_atom_num MAX_ATOM_NUM] [-ft FOCUS_THRESHOLD] [-cm CHOOSE_MAX]
[--min_dist_inter_mol MIN_DIST_INTER_MOL] [--bond_length_range BOND_LENGTH_RANGE] [-mdb MAX_DOUBLE_IN_6RING] [--with_print WITH_PRINT] [--root_path ROOT_PATH] [--readme README]
optional arguments:
-h, --help show this help message and exit
-pkt POCKET, --pocket POCKET
the pdb file of pocket in receptor
--ckpt CKPT the path of saved model
-n NUM_GEN, --num_gen NUM_GEN
the number of generateive molecule
--name NAME receptor name
-d DEVICE, --device DEVICE
cuda:x or cpu
-at ATOM_TEMPERATURE, --atom_temperature ATOM_TEMPERATURE
temperature for atom sampling
-bt BOND_TEMPERATURE, --bond_temperature BOND_TEMPERATURE
temperature for bond sampling
--max_atom_num MAX_ATOM_NUM
the max atom number for generation
-ft FOCUS_THRESHOLD, --focus_threshold FOCUS_THRESHOLD
the threshold of probility for focus atom
-cm CHOOSE_MAX, --choose_max CHOOSE_MAX
whether choose the atom that has the highest prob as focus atom
--min_dist_inter_mol MIN_DIST_INTER_MOL
inter-molecular dist cutoff between protein and ligand.
--bond_length_range BOND_LENGTH_RANGE
the range of bond length for mol generation.
-mdb MAX_DOUBLE_IN_6RING, --max_double_in_6ring MAX_DOUBLE_IN_6RING
--with_print WITH_PRINT
whether print SMILES in generative process
--root_path ROOT_PATH
the root path for saving results
--readme README, -rm README
description of this genrative task
Based on the pose of the ligand, the pocket structure can be splited from the protein structure
from pocket_flow import SplitPocket, Protein, Ligand
pro = Protein('/path/to/protein.pdb')
lig = Ligand('/path/to/ligand.sdf')
dist_cutoff = 10
pocket_block, _ = SplitPocket._split_pocket_with_surface_atoms(pro, lig, dist_cutoff)
open('/path/to/pocket.pdb','w').write(pocket_block)
The raw CrossDocked2020 dataset is large, which need about 50G disk space. You can donwload the processed data from Pocket2Mol
from pocket_flow import CrossDocked2020
unexpected_sample = [
line.split()[-1] for line in open('data/unexcept_element_sample_new.csv').read().split('\n')
]
cs2020 = CrossDocked2020(
'./data/crossdocked_pocket10/',
'./data/crossdocked_pocket10/index.pkl',
unexpected_sample=unexpected_sample
)
cs2020.run(
dataset_name='crossdocked_pocket10_processed_35Atoms.lmdb',
max_ligand_atom=35,
only_backbone=False,
lmdb_path='./data/'
)
The pretraining datase of PocketFlow was choosed from ZINC 3D. You can download ZINC 3D, and then use make_pretrain_data.py to produce the pretraining dataset.