Targeted protein degradation in Mycobacterium tuberculosis

Discovery of potential degraders (BacPROTACS) for essential tRNA synthetases in Mycobacterium tuberculosis.

Background

This project is part of the GC-ADDA4TB project, led by Prof. Erick Strauss. The project builds upon the BacPROTACs technology in which small-molecule bifunctional degraders are designed to harness the proteolytic machinery for targeted protein degradataion (TPD) of essential proteins in Mtb. This involves linking (a) a ligand that binds to a protein of interest (POI) to (b) a molecule that recruits the mycobacterial protease machinery, such as the ClpC:ClpP complex (ClpCP).

In a CRISPR genetic screening, several Mtb tRNA synthetases were identified as highly vulnerable in Mtb. Here, targeting these tRNA synthetases with TPD is proposed as a novel therapeutic strategy.

In this project, the main objective is to prioritise a set of purchasable or easily synthesizable compounds with multi-target properties against the tRNA synthetases family. This will be achieved in three consecutive steps:

Structural annotation and binding pocket comparison across the tRNA synthetase family.
Large-scale virtual screening for compounds with strong predicted binding scores across multiple tRNA synthetase binding sites.
Final shortlisting using additional criteria, such as the ligand's amenability to being extended with a linker to dCymM without disrupting the binding pose.

Domain-specific requirements

We recommend to create a Conda environment to run this code.

conda create -n adda4tb python=3.10
conda activate adda4tb
pip install -r requirements.txt

In addition, the following tools are required:

Open source PyMol for protein visualization.
PDB2PQR for protonation at pH 7.0.
PyRosetta for protein structure relaxation. PyRosetta can be installed with the PyRosetta Installer.
P2RANK for pocket detection.

To run P2RANK, Java is required:

conda activate adda4tb
conda install -c conda-forge openjdk

Progress reporting

This repository is work in progress, as summarized in the following progress report meetings:

Check-in meeting 1 (2025/02/04). Selection and structural annotation of tRNA synthetases.
Check-in meeting 2 (scheduled mid March 2025)

Below, we explain the progress made chronologically.

Sequence and structure annotation of tRNA synthetases

Based on the results of the CRISPR genetic screen (Bosch et al, 2021; Figure 5), we have selected 21 essential tRNA synthetases. The UniProt AC, name, protein sequence and EC number have been obtained from UniProt (Mtb H37RV proteome).

Protein structures

We have used the following servers or databases to obtain structural data for the selected tRNA synthetases. To ease the query of some resources, we have generated FASTA files for each tRNA synthetase using the scripts/00_generate_fasta_files.py script.

PDBe: Experimental structures (when available) can be found in the this subfolder. Note that these structures are often presented in multimeric form, and do not necessarily have full sequence coverage.
AlphaFold2 database: Predicted structures with AF2 were downloaded from the AF2-EBI database and stored in this subfolder. All structures in AF2 had 100% sequence coverage and are monomeric. Only one model was downloaded.
AlphaFold3 server: We predicted structures with the AF3 server and downloaded them into this subfolder. Five models are available per query.
Chai-1 server: Likewise, we predicted structures with the Chai-1 server, ticking the MSA option. Results are stored in this subfolder. Five models per query were returned.
AlphaFill: The AlphaFill resource was used to obtain AF2 structures along with ligands. We used the /scripts/01_download_alphafill.py script to programmatically download the structures and store them in this subfolder.
Swiss-Model: The Swiss-Model server was used to obtain homology models for each sequence. They can be found in this folder. Note that full coverage is not guaranteed, and that we required a minimum of 80%. A variable number of models per query were returned.

The multiple struture files were organized in the processed data subfolder and stored both in .cif and .pdb formats. This was done with the /scripts/02_organize_structures.py script. This scripts ensures that only one chain is saved for each file, and that sequences are not chunked. Note that we ommitted the PDBe files in this automated processing. Moreover, to simplify visualization, we aligned all structures using the /scripts/03_align_structures.py script. Based on RMSD, we removed structures that seemed to be far apart from the rest.

Then, we prepared these structures for docking with protein protonation with PDB2PQR and relaxation with PyRosetta using the scripts/04_relax_structures.py script. This procedure is computationally intensive.

Afterwards, structures are aligned again with the scripts/05_align_relaxed_structures.py script, using their unrelaxed counterparts as reference for all the alignments. At this stage, no structures were removed, even those with high RMSD against the unrelaxed structures.

Sequence data

We downloaded protein family and domain annotations from InterPro. Files can be found here. Sequence annotation data was processed using the scripts/05_sequence_annotation.py script.

Ligands data

In a first instance, we fetched data from ChEMBL using the UniProt AC identifiers. This was done with the scripts/06_fetch_from_chembl.py script. We only found data for 3 of the 21 tRNA synthetases.

Aggregated data

An aggregated file containing one row per processed structure is available here. This file contains the following information:

Field	Description
`file_name`	Name of the processed PDB structure file
`uniprot_ac`	Uniprot AC identifier
`n_residues`	Number of residues
`start_resid`	First residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence
`end_resid`	Last residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence
`coverage`	Percentage sequence coverage
`sequence_structure`	Sequence found in the PDB file
`full_sequence`	Sequence found in UniProt

Pocket detection

After the complete structural and sequential characterization of tRNA synthetases, we detected pockets in AF2, AF3, Chai-1 and SwissModel predicted models. Following the authors recommendations: for each structure, we considered detected pockets as those with a probability (K) > 0.2, but at least Top-3 (N) per structure. After that, we filtered those pockets having at least one residue with pLDDT < 65 (AF2, AF3, Chai-1) or QSQE < 0.65 (SwissModel), discarding about 25-30% of the pockets. Cut-offs are arbitrary; usual recommendations are 70 & 0.7, we’ve been slightly less restrictive.

A summary file containing one row per detected pocket and structure is available here. This file contains the following information:

Field	Description
`Uniprot_AC`	Uniprot AC identifier file
`File name`	PDB file in which pockets have been detected
`Prediction type`	The method used for protein structure prediction
`Full path`	The full directory path where the PDB file is stored
`Pocket number`	The identified pocket number within the structure
`Pocket score`	The score assigned to the detected pocket
`Pocket probability`	The probability value indicating confidence in pocket detection
`Pocket centroid coordinate (x y z)`	The (x, y, z) coordinates of the pocket’s centroid
`Pocket residues (chain_resn)`	List of residues forming the pocket, with chain and residue number
`B-factors`	Confidence measures: pLDDT (AF2, AF3, Chai) or QSQE (SM)

Pymol visualization

PyMOL session files (.pse) have been prepared to facilitate the visualization of detected pockets and their corresponding residues. These were generated using the scripts/09_prepare_pymol_visualizations.py script as step 09 in the pipeline.

Each PyMOL session (one per protein) includes the following elements:

Element	Description	Displayed?
Reference structure (AF2)	Wheat color, surface + cartoon representation	✅ Yes
Pockets detected in reference structure (AF2)	Sky-blue spheres with arbitrary size (pocket detection provides a single 3D coordinate)	✅ Yes
Residues defining each pocket in AF2	Orange color, surface + cartoon representation	✅ Yes
Aligned structures (all but AF2)	Gray color, cartoon representation	❌ No
Pockets detected in aligned structures	Gray-colored points	❌ No
InterPro annotations	Includes conserved sites, domains, families, homologous superfamilies etc (red color, surface representation)	❌ No

About the Ersilia Open Source Initiative

This repository is developed by the Ersilia Open Source Initiative. Ersilia develops AI/ML tools to support drug discovery research in the Global South. To learn more about us, please visit our GitBook Documentation and our GitHub profile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Targeted protein degradation in Mycobacterium tuberculosis

Background

Domain-specific requirements

Progress reporting

Sequence and structure annotation of tRNA synthetases

Protein structures

Sequence data

Ligands data

Aggregated data

Pocket detection

Pymol visualization

About the Ersilia Open Source Initiative

Files

README.md

Latest commit

History

README.md

File metadata and controls

Targeted protein degradation in Mycobacterium tuberculosis

Background

Domain-specific requirements

Progress reporting

Sequence and structure annotation of tRNA synthetases

Protein structures

Sequence data

Ligands data

Aggregated data

Pocket detection

Pymol visualization

About the Ersilia Open Source Initiative