Skip to content

Latest commit

 

History

History
130 lines (89 loc) · 12.3 KB

File metadata and controls

130 lines (89 loc) · 12.3 KB

Targeted protein degradation in Mycobacterium tuberculosis

Discovery of potential degraders (BacPROTACS) for essential tRNA synthetases in Mycobacterium tuberculosis.

Background

This project is part of the GC-ADDA4TB project, led by Prof. Erick Strauss. The project builds upon the BacPROTACs technology in which small-molecule bifunctional degraders are designed to harness the proteolytic machinery for targeted protein degradataion (TPD) of essential proteins in Mtb. This involves linking (a) a ligand that binds to a protein of interest (POI) to (b) a molecule that recruits the mycobacterial protease machinery, such as the ClpC:ClpP complex (ClpCP).

In a CRISPR genetic screening, several Mtb tRNA synthetases were identified as highly vulnerable in Mtb. Here, targeting these tRNA synthetases with TPD is proposed as a novel therapeutic strategy.

In this project, the main objective is to prioritise a set of purchasable or easily synthesizable compounds with multi-target properties against the tRNA synthetases family. This will be achieved in three consecutive steps:

  1. Structural annotation and binding pocket comparison across the tRNA synthetase family.
  2. Large-scale virtual screening for compounds with strong predicted binding scores across multiple tRNA synthetase binding sites.
  3. Final shortlisting using additional criteria, such as the ligand's amenability to being extended with a linker to dCymM without disrupting the binding pose.

Domain-specific requirements

We recommend to create a Conda environment to run this code.

conda create -n adda4tb python=3.10
conda activate adda4tb
pip install -r requirements.txt

In addition, the following tools are required:

To run P2RANK, Java is required:

conda activate adda4tb
conda install -c conda-forge openjdk

Progress reporting

This repository is work in progress, as summarized in the following progress report meetings:

  • Check-in meeting 1 (2025/02/04). Selection and structural annotation of tRNA synthetases.
  • Check-in meeting 2 (scheduled mid March 2025)

Below, we explain the progress made chronologically.

Sequence and structure annotation of tRNA synthetases

Based on the results of the CRISPR genetic screen (Bosch et al, 2021; Figure 5), we have selected 21 essential tRNA synthetases. The UniProt AC, name, protein sequence and EC number have been obtained from UniProt (Mtb H37RV proteome).

Protein structures

We have used the following servers or databases to obtain structural data for the selected tRNA synthetases. To ease the query of some resources, we have generated FASTA files for each tRNA synthetase using the scripts/00_generate_fasta_files.py script.

  • PDBe: Experimental structures (when available) can be found in the this subfolder. Note that these structures are often presented in multimeric form, and do not necessarily have full sequence coverage.
  • AlphaFold2 database: Predicted structures with AF2 were downloaded from the AF2-EBI database and stored in this subfolder. All structures in AF2 had 100% sequence coverage and are monomeric. Only one model was downloaded.
  • AlphaFold3 server: We predicted structures with the AF3 server and downloaded them into this subfolder. Five models are available per query.
  • Chai-1 server: Likewise, we predicted structures with the Chai-1 server, ticking the MSA option. Results are stored in this subfolder. Five models per query were returned.
  • AlphaFill: The AlphaFill resource was used to obtain AF2 structures along with ligands. We used the /scripts/01_download_alphafill.py script to programmatically download the structures and store them in this subfolder.
  • Swiss-Model: The Swiss-Model server was used to obtain homology models for each sequence. They can be found in this folder. Note that full coverage is not guaranteed, and that we required a minimum of 80%. A variable number of models per query were returned.

The multiple struture files were organized in the processed data subfolder and stored both in .cif and .pdb formats. This was done with the /scripts/02_organize_structures.py script. This scripts ensures that only one chain is saved for each file, and that sequences are not chunked. Note that we ommitted the PDBe files in this automated processing. Moreover, to simplify visualization, we aligned all structures using the /scripts/03_align_structures.py script. Based on RMSD, we removed structures that seemed to be far apart from the rest.

Then, we prepared these structures for docking with protein protonation with PDB2PQR and relaxation with PyRosetta using the scripts/04_relax_structures.py script. This procedure is computationally intensive.

Afterwards, structures are aligned again with the scripts/05_align_relaxed_structures.py script, using their unrelaxed counterparts as reference for all the alignments. At this stage, no structures were removed, even those with high RMSD against the unrelaxed structures.

Sequence data

We downloaded protein family and domain annotations from InterPro. Files can be found here. Sequence annotation data was processed using the scripts/05_sequence_annotation.py script.

Ligands data

In a first instance, we fetched data from ChEMBL using the UniProt AC identifiers. This was done with the scripts/06_fetch_from_chembl.py script. We only found data for 3 of the 21 tRNA synthetases.

Aggregated data

An aggregated file containing one row per processed structure is available here. This file contains the following information:

Field Description
file_name Name of the processed PDB structure file
uniprot_ac Uniprot AC identifier
n_residues Number of residues
start_resid First residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence
end_resid Last residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence
coverage Percentage sequence coverage
sequence_structure Sequence found in the PDB file
full_sequence Sequence found in UniProt

Pocket detection

After the complete structural and sequential characterization of tRNA synthetases, we detected pockets in AF2, AF3, Chai-1 and SwissModel predicted models. Following the authors recommendations: for each structure, we considered detected pockets as those with a probability (K) > 0.2, but at least Top-3 (N) per structure. After that, we filtered those pockets having at least one residue with pLDDT < 65 (AF2, AF3, Chai-1) or QSQE < 0.65 (SwissModel), discarding about 25-30% of the pockets. Cut-offs are arbitrary; usual recommendations are 70 & 0.7, we’ve been slightly less restrictive.

A summary file containing one row per detected pocket and structure is available here. This file contains the following information:

Field Description
Uniprot_AC Uniprot AC identifier file
File name PDB file in which pockets have been detected
Prediction type The method used for protein structure prediction
Full path The full directory path where the PDB file is stored
Pocket number The identified pocket number within the structure
Pocket score The score assigned to the detected pocket
Pocket probability The probability value indicating confidence in pocket detection
Pocket centroid coordinate (x y z) The (x, y, z) coordinates of the pocket’s centroid
Pocket residues (chain_resn) List of residues forming the pocket, with chain and residue number
B-factors Confidence measures: pLDDT (AF2, AF3, Chai) or QSQE (SM)

Pymol visualization

PyMOL session files (.pse) have been prepared to facilitate the visualization of detected pockets and their corresponding residues. These were generated using the scripts/09_prepare_pymol_visualizations.py script as step 09 in the pipeline.

Each PyMOL session (one per protein) includes the following elements:

Element Description Displayed?
Reference structure (AF2) Wheat color, surface + cartoon representation ✅ Yes
Pockets detected in reference structure (AF2) Sky-blue spheres with arbitrary size (pocket detection provides a single 3D coordinate) ✅ Yes
Residues defining each pocket in AF2 Orange color, surface + cartoon representation ✅ Yes
Aligned structures (all but AF2) Gray color, cartoon representation ❌ No
Pockets detected in aligned structures Gray-colored points ❌ No
InterPro annotations Includes conserved sites, domains, families, homologous superfamilies etc (red color, surface representation) ❌ No

About the Ersilia Open Source Initiative

This repository is developed by the Ersilia Open Source Initiative. Ersilia develops AI/ML tools to support drug discovery research in the Global South. To learn more about us, please visit our GitBook Documentation and our GitHub profile.