-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prepare option to remove transcripts with the same intron chain #270
Comments
Hi @swarbred , |
…d on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
872dcff has the required changes. As stated in the commit message, now Mikado will remove redundant transcripts based on their intron chains / monoexonic overlap, rather than checking simply for the start/end. Exceptions; two transcripts with similar coordinates but different strand or CDS content. |
Hi @swarbred Commit bc0774e should add all the features that you request.
The first three fields are mandatory. A transcript will be considered as redundant to a template if:
Please note that the last point implies that some transcripts will slip through: the check on the strandedness for multiexonic transcripts happens downstream to the redundancy test (as it is a quite expensive operation). |
* For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies. * For EI-CoreBioinformatics#270: this commit should introduce all the features asked by @swarbred (tagging also @ljyanesm)
* Issue #280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue #270: now Mikado will remove redundancy based on intron chains * For #270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
Fixed in b3f3d7a |
the mikado configure --help needs to be updated with the additional column for keep_redundant
|
True. I thought I had done already, will do so tomorrow. |
This was from the --help from 4ab8cf5 |
Sorry I'm slightly confused, when I read this originally I thought you had added an extra column to the list file "keep_redundant" that was a boolean for whether to collapse transcripts contained in another model with the default being TRUE i.e. only collapsing exact matching transcripts. Looking at the the prepare --help description for list this indicates <score(optional)> <is_reference(optional)><always_keep(optional) the help from an earlier version before this change <score(optional)> <always_keep(optional)> so the new column is "is_reference" as we had always_keep in the earlier version What you called "keep_redundant" looks to map to <always_keep(optional)> if so it seems we have switched the meaning of always_keep in the earlier version to now be is_reference. is_reference = flag which models are to be regarded as reference transcripts (this seems clear from the name, and I prefer using this as a column name than always_keep) always_keep = I guess if you take always_keep to be short for "always keep contained" then it makes sense what confused me is that we seem to have switched always_keep to be is_reference. I understand what matters is the column order (and the marking of reference transcripts is still col 5 across versions). As long as the doc makes clear the reason for the different columns and that the collapsing of contained models is separate from the marking of reference transcripts with the default NOT to collapse contained models then this seems fine. |
You are right that the documentation is confusing. This will be the main target for modifications before releasing 2.0 ... But onto how it should work:
Slightly incorrect. The choice is between either not collapsing (
Yes, that is correct. Internally I still use the "reference" moniker; I will amend accordingly. Moreover, I just realised that unfortunately the parsing of the input list of files (either in
Yes, we will amend the documentation. I have also changed things so that the default now is to not reduce redundancy unless specified. |
To clarify: the option of removing exactly matching transcripts is not present. Either all transcripts are left in or we remove redundant intron chains. My fault, I understood that we wanted to supersede the behaviour, rather than having three possible options instead of two (keep everything, remove identical transcripts, remove redundant intron chains).
This will not happen in any circumstance. Mikado removes transcripts that have the same intron chain. So if a transcript with
Yes, I fear so. Unfortunately, and very embarassingly for me, I realised this morning that the parsing of the list file was broken.
The |
This is what I was envisaging no changes to is_reference i.e. any model marked as reference is retained (strictly we may have some exceptions i.e. where models fail some criteria i.e. CDS with internal stops, whatever we are currently doing seems fine). Outside of reference models I cant see a reason why we would ever want to keep truly identical transcripts (there is no benefit in mikado pick if the models are really identical as far as I can see) so I wasn't thinking we would change this behaviour (i.e. I expected these always to be removed unless the model is a reference) and the user wouldn't need any control of this. I was thinking we would just add a column to the list file to enable the control over removing redundant intron chains. so the default behaviour would be as in previous versions i.e. we only remove identical models unless marked as reference but we would have the option to be more aggressive and remove redundant intron chains (again excluding models marked as reference).
My example would be like this, two adjacent genes, 1 single exon 1 multiexonic as shown in the REF annotation (i.e. what is the correct annotation). But the RNA assemblies generated look like the three transcript assemblies (in transcript 2, there is low coverage between the two genes which results in a single fused transcript). Transcript 2 and 3 are the same relationship as your A and A` example REF annotation XXXXXX XXX---XXX---XXXX In this case I believe if you run with the new collapse intron chain logic we will end up with just two transcripts 1 and 2. So we had the "correct" answer in the input but lose 1 of these correct transcripts in the prepare output if we collapse on intron chains. This would be one case where collapsing on intron chain would be detrimental (hence why I wouldn't do this as default for illumina assemblies). The above is going to be less of an issue when you have full-length transcripts rather than assembled rna-seq. |
Sorry my attempt at the diagram gets messed up on posting, I will post the image |
I see your point. I would hope that this specific instance would be caught by chimera splitting, but that's just quibbling - I can see the reason to keep the behaviour as keeping redundancies.
Yes, this is achievable. I will reopen the |
b552dfc and 657b9c7 should address this problem. Now redundant transcripts (ie, non-reference identical copies, down to the base and including a comparison of the CDS) are always de-duplicated by keeping only one copy (giving preference to reference transcripts first, to higher-scoring sources second). |
@lucventurini |
I'm just coming back to this after some time (we had hpc issues at the time of original email and the 2.0prc2 version i ran didn't complete the prepare stage) It looks like the modified behaviour is default on in 2.0prc2 e.g. 487445 seqs after prepare with d094f99 275429 seqs after prepare with 2.0prc2 my intention was to have this as an option for the special cases when you need to more aggressively remove models with the same intron chain i.e. very large data sets thinking mainly of pacbio CCS or ONT reads not as a default for the more common case of normal illumina assemblies. It certainly cuts down the model numbers so speeds down stream steps but I suspect it is detrimental for illumina assemblies. Whether it's default on or off I want to know how to switch to the original behaviour. Did we make a change to NOT have the new behaviour as default after the version i'm running or is this still the default? To clarify is this how it currently works with --keep-redundant TRUE , we remove only fully redundant models (i.e. match end to end) unless they are marked as reference i.e. always keep (i.e. the old behaviour) with --keep-redundant FALSE , we remove models with redundant intron chains unless they are marked as reference (i.e. the new behaviour) From the configration.toml file that was generated the default appears to be keep_redundant = false which would fit with the above. |
Hi @swarbred Yes, your understanding is exactly right. I can switch back the behaviour so that the flag can be "--exclude-redundant" if that is the preferred case. It will require a bit of fiddling as some automated tests rely on the redundancy removal. Best |
Hi @lucventurini when you have access see JIRA ticket GENANNO-480 and my comment from 28th sept 12:06 and the browser link which is shown in the comment below that. There will be some cases where the more aggressive removal results in an improved selection but I think there are more times where this is undesirable (specifically dealing with illumina assemblies) e.g. the screen shot below where the more aggressive redundancy removal results in only the over extended model being selected (this is not stranded data). The top track is with --keep-redundant FALSE the second track with = TRUE So I would favour changing to --exclude-redundant |
Hi @swarbred, I am proceeding to change the interface as you requested. Hopefully I should be done by this afternoon. As the only change will be the direction of the switch, not the behaviour itself, it should be possible to quickly test this and then merge into master? |
…ant' has become 'exclude_redundant' and the default behaviour is, like in Mikado 1.0, to keep redundant models unless specified otherwise. Also changed the minimum biopython version (1.78) and made small adjustments as the new library version has removed support for 'Alphabet'.
Travis is testing the latest commit in the branch (https://travis-ci.org/github/lucventurini/mikado/jobs/733038131) which should function for Python 3.6, 3.7 and 3.8. I will sign it off here and leaving this commit, 2a61631, for you to test. Please let me know if there is anything else I can do on this. |
@lucventurini can confirm 2a61631 works as expected |
Excellent! Merging into master and closing! |
…ioinformatics#305) * Issue EI-CoreBioinformatics#280: now mikado serialise has been refactored so that: * loading of BLAST XMLs should be faster thanks to using Cython on the most time-expensive function * mikado now accepts also *tabular* BLAST data (custom format, we need the `ppos` and `btop` extra fields) * `daijin` now automatically generates *tabular* rather than XML BLAST results * `mikado` will now use `spawn` as the default multiprocessing method. This avoids memory accounting problems in eg. SLURM (sometimes `fork` results in the HPC management system to think that the shared memory is duplicated, massively and falsely inflating the accounting of memory usage). * Issue EI-CoreBioinformatics#270: now Mikado will remove redundancy based on intron chains * For EI-CoreBioinformatics#270: now `mikado prepare` will remove redundant transcripts based on their *intron chains* / *monoexonic span overlap*, rather than start/end. Exact CDS match still applies.
…I-CoreBioinformatics#291. Ready to roll into Conda etc.
@lucventurini for possible future consideration (post 2.0)
With us now starting to use CSS reads or raw pacbio / nanopore alignments in mikado the number of reads is getting much larger.
In the context of the transcriptome module I've discussed with @ljyanesm some pragmatic solutions (e.g. stringtie2 to fillter and collapse the data, downside being this assembles reads OR collapsing with GFFread pre or post mikado).
Obviously this can be done outside of mikado but I wanted to check if it was easily feasible to do it as part of the prepare step.
The idea would be to remove transcript with identical intron chains (keeping I guess the version with the most extended terminal exons, e.g. as gffread -M), but NOT removing contained fragments of partial intron chains i.e. as gffread -M -K
That would not be advisable for illumina RNA-Seq assembled transcripts but it's less dangerous for long reads and would substantially reduce the number of transcripts for orf / blast analysis.
Ultimately there are better solutions to filter long reads prior to mikado but the above would be a useful feature (if you already hold the info required during prepare).
Note this would need to be tied to specific labels to allow this to be applied to just certain transcript sets.
The text was updated successfully, but these errors were encountered: