-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding UTRs to transcripts in the reference update mode #208
Comments
…. So far: - Now "=", "_" and "n" will be automatically added to the valid class codes if padding is activated - the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding - added three tests to make sure that the class behaves as expected Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought.
This issue will be worked on in the branch issue-208. |
In the aa1687e commit for the branch, I started working on this. Basically, in this version, whenever padding is activated normally "redundant" transcripts ( Furthermore, after padding, for any couple of identical transcripts (to the base, ie nF1 == 100, given that we are talking about padded transcripts), only one will be kept (reference transcripts and higher score transcripts first). The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs. I think, in general, that they would be useful as they allow to properly integrate RNASeq and ab initio prediction, with UTR expansion and all--even though this is not ready for prime time. |
…cally added to the valid ASE codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs.
…s#183 In this pull request, there are multiple enhancements to the code base. This does **not** close EI-CoreBioinformatics#208. General improvements: - The superlocus algorithms have been revamped: - the transcript graph definition should now have O(nlogn) complexity, rather than O(n^2). This should reduce a lot runtimes and even potentially memory usage in complex loci. - removed the "reduce_method_three" method for complex loci, which was untested and probably hardly ever used. The other two remaining methods have been updated and properly tested. - The first reduction method, in particular, should now be faster as well. For EI-CoreBioinformatics#208: - Now "=", "_" and "n" will be automatically added to the valid class codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs. - the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Various: - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Python 3.7.4 does **not** produce functional .so files with cythonize. Blacklisting it. - Added and configured a proper .coveragerc file; fixed .travis.yml and the environment YAML file
This PR covers quite a bit of ground. General improvements - Now the superlocus class has been revamped a bit: - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts. - removed the third method to reduce complex loci - rewrote for speed the method one to reduce complex loci - now both reduction methods will consider whether a transcript is reference Issue EI-CoreBioinformatics#208 Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned. Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Mixed - Increased unit test coverage - added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files) - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.
… a reference update with padding. This is not ready for prime time yet.
…ence update to go forward is to *remove the constraint on the maximum number of isoforms*. The only check will be whether a transcript meets the minimum score threshold.
This sounds great. I was just looking for something like this. Presently, I have been using Mikado to merge different transcript evidence, and then used that further in Funannotate together with mapped proteins (with GenomeThreader) and let Funannotate control training of Augustus (funannotate predict). However, I do believe that I have lost UTRs after this. Funannotate has its own method of updating the annotation using Kallisto to find the best transcripts and PASA and EVM to add the UTRs. This is quite cumbersome, and I have not gotten it to work properly. If I can use Mikado to add UTRs to the annotation from Funannotate, that would be perfect. Is it usable yet? Thank you. Ole |
Hi Ole, it should be. I am currently on holiday, but next week I can write out the detailed instructions (I need to do it anyway for the documentation). It's not perfect or well tested yet so I would really appreciate the third party testing! |
We definitely have a use for updating a reference annotation adding additional splice variants and extending UTRs. Marking a set of transcripts as "reference" allowed us to incorporate additional splice variants (SV), but we did this in a way that didn't require many changes to mikado (to save you work). So when we want to add SV to an exiting annotation we do the following
This doesn't guarantee all reference transcripts are carried through but all the primary models (file a) should be present. So this kind of works but it is convoluted. The "reference" assignment was there to ensure models dont get filtered prior to the pick and there are times when you want to assign a set of transcripts as reference but not use the --only-reference-update function i.e. if you want to update and add to a set of models. On the UTR update "I amended this so that this automatic addition would happen only when executing a "reference-update" That might be ok but there might be situations where you either want to do this without --only-reference-update being enabled or not do when that is enabled. I would consider a separate flag for the UTR behaviour (and good documentation to explain all this). "The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs." I don't really follow this, shouldn't the limit on ASE be applied after your nF1 == 100 filter. Making some minor tweeks to allow some form of UTR extension is useful, however to be honest there would probably be a better way to do the SV addition and UTR extension of an existing annotation i.e. a separate mikado tool but obviously thats a lot more work. |
No, as the logic is written, the maximum number of isoforms is used as a cutoff before we even do the padding. It is basically a switch that says to Mikado "just stop adding" once a critical number of isoforms has been reached. It is not the most elegant solution, hence why I suggested maybe removing it.
One that could still be done with the current infrastructure of the program, but yes, it would probably take a couple of weeks of dedicated programming to get it properly right. What would your preferred course of action with the changes I put in this branch? There are some (regarding making the construction of Mikado transcript graphs faster) that I think should go into the main branch no matter what, but the rest is a bit more of a grey area. |
…ure that there will be at most X isoforms in the locus, while doing all the padding.
The caveat means that there might be instances where the percentage threshold is not respected, in the final version of the locus. |
To clarify, Mikado does not ensure that the primary transcript has a stable score during the AS selection stage. |
* This should put a nail into #208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding. * BF for the `pick` command line. * Bfix for previous commit. * Solved a bug which deleted attributes from transcripts when serialising them.
@lucventurini just to comment while I have this in my head if we want to "fully" enable updates to reference annotation ie new splice variants and UTR extensions then we might need to look at allowing some non-valid models This ties in with #226 i.e. if you wanted to update say the downloaded human ensembl gene file you would find many models that will not meet mikados validation i.e. pseudogenes with annotated CDS containing stops , annotation based on haplotypes again containing stops or other features annotated in ways that mikado would not except. Currently mikado prepare would clean these up (which is obviously desirable in a normal use case and I assume still true even if the models are marked as reference) but in the context of wanting to update an existing annotation probably not desirable. So potentially you could have a lenient mode that for prepare, stats, pick etc that passes through these models unchanged. For the pick I guess the working solution would be to ignore any loci containing these transcripts i.e. no updates are made i.e. it's a set of loci that for pick purposes and the generation of metrics are simply ignored. I'm not suggesting we do this as I realise it's likely not straight forward but just commenting as you will see more issues like issues/226 if people use mikado with existing annotation. |
Hi @swarbred , |
Hi @lucventurini. I see that you've changed the milestone to 2.1 a month ago or so. I guess this means that there is no straight-forward way of just adding UTRs to existing gene models (and only to valid models for instance, ignoring all non-valid). Thank you. |
I think this is now completely integrated in Mikado 2, due to progress on other issues. |
This PR covers quite a bit of ground. General improvements - Now the superlocus class has been revamped a bit: - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts. - removed the third method to reduce complex loci - rewrote for speed the method one to reduce complex loci - now both reduction methods will consider whether a transcript is reference Issue EI-CoreBioinformatics#208 Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned. Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Mixed - Increased unit test coverage - added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files) - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.
…lass codes when doing a reference update with padding. This is not ready for prime time yet.
* This should put a nail into EI-CoreBioinformatics#208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding. * BF for the `pick` command line. * Bfix for previous commit. * Solved a bug which deleted attributes from transcripts when serialising them.
In #207, @fmarletaz asked me whether we have a way to add UTRs to transcripts derived from eg. Augustus. This made me realise that currently this is not the case, not even in reference update mode.
The problem here is that Mikado explicitly disallows having transcripts that are identical inside a locus (ie.: the
=
andn
class codes are explicitly forbidden as valid ASE class codes inpick
). This means that, in the not-so-hypothetical scenario in which we have a perfectly supported ab initio transcript lacking UTR, and its companion cDNA/assembly from RNASeq which would add the UTR itself, we will not have a method to add it.Moreover, at the moment when doing the reference update, the only thing that happens is that Mikado will discard any locus that has no reference transcript. In subsequent stages, however, there is no guarantee that those transcripts will pass through at all.
There are ways to work around this (e.g.: avoid doing a proper mikado run and use instead
mikado compare
to drive an ancillary script to expand transcripts) but I would like to hear your opinions on the matter.The text was updated successfully, but these errors were encountered: