Adding UTRs to transcripts in the reference update mode #208

lucventurini · 2019-08-09T09:32:02Z

In #207, @fmarletaz asked me whether we have a way to add UTRs to transcripts derived from eg. Augustus. This made me realise that currently this is not the case, not even in reference update mode.

The problem here is that Mikado explicitly disallows having transcripts that are identical inside a locus (ie.: the = and n class codes are explicitly forbidden as valid ASE class codes in pick). This means that, in the not-so-hypothetical scenario in which we have a perfectly supported ab initio transcript lacking UTR, and its companion cDNA/assembly from RNASeq which would add the UTR itself, we will not have a method to add it.

Moreover, at the moment when doing the reference update, the only thing that happens is that Mikado will discard any locus that has no reference transcript. In subsequent stages, however, there is no guarantee that those transcripts will pass through at all.

There are ways to work around this (e.g.: avoid doing a proper mikado run and use instead mikado compare to drive an ancillary script to expand transcripts) but I would like to hear your opinions on the matter.

The text was updated successfully, but these errors were encountered:

…. So far: - Now "=", "_" and "n" will be automatically added to the valid class codes if padding is activated - the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding - added three tests to make sure that the class behaves as expected Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought.

lucventurini · 2019-08-09T15:01:22Z

This issue will be worked on in the branch issue-208.

lucventurini · 2019-08-09T15:20:17Z

In the aa1687e commit for the branch, I started working on this. Basically, in this version, whenever padding is activated normally "redundant" transcripts (=, _, n) will be considered valid ASEs. In ec31f37, I amended this so that this automatic addition would happen only when executing a "reference-update" run.

Furthermore, after padding, for any couple of identical transcripts (to the base, ie nF1 == 100, given that we are talking about padded transcripts), only one will be kept (reference transcripts and higher score transcripts first).

The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs. I think, in general, that they would be useful as they allow to properly integrate RNASeq and ab initio prediction, with UTR expansion and all--even though this is not ready for prime time.

…cally added to the valid ASE codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs.

…s#183 In this pull request, there are multiple enhancements to the code base. This does **not** close EI-CoreBioinformatics#208. General improvements: - The superlocus algorithms have been revamped: - the transcript graph definition should now have O(nlogn) complexity, rather than O(n^2). This should reduce a lot runtimes and even potentially memory usage in complex loci. - removed the "reduce_method_three" method for complex loci, which was untested and probably hardly ever used. The other two remaining methods have been updated and properly tested. - The first reduction method, in particular, should now be faster as well. For EI-CoreBioinformatics#208: - Now "=", "_" and "n" will be automatically added to the valid class codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs. - the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Various: - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Python 3.7.4 does **not** produce functional .so files with cythonize. Blacklisting it. - Added and configured a proper .coveragerc file; fixed .travis.yml and the environment YAML file

This PR covers quite a bit of ground. General improvements - Now the superlocus class has been revamped a bit: - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts. - removed the third method to reduce complex loci - rewrote for speed the method one to reduce complex loci - now both reduction methods will consider whether a transcript is reference Issue EI-CoreBioinformatics#208 Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned. Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Mixed - Increased unit test coverage - added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files) - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.

… a reference update with padding. This is not ready for prime time yet.

…ence update to go forward is to *remove the constraint on the maximum number of isoforms*. The only check will be whether a transcript meets the minimum score threshold.

olekto · 2019-08-23T09:26:10Z

This sounds great. I was just looking for something like this. Presently, I have been using Mikado to merge different transcript evidence, and then used that further in Funannotate together with mapped proteins (with GenomeThreader) and let Funannotate control training of Augustus (funannotate predict). However, I do believe that I have lost UTRs after this. Funannotate has its own method of updating the annotation using Kallisto to find the best transcripts and PASA and EVM to add the UTRs. This is quite cumbersome, and I have not gotten it to work properly.

If I can use Mikado to add UTRs to the annotation from Funannotate, that would be perfect. Is it usable yet?

Thank you.

Ole

lucventurini · 2019-08-23T11:55:25Z

Hi Ole, it should be. I am currently on holiday, but next week I can write out the detailed instructions (I need to do it anyway for the documentation).

It's not perfect or well tested yet so I would really appreciate the third party testing!

swarbred · 2019-08-23T14:39:22Z

Hi @lucventurini

We definitely have a use for updating a reference annotation adding additional splice variants and extending UTRs. Marking a set of transcripts as "reference" allowed us to incorporate additional splice variants (SV), but we did this in a way that didn't require many changes to mikado (to save you work). So when we want to add SV to an exiting annotation we do the following

split the reference annotation into two files, file (a) with just the primary models and file (b) with the non primary models.
in the list.txt file both file a and b are marked as reference (note I see in mikado configure --help we call this field always_keep rather than reference, which might confuse as it only really means alwasy keep for the prepare step). Any other models (which you want to add SV from) are not reference.
Also in the list.txt file you need to add a score against file a and file b, file a needs to be higher than b (e.g. 1000 and 500), this way you ensure that the same model is selected by pick as the primary, and that one of your reference models wins out at the pick stage
set min_score_perc in the config to 0.0 otherwise you will exclude all other SV (given the high base scores we gave each of our two reference transcript files)
run pick with the --only-reference-update enabled

This doesn't guarantee all reference transcripts are carried through but all the primary models (file a) should be present.

So this kind of works but it is convoluted. The "reference" assignment was there to ensure models dont get filtered prior to the pick and there are times when you want to assign a set of transcripts as reference but not use the --only-reference-update function i.e. if you want to update and add to a set of models.

On the UTR update

"I amended this so that this automatic addition would happen only when executing a "reference-update"

That might be ok but there might be situations where you either want to do this without --only-reference-update being enabled or not do when that is enabled. I would consider a separate flag for the UTR behaviour (and good documentation to explain all this).

"The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs."

I don't really follow this, shouldn't the limit on ASE be applied after your nF1 == 100 filter.

Making some minor tweeks to allow some form of UTR extension is useful, however to be honest there would probably be a better way to do the SV addition and UTR extension of an existing annotation i.e. a separate mikado tool but obviously thats a lot more work.

lucventurini · 2019-08-26T20:15:04Z

I don't really follow this, shouldn't the limit on ASE be applied after your nF1 == 100 filter.

No, as the logic is written, the maximum number of isoforms is used as a cutoff before we even do the padding. It is basically a switch that says to Mikado "just stop adding" once a critical number of isoforms has been reached. It is not the most elegant solution, hence why I suggested maybe removing it.

Making some minor tweeks to allow some form of UTR extension is useful, however to be honest there would probably be a better way to do the SV addition and UTR extension of an existing annotation i.e. a separate mikado tool but obviously thats a lot more work.

One that could still be done with the current infrastructure of the program, but yes, it would probably take a couple of weeks of dedicated programming to get it properly right.

What would your preferred course of action with the changes I put in this branch? There are some (regarding making the construction of Mikado transcript graphs faster) that I think should go into the main branch no matter what, but the rest is a bit more of a grey area.

…ure that there will be at most X isoforms in the locus, while doing all the padding.

lucventurini · 2019-09-13T13:07:38Z

Hi @swarbred, @gemygk:

I have merged most of the changes back into master. I have tested extensively and they seem to function.
I am now testing (see 6c350ec) a change in the definition of alternative splicing which will allow to pad, remove redundancies, and still keep a meaningful number of maximum isoforms. See the commit for the changes. The caveat is that the threshold for keeping or discarding a transcript from the final gene is now calculated when all candidates are present, only once. Before, the threshold was recalculated every time (see the commit for details).

The caveat means that there might be instances where the percentage threshold is not respected, in the final version of the locus.

lucventurini · 2019-09-13T18:48:45Z

To clarify, Mikado does not ensure that the primary transcript has a stable score during the AS selection stage.
To the contrary, both with the previous method and the current, there are instances where the primary transcript gets superseded by other transcripts in the locus for the final scoring. This is due to the fact that of course there are many metrics that hugely depend on the transcripts that are present in the locus at a given time (e.g. proportion_verified_introns_in_locus) and which therefore will vary as we add or remove transcripts from a locus.
Even for stable metrics, the associated score will change (e.g. for scoring on the transcript with the greatest number of transcripts).
Calculating all possible combinations would probably be prohibitive, therefore Mikado uses as criterion for the addition only the initial scores.

* This should put a nail into #208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding. * BF for the `pick` command line. * Bfix for previous commit. * Solved a bug which deleted attributes from transcripts when serialising them.

swarbred · 2019-10-03T10:16:13Z

@lucventurini just to comment while I have this in my head if we want to "fully" enable updates to reference annotation ie new splice variants and UTR extensions then we might need to look at allowing some non-valid models

This ties in with #226

i.e. if you wanted to update say the downloaded human ensembl gene file you would find many models that will not meet mikados validation i.e. pseudogenes with annotated CDS containing stops , annotation based on haplotypes again containing stops or other features annotated in ways that mikado would not except.

Currently mikado prepare would clean these up (which is obviously desirable in a normal use case and I assume still true even if the models are marked as reference) but in the context of wanting to update an existing annotation probably not desirable.

So potentially you could have a lenient mode that for prepare, stats, pick etc that passes through these models unchanged. For the pick I guess the working solution would be to ignore any loci containing these transcripts i.e. no updates are made i.e. it's a set of loci that for pick purposes and the generation of metrics are simply ignored.

I'm not suggesting we do this as I realise it's likely not straight forward but just commenting as you will see more issues like issues/226 if people use mikado with existing annotation.

lucventurini · 2019-10-03T14:32:36Z

Hi @swarbred ,
yes, I get your point.
Regarding the issue in #226, actually, paradoxically mikado prepare would not have a problem with that ... at that stage the program only collects exon/CDS/UTR features. So a model like the one causing the issue there (ie having only exons and pseudogene, but no transcript, defined) would probably pass through fine.

olekto · 2020-03-27T13:52:37Z

Hi @lucventurini.

I see that you've changed the milestone to 2.1 a month ago or so. I guess this means that there is no straight-forward way of just adding UTRs to existing gene models (and only to valid models for instance, ignoring all non-valid).

Thank you.

lucventurini · 2021-01-08T10:42:56Z

I think this is now completely integrated in Mikado 2, due to progress on other issues.

This PR covers quite a bit of ground. General improvements - Now the superlocus class has been revamped a bit: - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts. - removed the third method to reduce complex loci - rewrote for speed the method one to reduce complex loci - now both reduction methods will consider whether a transcript is reference Issue EI-CoreBioinformatics#208 Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned. Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought. Mixed - Increased unit test coverage - added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files) - Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand) - Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled. - Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.

…lass codes when doing a reference update with padding. This is not ready for prime time yet.

* This should put a nail into EI-CoreBioinformatics#208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding. * BF for the `pick` command line. * Bfix for previous commit. * Solved a bug which deleted attributes from transcripts when serialising them.

lucventurini added enhancement question EI-Internal labels Aug 9, 2019

lucventurini added this to the 2.0 milestone Aug 9, 2019

lucventurini assigned cschu, swarbred, gemygk and lucventurini Aug 9, 2019

lucventurini changed the title ~~Adding UTRs to transcripts in the reference update mode~~ [Adding UTRs to transcripts in the reference update mode](https://github.com/lucventurini/mikado/tree/issue-208) Aug 9, 2019

lucventurini changed the title ~~[Adding UTRs to transcripts in the reference update mode](https://github.com/lucventurini/mikado/tree/issue-208)~~ Adding UTRs to transcripts in the reference update mode Aug 9, 2019

lucventurini added a commit that referenced this issue Aug 16, 2019

#208: disabled automatic addition of redundant class codes when doing…

09e507c

… a reference update with padding. This is not ready for prime time yet.

lucventurini added a commit to lucventurini/mikado that referenced this issue Sep 13, 2019

This should put a nail into EI-CoreBioinformatics#208: now we can ens…

6c350ec

…ure that there will be at most X isoforms in the locus, while doing all the padding.

lucventurini modified the milestones: 2.0, 2.1 Sep 30, 2019

lucventurini modified the milestones: 2.1, 2.x Feb 27, 2020

lucventurini mentioned this issue Oct 14, 2020

intron fraction is null #324

Closed

lucventurini closed this as completed Jan 8, 2021

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

EI-CoreBioinformatics#208: disabled automatic addition of redundant c…

da5b7d5

…lass codes when doing a reference update with padding. This is not ready for prime time yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding UTRs to transcripts in the reference update mode #208

Adding UTRs to transcripts in the reference update mode #208

lucventurini commented Aug 9, 2019

lucventurini commented Aug 9, 2019

lucventurini commented Aug 9, 2019 •

edited

Loading

olekto commented Aug 23, 2019

lucventurini commented Aug 23, 2019

swarbred commented Aug 23, 2019

lucventurini commented Aug 26, 2019

lucventurini commented Sep 13, 2019

lucventurini commented Sep 13, 2019

swarbred commented Oct 3, 2019

lucventurini commented Oct 3, 2019

olekto commented Mar 27, 2020

lucventurini commented Jan 8, 2021

Adding UTRs to transcripts in the reference update mode #208

Adding UTRs to transcripts in the reference update mode #208

Comments

lucventurini commented Aug 9, 2019

lucventurini commented Aug 9, 2019

lucventurini commented Aug 9, 2019 • edited Loading

olekto commented Aug 23, 2019

lucventurini commented Aug 23, 2019

swarbred commented Aug 23, 2019

lucventurini commented Aug 26, 2019

lucventurini commented Sep 13, 2019

lucventurini commented Sep 13, 2019

swarbred commented Oct 3, 2019

lucventurini commented Oct 3, 2019

olekto commented Mar 27, 2020

lucventurini commented Jan 8, 2021

lucventurini commented Aug 9, 2019 •

edited

Loading