Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding UTRs to transcripts in the reference update mode #208

Closed
lucventurini opened this issue Aug 9, 2019 · 12 comments
Closed

Adding UTRs to transcripts in the reference update mode #208

lucventurini opened this issue Aug 9, 2019 · 12 comments

Comments

@lucventurini
Copy link
Collaborator

In #207, @fmarletaz asked me whether we have a way to add UTRs to transcripts derived from eg. Augustus. This made me realise that currently this is not the case, not even in reference update mode.

The problem here is that Mikado explicitly disallows having transcripts that are identical inside a locus (ie.: the = and n class codes are explicitly forbidden as valid ASE class codes in pick). This means that, in the not-so-hypothetical scenario in which we have a perfectly supported ab initio transcript lacking UTR, and its companion cDNA/assembly from RNASeq which would add the UTR itself, we will not have a method to add it.

Moreover, at the moment when doing the reference update, the only thing that happens is that Mikado will discard any locus that has no reference transcript. In subsequent stages, however, there is no guarantee that those transcripts will pass through at all.

There are ways to work around this (e.g.: avoid doing a proper mikado run and use instead mikado compare to drive an ancillary script to expand transcripts) but I would like to hear your opinions on the matter.

@lucventurini lucventurini added this to the 2.0 milestone Aug 9, 2019
lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 9, 2019
…. So far:

- Now "=", "_" and "n" will be automatically added to the valid class codes if padding is activated
- the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding
- added three tests to make sure that the class behaves as expected
Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief
given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one,
and lose all ASEs. This aspect will require careful thought.
@lucventurini
Copy link
Collaborator Author

This issue will be worked on in the branch issue-208.

@lucventurini lucventurini changed the title Adding UTRs to transcripts in the reference update mode [Adding UTRs to transcripts in the reference update mode](https://github.com/lucventurini/mikado/tree/issue-208) Aug 9, 2019
@lucventurini lucventurini changed the title [Adding UTRs to transcripts in the reference update mode](https://github.com/lucventurini/mikado/tree/issue-208) Adding UTRs to transcripts in the reference update mode Aug 9, 2019
@lucventurini
Copy link
Collaborator Author

lucventurini commented Aug 9, 2019

In the aa1687e commit for the branch, I started working on this. Basically, in this version, whenever padding is activated normally "redundant" transcripts (=, _, n) will be considered valid ASEs. In ec31f37, I amended this so that this automatic addition would happen only when executing a "reference-update" run.

Furthermore, after padding, for any couple of identical transcripts (to the base, ie nF1 == 100, given that we are talking about padded transcripts), only one will be kept (reference transcripts and higher score transcripts first).

The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs. I think, in general, that they would be useful as they allow to properly integrate RNASeq and ab initio prediction, with UTR expansion and all--even though this is not ready for prime time.

lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 9, 2019
…cally added to the valid ASE codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs.
lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 12, 2019
…s#183

In this pull request, there are multiple enhancements to the code base. This does **not** close EI-CoreBioinformatics#208.

General improvements:
- The superlocus algorithms have been revamped:
  - the transcript graph definition should now have O(nlogn) complexity, rather than O(n^2). This should reduce a lot runtimes and even potentially memory usage in complex loci.
  - removed the "reduce_method_three" method for complex loci, which was untested and probably hardly ever used. The other two remaining methods have been updated and properly tested.
  - The first reduction method, in particular, should now be faster as well.

For EI-CoreBioinformatics#208:
- Now "=", "_" and "n" will be automatically added to the valid class codes **only when executing a reference-update run**. This should prevent runaway behaviour in normal runs.
- the Locus class will remove redundancies (ie transcripts that are strictly equal) after padding

Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief
given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one,
and lose all ASEs. This aspect will require careful thought.

Various:
- Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled.
- Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand)
- Python 3.7.4 does **not** produce functional .so files with cythonize. Blacklisting it.
- Added and configured a proper .coveragerc file; fixed .travis.yml and the environment YAML file
lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 12, 2019
This PR covers quite a bit of ground.

General improvements
- Now the superlocus class has been revamped a bit:
  - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts.
  - removed the third method to reduce complex loci
  - rewrote for speed the method one to reduce complex loci
  - now both reduction methods will consider whether a transcript is reference

Issue EI-CoreBioinformatics#208
Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned.
Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief
given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought.

Mixed
- Increased unit test coverage
- added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files)
- Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand)
- Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled.
- Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.
lucventurini added a commit that referenced this issue Aug 16, 2019
… a reference update with padding. This is not ready for prime time yet.
lucventurini added a commit to lucventurini/mikado that referenced this issue Aug 18, 2019
…ence update to go forward is to *remove the constraint on the maximum number of isoforms*. The only check will be whether a transcript meets the minimum score threshold.
@olekto
Copy link

olekto commented Aug 23, 2019

This sounds great. I was just looking for something like this. Presently, I have been using Mikado to merge different transcript evidence, and then used that further in Funannotate together with mapped proteins (with GenomeThreader) and let Funannotate control training of Augustus (funannotate predict). However, I do believe that I have lost UTRs after this. Funannotate has its own method of updating the annotation using Kallisto to find the best transcripts and PASA and EVM to add the UTRs. This is quite cumbersome, and I have not gotten it to work properly.

If I can use Mikado to add UTRs to the annotation from Funannotate, that would be perfect. Is it usable yet?

Thank you.

Ole

@lucventurini
Copy link
Collaborator Author

Hi Ole, it should be. I am currently on holiday, but next week I can write out the detailed instructions (I need to do it anyway for the documentation).

It's not perfect or well tested yet so I would really appreciate the third party testing!

@swarbred
Copy link
Collaborator

Hi @lucventurini

We definitely have a use for updating a reference annotation adding additional splice variants and extending UTRs. Marking a set of transcripts as "reference" allowed us to incorporate additional splice variants (SV), but we did this in a way that didn't require many changes to mikado (to save you work). So when we want to add SV to an exiting annotation we do the following

  1. split the reference annotation into two files, file (a) with just the primary models and file (b) with the non primary models.
  2. in the list.txt file both file a and b are marked as reference (note I see in mikado configure --help we call this field always_keep rather than reference, which might confuse as it only really means alwasy keep for the prepare step). Any other models (which you want to add SV from) are not reference.
  3. Also in the list.txt file you need to add a score against file a and file b, file a needs to be higher than b (e.g. 1000 and 500), this way you ensure that the same model is selected by pick as the primary, and that one of your reference models wins out at the pick stage
  4. set min_score_perc in the config to 0.0 otherwise you will exclude all other SV (given the high base scores we gave each of our two reference transcript files)
  5. run pick with the --only-reference-update enabled

This doesn't guarantee all reference transcripts are carried through but all the primary models (file a) should be present.

So this kind of works but it is convoluted. The "reference" assignment was there to ensure models dont get filtered prior to the pick and there are times when you want to assign a set of transcripts as reference but not use the --only-reference-update function i.e. if you want to update and add to a set of models.

On the UTR update

"I amended this so that this automatic addition would happen only when executing a "reference-update"

That might be ok but there might be situations where you either want to do this without --only-reference-update being enabled or not do when that is enabled. I would consider a separate flag for the UTR behaviour (and good documentation to explain all this).

"The problem with the approach is that implemented like this, we risk losing bunches of ASEs as Mikado will integrate all minor variants of transcripts into the locus ... and there is a hard limit on the number of ASEs."

I don't really follow this, shouldn't the limit on ASE be applied after your nF1 == 100 filter.

Making some minor tweeks to allow some form of UTR extension is useful, however to be honest there would probably be a better way to do the SV addition and UTR extension of an existing annotation i.e. a separate mikado tool but obviously thats a lot more work.

@lucventurini
Copy link
Collaborator Author

I don't really follow this, shouldn't the limit on ASE be applied after your nF1 == 100 filter.

No, as the logic is written, the maximum number of isoforms is used as a cutoff before we even do the padding. It is basically a switch that says to Mikado "just stop adding" once a critical number of isoforms has been reached. It is not the most elegant solution, hence why I suggested maybe removing it.

Making some minor tweeks to allow some form of UTR extension is useful, however to be honest there would probably be a better way to do the SV addition and UTR extension of an existing annotation i.e. a separate mikado tool but obviously thats a lot more work.

One that could still be done with the current infrastructure of the program, but yes, it would probably take a couple of weeks of dedicated programming to get it properly right.

What would your preferred course of action with the changes I put in this branch? There are some (regarding making the construction of Mikado transcript graphs faster) that I think should go into the main branch no matter what, but the rest is a bit more of a grey area.

lucventurini added a commit to lucventurini/mikado that referenced this issue Sep 13, 2019
…ure that there will be at most X isoforms in the locus, while doing all the padding.
@lucventurini
Copy link
Collaborator Author

Hi @swarbred, @gemygk:

  • I have merged most of the changes back into master. I have tested extensively and they seem to function.
  • I am now testing (see 6c350ec) a change in the definition of alternative splicing which will allow to pad, remove redundancies, and still keep a meaningful number of maximum isoforms. See the commit for the changes. The caveat is that the threshold for keeping or discarding a transcript from the final gene is now calculated when all candidates are present, only once. Before, the threshold was recalculated every time (see the commit for details).

The caveat means that there might be instances where the percentage threshold is not respected, in the final version of the locus.

@lucventurini
Copy link
Collaborator Author

To clarify, Mikado does not ensure that the primary transcript has a stable score during the AS selection stage.
To the contrary, both with the previous method and the current, there are instances where the primary transcript gets superseded by other transcripts in the locus for the final scoring. This is due to the fact that of course there are many metrics that hugely depend on the transcripts that are present in the locus at a given time (e.g. proportion_verified_introns_in_locus) and which therefore will vary as we add or remove transcripts from a locus.
Even for stable metrics, the associated score will change (e.g. for scoring on the transcript with the greatest number of transcripts).
Calculating all possible combinations would probably be prohibitive, therefore Mikado uses as criterion for the addition only the initial scores.

lucventurini added a commit that referenced this issue Sep 16, 2019
* This should put a nail into #208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding.

* BF for the `pick` command line.

* Bfix for previous commit.

* Solved a bug which deleted attributes from transcripts when serialising them.
@lucventurini lucventurini modified the milestones: 2.0, 2.1 Sep 30, 2019
@swarbred
Copy link
Collaborator

swarbred commented Oct 3, 2019

@lucventurini just to comment while I have this in my head if we want to "fully" enable updates to reference annotation ie new splice variants and UTR extensions then we might need to look at allowing some non-valid models

This ties in with #226

i.e. if you wanted to update say the downloaded human ensembl gene file you would find many models that will not meet mikados validation i.e. pseudogenes with annotated CDS containing stops , annotation based on haplotypes again containing stops or other features annotated in ways that mikado would not except.

Currently mikado prepare would clean these up (which is obviously desirable in a normal use case and I assume still true even if the models are marked as reference) but in the context of wanting to update an existing annotation probably not desirable.

So potentially you could have a lenient mode that for prepare, stats, pick etc that passes through these models unchanged. For the pick I guess the working solution would be to ignore any loci containing these transcripts i.e. no updates are made i.e. it's a set of loci that for pick purposes and the generation of metrics are simply ignored.

I'm not suggesting we do this as I realise it's likely not straight forward but just commenting as you will see more issues like issues/226 if people use mikado with existing annotation.

@lucventurini
Copy link
Collaborator Author

Hi @swarbred ,
yes, I get your point.
Regarding the issue in #226, actually, paradoxically mikado prepare would not have a problem with that ... at that stage the program only collects exon/CDS/UTR features. So a model like the one causing the issue there (ie having only exons and pseudogene, but no transcript, defined) would probably pass through fine.

@lucventurini lucventurini modified the milestones: 2.1, 2.x Feb 27, 2020
@olekto
Copy link

olekto commented Mar 27, 2020

Hi @lucventurini.

I see that you've changed the milestone to 2.1 a month ago or so. I guess this means that there is no straight-forward way of just adding UTRs to existing gene models (and only to valid models for instance, ignoring all non-valid).

Thank you.

@lucventurini
Copy link
Collaborator Author

I think this is now completely integrated in Mikado 2, due to progress on other issues.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
This PR covers quite a bit of ground.

General improvements
- Now the superlocus class has been revamped a bit:
  - made the definition of transcript graphs in superloci a O(nlogn) rather than O(n**2) algorithm for multiexonic transcripts.
  - removed the third method to reduce complex loci
  - rewrote for speed the method one to reduce complex loci
  - now both reduction methods will consider whether a transcript is reference

Issue EI-CoreBioinformatics#208
Now, when doing a reference update, the locus class will automatically consider redundant class codes for padding (n, =, _). After padding, completely redundant models (nF1 == 100%) will be pruned.
Still problematic: adding "=", "_" and "n" as valid class codes could theoretically cause grief
given the hard limit on the number of transcripts per locus. Briefly, we risk only adding minor variants of the same transcript, uniform them into a single one, and lose all ASEs. This aspect will require careful thought.

Mixed
- Increased unit test coverage
- added a proper .coveragerc file to avoid considering certain files (e.g. all of the test files)
- Solved a bug in the expansion of transcripts (template having a non-terminal exon with the same start coordinate of the last exon of the transcript to expand)
- Solved a serious bug in `mikado prepare` - in multiprocessing mode, the CDS was not appropriately kept when strip-cds was disabled.
- Python 3.7.4 does **not** produce valid cythonised files. Blacklisted it in TRAVIS and for Conda.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…lass codes when doing a reference update with padding. This is not ready for prime time yet.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* This should put a nail into EI-CoreBioinformatics#208: now we can ensure that there will be at most X isoforms in the locus, while doing all the padding.

* BF for the `pick` command line.

* Bfix for previous commit.

* Solved a bug which deleted attributes from transcripts when serialising them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants