Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can mikado serialise use multi-CPU? #177

Closed
shenweima opened this issue May 26, 2019 · 12 comments
Closed

can mikado serialise use multi-CPU? #177

shenweima opened this issue May 26, 2019 · 12 comments
Assignees
Milestone

Comments

@shenweima
Copy link

I have a orf bed file containing 141655528 lines. I found it is very slow and mikado serialise only use single CPU when import data to sqlite3 db. ca this step use multi-CPU?
This is my command
mikado serialise -p 12 --start-method spawn --json-conf configuration.yaml --xo_prepared_blastx.xml.gz --orfs mikado_prepared_orfs.bed --blast_targets /data2/user_data/blastx/plant_protein.fa --max_target_seqs 6

@lucventurini
Copy link
Collaborator

Dear @shenweima,
Unfortunately no. I have only ever imported BED files that are much smaller than that, so that that section never posed too much of a problem. Internally, mikado will check each BED entry against the corresponding transcript (hence trying to check whether the orf is complete, whether it would be advisable to shrink it to find a Met codon). That, and the lack of multiprocessing, cause the slowness.

I will have a look at how difficult it would be to implement this and will get back through this issue.

@lucventurini lucventurini self-assigned this May 26, 2019
@lucventurini lucventurini added this to the 1.5 milestone May 26, 2019
@shenweima
Copy link
Author

shenweima commented May 26, 2019

Thanks. Actually, I had two thousands RNA-Seq data and I had ran mikado pipeline about half year.

@lucventurini
Copy link
Collaborator

Moving this issue to the next release. I fully intend to fix this, but unfortunately it cannot hold the very belated version 1.5. I will dedicate myself to it as soon as possible.

@lucventurini lucventurini modified the milestones: 1.5, 1.6 Jun 6, 2019
lucventurini added a commit that referenced this issue Jun 18, 2019
* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit that referenced this issue Jun 18, 2019
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Development (#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* Development (#184)

* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit that referenced this issue Jun 19, 2019
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue #166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Development (#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (#166) and fix for #172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (#137) potentially also fixing #172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing #175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on #142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue #174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* #174: this should provide a solution to the issue, which is however only temporary. To be tested.

* #174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* #174: peppered the failing block with try-except statements.

* #174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed #176

* BROKEN. Progress on #142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing #155.

* #174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* #166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix #142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

* Development (#184)

* This should address #173 (both configuration file and docs) and #158

* Fix #181 and small bug fix for parsing Mikado annotations.

* Progress for #142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for #142)

* #142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test #142 for good.

* Removed spurious warning/error messages

* #142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* #142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* #142: fixing previous commit

* Pushing the fix for #182 onto the development branch

* Fix #183

* Fix #183 and previous commit

* #183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* #177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
@benhg
Copy link

benhg commented Jul 27, 2019

I would be happy to take a stab at implementing this, if it would be of use.

@lucventurini
Copy link
Collaborator

@benhg , please, be my guest! The relevant code should be here:

https://github.com/lucventurini/mikado/blob/master/Mikado/serializers/orf.py

@benhg
Copy link

benhg commented Aug 5, 2019

I think I've got something working ... Is there some data I can use to test with? Unfortunately, I'm not super familiar with the codebase.
Thanks,
Ben

@lucventurini
Copy link
Collaborator

Hi @benhg, you can use the public data for the article. You can find it (here)[https://figshare.com/articles/Input_assemblies/5688016]. I would be curious, of course, to see the changes you made!

@benhg
Copy link

benhg commented Aug 5, 2019

I will push it to my fork (and open a PR) as soon as I can verify it works ! I'm too self conscious to push an untested version haha!

@lucventurini
Copy link
Collaborator

lucventurini commented Sep 4, 2019

Update: after testing on the data for the original article, it appears evident that the bulk of time in serialising the ORFs is spent by calling the BioPython Bio.Seq.Seq.translate method.
This accounts for half of the total time spent by the program serialising the ORFs.
As this is a third party module, there are only two alternatives:

  • trying to use multiprocessing to speed up. Due to locking, inter-process communication, etc., this will most probably return quite less than linear speed ups by adding cores.
  • Change the method for something more performant.

I attach the .stats file (generated with cProfile and visualised with SnakeViz)

orfs.stats.zip

lucventurini added a commit to lucventurini/mikado that referenced this issue Sep 6, 2019
…improving the XML serialisation and using a more performant function for translating CDS into protein sequences. This should hopefully make Mikado serialise capable of dealing with large ORF databases.
lucventurini added a commit to lucventurini/mikado that referenced this issue Sep 6, 2019
… reaching the limit of what can be achieved with pure Python code.
lucventurini added a commit to lucventurini/mikado that referenced this issue Sep 8, 2019
@lucventurini
Copy link
Collaborator

Dear @shenweima, the latest commit introduces proper multiprocessing for serialising ORFs. I expect it to be more memory hungry as there is the use of a queue, but it should still be very lightweight.

Tomorrow I'll test it with a proper challenging dataset. If it functions, I'll close the issue.

@lucventurini
Copy link
Collaborator

Dear @shenweima, I can confirm that the new method functions. Specifically, now mikado serialise can load 954,447 ORFs in just about 3 minutes using 28 parallel processes; at least one minute of this was dedicated to loading the 696,889 transcripts into the "query" table.
Memory was also manageable, with a peak at about 8GBs.
I will close the issue, which will be completely solved once I merge back everything into master.

Again, thank you for reporting and prodding me to solve this long-standing inefficiency.
@benhg , many thanks for trying to solve this. If you would like to see my final implementation, please see this commit: 335e6a4. Please ignore the changes to Mikado/parsers/__init__.py, I reverted them as inefficient. The rest stands.

lucventurini added a commit that referenced this issue Sep 11, 2019
Improvements in this commit:
* Daijin now can use conda environments (provided in a system directory). This should allow for more reproducible runs.
* Switched from the unsupported `ujson` to the as-rapid and maintained `rapidjson`
* Cleaning up a little bit the configuration file
* Now the subloci output in `pick` has to be **explicitly** requested by the user. Producing it is by an expensive operation in `mikado pick`, and it should not therefore be on by default.
* Speed up in `mikado serialise`  (closes #177): 
  - XML serialising improved
  - ORF translation improved
  - ORF serialisation is now completely parallelised
* Speed ups in `mikado pick`:
  - Introduced a class method for printing GFAnnotation objects, so to avoid building a whole GtfLine/GffLine object when printing transcripts. 
   - In the main process, the GTF file will be parsed using a novel ad-hoc lighter parsing method. This prevents the parsing from being the main bottleneck. Likewise, now transcripts will be sent to the inter-process database only in batches, rather than atomically: the `commit` operation is too expensive to be performed in that way.
  - Now `mikado pick` has fully parallelised printing of the output files. This prevents the last stage (the file merging) from becoming the bottleneck.
* Now in Daijin BLAST chunks will always be at least equal to the number of threads (to ensure proper parallelism)
@shenweima
Copy link
Author

Dear @shenweima, I can confirm that the new method functions. Specifically, now mikado serialise can load 954,447 ORFs in just about 3 minutes using 28 parallel processes; at least one minute of this was dedicated to loading the 696,889 transcripts into the "query" table.
Memory was also manageable, with a peak at about 8GBs.
I will close the issue, which will be completely solved once I merge back everything into master.

Again, thank you for reporting and prodding me to solve this long-standing inefficiency.
@benhg , many thanks for trying to solve this. If you would like to see my final implementation, please see this commit: 335e6a4. Please ignore the changes to Mikado/parsers/__init__.py, I reverted them as inefficient. The rest stands.

thanks very much. I will test it on my dataset.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Development (EI-CoreBioinformatics#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* Development (EI-CoreBioinformatics#184)

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* Solved a small bug in the Gene class

* This commit should fix some of the performance issues found in Mikado compare when testing in the all vs all (issue EI-CoreBioinformatics#166).

* Updated the CHANGELOG.

* Slight improvements to the generic GFLine class and to the to_gff wrapper

* Solved some assorted bugs, from stop_codon parsing in GTF2 (for Augustus) to avoiding a very costly pragma check on MIDX databases.

* Now Mikado util stats will only return one value for the mode, making the table parsable

* Solved some small bugs introduced by changing the mode for mikado util stats

* Dropping automated support for Python3.5. The conda environment cannot be created successfully, too many packages have not been updated in the original repositories.

* Updating the conda environment to reflect that only Python>=3.6 is now accepted

* Various fixes for managing correctly BED12 files.

* Fix for the previous commit breaking TRAVIS

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Development (EI-CoreBioinformatics#178)

* Switched to PySam for loading and fetching from genome files. Also, improved massively the speed of tests.

* Fixed previous commit

* Fixed travis bug

* Refactoring of check_index for Mikado compare (EI-CoreBioinformatics#166) and fix for EI-CoreBioinformatics#172

* Now Mikado will merge touching (NOT overlapping) exons coming from BED12 files. This should fix an issue with halLiftover

* This commit should fix a bunch of tests for when Mikado is installed with SUDO privileges (EI-CoreBioinformatics#137) potentially also fixing EI-CoreBioinformatics#172.

* Corrected a bug in the printing of transcriptomic BED12 files, corrected a bug in the serialisation of ORFs

* Fixed previous breakage

* Moved the code for checking the index into gene_dict. Also, now GeneDict allows access to positions as well.

* Minor edit to assigner

* Fixing previously broken commit

* Solving a bug which rendered the exclude_utr/protein_coding flags of mikado compare useless.

* Adding the GZI index to the tests directory to avoid permission errors. Addressing EI-CoreBioinformatics#175

* Corrected some testing. Moreover, now Mikado supports the BED12+1 format (ie gffread --bed output)

* Adding a maximum intron length for the default scoring configuration files.

* BROKEN. Proceeding on EI-CoreBioinformatics#142. Now the padding algorithm is aware of where a transcript finishes (intron vs exon). Moreover, we need to change the data structure for padding to a *directional* graph and keep in mind the distance needed to pad a transcript, to solve ambiguous cases in a deterministic (rather than random) way.

* Issue EI-CoreBioinformatics#174: modification to the abstractlocus.py file, to try to solve the issue found by @cschuh.

* EI-CoreBioinformatics#174: this should provide a solution to the issue, which is however only temporary. To be tested.

* EI-CoreBioinformatics#174: making the implicit "for" cycle explicit. Hopefully this should help pinpoint the error better.

* EI-CoreBioinformatics#174: peppered the failing block with try-except statements.

* EI-CoreBioinformatics#174: this should solve it. Now missing external scores in the database will cause Mikado to explicitly fail.

* Fixed EI-CoreBioinformatics#176

* BROKEN. Progress on EI-CoreBioinformatics#142, the code runs, but the tests are broken. **This might be legitimate as we changed the behaviour of the code.**

* Closing EI-CoreBioinformatics#155.

* EI-CoreBioinformatics#174: Now Mikado pick will die informatively if the SQLite3 database has not been found.

* EI-CoreBioinformatics#166: fixed some issues with self-compare

* BROKEN. We have to verify that the padding functions also on the 5' end, but we need to make a new test for that. The test development is in progress.

* The padding now should be tested and correct.

* Fixed previous commit. This should fix EI-CoreBioinformatics#142.

* Update Singularity.centos.def

Changed python to python3 during %post, otherwise it will use the system python2.7...

* Fixed small bug in external metrics handling

* Update Singularity.centos.def

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.

* Development (EI-CoreBioinformatics#184)

* This should address EI-CoreBioinformatics#173 (both configuration file and docs) and EI-CoreBioinformatics#158

* Fix EI-CoreBioinformatics#181 and small bug fix for parsing Mikado annotations.

* Progress for EI-CoreBioinformatics#142 - this should fix the wrong ORF calculation for cases when the CDS was open at the 5' end.

* Fixed previous commit (always for EI-CoreBioinformatics#142)

* EI-CoreBioinformatics#142: corrected and tested the issue with one-off exons, for padding.

* This should fix and test EI-CoreBioinformatics#142 for good.

* Removed spurious warning/error messages

* EI-CoreBioinformatics#142: solved a bug which caused truncated transcripts at the 5' end not to be padded.

* EI-CoreBioinformatics#142: solved a problem which caused a false abort for transcripts on the - strand with changed stop codon.

* EI-CoreBioinformatics#142: fixing previous commit

* Pushing the fix for EI-CoreBioinformatics#182 onto the development branch

* Fix EI-CoreBioinformatics#183

* Fix EI-CoreBioinformatics#183 and previous commit

* EI-CoreBioinformatics#183: now Mikado configure will set a seed when generating the configuration file. The seed will be explicitly mentioned in the log.

* EI-CoreBioinformatics#177: made ORF loading slightly faster with pysam. Also made XML serialisation much faster using SQL sessions and multiprocessing.Pool instead of queues.

* Solved annoying bug that caused Mikado to crash with TAIR GFF3s.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
Improvements in this commit:
* Daijin now can use conda environments (provided in a system directory). This should allow for more reproducible runs.
* Switched from the unsupported `ujson` to the as-rapid and maintained `rapidjson`
* Cleaning up a little bit the configuration file
* Now the subloci output in `pick` has to be **explicitly** requested by the user. Producing it is by an expensive operation in `mikado pick`, and it should not therefore be on by default.
* Speed up in `mikado serialise`  (closes EI-CoreBioinformatics#177): 
  - XML serialising improved
  - ORF translation improved
  - ORF serialisation is now completely parallelised
* Speed ups in `mikado pick`:
  - Introduced a class method for printing GFAnnotation objects, so to avoid building a whole GtfLine/GffLine object when printing transcripts. 
   - In the main process, the GTF file will be parsed using a novel ad-hoc lighter parsing method. This prevents the parsing from being the main bottleneck. Likewise, now transcripts will be sent to the inter-process database only in batches, rather than atomically: the `commit` operation is too expensive to be performed in that way.
  - Now `mikado pick` has fully parallelised printing of the output files. This prevents the last stage (the file merging) from becoming the bottleneck.
* Now in Daijin BLAST chunks will always be at least equal to the number of threads (to ensure proper parallelism)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants