Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concordance: tokenization and small fixes #232

Merged
merged 4 commits into from
May 26, 2017

Conversation

ajdapretnar
Copy link
Collaborator

@ajdapretnar ajdapretnar commented Apr 7, 2017

Issue

Fixes #208.

Description of changes
  1. Concordance computes tokens internally to distinguish from corpus.tokens.
  2. Input and output set to Corpus.
  3. Widget doesn't crash upon deleting the connection.
Includes
  • Code changes
  • Tests
  • Documentation

@ajdapretnar ajdapretnar changed the title Concordance tokens Concordance: tokenization and small fixes Apr 7, 2017
@ajdapretnar
Copy link
Collaborator Author

@janezd Crucial bug: when deleting a connection from, say, Corpus to Concordance (with a selection in Concordance), the widget crashes. Problem in modelReset.emit(). How to go about fixing this?

@codecov-io
Copy link

codecov-io commented Apr 7, 2017

Codecov Report

Merging #232 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #232   +/-   ##
=======================================
  Coverage   93.81%   93.81%           
=======================================
  Files          32       32           
  Lines        1600     1600           
  Branches      294      294           
=======================================
  Hits         1501     1501           
  Misses         60       60           
  Partials       39       39

if self.corpus.has_tokens() else 'n/a'
len(self.corpus))
self.n_tokens = sum(map(len, self.model.tokens)) \
if self.model.tokens is not None else 'n/a'
self.n_types = len(self.corpus.dictionary) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.corpus.dictionary should be replaced with the number of types from internal concordance tokenization.

Currently, we show number of tokens from internal concordances' tokenization and number of types from corpus' tokenization. But more serious problem than this mismatch is that the call to self.corpus.dictionary runs default preprocessor if the corpus is not yet preprocessed. Hence, when passing unpreprocessed data to concordance (e.g. Corpus -> Concordance) preprocessing is run twice. Once for concordances (as it should) and once since we call self.corpus.dictionary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add something like self.n_types = len(set(tokens)) to ConcordanceModel's set_tokens method and use that instead.

@ajdapretnar ajdapretnar force-pushed the concordance-tokens branch from 017924d to 809dc5b Compare May 25, 2017 08:58
@ajdapretnar
Copy link
Collaborator Author

Yaaay, after rebasing I have inadvertedly lost @janezd tests. 😞 How do I fix this and why on Earth does Git override someone else's commit??? That's bs. 😠

@ajdapretnar
Copy link
Collaborator Author

@nikicc After a lot of painful edits, this is ready to merge. :)

@nikicc nikicc merged commit e77756f into biolab:master May 26, 2017
@ajdapretnar ajdapretnar deleted the concordance-tokens branch June 29, 2017 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants