-
-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concordance: tokenization and small fixes #232
Conversation
@janezd Crucial bug: when deleting a connection from, say, Corpus to Concordance (with a selection in Concordance), the widget crashes. Problem in modelReset.emit(). How to go about fixing this? |
Codecov Report
@@ Coverage Diff @@
## master #232 +/- ##
=======================================
Coverage 93.81% 93.81%
=======================================
Files 32 32
Lines 1600 1600
Branches 294 294
=======================================
Hits 1501 1501
Misses 60 60
Partials 39 39 |
if self.corpus.has_tokens() else 'n/a' | ||
len(self.corpus)) | ||
self.n_tokens = sum(map(len, self.model.tokens)) \ | ||
if self.model.tokens is not None else 'n/a' | ||
self.n_types = len(self.corpus.dictionary) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.corpus.dictionary
should be replaced with the number of types from internal concordance tokenization.
Currently, we show number of tokens from internal concordances' tokenization and number of types from corpus' tokenization. But more serious problem than this mismatch is that the call to self.corpus.dictionary
runs default preprocessor if the corpus is not yet preprocessed. Hence, when passing unpreprocessed data to concordance (e.g. Corpus -> Concordance) preprocessing is run twice. Once for concordances (as it should) and once since we call self.corpus.dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add something like self.n_types = len(set(tokens))
to ConcordanceModel's set_tokens method and use that instead.
017924d
to
809dc5b
Compare
Yaaay, after rebasing I have inadvertedly lost @janezd tests. 😞 How do I fix this and why on Earth does Git override someone else's commit??? That's bs. 😠 |
@nikicc After a lot of painful edits, this is ready to merge. :) |
Issue
Fixes #208.
Description of changes
Includes