-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sentence enumeration to paragraph identification when sentences are split #12
Conversation
If Bifixer resplite sentences, the position of the split sentences will be added to the paragraph identification
Is it expected to add two ids?
|
No, it is not expected. I just trusted the deferred implementation since the behaviour for the paragraphs identification is very similar. I though that line Line 207 in 276cc2c
echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long. Sentence that is going to be splitted ? \tp0s1\t0123456789" | monofixer --scol 1 --sparagraphid 2 --sdeferredcol 3 - - en you can see that the output will be:
The output skips the output of the following split lines for the deferred, but I didn't added the mentioned line since I though it was due to the deferred reconstructor as I have said. Maybe @lpla might say if this is intended or this shouldn't happen this way. |
When sentence was being split, the paragraph identifier was being wrongly printed multiple times
The problem now should be fixed, at least for the provided example. But the problem with the deferred still remains. I've checked the code a bit and I don't know if line Line 201 in 0c168cc
|
I've checked out that the same problem applies to Bifixer as well. In the case of the deferred, only the first line is printed when resplit, and multiple identifiers are added in the case of the paragraphs. Still I really think that line Line 228 in 0c168cc
copy.copy or copy.deepcopy , depending on the elements, at line Line 201 in 0c168cc
but @ZJaume should clarify if this array it is intended to be a reference or doesn't matter if it is a copy (I think that doesn't matter to be a copy if no processing is applied through the resplit segments loop) |
I think that Line 201 in 0c168cc
wasn't intended to be a reference, but it didn't gave any issue because parts was not being modified. If you really need to query the parts variable as it was prior to modifications, please feel free to do deepcopy .
|
The commented issues should be solved now with the previous commits:
|
If paragraph identification data is provided in the input file and sentences are split, we would like to keep the sentence enumeration in order to reconstruct the split sentences.
The new behaviour is very similar to deferred (flags
--sdeferredcol
and--tdeferredcol
). So, flags--sparagraphid
and--tparagraphid
are introduced in order to specify the columns of the source and target paragraph identification data.If sentences are split, the value
#{no. sentence}
will be added to source and target paragraph identifiers, like it is done with deferred.