Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sentence enumeration to paragraph identification when sentences are split #12

Merged
merged 5 commits into from
Feb 15, 2022

Conversation

cgr71ii
Copy link
Contributor

@cgr71ii cgr71ii commented Feb 7, 2022

If paragraph identification data is provided in the input file and sentences are split, we would like to keep the sentence enumeration in order to reconstruct the split sentences.

The new behaviour is very similar to deferred (flags --sdeferredcol and --tdeferredcol). So, flags --sparagraphid and --tparagraphid are introduced in order to specify the columns of the source and target paragraph identification data.

If sentences are split, the value #{no. sentence} will be added to source and target paragraph identifiers, like it is done with deferred.

If Bifixer resplite sentences, the position of the split sentences
will be added to the paragraph identification
@ZJaume
Copy link
Member

ZJaume commented Feb 14, 2022

Is it expected to add two ids?

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1" | monofixer --scol 1 --sparagraphid 2 - - en 
2022-02-14 17:39:19,039 - INFO - Arguments processed.
2022-02-14 17:39:19,040 - INFO - Executing main program...
2022-02-14 17:39:19,040 - INFO - Starting fixing text
hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  02cc36dcd3f171e8        1
Sentence that is going to be splitted?  p0s1#1#2        9cc706c938726cf6        1
2022-02-14 17:39:19,049 - INFO - Text fixing finished
2022-02-14 17:39:19,049 - INFO - Finished
2022-02-14 17:39:19,049 - INFO - Input lines: 1 rows
2022-02-14 17:39:19,049 - INFO - Output lines: 2 rows
2022-02-14 17:39:19,049 - INFO - Elapsed time 0.01 s
2022-02-14 17:39:19,049 - INFO - Troughput: 108 rows/s
2022-02-14 17:39:19,049 - INFO - Output file: <stdout>
2022-02-14 17:39:19,049 - INFO - Program finished

@cgr71ii
Copy link
Contributor Author

cgr71ii commented Feb 15, 2022

No, it is not expected. I just trusted the deferred implementation since the behaviour for the paragraphs identification is very similar. I though that line

if "#" in parts[args.sdeferredcol-1]:
was due to the deferred reconstructor, but now I'm thinking that the reason for that line might be to avoid the problem that you are saying (i.e. add multiple identifiers). The problem is that, if you execute

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1\t0123456789" | monofixer --scol 1 --sparagraphid 2 --sdeferredcol 3 - - en

you can see that the output will be:

hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  0123456789#1    02cc36dcd3f171e8 1

The output skips the output of the following split lines for the deferred, but I didn't added the mentioned line since I though it was due to the deferred reconstructor as I have said. Maybe @lpla might say if this is intended or this shouldn't happen this way.

When sentence was being split, the paragraph identifier was being
wrongly printed multiple times
@cgr71ii
Copy link
Contributor Author

cgr71ii commented Feb 15, 2022

The problem now should be fixed, at least for the provided example. But the problem with the deferred still remains. I've checked the code a bit and I don't know if line

new_parts = parts
is intended, since it is a reference, which throughout the code is being modified, so when a line is split and modified due to, at least, deferred or paragraph identification, the next time which should be modified, it will already contain the previous change, what leaded to the initial problem.

@cgr71ii
Copy link
Contributor Author

cgr71ii commented Feb 15, 2022

I've checked out that the same problem applies to Bifixer as well. In the case of the deferred, only the first line is printed when resplit, and multiple identifiers are added in the case of the paragraphs. Still I really think that line

if sent_num != int(parts[args.sdeferredcol - 1].split('#')[1]):
it is intended to work with the deferred reconstructor (I would need feedback from @lpla to be sure). In that case, these bugs might be fixed using copy.copy or copy.deepcopy, depending on the elements, at line
new_parts = parts

but @ZJaume should clarify if this array it is intended to be a reference or doesn't matter if it is a copy (I think that doesn't matter to be a copy if no processing is applied through the resplit segments loop)

@ZJaume
Copy link
Member

ZJaume commented Feb 15, 2022

I think that

new_parts = parts

wasn't intended to be a reference, but it didn't gave any issue because parts was not being modified. If you really need to query the parts variable as it was prior to modifications, please feel free to do deepcopy.

@cgr71ii
Copy link
Contributor Author

cgr71ii commented Feb 15, 2022

The commented issues should be solved now with the previous commits:

  • Resplit sentences are now printed when deferred is enabled.
  • Resplit sentences are now right identified when paragraph identification is enabled.

@ZJaume ZJaume merged commit 7b1203a into master Feb 15, 2022
@cgr71ii cgr71ii deleted the paragraph_identification branch March 2, 2022 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants