Add sentence enumeration to paragraph identification when sentences are split #12

cgr71ii · 2022-02-07T15:28:53Z

If paragraph identification data is provided in the input file and sentences are split, we would like to keep the sentence enumeration in order to reconstruct the split sentences.

The new behaviour is very similar to deferred (flags --sdeferredcol and --tdeferredcol). So, flags --sparagraphid and --tparagraphid are introduced in order to specify the columns of the source and target paragraph identification data.

If sentences are split, the value #{no. sentence} will be added to source and target paragraph identifiers, like it is done with deferred.

If Bifixer resplite sentences, the position of the split sentences will be added to the paragraph identification

ZJaume · 2022-02-14T16:40:03Z

Is it expected to add two ids?

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1" | monofixer --scol 1 --sparagraphid 2 - - en

2022-02-14 17:39:19,039 - INFO - Arguments processed.
2022-02-14 17:39:19,040 - INFO - Executing main program...
2022-02-14 17:39:19,040 - INFO - Starting fixing text
hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  02cc36dcd3f171e8        1
Sentence that is going to be splitted?  p0s1#1#2        9cc706c938726cf6        1
2022-02-14 17:39:19,049 - INFO - Text fixing finished
2022-02-14 17:39:19,049 - INFO - Finished
2022-02-14 17:39:19,049 - INFO - Input lines: 1 rows
2022-02-14 17:39:19,049 - INFO - Output lines: 2 rows
2022-02-14 17:39:19,049 - INFO - Elapsed time 0.01 s
2022-02-14 17:39:19,049 - INFO - Troughput: 108 rows/s
2022-02-14 17:39:19,049 - INFO - Output file: <stdout>
2022-02-14 17:39:19,049 - INFO - Program finished

cgr71ii · 2022-02-15T07:45:46Z

No, it is not expected. I just trusted the deferred implementation since the behaviour for the paragraphs identification is very similar. I though that line

bifixer/bifixer/monofixer.py

Line 207 in 276cc2c

if "#" in parts[args.sdeferredcol-1]:

was due to the deferred reconstructor, but now I'm thinking that the reason for that line might be to avoid the problem that you are saying (i.e. add multiple identifiers). The problem is that, if you execute

echo -ne "hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a  long. Sentence that is going to be splitted ? \tp0s1\t0123456789" | monofixer --scol 1 --sparagraphid 2 --sdeferredcol 3 - - en

you can see that the output will be:

hey very aa a a a a a a a a a a a a a a a a a a a aa a a a a long.      p0s1#1  0123456789#1    02cc36dcd3f171e8 1

The output skips the output of the following split lines for the deferred, but I didn't added the mentioned line since I though it was due to the deferred reconstructor as I have said. Maybe @lpla might say if this is intended or this shouldn't happen this way.

When sentence was being split, the paragraph identifier was being wrongly printed multiple times

cgr71ii · 2022-02-15T08:05:17Z

The problem now should be fixed, at least for the provided example. But the problem with the deferred still remains. I've checked the code a bit and I don't know if line

bifixer/bifixer/monofixer.py

Line 201 in 0c168cc

new_parts = parts

is intended, since it is a reference, which throughout the code is being modified, so when a line is split and modified due to, at least, deferred or paragraph identification, the next time which should be modified, it will already contain the previous change, what leaded to the initial problem.

cgr71ii · 2022-02-15T09:10:54Z

I've checked out that the same problem applies to Bifixer as well. In the case of the deferred, only the first line is printed when resplit, and multiple identifiers are added in the case of the paragraphs. Still I really think that line

bifixer/bifixer/bifixer.py

Line 228 in 0c168cc

if sent_num != int(parts[args.sdeferredcol - 1].split('#')[1]):

it is intended to work with the deferred reconstructor (I would need feedback from @lpla to be sure). In that case, these bugs might be fixed using copy.copy or copy.deepcopy, depending on the elements, at line

bifixer/bifixer/monofixer.py

Line 201 in 0c168cc

new_parts = parts

but @ZJaume should clarify if this array it is intended to be a reference or doesn't matter if it is a copy (I think that doesn't matter to be a copy if no processing is applied through the resplit segments loop)

ZJaume · 2022-02-15T10:15:00Z

I think that

bifixer/bifixer/monofixer.py

Line 201 in 0c168cc

new_parts = parts

wasn't intended to be a reference, but it didn't gave any issue because parts was not being modified. If you really need to query the parts variable as it was prior to modifications, please feel free to do deepcopy.

cgr71ii · 2022-02-15T13:08:24Z

The commented issues should be solved now with the previous commits:

Resplit sentences are now printed when deferred is enabled.
Resplit sentences are now right identified when paragraph identification is enabled.

cgr71ii added 2 commits February 4, 2022 15:41

Paragraph identification in bifixer

11a204d

If Bifixer resplite sentences, the position of the split sentences will be added to the paragraph identification

Paragraph identification in monofixer

276cc2c

This was referenced Feb 7, 2022

Add headers to input and output files #11

Merged

Paragraph identification bitextor/bitextor#225

Merged

lpla approved these changes Feb 14, 2022

View reviewed changes

mbanon assigned ZJaume Feb 14, 2022

Fix paragraphs

0c168cc

When sentence was being split, the paragraph identifier was being wrongly printed multiple times

cgr71ii added 2 commits February 15, 2022 11:34

Fix resplit (deferred and paraid)

bcb8b62

Fix deferred reconstructor

27d0aa3

ZJaume merged commit 7b1203a into master Feb 15, 2022

cgr71ii deleted the paragraph_identification branch March 2, 2022 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sentence enumeration to paragraph identification when sentences are split #12

Add sentence enumeration to paragraph identification when sentences are split #12

cgr71ii commented Feb 7, 2022

ZJaume commented Feb 14, 2022

cgr71ii commented Feb 15, 2022 •

edited

Loading

cgr71ii commented Feb 15, 2022

cgr71ii commented Feb 15, 2022

ZJaume commented Feb 15, 2022

cgr71ii commented Feb 15, 2022

Add sentence enumeration to paragraph identification when sentences are split #12

Add sentence enumeration to paragraph identification when sentences are split #12

Conversation

cgr71ii commented Feb 7, 2022

ZJaume commented Feb 14, 2022

cgr71ii commented Feb 15, 2022 • edited Loading

cgr71ii commented Feb 15, 2022

cgr71ii commented Feb 15, 2022

ZJaume commented Feb 15, 2022

cgr71ii commented Feb 15, 2022

cgr71ii commented Feb 15, 2022 •

edited

Loading