Issue 102766 - patch to improve geman hyphenation pattern
Summary: patch to improve geman hyphenation pattern
Status: CLOSED FIXED
Alias: None
Product: lingucomponent
Classification: Code
Component: other (show other issues)
Version: DEV300m50
Hardware: All All
: P3 Trivial (vote)
Target Milestone: OOo 3.2
Assignee: stefan.baltzer
QA Contact: issues@lingucomponent
URL:
Keywords:
: 79559 (view as issue list)
Depends on:
Blocks:
 
Reported: 2009-06-14 19:34 UTC by lohmaier
Modified: 2017-05-20 09:01 UTC (History)
6 users (show)

See Also:
Issue Type: PATCH
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
patch to enhance the german hyphenation patterns (1.24 KB, patch)
2009-06-14 19:36 UTC, lohmaier
no flags Details | Diff
second version of patch, now also covers issue 79559 (1.34 KB, patch)
2009-06-15 01:08 UTC, lohmaier
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this issue.
Description lohmaier 2009-06-14 19:34:17 UTC
from issue 94231

String "Endanwender" is incorrectly hyphenated as "En-dan-wen-der" instead of
"End-an-wen-der"

When having a look at the patterns, I noticed that a few other strings were
affected, like Aben=dan=dacht or Aben=dan=zü=ge instead of Abend=an=dacht or
Abend=an=zü=ge.

The attached patch solves the issue. (note that a few lines of the existing
patterns were moved around when running the substrings.pl - the last 8 lines are
the actual addition.

The result from hyphenating the unmunched de_DE-myspell dictionaries (around
325800 unique words) with the current and the modified patterns:
$ diff orighyphenated.txt newhyphenated.txt 
796,799c796,799
< aben=dan=dacht
< aben=dan=dach=ten
< aben=dan=zü=ge
< aben=dan=zü=gen
---
> abend=an=dacht
> abend=an=dach=ten
> abend=an=zü=ge
> abend=an=zü=gen
87724c87724
< en=d=ab=neh=mer
---
> end=ab=neh=mer
87727,87728c87727,87728
< en=dab=schal=tung
< en=dan=wen=der
---
> end=ab=schal=tung
> end=an=wen=der

If you have a larger wordlist, go on and compare/check for regressions
Comment 1 lohmaier 2009-06-14 19:36:38 UTC
Created attachment 62980 [details]
patch to enhance the german hyphenation patterns
Comment 2 lohmaier 2009-06-15 01:07:04 UTC
*** Issue 79559 has been marked as a duplicate of this issue. ***
Comment 3 lohmaier 2009-06-15 01:08:11 UTC
Created attachment 62987 [details]
second version of patch, now also covers issue 79559
Comment 4 lohmaier 2009-06-15 01:10:37 UTC
changes compared to the current pattern (Note: Hundehaufen and Vollernter have
not been in the unmunched wordlist, I added them manually to the hyphenation-input):

aben=dan=dacht             |  abend=an=dacht
aben=dan=dach=ten          |  abend=an=dach=ten
aben=dan=zü=ge             |  abend=an=zü=ge
aben=dan=zü=gen            |  abend=an=zü=gen
baum=wol=lern=te           |  baum=woll=ern=te
en=d=ab=neh=mer            |  end=ab=neh=mer
en=dab=schal=tung          |  end=ab=schal=tung
en=dan=wen=der             |  end=an=wen=der
                           >  hun=de=hau=fen
ob=jek=t=ori=en=tiert      |  ob=jekt=ori=en=tiert
ob=jek=t=ori=en=tier=te    |  ob=jekt=ori=en=tier=te
ob=jek=t=ori=en=tier=tem   |  ob=jekt=ori=en=tier=tem
ob=jek=t=ori=en=tier=ten   |  ob=jekt=ori=en=tier=ten
ob=jek=t=ori=en=tier=ter   |  ob=jekt=ori=en=tier=ter
ob=jek=t=ori=en=tier=tes   |  ob=jekt=ori=en=tier=tes
pro=gram=mab=bruch         |  pro=gramm=ab=bruch
pro=gram=m=ab=hän=gig      |  pro=gramm=ab=hän=gig
pro=gram=m=ab=hän=gi=ge    |  pro=gramm=ab=hän=gi=ge
pro=gram=m=ab=hän=gi=gem   |  pro=gramm=ab=hän=gi=gem
pro=gram=m=ab=hän=gi=gen   |  pro=gramm=ab=hän=gi=gen
pro=gram=m=ab=hän=gi=ger   |  pro=gramm=ab=hän=gi=ger
pro=gram=m=ab=hän=gi=ges   |  pro=gramm=ab=hän=gi=ges
pro=gram=ma=b=lauf         |  pro=gramm=ab=lauf
pro=gram=ma=b=lau=fes      |  pro=gramm=ab=lau=fes
pro=gram=ma=b=lauf=plan    |  pro=gramm=ab=lauf=plan
pro=gram=ma=b=lauf=plans   |  pro=gramm=ab=lauf=plans
pro=gram=ma=b=laufs        |  pro=gramm=ab=laufs
pro=gram=mab=läu=fe        |  pro=gramm=ab=läu=fe
pro=gram=mab=läu=fen       |  pro=gramm=ab=läu=fen
pro=gram=m=ana=ly=se       |  pro=gramm=ana=ly=se
pro=gram=m=ana=ly=sen      |  pro=gramm=ana=ly=sen
pro=gram=m=aus=füh=rung    |  pro=gramm=aus=füh=rung
te=le=gram=man=schrift     |  te=le=gramm=an=schrift
te=le=gram=man=schrif=ten  |  te=le=gramm=an=schrif=ten
                           >  voll=ern=ter

As before: If you have big wordlists to check against, please do and check for
regressions.
Comment 5 Mathias_Bauer 2009-06-15 09:47:56 UTC
The hyphenation patterns included in OOo are taken from 

http://www.tug.org/tex-archive/language/hyphenation/dehyphn.tex

I wonder where would be the right place to "upstream" that patch.
Comment 6 lohmaier 2009-06-15 10:04:14 UTC
I doubt there is one - that old patters don't get updated, it is still at
revision level 31 from 2001 (the version that the OOo-patterns are based on)
instead there is now an experimental german hyphenation package, probably aimed
to replace the old patterns someday.
http://ctan.tug.org/tex-archive/language/hyphenation/dehyph-exptl/
Comment 7 Mathias_Bauer 2009-06-15 11:57:14 UTC
"Experimental" doesn't sound like it could be something we want now. But what
about updating to a newer version of the "non-experimental" patterns?

Should we start a discussion about that on dev@de?
Comment 8 lohmaier 2009-06-15 12:56:54 UTC
> But what about updating to a newer version of the "non-experimental" patterns?

That's what I tried to explain: There is no newer version. The revision that is
available is the same as it was back in 2001 - revision level 31 - that is what
the current ones were based on, and that is the most recent version that is
available. All that apparently was updated is an additional exceptionlist to
blacklist some words.
Comment 9 Mathias_Bauer 2009-06-15 13:41:16 UTC
Ah, yes, I did not look on the date entries carefully enough, the "2008-07-09" I
saw only was for the README.

OK, so it seems that we can apply the fixes in our repository only.
Comment 10 thomas.lange 2009-10-08 12:32:29 UTC
I applied the new entries
+en6d5an
+en6d5ab
+eben7d6a
+ebe2n1d
+pen7d8an
+pe2n1d
+ten7d8an
+te2n1d
+gram4m5a2
+gra2m1m
+gram5m6a3t
+de1h6a
+jek2t3o
+je2k1t
+ojek3t4o
+oje2k1t
+ol2l1ernt
+ol1ler
+olle2rn

to the hyphenation dictionaries of de_DE, de_AT, de_CH.

However if we do not make sure these changes will get applied to the *frami*
de_* extensions as well these changes will go to waste upon the next automatic
update to those extensions. Thus I'm going to write a mail to the extension
provider about this patch.


Comment 11 thomas.lange 2009-10-08 12:33:01 UTC
.
Comment 12 thomas.lange 2009-10-08 12:33:30 UTC
.
Comment 13 thomas.lange 2009-10-09 11:38:52 UTC
.
Comment 14 stefan.baltzer 2009-10-12 14:01:26 UTC
Verified in CWS sw32bf05.