Apache OpenOffice (AOO) Bugzilla – Issue 102766
patch to improve geman hyphenation pattern
Last modified: 2017-05-20 09:01:28 UTC
from issue 94231 String "Endanwender" is incorrectly hyphenated as "En-dan-wen-der" instead of "End-an-wen-der" When having a look at the patterns, I noticed that a few other strings were affected, like Aben=dan=dacht or Aben=dan=zü=ge instead of Abend=an=dacht or Abend=an=zü=ge. The attached patch solves the issue. (note that a few lines of the existing patterns were moved around when running the substrings.pl - the last 8 lines are the actual addition. The result from hyphenating the unmunched de_DE-myspell dictionaries (around 325800 unique words) with the current and the modified patterns: $ diff orighyphenated.txt newhyphenated.txt 796,799c796,799 < aben=dan=dacht < aben=dan=dach=ten < aben=dan=zü=ge < aben=dan=zü=gen --- > abend=an=dacht > abend=an=dach=ten > abend=an=zü=ge > abend=an=zü=gen 87724c87724 < en=d=ab=neh=mer --- > end=ab=neh=mer 87727,87728c87727,87728 < en=dab=schal=tung < en=dan=wen=der --- > end=ab=schal=tung > end=an=wen=der If you have a larger wordlist, go on and compare/check for regressions
Created attachment 62980 [details] patch to enhance the german hyphenation patterns
*** Issue 79559 has been marked as a duplicate of this issue. ***
Created attachment 62987 [details] second version of patch, now also covers issue 79559
changes compared to the current pattern (Note: Hundehaufen and Vollernter have not been in the unmunched wordlist, I added them manually to the hyphenation-input): aben=dan=dacht | abend=an=dacht aben=dan=dach=ten | abend=an=dach=ten aben=dan=zü=ge | abend=an=zü=ge aben=dan=zü=gen | abend=an=zü=gen baum=wol=lern=te | baum=woll=ern=te en=d=ab=neh=mer | end=ab=neh=mer en=dab=schal=tung | end=ab=schal=tung en=dan=wen=der | end=an=wen=der > hun=de=hau=fen ob=jek=t=ori=en=tiert | ob=jekt=ori=en=tiert ob=jek=t=ori=en=tier=te | ob=jekt=ori=en=tier=te ob=jek=t=ori=en=tier=tem | ob=jekt=ori=en=tier=tem ob=jek=t=ori=en=tier=ten | ob=jekt=ori=en=tier=ten ob=jek=t=ori=en=tier=ter | ob=jekt=ori=en=tier=ter ob=jek=t=ori=en=tier=tes | ob=jekt=ori=en=tier=tes pro=gram=mab=bruch | pro=gramm=ab=bruch pro=gram=m=ab=hän=gig | pro=gramm=ab=hän=gig pro=gram=m=ab=hän=gi=ge | pro=gramm=ab=hän=gi=ge pro=gram=m=ab=hän=gi=gem | pro=gramm=ab=hän=gi=gem pro=gram=m=ab=hän=gi=gen | pro=gramm=ab=hän=gi=gen pro=gram=m=ab=hän=gi=ger | pro=gramm=ab=hän=gi=ger pro=gram=m=ab=hän=gi=ges | pro=gramm=ab=hän=gi=ges pro=gram=ma=b=lauf | pro=gramm=ab=lauf pro=gram=ma=b=lau=fes | pro=gramm=ab=lau=fes pro=gram=ma=b=lauf=plan | pro=gramm=ab=lauf=plan pro=gram=ma=b=lauf=plans | pro=gramm=ab=lauf=plans pro=gram=ma=b=laufs | pro=gramm=ab=laufs pro=gram=mab=läu=fe | pro=gramm=ab=läu=fe pro=gram=mab=läu=fen | pro=gramm=ab=läu=fen pro=gram=m=ana=ly=se | pro=gramm=ana=ly=se pro=gram=m=ana=ly=sen | pro=gramm=ana=ly=sen pro=gram=m=aus=füh=rung | pro=gramm=aus=füh=rung te=le=gram=man=schrift | te=le=gramm=an=schrift te=le=gram=man=schrif=ten | te=le=gramm=an=schrif=ten > voll=ern=ter As before: If you have big wordlists to check against, please do and check for regressions.
The hyphenation patterns included in OOo are taken from http://www.tug.org/tex-archive/language/hyphenation/dehyphn.tex I wonder where would be the right place to "upstream" that patch.
I doubt there is one - that old patters don't get updated, it is still at revision level 31 from 2001 (the version that the OOo-patterns are based on) instead there is now an experimental german hyphenation package, probably aimed to replace the old patterns someday. http://ctan.tug.org/tex-archive/language/hyphenation/dehyph-exptl/
"Experimental" doesn't sound like it could be something we want now. But what about updating to a newer version of the "non-experimental" patterns? Should we start a discussion about that on dev@de?
> But what about updating to a newer version of the "non-experimental" patterns? That's what I tried to explain: There is no newer version. The revision that is available is the same as it was back in 2001 - revision level 31 - that is what the current ones were based on, and that is the most recent version that is available. All that apparently was updated is an additional exceptionlist to blacklist some words.
Ah, yes, I did not look on the date entries carefully enough, the "2008-07-09" I saw only was for the README. OK, so it seems that we can apply the fixes in our repository only.
I applied the new entries +en6d5an +en6d5ab +eben7d6a +ebe2n1d +pen7d8an +pe2n1d +ten7d8an +te2n1d +gram4m5a2 +gra2m1m +gram5m6a3t +de1h6a +jek2t3o +je2k1t +ojek3t4o +oje2k1t +ol2l1ernt +ol1ler +olle2rn to the hyphenation dictionaries of de_DE, de_AT, de_CH. However if we do not make sure these changes will get applied to the *frami* de_* extensions as well these changes will go to waste upon the next automatic update to those extensions. Thus I'm going to write a mail to the extension provider about this patch.
.
Verified in CWS sw32bf05.