Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Description
lars
2003-09-14 16:39:13 UTC
Since the thesaurus is used for several languages it is not a good idea to add lang specific code like removing "ing" or for example adding plurals to the code itself. There is no end to be seen when starting this. I think instead the words should be available in the word list of the thesaurus itself. TL->Kevin: Can you please take over? TL->OH: Please submit a seperate bug for the dialog size. Hi, Yes, the thesaurus needs a lot of work to group synonyms by meaning (which will "fix" the dialog problem" and to greatly expanbd the wordlist to handle more words. All of this is in the works but requires a lot of volunteer help and time. Changing this to started ... Kevin . see also issue 19584 (thesaurus dialogue size improvement), issue 19586 (present thesaurus results more organized) and issue 19647 (on- the-fly thesaurus) > requires a lot of volunteer help and time. yes; perhaps there already are such lists (databases) on the internet somewhere, more specifically some groups are working on it already perhaps. One can build up on their work or work together with them or if none such groups exist work together with other projects which benefit from such work, ie. Mozilla spellcheck or thesaurus group (does a moz thesaurus group exist? hmm, can't find it on their projects page) (actually every (non-)human (:-)) benefits from this work ((don't hit me:) comparing vocabulary with Lego and Duplo (Lego Creator): more vocabulary as more quantity of and more varied pieces of Lego (bricks/blocks) allowing to build more than with big Duplo bricks -- and if others "understand" these pieces one can play together (uhui......)). So there can be started a new ((one-and-only) reference) vocabulary database for all languages :-) A "group" which can help on this field is http://dict.leo.org (for english and german) - Sun sponsored by the way. FYI: It will be possible to solve this, not only for -ing, with hunstem (part of hunspell). *** Issue 51889 has been marked as a duplicate of this issue. *** Target of Hunspell and Thesaurus integration: 2.0.2 (with morphological generations, for example: making -> doing). Need also a better American English dictionary based on real affixes. (It seems, British is good.) Target: 2.0.3 nemeth? nemeth->pjanik: I hope, 2.0.4. *** Issue 76272 has been marked as a duplicate of this issue. *** Any news? I suggest to move the target to 3.x as otherwise we need a fix ASAP. New target: 3.0 Very good news, that the latest Hunspell release has language independent stemming and morphological generation functions for this task, tested with the new Hungarian spelling and morphological dictionary. Next months I will work on a Hungarian thesaurus project, and I plan to fix this issue, too. Any help would be welcome, especially for the Hunspell-OOo thesaurus integration. This would be really great, for Slovenian thesaurus without this capability is useless. I hope you make it for 3.0. Thanks and good luck! Will gladly test it with Slovenian files if it works as it should. Filmsi: Thanks for your kind words. Next week I will release a test version and mail to the lingu-dev mailing list, also make a CWS. Most of the stemming will work without modification of the spelling dictionary, if the affix file contains real affixes. Fixed in the CWS hunspell4thesaurus. Test data for stemming: Press Ctrl-F7 (thesaurus) on "facts" in the Writer. MyThes thesaurus will stem "facts" using UNO interface of the spellchecker component, and show the synonyms of "fact". Issue Type: DEFECT. (Maybe morphological generation is an enhancement, but stemming is a bug fix and basic competitive feature of the thesaurus.) Test build: http://hunspell.sourceforge.net/OOo_3.0.0_080603_LinuxIntel_install.tar.gz en_US dictionary patch for affixation test (attached also in universal diff format): -----------en_US.dic.diff------------------ 5561c5561 < bet/MS --- > bet/MS ts:nom 8945c8945 < cat/SMRZ --- > cat/SMRZ ts:nom 30871c30871 < kitty/SM --- > kitty/SM ts:nom 33932c33932 < mammal/SM --- > mammal/SM ts:nom 43289c43289 < pool/MDSG --- > pool/MDSG ts:nom 44947c44947 < pussy/TRSM --- > pussy/TRSM ts:nom ------------en_US.aff.diff---------------------- 92,95c92,95 < SFX S y ies [^aeiou]y < SFX S 0 s [aeiou]y < SFX S 0 es [sxzh] < SFX S 0 s [^sxzhy] --- > SFX S y ies [^aeiou]y is:pl > SFX S 0 s [aeiou]y is:pl > SFX S 0 es [sxzh] is:pl > SFX S 0 s [^sxzhy] is:pl ------------------------------------------------ Test with the patched OOo and en_US.dic: 1. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets" synonyms instead of "poll" and "bet" in the first meaning "poll". 2. Choose "kitty-cat" meaning in the dialog. It has "pussies", "domestic cats" and "house cats" synonyms instead of "pussy", "domestic cat" and "house cat". 3. Choose "domestic cats" synonym with double click, the showed "house cat" meaning has "house cats", "cats" and "true cats" synonyms instead of "house cat", "cat" and "true cat". Created attachment 54217 [details]
en_US.dic patch
Created attachment 54218 [details] en_US.aff patch, see Hunspell manual for the morphological notation (http://sourceforge.net/docman/display_doc.php?docid=29374&group_id=143754) Reassigned for QA Old Linux test build for dictionary developers: http://hunspell.sourceforge.net/OOo_3.0.0_080603_LinuxIntel_install.tar.gz Linux test build (generated on an Ubuntu 8.04): http://hunspell.sourceforge.net/OOo_3.0.0_080702_LinuxIntel_install.tar.gz Note: After installation on Ubuntu 8.04, I ran it with the command LD_LIBRARY_PATH=/usr/lib /opt/openoffice.org3/program/soffice because of a symbol lookup error (/usr/lib/libcairo.so.2: undefined symbol: FT_Library_SetLcdFilter) Created attachment 54888 [details]
patched en_US dictionary files (no need to apply the previous patches)
The new Windows test build contains Hunspell 1.2.6 with affix condition matching fixes: hunspell.sourceforge.net/Windows080715/en-US.zip en_GB test word for the affix condition fix: "entertained" (it is accepted by the new build). The CWS is on target 3.0.1, therefore I changed the issue to the corresponding target. New test build: ftp://ftp.fsf.hu/OpenOffice.org_hu/devel/ Changes: Affixation of multiple word expressions are forbidden now. Steps of the verification: 0. Install attached en_US dictionary. 1. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets" synonyms instead of "poll" and "bet" in the first meaning "poll". 2. Choose "kitty-cat" meaning in the dialog. It has "pussies" instead of "pussy". 3. Choose "pussies" synonym with double click, and the "kitty" meaning has a "kitties" suggestion instead of "kitty". cws at target 3.1 -> issue at target 3.1 Created attachment 58734 [details]
Wordlist Hunspell en_US, en_CA spelling and morphological dictionaries
Created attachment 58739 [details]
improved suggestions for "astronauts": spacemen, cosmonauts, travelers (screenshot, note: mostly British "traveller" and its plural form arn't there in the en_US spelling and morphological dictionary)
Attached en_US, en_CA spelling and morphological dictionaries They are extended equivalents of the last Wordlist Hunspell dictionaries (version 2008-12-05). Tested with Hunspell 1.1.12 (OOo 3.0), too. Only ordinal number checking doesn't work with Hunspell 1.1.12. (COMPOUNDRULE didn't handle numerical flags in older Hunspell versions.) Attached screenshot: morphological dictionary test in the test build (ftp://ftp.fsf.hu/OpenOffice.org_hu/devel/) (tests for dictionary equivalence: $ unmunch <(sed -n '24,$p' en_US.dic) en_US.aff | sort | uniq >/tmp/en_US.wordlist $ cat <(echo badword) /tmp/en_US.wordlist | hunspell -d hunspell-en-morph-20081212/en_US -l badword badword $ unmunch <(sed -n '24,$p' en_CA.dic) en_CA.aff | sort | uniq >/tmp/en_CA.wordlist $ cat <(echo badword) /tmp/en_CA.wordlist | hunspell -d hunspell-en-morph-20081212/en_CA -l badword Reverse: $ awk 'FILENAME~/en_CA_notaliascomp[.]aff$/{if (NF==4){n[$2]=$4; i=0; next};i++;s[$2,i,1]=($3=="0"?0:length($3));s[$2,i,2]=($4=="0"?"":$4);s[$2,i,3]=$5"$"; next}!/\//{print $1;next}{split($1,a,"/");print a[1];l=split(a[2],b,",");for(i=1;i<=l;i++){ m=n[b[i]]; for(j=1;j<=m;j++){if(a[1] ~ s[b[i], j, 3])print substr(a[1], 1, length(a[1])-s[b[i], j, 1]) s[b[i], j, 2]}}}' en_CA_notaliascomp.{aff,dic} | sort | uniq | sed -n '3,$p' >/tmp/en_morf.wl0 f> diff /tmp/en_CA.wordlist /tmp/en_morf.wl0 0a1,22 > 1 > 1st > 1th > 2 > 2nd > 2th > 3 > 3rd > 3th > 4 > 4th > 5 > 53000 > 5th > 6 > 6th > 7 > 7th > 8 > 8th > 9 > 9th (only extra words) Please, someone working on this, please create a document or a wiki page explaining what this is all about. I am working with Slovenian OOo localization team as lead translator and am also working on the Slovenian thesaurus at www.tezaver.si. I do not know what needs to be done for other languages, I do not know if Slovenian spelling and hypehenation dictionaries, used in OOo, have all the necessary attributes for this what you are trying to do. How do I check if Slovenian dictionary has the right form, if not, what do I need to do with this dictionary, what form should it use? Etc., etc. So please do explain this to other localiation teams so not only English and two or three languages would benefit, but that all localization teams could concurrently work on their languages making OOo better for everyone. Thanks. Created attachment 58765 [details]
Dictionaries, release 2 (fixed morphological codes of comparative affixes)
filmsi: most of the stemming issues will work with the recent dictionaries. For irregular dictionary items (affixed words), you can use the "st:" field to add the stem (use tabulator instead of space for back compatibility) to the dictionary item: best st:good For morphological generation, you need to specify the morphological categories of the affixes and dictionary items by "ds:", "is:", "ts:" fields, or allomorphs by the "al:" items, like in the attached patches. An example for the "al" items: best st:good is:comp2 better st:good is:comp1 good al:better al:best ts:0 Wiki is a good idea for more explanation. I will use it. Thanks, László The newest versions of the spelling and morphological dictionaries were attached to the Issue 97403. Created attachment 58925 [details]
English spelling and morphological dictionary conversion script
Created attachment 59630 [details]
Test extension (en_US dictionaries, but for to the hu_HU locale)
I have attached a test dictionary. It contains en_US dictionaries, but installed for hu_HU locale to exclude the collision (it is not possible to switch off a default dictionary extension in the extension manager). The extension contains a full en_US spelling dictionary and a minimal version thesaurus for the verification. Steps of the verification: 1. Install attached extension. 2. Change the document language to Hungarian. 3. See "kitties"+Ctrl-F7 in Writer. Thesaurus dialog shows "polls" and "bets" synonyms instead of "poll" and "bet" in the first meaning "poll". 4. Choose "kitty-cat" meaning in the dialog. It has "pussies" instead of "pussy". 5. Choose "pussies" synonym with double click, and the "kitty" meaning has a "kitties" suggestion instead of "kitty". Verified in CWS hunspell4thesaurus. Sorry, should this already work in m40 (i.e. 3.1)? I downloaded a Pavel Janik Slovenian build of m40, but couldn't make it work with Slovenian thesaurus. sba - > filmsi: CWS hunspell4thesaurus is not yet nominated/integrated. To track the progress: http://eis.services.openoffice.org/EIS2/cws.ShowCWS?Path=DEV300%2Fhunspell4thesaurus This issue is closed automatically and wasn't rechecked in a current version of OOo. The fixed issue should be integrated in OOo since more than half a year. If you think this issue isn't fixed in a current version (OOo 3.1), please reopen it and change the field 'Target Milestone' accordingly. If you want to download a current version of OOo => http://download.openoffice.org/index.html If you want to know more about the handling of fixed/verified issues => http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues Sorry this issue was wrongly closed. This issue will be reopened automatically. And will be set after that back to fixed/verified. Set to state 'fixed'. Set back to state 'verified/fixed'. Again. Sorry for the mass of mails. This issue is closed automatically. It should be fixed in a version with is available for longer than half a year (OOo 3.1). If you think this issue isn't fixed in the current version (OOo 3.2) please reopen it. But then please pay attention about the field 'target milestone'. The closure was approved by the Release Status Meeting at 22nd of February 2010 and it is based on the issue handling guideline for fixed/verified issues : http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues |