萬國碼漢字檢字系統改善計畫

Tunghai University Institutional Repository > 管理學院 > 資訊管理學系所 > 國科會研究報告 > Item 310901/14616

Please use this identifier to cite or link to this item: http://140.128.103.80:8080/handle/310901/14616

Title:	萬國碼漢字檢字系統改善計畫
Other Titles:	Enhancement of Unicode Han Character Lookup System
Authors:	林正偉 Lin, Jeng-Wei
Contributors:	行政院國家科學委員會東海大學資訊管理學系
Keywords:	萬國碼；檢字；構字式；異體字；義項 Unicode； Han character lookup； Han character construction expression； Han character variant； meaning items
Date:	2010
Issue Date:	2012-04-27T03:06:24Z (UTC)
Abstract:	本計畫旨在延續98?計畫成果-萬國碼(Unicode)漢字檢字系統之建?，進一步提供?方?的檢字方式，及以相關的漢字資?，讓使用者可以從眾多的Unicode漢字中檢?出他真正想要使用的漢字。 Unicode 3.1版已經收?70,195個漢字，預計在?久的未?將成長到10萬個漢字。然而，除?一開始收?在中日韓認同表意文字區(CJK Unified Ideograph)的20,902個漢字外，後?新收?的4萬多個漢字只能在少?的系統上使用。隨著科技進步，有愈?愈多的系統可以支援32位元的Unicode系統，也有少?的字型廠商也開始推出支援7萬餘個漢字的字型檔。可以預?的是，在全球化的?位世界?，將會有?多的資訊處?使用Unicode作為基本字集。然而，新收?的那4萬多個漢字多半是一些生僻的漢字，大部份的中文輸入法無法輸入這些字，許多情形只能使用內碼輸入。當使用者要使用某一個特定字形的漢字而無法用一般的輸入法輸入時，他需要使用一個漢字檢字系統?查詢該字是否收?在Unicode中，?有，內碼是多少，?無，則該字是一個缺字，使用者必需使用缺字系統?處?。一個漢字的外形在?同的區域可能是?同的，由於Unicode採取字形認同原則，一個漢字可能有許多相似的變形被收?到Unicode中，也有可能因為字型檔?同，在螢幕上或?印時所呈現出?的字形就?同，這些現象造成目前大多?的漢字檢字系統要求使用者對於漢字編碼與漢字結構有一定程?的?解，並?適合一般使用者使用。?一個漢字已經收?在Unicode中，使用者卻無法透過輸入法、漢字檢字系統檢?到，他必須回頭去使用舊有的自造字系統，這將使得該使用者採用Unicode的初衷大打折扣，而且使用自造字系統產生的資?，仍舊必需面對自造字的資訊交換、檢?等問題。本團隊於98?8月開始建構一個Unicode漢字檢字系統，提供相似字形漢字檢?。透過計算?個漢字之間的構字式編輯距??估算?個漢字字形的相似?，使用者?需要精確描述一個漢字的外形即可進?檢字作業。99??，我們計畫提供?種新的漢字檢?方式：(1)多相似部件漢字檢?、(2)?體字關?漢字檢?，提供?進階且方?的檢字功能；我們也計畫進一步的提供個別漢字的義項和其在使用?同字型檔所顯示的外形，讓使用者可以?容??解系統檢?出?的漢字是否是他所想要的漢字。 Abstract In this project, we aim to enhance the Unicode Han character lookup system, which we started developing from August 2009. Unicode 3.1 has included 70,195 Han characters in its repertory, and in the near future, it is expected Unicode will include more than 100,000 Han characters. However, most systems equipped with their default font display and input method subsystems support only 20,902 Han characters included in Unicode CJK Unified Ideograph in the first release of Unicode. Now, there are some systems supporting the use of 32-bit Unicode characters, and thus Han characters in Unicode CJK Unified Ideograph Extensions A and B. When suitable font files are installed, users can use all characters included in the Unicode repertory. However, most widely-used Han character input methods are incapable of looking up these newly-included Han characters. Due to the Unicode unification mechanism, a Han character that has several similar appearances might have been assigned to several codepoints, each of which identifies an appearance; a Unicode Han character might also represent many characters, which have similar appearances and are used in different regions. A Unicode Han character shown on a user’s screen or printed on a paper is probably different when different font files are used. As a result, it is hard for a generic user to look up a Han character in Unicode. From 2009, we started developing a Unicode Han character lookup system. We evaluate the similarity of two Han characters by calculating the edit distance of their Han character construction expressions. Thus, a user can input an appearance-similar Han character to look up a Han character whose appearance he or she does not exactly know. In 2010, we plan to enhance this system by supporting two new lookup methods: (1) to use similar components, and (2) to use the relationship of Han character variants. After a lookup is completed, for each of the returned Han characters, we also plan to show its meaning items and appearances when different font files are used. Thus, users can easily verify which Han character is exactly the one he or she wants to use.
Relation:	研究編號：NSC99-2221-E029-035 研究期間：2010-08~ 2011-07
Appears in Collections:	[資訊管理學系所] 國科會研究報告

Files in This Item:

File	Description	Size	Format
萬國碼漢字檢字系統改善計畫.PDF		1622Kb	Adobe PDF	813	View/Open

Loading...