Tuesday, June 4, 2019
Automatic Encoding Detection And Unicode Conversion Engine Computer Science Essay
Automatic En mark maculation And Unicodification Conversion Engine ready reck championr Science EssayIn computers, founts ar instanceed using numbers. Initi wholey the vary schemes were designed to assert the English alphabet, which has a limited number of symbols. later(prenominal) the requirement for a humankindwide persona en law scheme to support multi lingual computing was identified. The ascendant was to come up with a 16 encode scheme to represent a character so that it can support up to large character squargon off. The current Unicode version contains 107,000 characters covering 90 al-Qurans. In the current context operating administrations much(prenominal) as Windows 7, UNIX found operating systems applications such as word processors and in wee-weeation exchange technologies do support this warning enabling internationalization in the IT persistence. Even though this beat has been the de facto standard, still there can be seen certain applications u sing trademarked encoding schemes to represent the data. As an example, famous Sinhalese news internet sites still do non adapt Unicode standard based fonts to represent the content. This ca drug ab consumptions issues such as the requirement of downloading proprietary fonts, browser dependencies making the efforts of Unicode standard in vain. In appendage to the web site content itself there are collections of learning included in documents such as PDFs in non Unicode fonts making it difficult to se sozzled through search engines unless the search term is entered in that particular font encoding.This has presumptuousness the requirement of automatic separately(prenominal)y detecting the encoding and transforming into the Unicode encoding in the corresponding speech communication, so that it avoids the problems mentioned. In case of web sites, a browser plug-in implementation to support the automatic non-Unicode to Unicode passage would eliminate the requirement of downl oading legacy fonts, which uses proprietary character encodings. Although somewhat web sites lead the source font in fix upion, there are certain web applications, which do non give this information, making the auto detection process much difficult. Hence it is required to detect the encoding first, before it has been fed to the transformation process. This has given the rise to a research area of auto detecting the lyric poem encoding for a given text based on language characteristics.This problem result be addressed based on a statistical language encoding detection mechanism. The technique would be demonstrated with the support for all the Sinhala Non Unicode encodings. The implementation for the demonstration will make sure that it is an extendible solution for early(a) languages making it support for any given language based on a future requirement.Since the beginning of the computer age, many encoding schemes have been created to represent various writing scripts/chara cters for computerized data. With the advent of globalization and the development of the Internet, information exchanges crossing both language and regional boundaries are becoming ever more important. However, the existence of multiple coding schemes presents a significant barrier. The Unicode has provided a universal proposition coding scheme, only it has non so far replaced existing regional coding schemes for a variety of reasons. Thus, todays global software applications are required to speak multiple encodings in accessory to supporting Unicode.In computers, characters are encoded as numbers. A typeface is the scheme of letterforms and the font is the computer file or program which physically embodies the typeface. Legacy fonts use several(prenominal)(predicate) encoding systems for assigning the numbers for characters. This leads to the fact that two legacy font encodings defining contrastive numbers for the aforesaid(prenominal) character. This whitethorn lead to co nflicts with how the characters are encoded in different systems and will require maintaining multiple encoding fonts. The requirement of having a standard to unique character appointment was satisfied with the introduction of Unicode. Unicode en equal to(p)s a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering.UnicodeUnicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the worlds writing systems. The latest Unicode has more than 107,000 characters covering 90 scripts, which consists of a set of code charts. The Unicode Consortium co-ordinates Unicodes development and the goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes. This standard is being back up in many recent technologies including Programming Languages and modern operating system s. All W3C recommendations have use Unicode as their document character set since HTML 4.0. mesh browsers have supported Unicode, e sparely UTF-8, for many years 4, 5.Sinhala Legacy Font Conversion Requirement for Web ContentSinhala language utilisation in computer technology has been present since 1980s scarce the lack of standards in character representation system resulted in proprietary fonts. Sinhala was added to Unicode in 1998 with the intention of overcoming the limitations in proprietary character encodings. Dinamina, DinaminaUniWeb, Iskoola Pota, KandyUnicode, KaputaUnicode, Malithi Web, Potha are some Sinhala Unicode fonts which were developed so that the numbers assigned with the characters are the same. Still some study news sites which display Sinhala character table of contents have not adapted the Unicode standards. The Legacy Fonts encoding schemes are used sooner causing the conflicts in content representation. In ordain to minimize the problems, font familie s were created where the shape of characters only differs exactly the encoding remains the same. FM Font Family, DL Font Family are some examples where a font family concept is used as a grouping of Sinhala fonts with similar encodings 1, 2.Adaptation of non Unicode encodings causes a lot of compatibility issues when viewed in different browsers and operating systems. Operating systems such as Windows Vista, Windows7 come with Sinhala Unicode support and do not require external fonts to be installed to read Sinhalese script. Variations of gnu/Linux distributions such as Dabian or Ubuntu likewise provide Sinhala Unicode support. Enabling non Unicode applications especially web contents with the support for Unicode fonts will allow the users to view contents without installing the legacy fonts.Non Unicode PDF DocumentsIn addition to the contents in the web, there exists a whole lot of government documents which are in PDF format but their contents are encoded with legacy fonts. Thos e documents would not be searchable through search engines by entering the search terms in Unicode. In order to pommel the problem it is important to convert such documents in to a Unicode font so that they are searchable and its data can be used by early(a) applications consistently, irrespective of the font. As another part of the project this problem would be addressed through a converter tool, which creates the Unicode version of existing PDF document which are currently in legacy font.The ProblemSections 1.3, 1.4 describe two domains in which the Non Unicode to Unicode conversion is required. The conversion involves identification of non-Unicode contents and replacing it with the corresponding Unicode contents. The content switch requires a Mapping engine, which would do the proper segmentation of the foreplay text and map it with the corresponding Unicode code. The mapping engine can perform the mapping occupation only if it knows what is the source text encoding. In gener al, the encoding is specified a broad with the content so that the mapping engine could feed it directly. However, in certain cases the encoding is not specified along with the content. Hence detecting the encoding through an encoding the detection engine provides a research area, especially with the non-Unicode content. In addition to that, incorporating the detection engine along with a conversion engine would be another part of the problem, to solve the application areas in 1.3, 1.4.Project ScopeThe system will be initially targeted for Sinhala fonts used by local sites. Later the same mechanism will be extended to support other languages and scripts (Tamil, Devanagaree).Deliverables and outcomesWeb Service/Plug-in to Local Language web site Font Conversion which automatically converts website contents from legacy fonts to Unicode.PDF document conversion tool to convert legacy fonts to UnicodeIn both implementations, the language encoding detection would use the proposed encoding detection mechanism. It can be considered as the core for the implementations in addition to the translation engine which performs the Non Unicode to Unicode mapping.Literature ReviewCharacter EncodingsCharacter Encoding lineationsEncoding refers to the process of representing information in some form. Human language is an encoding system by which information is correspond in terms of sequences of lexical units, and those in terms of sound or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.A character encoding is an algorithm for presenting characters in digital form as sequences of octets. There are hundreds of encodings, and many of them have different names. There is a standardized procedure for registering an encoding. A primary name is assigned to an encoding, and possibly some alias names. For example, ASCII, US-ASCII, ANSI_X3.4-1986, and ISO646-US are different names for an encoding. There are also many unregistered encodings and names that are used widely. The character encoding names are not case sensitive and hence ASCII and Ascii are homogeneous 25.Figure 2.1 Character encoding ExampleSingle Octet EncodingsWhen character repertoire that contains at most 256 characters, assigning a number in the array 0255 to each character and use an octet with that value to represent that character is the most simplest and obvious way. Such encodings, called single-octet or 8-bit encodings, are widely used and will remain important 22.Multi-Octet EncodingsIn multi octet encodings more than one octet is used to represent a single character. A simple two-octet encoding is comfortable for a character repertoire that contains at most 65,536 characters. Two octet schemes are uneconomical if the text mostly consists of characters that could be presented in a single-octet encoding. On the other hand, the objective of supporting Universal character set is not achievable with just 65,536 unique codes. Thus, encodings that use a variable number of octets per character are more common. The most widely used among such encodings is UTF-8 (UTF stands for Unicode Transformation Format), which uses one to 4 octets per character.Principles of Unicode StandardUnicode has used as the universal encoding standard to encode characters in all living languages. To the end, is follows a set of fundamental rationales. The Unicode standard is simple and consistent. It does not depend on posits or modes for encoding special characters.The Unicode standard incorporates the character sets of many existing standards For example, it includes Latin-I, character set as its first 256 characters. It includes repertoire of characters from numerous other corporate, national and international standards as considerably.In modern businesses needs handle characters from a wide variety of languages at the same time. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. The data corruption problems do not occur since Unicode has a single definition for each character. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now, or are actively developing support for it. The same is true of databases, as well.There are 10 design principles associated with Unicode.UniversilityThe Unicode is designed to be Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange. Unicode needs to encompass a variety of essentially different collections of characters and writing systems. For example, it cannot get hold of that all text is pen left to right, or that all letters have upp ercase and lowercase forms, or that text can be split into terminology separated by spaces or other whitespace.EfficientSoftware does not have to maintain state or look for special overlook sequences, and character synchronization from any point in a character stream is quick and unambiguous. A fixed character code allows for efficient sorting, searching, display, and editing of text. But with Unicode efficiency there exist certain tradeoffs made specially with the storage requirements needing four octets for each character. Certain representation forms such as UTF-8 format requiring linear processing of the data stream in order to identify characters. Unicode contains a large amount of characters and features that have been included only for compatibility with other standards. This may require preprocessing that deals with compatibility characters and with different Unicode representations of the same character (e.g., letter as a single character or as two characters).Character s, not glyphsUnicode assigns code points to characters as abstractions, not to visual appearances. A character in Unicode represents an abstract concept quite than the manifestation as a particular form or glyph. As shown in Figure 2.2, the glyphs of many fonts that render the Latin character A all correspond to the same abstract character a.Figure 2.2 Abstract Latin Letter a and Style VariantsAnother example is the Arabic presentation form. An Arabic character may be written in up to four different shapes. Figure 2.3 shows an Arabic character written in its isolated form, and at the beginning, in the middle, and at the end of a word. According to the design principle of encoding abstract characters, these presentation variants are all represented by one Unicode character.Figure 2.3 Arabic character with four representationsThe relationship between characters and glyphs is rather simple for languages like English mostly each character is presented by one glyph, taken from a font th at has been chosen. For other languages, the relationship can be much more complex routinely combining several characters into one glyph.SemanticsCharacters have well- define meanings. When the Unicode standard refers to semantics, it often means the properties of characters, such spacing, combinability, and directionality, rather than what the character really means.Plain textUnicode deals with plain texti.e., strings of characters without formatting or structuring information (except for things like line breaks).Logical orderThe default representation of Unicode data uses logical order of data, as opposed to approaches that handle writing direction by changing the order of characters.UnificationThe principle of uniqueness was also utilise to decide that certain characters should not be encoded separately. Unicode encodes duplicates of a character as a single code point, if they belong to the same script but different languages. For example, the letter denoting a particular vowel in German is treated as the same as the letter in Spanish.The Unicode standard uses Han unification to unite Chinese, Korean, and Japanese ideographs. Han unification is the process of assigning the same code point to characters historically perceived as being the same character but represented as unique in more than one vitamin E Asian ideographic character standard. These results in a group of ideographs shared by several cultures and significantly reduces the number of code points needed to encode them. The Unicode Consortium chose to represent shared ideographs only once because the goal of the Unicode standard was to encode characters separatist of the languages that use them. Unicode makes no distinctions based on pronunciation or meaning higher-level operating systems and applications must take that responsibility. Through Han unification, Unicode assigned about 21,000 code points to ideographic characters instead of the 120,000 that would be required if the Asian langua ges were treated separately. It is true that the same character might look slightly different in Chinese than in Japanese, but that difference in appearance is a font issue, not a uniqueness issue.Figure 2.4 Han Unification exampleThe Unicode standard allows for character composition in creating marked characters. It encodes each character and diacritic or vowel mark separately, and allows the characters to be combined to create a marked character. It provides single codes for marked characters when obligatory to comply with preexisting character standard.Dynamic compositionCharacters with diacritic marks can be composed dynamically, using characters designated as combining marks.Equivalent sequencesUnicode has a large number of characters that are precomposed forms, such as . They have decompositions that are declared as equivalent to the precomposed form. An application may still treat the precomposed form and the decomposition differently, since as strings of encoded characters, they are distinct.ConvertibilityCharacter data can be accurately converted between Unicode and other character standards and specifications.South Asian volumesThe scripts of South Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letterforms. With minor historical exceptions, they are written from left to right. They are all abugidas in which most symbols stand for a consonant plus an natural vowel (usually the sound /a/). Word-initial vowels in many of these scripts have distinct symbols, and word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign 17.Another designation is preferred in some languages. As an example in Hindi, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel suppressed. The vira ma sign nominally serves to suppress the inherent vowel of the consonant to which it is applied it is a combining character, with its shape varying from script to script.Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south, from Pakistan in the west to the easternmost islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from the third century BCE, were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an important administrative language of the Middle East at that time. Kharoshthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some cc different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in as cendancy in India proper as the major script of Sanskrit literature.The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha.The South Indian scripts are also derived from Brahmi and, therefore, share many structural characteristics. These scripts were first used to write Pali and Sanskrit but were later adapted for use in writing non-Indo-European languages including Dravidian family of southern India and Sri Lanka.Sinhala LanguageCharacteristics of SinhalaThe Sinhala script, also known as Sinhalese, is used to write the Sinhala language, by the absolute majority language of Sri Lanka. It is also used to write the Pali and Sanskrit languages. The script is a descendant of Brahmi and resembles the scripts of South India in form and struct ure. Sinhala differs from other languages of the region in that it has a series of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other words, both forms occur and are written differently 23.Figure 2.5 Example for prenasalized stop in SinhalaIn addition, Sinhala has separate distinct signs for both a short and a long low front vowel sounding similar to the initial vowel of the English word apple, usually represented in IPA as U+00E6 Latin small letter ae (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88.Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern set up for the other Indic scripts (for example, Devanagari). It does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA sinhala sign al-lakuna) to indicate conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does u se many ligatures.Sinhala Writing SystemThe Sinhala writing system can be called an abugida, as each consonant has an inherent vowel (/a/), which can be changed with the different vowel signs. Thus, for example, the basic form of the letter k is ka. For ki, a small arch is placed over the . This replaces the inherent /a/ by /i/. It is also possible to have no vowel following a consonant. In order to produce such a pure consonant, a special marker, the hal kirma has to be added . This marker suppresses the inherent vowel.Figure 2.6 Character associative Symbols in SinhalaHistorical Symbols. Neither U+0DF4 sinhala punctuation kunddaliya nor the Sinhala numerals are in general use today, having been replaced by Western-style punctuation and Western digits. The kunddaliya was formerly used as a full stop or period. It is included for scholarly use. The Sinhala numerals are not presently encoded.Sinhala and UnicodeIn 1997, Sri Lanka submitted a proposal for the Sinhala character code at the Unicode working group meeting in Crete, Greece. This proposal competed with proposals from UK, Ireland and the USA. The Sri Lankan draft was finally accepted with slight modifications. This was ratified at the 1998 meeting of the working group held at Seattle, USA and the Sinhala Code Chart was included in Unicode Version 3.0 2.It has been suggested by the Unicode consortium that ZWJ and ZWNJ should be introduced in Orthographic languages like Sinhala to achieve the following1. ZWJ joins two or more consonants to form a single unit (conjunct consonants).2. ZWJ can also convert shape of preceding consonants (cursiveness of the consonant).3. ZWNJ can be used to disjoin a single ligature into two or more units.Encoding auto DetectionBrowser and auto-detectionIn designing auto detection algorithms to auto detect encodings in web pages it needs to depend on the following assumptions on input data 24.Input text is composed of words/sentences readable to readers of a particular la nguage.Input text is from typical web pages on the Internet which is not an ancient dead language.The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words (e.g. English words in Chinese documents), space and other format/control characters.Methods of auto detectionThe paper24 discusses about 3 different methods for detecting the encoding of text data.Coding Scheme MethodIn any of the multi-byte encoding coding schemes, not all possible code points are used. If an illegal byte or byte sequence (i.e. unused code point) is encountered when confirming a certain encoding, it is possible to immediately conclude that this is not the right guess. Efficient algorithm to detecting character set using coding scheme through a parallel state machine is discussed in the paper 24.For each coding scheme, a state machine is implemented to verify a byte sequence for this particular encoding. For each byte the detector receives, it will fee d that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. In a typical example, one state machine will eventually provide a positive answer and all others will provide a negative answer.Character scattering MethodIn any given language, some characters are used more often than other characters. This fact can be used to devise a data model for each language script. This is particularly useful for languages with a large number of characters such as Chinese, Japanese and Korean. The tests were carried out with the data for simplified Chinese encoded in GB2312, traditional Chinese encoded in Big, Japanese and Korean. It was observed that a rather small set of coding points covers a significant percentage of characters used.Parameter called Distribution Ration was defined and used for the purpose separating the two encodings.Distribution Ratio = the Number of occurrences of the 512 mo st frequently used characters divided by the Number of occurrences of the rest of the characters.. Two-Char Sequence Distribution MethodIn languages that only use a small number of characters, we need to go further than counting the occurrences of each single character. Combination of characters reveals more language-characteristic information. 2-Char Sequence as 2 characters appearing immediately one after another in input text, and the order is significant in this case. Just as not all characters are used equally frequently in a language, 2-Char Sequence distribution also turns out to be extremely language/encoding dependent.Current Approaches to Solve Encoding ProblemsSiyabas ScriptThe SiyabasScript is as an attempt to develop a browser plugin, which solves the problem using legacy font in Sinhala news sites 6. It is an extension to Mozilla Firefox and Google Chrome web browsers. This solution was specifically designed for a limited number of target web sites, which were having t he specific fonts. The solution had the limitation of having to reengineer the plug-in, if a new version of the browser is released. The solution was not global since that id did not have the ability to support a new site which is using a Sinhala legacy font. In order to overcome that, the proposed solution will identify the font and encodings based on the content but not on site. There is a chance that the solution might not work if the site decided to adapt another legacy font, as it cannot detect the encoding scheme changes. There is a significant delay in the conversion process. The user would notice the display of the content with characters which are in legacy font before they get converted to the Unicode. This performance delay can be also identified as an area to improve in the solution. The conversion process does not provide the exact conversion specially when the characters need to be combined in Unicode. = ... , , , , + , , can be mentioned as the examples of words of such conversion issues.The plug-in supports the Sinhala Unicode conversion for the sites www.lankadeepa.lk, www.lankaenews.com and www.lankascreen.com. But the other websites mentioned in the paper does not get properly converted to Sinhala with Firefox version 3.5.17.Aksharamukha Asian Script ConverterAksharamukha is a South South-East-Asian script convertor tool. It supports transliteration between Brahmi derived Asian scripts. It also has the functionality to transliterate web pages from Indic Scripts to other scripts. The Convertor scrapes the HTML page, wherefore transliterates the Indic Scripts and displays the HTML. There are certain issues in the tool when it comes to alignment with the original web page. Misalignments and missing images, unconverted hyperlinks are some of them.Figure 2.7 Aksharamukha Asian Script ConverterCorpus-based Sinhala LexiconThe Lexicon of a language is its vocabulary including higher order constructs such as words and expressions. In orde r to detect the encoding of a given text this can be used as a supporting tool. Corpus based Sinhala lexicon has nearly 35000 entries based on a corpus consisting of 10 million words from diverse genres such as technical writing, creative writing and news reportage 7, 9. The text distribution across genres is given in table 1.Table 2.1 Distribution of Words across Genres 7GenreNumber of wordsPercentage of wordsCreative Writing234099923%Technical Writing435768043%News Reportage343377234%N-gram-based language, script, and encoding scheme-detectionN-Gram refers to N character sequences and is used as a well-established technique used in classifying language of text documents. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target consort the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. N-grams are extracted from a string, or a docum ent, by a sliding window that shifts one character at a time.Sinhala Enabled Mobile Browser for J2ME PhonesMobile phone usage is rapidly increasing throughout the world as well as in Sri Lanka. It has become the most ubiquitous communication device. Accessing internet through the mobile phone has become a common activity of commonwealth especially for messaging and news items. In J2ME enabled phones Sinhala Unicode support yet to be developed. They do not allow installation of fonts outside. Hence those devices will not be able to display Unicode contents, especially on the web, until Unicode is supported by the platform. Integrating the Unicode viewing support will provide a good chance to carry the technology to remote areas if it can be presented in the native language. If this is facilitated, in addition to the urban crowd, people from rural areas will be able to subscribe to a daily newspaper with their mobile. One major advantage of such an application is that it will provid e a phone model independent solution which supports any Java enabled phone.Cillion is a Mini browser software which shows Unicode contents in J2ME phones. This software is an application developed with the fonts integrated wh
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.