Jardic Pro - Dictionary Implementation Details

Dictionary File Structure

Dictionary File Logical Structure

Jardic Pro dictionary file consists from the set of article records that can be referenced from different indexes. A size of one article record is unlimited. Format of one article record can vary depending on its origination (Jardic, EDICT, DSL, Eijiro and so on).

Indexes are used to store lists of keys. From the user point of view they are lists of words. Index entries point to records. Keys could be complex and they could consist from subkeys like: <reading><word><priority>.

Jardic Pro supports following indexes:

Index of all records with a key: <record_position>
Index of supplemental records with a key: <record_type>
Index of record sequential numerswith a key: <record_number>
Index of book record sequence with a key: <record_position><previous_record><next_record>
Index of readings with a key: <reading><kanji_word>
Index of kanji with a key: <kanji_word><reading>
Index of words with a key: <word><weight>
Lexicon. List of all words and kanji with a key: <word_or_kanji>
List of word and kanji usage with a key: <number_in_lexicon>
Index of latin word stems with a key: <stem>

Subkeys can have variable length. Indexes are implemented in a form of multilevel balanced trees. Jardic Pro indexes have a feature that allow fast search of keys using sequential numbers of key values. Jardic Pro uses this feature to display lists of words, to scroll lists of words, and to emulate virtual merged indexes for multiple dictionary files.

Low-level Organization of Dictionary File

Jardic Pro dictionary file consists from pages. Following types of pages are supported:

File Header page
Page Allocation Table pages
Data pages
Index pages

Pages are identified by numbers. Relationship between page number and physical placement of a page is defined using Page Allocation Table (PAT). Pages of one file have fixed length. Data pages and index pages could be compressed. Due to page compression positions of pages in the file are not easily calculated. Jardic Pro uses page numbers and PAT to convert virtual data pointers into the pair of values: position of compressed page, and data displacement on the page.

Data Management System

Jardic Pro Data Management System (DMS) performs functions that are typical to Index Management Systems.

Page Access

Jardic Pro DMS is implemented as a 2-tier system. Low-level tier supports an access to pages using specified page numbers. High-level tier supports external requests coming from application program.

Low-level tier supports following functions:

Calculation of page location
Compression and uncompression of pages
Page caching
Management of free space

DMS External functions

Jardic Pro DMS external functions support dictionary data search and data update. DMS functions could be divided into following groups:

Management
Search in indexes
Reading of data
Key insertion
Bulk key insertion
Update of keys and records
Key comparison

Following are the lists of functions for separate groups.

Management: Connect, Disconnect, CreateCursor, DestroyCursor, UseRecord, UseIndex, Compress, SetProgressProc and some other.

Search in indexes: FindEQ, FindGE, FindLE, FindLT, FindGT, FindFirst, FindLast, FindNext, FindPrevious, FindByCounter, FindByPercentage. Results of search functions are: record pointer (article pointer), and a sequence number of a found key value in the index.

Reading of data: GetRecord.

Key insertion: InsertRecord, InsertKey, BeginBulkInsertKey, EndBulkInsertKey, StopBulkInsertKey.

Key comparison: CompareKeys, GetCollator, GetCollatorVersion, CreateCollator0, CreateCollator1. Key comparison functions are exported from DMS to application program, because the application program should use the same string comparison functions as DMS.

Jardic Pro 4.3 does not support update functions. They are not currently implemented because there is no user demand, and due to the complexity of full-text search index update. Future version of Jardic Pro could have implementation of this set of function for some types of dictionaries.

Jardic Pro full text search indexes are stored in dictionary files. Depending on dictionary size such indexes could contain up to some millions of entries. To build large dictionary indexes Jardic Pro uses optimized bulk insert functions. Our bulk insert functions does not use preliminary data sorting (e.g. like in Microsoft SQL Server). That is due to the fact the time of preliminary data sorting could be comparable to the time of pure key insertion into the index. Instead of that, Jardic Pro uses optimization based on on-the-fly calculation of key distribution histogram. The histogram intervals are set to such a value that the volumes of key values hit into one interval are comparable to DMS memory cache size. Jardic Pro bulk insert functions support insertion of about 0.7 million keys per 1 minute for 1.6 GHz CPU. Such insertion includes index filling, index page compression, and writing index pages on HDD.

Sorting

DMS functions that work with indexes intensively use key comparison and key sorting functions. Sorting of keys with text strings is based on the internal implementation of Unicode Collation Algorithm (UCA) and Default Unicode Collation Element Table (DUCET). This implementation supports correct "dictionary" sorting of words with small and capital letters, diacritics and with internal punctuation (apostrophes, dashes and so on).

Comparison of text strings according to UCA is implemented in following steps:

Conversion of UTF-16 Unicode characters into array of Unicode Points
Creation of arrays of Collation Elements (CE)
Variable weights handling
Comparison of CE arrays

At the first step (conversion to Unicode Points) the program performs "tailoring": it handles Unicode surrogates, replaces Japanese repetition characters to proper kana and kanji characters, decomposes Hangul into Jamo and so on. The program does not perform tailoring for Western languages because it builds common word lists for all Western languages and "correct" collation for them is contradictory.

3-level Collation Elements (CE's) are built using Default Unicode Collation Element Table (DUCET).

For correct sorting of text string with blanks, dashes, apostrophes and other punctuation characters the program uses CE's with variable weights.

Dictionary Shell

Under the term "dictionary shell" we mean the program that interacts with a user. It is the dictionary shell that is accepted by the user as "Jardic Pro". Dictionary shell performs following base functions:

Loading of dictionaries and emulation of virtual merged lists of words
Displaying of multicolumn lists of words
Displaying of dictionary articles
Positioning of lists of words according to entered text
Finding word translation under the cursor on-the-fly
Import and conversion of dictionaries

Jardic Pro can work with multiple opened dictionaries simultaneously. A user can see merged lists for all dictionaries opened at the moment. Merged lists of words are virtual, i.e., they are not built in the memory or elsewhere, but they are emulated algorithmically.

All the following word lists displayed in Jardic Pro are virtual:

Lists of words by reading
Lists of words by kanji
Lists of usual (Western) words
Lists of all words for full text search
Lists of all kanji substrings for full text search

One of Jardic Pro features is its ability to display lists of words with multiple columns. For example, a list of words by reading contains 3 columns: Reading, Kanji, and Article (first characters). To display such lists the program issues DMS requests to search keys in indexes and to read articles. Due to effective DMS implementation, DMS performs such requests quickly enough so that manual scrolling of lists with millions of words runs without noticeable delay.

In one of its windows Jardic Pro displays dictionary article for the current list item. Jardic Pro can work with imported dictionaries containing articles in different format. To display articles of dictionaries with different format the program uses separate formatting functions (e.g., for EDICT, DSL, Eijiro and so on).

The main function of each electronic dictionary is searching for entered words. This function is implemented in Jardic Pro by performing a search in virtual lists that are common for all opened dictionaries.

Jardic Pro can search for translation of words under the cursor on-the-fly in Microsoft Word, Internet Explorer, HTML Help and some other programs. The access to MS Word objects is implemented using COM Automation, specifically through Running Object Table (ROT). An access to Internet Explorer and HTML objects is implemented using IHTMLDocument2 interface obtained through Microsoft Active Accessibility (MSAA).

When translating words on-the fly Jardic Pro analyzes a type of text under the cursor. If the text contains kanji, the program tries to find translation for words starting from current kanji under the cursor and with some next characters. If a text under the cursor contains kana words, the program converts word endings to "normal" form. Then the program repeats those steps for the next character to the left, then one more character to the left and so on. Search results (found word translations) are accumulated. Jardic Pro displays found word with maximum length, but it can show all intermediate results including translation of separate kanji characters.

If the text under the cursor does not contain kanji the program looks for word boundaries at the left and at the right position of the cursor. Search for word boundaries is implemented using Unicode default boundary search algorithm. Then the program searches for translation of the selected word limited by found boundaries.

Import of Dictionaries

When importing dictionaries Jardic Pro performs following steps:

Loads dictionary articles
Builds the index of words by reading
Builds the index of words by kanji
Builds the index of usual (Western) words
Builds the index of all words for full text search
Builds the index of all kanji substrings for full text search

Depending on the type of imported dictionary some of steps could be skipped. For example, the program does not build lists of kanji when importing Western dictionaries. Jardic Pro builds indexes of all words and all kanji by default, but for some dictionaries (e.g. for Chinese-Chinese dictionaries) the size of all kanji index can be very large (10-100 millions of entries). When some indexes could be huge, the program advices a user to refuse from them.

Jardic pro builds dictionary indexes using DMS bulk insert functions: BeginBulkInsertKey, InsertKey, EndBulkInsertKey. The program extracts words from articles for full text search indexes using Unicode default boundary search algorithm.

Jardic Pro is written in C++. Source code consists from more than 160,000 lines.

Jardic Pro Implementation Details