[NEXT] Supplementary Data
[PREVIOUS] Numbers
[CONTENTS]
[EESTI]


8 SORTING

Sorting is the operation most often used in information systems. In different applications, different sorting algorithms are used. Sorting is based on weights assigned to characters. In Estonian texts, five types of weights are used (see Table 8.1):

AW -- Cultural Alphanumeric Weight. The weight has values only for letters and digits. Other characters, e.g. quotation marks do not possess weights. These characters are considered in the sorting process as less weighted. The weights of capital and small letters are equal. Letters with diacritics, not included in the Estonian alphabet, have as a rule the same weight as the corresponding letters without diacritics: e.g. à, A, and a have the AW-weight 91. According to international agreements, for the Latin alphabet letters, weights between 91- 132 are reserved.

DW -- Cultural Diacritic Weight. Texts including characters with diacritics are sorted in the Estonian language as a first approximation, in the same way as texts without diacritics. To compare two letters with diacritics having the same prototype (e.g. ã and à), DW-weights have to be assigned to the characters. In the Latin alphabet, according to international agreements the following weights are used:

32	No diacritic
33 	Variant
34 	Ligature
35 	Acute
36 	Grave
37 	Breve
38 	Circumflex
39 	Caron, hacek
40 	Overcircle
41 	Diaresis or umlaut
42 	Diaresis and acute
43 	Double acute
44 	Tilde
45 	Overdot
46 	Macron
47 	(reserved)
48 	Middle dot
49 	Stroke
50 	Cedilla
51 	Ogonek
52 	Underdot
53 	Underline
54 	(reserved)
CW -- Cultural Case Weight. For the Estonian language, standard weights applied in countries using the Latin alphabet are in use:

1 	(reserved)
2 	(reserved)
3 	Small letters, digits and remaining characters
4 	Small letters and digits as index
5 	Small letters and digits as upper index
6 	(reserved)
7 	Capitals
8 	Capitals as index
9 	Capitals as upper index
11 	Fraction 1/8
12 	Fraction 1/4
13 	Fraction 3/8
14 	Fraction 1/2
15 	Fraction 5/8
16 	Fraction 3/4
17 	Fraction 7/8
18 	(reserved)
SW -- Cultural Special Weight.

In the Estonian language, letters and digits do not have a SW-weight, as other characters have. The SW-weight guarantees the sorting order of texts including special characters.

SH -- Shared Weight.

For the Estonian language, the SH-weight is close to the AW-weight. There are differences for special characters as well as for the letter B. So the SH-weight does not guarantee a correct order.

For text ordering, four methods can be used. In every case, a sort key is formed on the basis of the text using the weights described above. The order of texts is determined by the lexicographic order of the sort keys.

I. Sort key with unique weights.
The method achieves a "quick and dirty" order. To every character,a weight successively from 0 (SP) to 190 is assigned.
II. Sort key with shared weights.
The method guarantees a quite accurate order. Every character is assigned a SH-weight according to the last column of the table 8.1.
III Text key.
The method guarantees an order exactly according to the Estonian language requirements. The sort key is built as follows.
  1. A string is formed according to the AW-weights. The AW-weight of every character is taken from the table 8.1 (column AW). E.g., the AW-string of the word "Co-op" is X'5E707072', for the word "coté" X'5E707761'.
  2. A preparatory DW-string is formed. The DW-weight of every character is taken from the table 8.1 (column DW). The preparatory DW-string for the word "Co-op" is X'20202020', for the word "coté" X'20202023'.
  3. The preparatory DW-string is turned around.
  4. From the string such formed, codes are eliminated starting from the end of the string until the first code differing from X'20'. The final DW-string for the word "Co-op" is X'', for the word "cotè" X'23'. This step is not mandatory.
  5. A preparatory CW-string is formed. The weights of every character are taken from the table 8.1 (column CW). The preparatory CW-string of the word "Co-op" is X'07030303', for the word "cotè" X'03030303'.
  6. From the string such formed, codes are eliminated starting from the end of the string until the first code differing from X'03'. The final CW-string for the word "Co-op" is X'07', for the word "cotè" X''. This step is not mandatory.
  7. A string is formed according to the SW-weights. The SW-weight of every character is taken from the table 8.1 (column SW). If the corresponding character does not have a SW- weight in the table, the character is excluded. Before every non-empty SW-weight, the position (order) of the character in the text is added. Such, the SW-string for the word "Co-op" is X'0307' (the character "-" in position 3), for the word "cotè" X".
  8. The text key is formed from the AW-string, DW-string, CW-string, X'00', SW-string. Thus, for the word "co-op" we have the text key X'5E70707207000307', for the word "cotè" X'5E7077612300'.
IV. String key.
For some applications (where special characters, the location of texts in columns, etc are to be considered), this method achieves a more convenient order. Here, as the first key the SH-string is used, yet the SW-string is excluded. The key is built as follows.
  1. A string is formed in accordance with the SH-weights. The SH-weight of every character is taken from the table 8.1 (column SH). Such, the SH-string for the word "Co-op" is X'4854055455', for the word "cotè" X'4854594A'.
  2. A preparatory DW-string is formed. The DW-weights of every character are taken from the table 8.1 (column DW). The preparatory DW-string for the word "Co-op" is X'20202020', for the word "cotè X'20202023'.
  3. The preparatory DW-string is turned around.
  4. From the string such formed, the codes starting from the end until the first code differing from X'20' are eliminated. The final DW-string for the word "Co-op" is X'', for the word "cotè X'23'.
  5. The preparatory CW-string is formed. The weight of every character is taken from the table 8.1 (column CW). The preparatory CW-string for the word "co-op" is X'07', for the word "cotè" X'03030303'.
  6. From the string such formed, the codes starting from the end until the first code differing from X'03' are eliminated. The final CW-string for the word "Co-op" is X'07', for the word "cotè" X''.
  7. The SW-string is not formed.
  8. The string key is formed from the strings SH-string, X'00', DW-string, CW-string. Such, for the word "Co-op" we achieve the string key X'48540554550007', for the word "cotè" X'4854594A0023'.

[NEXT] Supplementary Data
[PREVIOUS] Numbers
[CONTENTS]