[NEXT] Supplementary Data
[PREVIOUS] Numbers
[CONTENTS]
[EESTI]
8 SORTING
Sorting is the operation most often used in information systems. In
different applications, different sorting algorithms are used. Sorting is
based on weights assigned to characters. In Estonian texts, five types of
weights are used (see Table 8.1):
AW -- Cultural Alphanumeric Weight. The weight has values only for letters
and digits. Other characters, e.g. quotation marks do not possess weights.
These characters are considered in the sorting process as less weighted.
The weights of capital and small letters are equal. Letters with
diacritics, not included in the Estonian alphabet, have as a rule the same
weight as the corresponding letters without diacritics: e.g. à, A,
and a have the AW-weight 91. According to international agreements, for
the Latin alphabet letters, weights between 91- 132 are reserved.
DW -- Cultural Diacritic Weight. Texts including characters with diacritics
are sorted in the Estonian language as a first approximation, in the same
way as texts without diacritics. To compare two letters with diacritics
having the same prototype (e.g. ã and à), DW-weights have to
be assigned to the characters. In the Latin alphabet, according to
international agreements the following weights are used:
32 No diacritic
33 Variant
34 Ligature
35 Acute
36 Grave
37 Breve
38 Circumflex
39 Caron, hacek
40 Overcircle
41 Diaresis or umlaut
42 Diaresis and acute
43 Double acute
44 Tilde
45 Overdot
46 Macron
47 (reserved)
48 Middle dot
49 Stroke
50 Cedilla
51 Ogonek
52 Underdot
53 Underline
54 (reserved)
CW -- Cultural Case Weight. For the Estonian language, standard weights
applied in countries using the Latin alphabet are in use:
1 (reserved)
2 (reserved)
3 Small letters, digits and remaining characters
4 Small letters and digits as index
5 Small letters and digits as upper index
6 (reserved)
7 Capitals
8 Capitals as index
9 Capitals as upper index
11 Fraction 1/8
12 Fraction 1/4
13 Fraction 3/8
14 Fraction 1/2
15 Fraction 5/8
16 Fraction 3/4
17 Fraction 7/8
18 (reserved)
SW -- Cultural Special Weight.
In the Estonian language, letters and
digits do not have a SW-weight, as other characters have. The SW-weight
guarantees the sorting order of texts including special characters.
SH -- Shared Weight.
For the Estonian language, the SH-weight is close to the AW-weight. There
are differences for special characters as well as for the letter B. So the
SH-weight does not guarantee a correct order.
For text ordering, four methods can be used. In every case, a sort key is formed on the
basis of the text using the weights described above. The order of texts is determined by
the lexicographic order of the sort keys.
- I. Sort key with unique weights.
- The method achieves a "quick and dirty" order. To every
character,a weight successively from 0 (SP) to 190 is assigned.
- II. Sort key with shared weights.
- The method guarantees a quite accurate order. Every character is assigned
a SH-weight according to the last column of the table 8.1.
- III Text key.
- The method guarantees an order exactly according to the Estonian language
requirements. The sort key is built as follows.
- A string is formed according to the AW-weights. The AW-weight of
every character is taken from the table 8.1 (column AW). E.g., the
AW-string of the word "Co-op" is X'5E707072', for the word
"coté" X'5E707761'.
- A preparatory DW-string is formed. The DW-weight of every character
is taken from the table 8.1 (column DW). The preparatory DW-string for the
word "Co-op" is X'20202020', for the word
"coté" X'20202023'.
- The preparatory DW-string is turned around.
- From the string such formed, codes are eliminated starting from the
end of the string until the first code differing from X'20'. The final
DW-string for the word "Co-op" is X'', for the word
"cotè" X'23'. This step is not mandatory.
- A preparatory CW-string is formed. The weights of every character are
taken from the table 8.1 (column CW). The preparatory CW-string of the
word "Co-op" is X'07030303', for the word
"cotè" X'03030303'.
- From the string such formed, codes are eliminated starting from the
end of the string until the first code differing from X'03'. The final
CW-string for the word "Co-op" is X'07', for the word
"cotè" X''. This step is not mandatory.
- A string is formed according to the SW-weights. The SW-weight of
every character is taken from the table 8.1 (column SW). If the
corresponding character does not have a SW- weight in the table, the
character is excluded. Before every non-empty SW-weight, the position
(order) of the character in the text is added. Such, the SW-string for the
word "Co-op" is X'0307' (the character "-" in position
3), for the word "cotè" X".
- The text key is formed from the AW-string, DW-string, CW-string,
X'00', SW-string. Thus, for the word "co-op" we have the text
key X'5E70707207000307', for the word "cotè"
X'5E7077612300'.
- IV. String key.
- For some applications (where special characters, the location of texts in
columns, etc are to be considered), this method achieves a more convenient
order. Here, as the first key the SH-string is used, yet the SW-string is
excluded. The key is built as follows.
- A string is formed in accordance with the SH-weights. The SH-weight
of every character is taken from the table 8.1 (column SH). Such, the
SH-string for the word "Co-op" is X'4854055455', for the word
"cotè" X'4854594A'.
- A preparatory DW-string is formed. The DW-weights of every character
are taken from the table 8.1 (column DW). The preparatory DW-string for
the word "Co-op" is X'20202020', for the word "cotè
X'20202023'.
- The preparatory DW-string is turned around.
- From the string such formed, the codes starting from the end until
the first code differing from X'20' are eliminated. The final DW-string
for the word "Co-op" is X'', for the word "cotè
X'23'.
- The preparatory CW-string is formed. The weight of every character is
taken from the table 8.1 (column CW). The preparatory CW-string for the
word "co-op" is X'07', for the word "cotè"
X'03030303'.
- From the string such formed, the codes starting from the end until
the first code differing from X'03' are eliminated. The final CW-string
for the word "Co-op" is X'07', for the word
"cotè" X''.
- The SW-string is not formed.
- The string key is formed from the strings SH-string, X'00',
DW-string, CW-string. Such, for the word "Co-op" we achieve the
string key X'48540554550007', for the word "cotè"
X'4854594A0023'.
[NEXT] Supplementary Data
[PREVIOUS] Numbers
[CONTENTS]