Technical detail

Sorting and Grouping

If you want to make or fix sort_order_xx.py, read this section.

Inside sort_order_xx.py, the internal sort keys are generated to be sortable with unicode code point order. You should make the function get_string_to_sort to return the internal sort key.

After sorting the internal sort key with unicode order, the grouping of the terms should be finished. For example, if you want the symbols into one group instead of appearing both side of US-ASCII alphabets, the function should return the prepended string like unichr(127). By that process, you know US-ASCII alphabets are mapped 0x41-0x7A, you will get all symbols after English alphabets.

The internal sort keys don’t have to real readable. Those keys are internally used only. Of course it will be better to make readable for easy debugging.

Canonical string is not sortable one as human acceptable

Consider any language except English and see the order in Unicode tables!

  • Diacritical marked characters are all placed after ones that isn’t diacritical-marked.
  • CJK(V) ideographs sorted by radicals.
    • But generally Japanese uses to sort the word by Yomigana.
    • There is several CJK areas. Unified Ideographs, Extensions A to D and Compatibilities.

Using Unicode characters direct in HTML

With HTML 4, we can only use US-ASCII for URL, but in the appendix, The specification tells how to convert Unicode characters to percent encoded string[HTML4.01Appendix_B2].

With HTML5, almost same method (seems) applied as defined on the current specification[HTML5URL].

Say, if not we can’t use our characters to point any articles in Wikipedia except with English version!

Yomigana vs. Sort-keys and Ruby Annotation

Yomigana and Ruby Annotation is very similar, but there are a few difference between them.

Yomigana is required for ALL words written in CJK Ideographs to sort. But generally, Ruby is required ONLY for words very rare or difficult to read.

They might be unified - in this case, Some Ruby/Yomigana should be shown and the others are not depends on the idea of the writer of each documents.

On the other hand, the functionality yomigana is more similar to the sort key feature implemented within MediaWiki, used Wikipedia and several projects. But they are not completely same, because the sort order of sort-key feature is limited only with unicode codepoints.