-
Task
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
Two Unicode characters are competing for the role as hyphen (in words such as style-related or names such as John-Jack Doe): U+002D HYPHEN-MINUS and U+2010 HYPHEN. As there is no practical distinction, we should decide to use one of them and implement the choice in the input sanitation code.
Background
In the beginning, there was ASCII, and ASCII only had a single character for “horizontal lines at middle height”, which consequently was used for hyphen, minus sign, various dashes (sometimes duplicated -- to make it more similar visually to an en/em dash) etc. When Unicode was created, separate characters were created for each role (such as U+2010 HYPHEN, U+2212 MINUS SIGN, U+2013 EN-DASH), and the original ASCII character was retained as U+002D HYPHEN-MINUS.
The intention was (similar to the case of the typewriter quotation marks) that the HYPHEN-MINUS would only be used in legacy texts converted from ASCII pending conversion, while the matching new character would be used in new or updated texts. However, (and unlike the case of the typewriter quotation marks,) U+2010 HYPHEN and U+002D HYPHEN-MINUS look the same, and so there was no perceived need to replace a U+002D HYPHEN-MINUS when used as a hyphen. (Of course, when used as, e.g., an en-dash, there was an incentive to do the conversion to the appropriate new character.) The de-facto standard for representation of a hyphen, on the web and elsewhere, has therefore evolved to be U+002D HYPHEN-MINUS.
Discussion
Following the official standard, U+2010 HYPHEN should be used; following the de-facto standard, U+002D HYPHEN-MINUS should be used. Using U+2010 HYPHEN may cause interoperability issues with other software (e.g. when copy-and-pasting text from MB somewhere else).