Incorrect unicode escapes

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Incorrect unicode escapes

Angus Duggan
Sorry, bashbug didn't work under cygwin...

BASH_VERSION=4.4.12(3)-release
uname -a: CYGWIN_NT-6.1 xxxxxxx 2.8.0(0.309/5/3) 2017-04-01 20:47 x86_64 Cygwin

The function u32toutf16() in lib/sh/unicode.c incorrectly implements
surrogate pairs. \uff08 (Full Width Left Paren) is encoded to the invalid
surrogate pair d7ff df08.

Unicode code points in the range 0xe000-0xffff should be encoded as a single
16-bit code unit.

To repeat (Windows 64-bit, cygwin):

  export LANG=en_us.UTF-8
  echo $'\uff08' | hexdump -C

This prints:

00000000  ed 9f bf ed bc 88 0a                              |.......|
00000007

This is UTF-8 encoding for the two 16-bit values 0xdf77 0xdf08. This is
invalid as a UTF-8 encoding, surrogate pairs should not be UTF-8 encoded.

The fix is simple, add tests for the e000-ffff range, or invert the test
order and add a test for dfff (CAVEAT EMPTOR! THIS IS UNTESTED!):

    if (c >= 0x010000 && c <= 0x010ffff)
    {
      c -= 0x010000;
      s[0] = (unsigned short)((c >> 10) + 0xd800);
      s[1] = (unsigned short)((c & 0x3ff) + 0xdc00);
      l = 2;
    }
    else if (c < 0x0d800 || c > 0xdfff )
    {
      s[0] = (unsigned short) (c & 0xFFFF);
      l = 1;
    }

a.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Incorrect unicode escapes

Chet Ramey
On 6/26/17 9:12 AM, Angus Duggan wrote:
> Sorry, bashbug didn't work under cygwin...
>
> BASH_VERSION=4.4.12(3)-release
> uname -a: CYGWIN_NT-6.1 xxxxxxx 2.8.0(0.309/5/3) 2017-04-01 20:47 x86_64 Cygwin
>
> The function u32toutf16() in lib/sh/unicode.c incorrectly implements
> surrogate pairs. \uff08 (Full Width Left Paren) is encoded to the invalid
> surrogate pair d7ff df08.

Thanks for the report.  This was reported and fixed back in November, 2016
as the result of

http://lists.gnu.org/archive/html/bug-bash/2016-11/msg00039.html

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Loading...