Re: Fwd: Non-upstream patches for bash (2014)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Eduardo A. Bustamante López
I was looking through this old thread:
http://seclists.org/oss-sec/2014/q3/851

It looks like the issue reported in there is still there:

  dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK bash
  �\
  dualbus@debian:~$ LANG=en_US.UTF8 printf 'echo \u4e57\n' |LANG=en_US.UTF8 bash
  乗
  dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK mksh
  �
  dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK ksh
  �\
  dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK zsh
  �
(In the case that your font doesn't render the glyph for U+4E57, it's:
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e57)

  dualbus@debian:~$ LANG=zh_CN.GBK printf '\u4e57' | od -tx1 -An
   81 5c

It looks like it doesn't detect that \x81\x5c is a single character, and
instead treats the multibyte character as separate characters.

--
Eduardo Bustamante
https://dualbus.me/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

tetsujin
On Sat, 2017-06-24 at 12:41 -0500, Eduardo A. Bustamante López wrote:

> I was looking through this old thread:
> http://seclists.org/oss-sec/2014/q3/851
>
> It looks like the issue reported in there is still there:
>
>   dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK bash
>   �\
>   dualbus@debian:~$ LANG=en_US.UTF8 printf 'echo \u4e57\n' |LANG=en_US.UTF8 bash
>   乗
>   dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK mksh
>   �
>   dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK ksh
>   �\
>   dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK zsh
>   �
> (In the case that your font doesn't render the glyph for U+4E57, it's:
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4e57)
>
>   dualbus@debian:~$ LANG=zh_CN.GBK printf '\u4e57' | od -tx1 -An
>    81 5c
>
> It looks like it doesn't detect that \x81\x5c is a single character, and
> instead treats the multibyte character as separate characters.
>
I'm not seeing the problem here (at least, not in Bash or ksh - mksh and zsh seem to have gotten it wrong...)
Bash and ksh (in GBK locale) are outputting $'\u4E57' as a two-byte sequence, (0x81, 0x5C), and then you're reading that back into bash and ksh
(respectively) under the same locale, and that same two-byte sequence is being retained. If your terminal were in a GBK or GB18030 locale, the
character would be displayed correctly, too.
That said, this seems to not work as well:
$ LANG=zh_CN.GBK printf "echo \$'%s'" $'\u4e57n' |LANG=zh_CN.GBK bash

The test checks to see whether the 0x5c byte (as part of a multi-byte character) is treated like a backslash in the $'..' quoting syntax - in this
case, it is treated as a backslash, and the '\n' sequence is turned into a newline.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Eduardo A. Bustamante López
On Sat, Jun 24, 2017 at 04:46:47PM -0400, George wrote:
[...]
> I'm not seeing the problem here (at least, not in Bash or ksh - mksh and zsh seem to have gotten it wrong...)

You are right. I should get some sleep. FWIW, the original claim is that
having a locale-dependent parser is a problem.

--
Eduardo Bustamante
https://dualbus.me/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Chet Ramey
In reply to this post by Eduardo A. Bustamante López
On 6/24/17 1:41 PM, Eduardo A. Bustamante López wrote:
> I was looking through this old thread:
> http://seclists.org/oss-sec/2014/q3/851
>
> It looks like the issue reported in there is still there:
>
>   dualbus@debian:~$ LANG=zh_CN.GBK printf 'echo \u4e57\n' |LANG=zh_CN.GBK bash
>   �\
>   dualbus@debian:~$ LANG=en_US.UTF8 printf 'echo \u4e57\n' |LANG=en_US.UTF8 bash
>   乗

This shows that if it's a valid character in the current locale, bash will
convert it and read it back.  `printf' takes the unicode encoding (in this
case, a three-byte character) and runs it through iconv to try and convert
it to a valid multibyte character in the current locale.

>   dualbus@debian:~$ LANG=zh_CN.GBK printf '\u4e57' | od -tx1 -An
>    81 5c
>
> It looks like it doesn't detect that \x81\x5c is a single character, and
> instead treats the multibyte character as separate characters.

It's apparently not a single character in that locale.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

tetsujin
On Sun, 2017-06-25 at 12:23 -0400, Chet Ramey wrote:

> On 6/24/17 1:41 PM, Eduardo A. Bustamante López wrote:
>
> >
> >   dualbus@debian:~$ LANG=zh_CN.GBK printf '\u4e57' | od -tx1 -An
> >    81 5c
> >
> > It looks like it doesn't detect that \x81\x5c is a single character, and
> > instead treats the multibyte character as separate characters.
> It's apparently not a single character in that locale.
>
Yes it is!
https://en.wikipedia.org/wiki/GBK
\x81 \x5C is a two-byte character from level GBK/3.
But unless I've misunderstood something, it seems to be behaving correctly already. At least, with the exception of within $'..' quotes.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Chet Ramey
On 6/25/17 11:08 PM, George wrote:

> On Sun, 2017-06-25 at 12:23 -0400, Chet Ramey wrote:
>> On 6/24/17 1:41 PM, Eduardo A. Bustamante López wrote:
>>
>>> dualbus@debian:~$ LANG=zh_CN.GBK printf '\u4e57' | od -tx1 -An 81 5c It
>>> looks like it doesn't detect that \x81\x5c is a single character, and
>>> instead treats the multibyte character as separate characters.
>>
>>
>> It's apparently not a single character in that locale.
>>
>
> Yes it is!
>
> https://en.wikipedia.org/wiki/GBK
> \x81 \x5C is a two-byte character from level GBK/3.

OK. The terminal emulator I'm using simply doesn't render the glyph.

> But unless I've misunderstood something, it seems to be behaving correctly
> already. At least, with the exception of within $'..' quotes.

It is behaving correctly. $'...' works using bytes.  You can get it to
expand a byte sequence to a multibyte character using \u or \x, but it
works on bytes and always has, just like in C.  Since 0x5c introduces an
escape sequence, that's how it's treated.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

tetsujin

It seems a strange inconsistency, though: Double-quoted strings (and,
really, pretty much all other Bash syntax as far as I have seen)
recognize 0x81 0x5C as a two-byte character rather than treating 0x5C
as a backslash within the quoting syntax, but $'..' strings
unconditionally treat 0x5C as a backslash...  Is there any reason a
disparity like that would be desirable?

----- Original Message -----
From: [hidden email]
To:"George" <[hidden email]>, "Eduardo_A._Bustamante_López"
<[hidden email]>, <[hidden email]>
Cc:<[hidden email]>
Sent:Mon, 26 Jun 2017 11:04:42 -0400
Subject:Re: Fwd: Non-upstream patches for bash (2014)

 On 6/25/17 11:08 PM, George wrote:
 > On Sun, 2017-06-25 at 12:23 -0400, Chet Ramey wrote:
 >> On 6/24/17 1:41 PM, Eduardo A. Bustamante López wrote:
 >>
 >>> dualbus@debian:~$ LANG=zh_CN.GBK printf 'u4e57' | od -tx1 -An 81
5c It
 >>> looks like it doesn't detect that x81x5c is a single character,
and
 >>> instead treats the multibyte character as separate characters.
 >>
 >>
 >> It's apparently not a single character in that locale.
 >>
 >
 > Yes it is!
 >
 > https://en.wikipedia.org/wiki/GBK
 > x81 x5C is a two-byte character from level GBK/3.

 OK. The terminal emulator I'm using simply doesn't render the glyph.

 > But unless I've misunderstood something, it seems to be behaving
correctly
 > already. At least, with the exception of within $'..' quotes.

 It is behaving correctly. $'...' works using bytes. You can get it to
 expand a byte sequence to a multibyte character using u or x, but it
 works on bytes and always has, just like in C. Since 0x5c introduces
an
 escape sequence, that's how it's treated.

 --
 ``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
 Chet Ramey, UTech, CWRU [hidden email]
http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Chet Ramey
On 6/26/17 12:46 PM, [hidden email] wrote:
>
> It seems a strange inconsistency, though: Double-quoted strings (and,
> really, pretty much all other Bash syntax as far as I have seen) recognize
> 0x81 0x5C as a two-byte character rather than treating 0x5C as a backslash
> within the quoting syntax, but $'..' strings unconditionally treat 0x5C as
> a backslash...  Is there any reason a disparity like that would be desirable?

Because that's how C works, and that's how all the shells that currently
implement it work (and have always worked).

Posix has considered this feature on and off (2010, 2015, 2016)[1], albeit
with a lot of tangents.  If it ever gets standardized, I expect this
question to be resolved. When it does, and if it's required, I'll change
Posix mode to match and then we'll see.

[1] http://austingroupbugs.net/view.php?id=249

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

tetsujin

That's not a reason why this disparity is "desirable" IMO.

Generally, the rule is that the shell interprets the shell script it's
given according to the character set of the active locale. It may not
allow any given sequence of characters in any given context, but in
terms of how the parser translates the sequence of input bytes into
input characters, it's always in terms of that input locale. Why would
this be true for virtually every facet of the language except for one
particular quoting syntax? The character set of a script is its most
fundamental layer of abstraction, and the shell is generally expected
to respect that abstraction. The idea that 0x5C in this context is
still a "backslash" is a fiction created by mis-interpreting the
script.

I'd argue that having this kind of inconsistency in the parser is
undesirable, and retaining it only causes harm. If other shells also
implement it this way, then those other shells are also broken, and
I'll submit this as a bug to their projects, too. :)

----- Original Message -----
From: [hidden email]
To:<[hidden email]>, "Eduardo_A._Bustamante_López"
<[hidden email]>, <[hidden email]>
Cc:<[hidden email]>
Sent:Mon, 26 Jun 2017 14:34:32 -0400
Subject:Re: Fwd: Non-upstream patches for bash (2014)

 On 6/26/17 12:46 PM, [hidden email] wrote:
 >
 > It seems a strange inconsistency, though: Double-quoted strings
(and,
 > really, pretty much all other Bash syntax as far as I have seen)
recognize
 > 0x81 0x5C as a two-byte character rather than treating 0x5C as a
backslash
 > within the quoting syntax, but $'..' strings unconditionally treat
0x5C as
 > a backslash... Is there any reason a disparity like that would be
desirable?

 Because that's how C works, and that's how all the shells that
currently
 implement it work (and have always worked).

 Posix has considered this feature on and off (2010, 2015, 2016)[1],
albeit
 with a lot of tangents. If it ever gets standardized, I expect this
 question to be resolved. When it does, and if it's required, I'll
change
 Posix mode to match and then we'll see.

 [1] http://austingroupbugs.net/view.php?id=249

 Chet

 --
 ``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
 Chet Ramey, UTech, CWRU [hidden email]
http://cnswww.cns.cwru.edu/~chet/


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Fwd: Non-upstream patches for bash (2014)

Chet Ramey
On 6/26/17 4:35 PM, [hidden email] wrote:
>
> That's not a reason why this disparity is "desirable" IMO.

It's not intended to be anything but an explanation of why things are
the way they are: compatibility and backwards compatibility.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Loading...