Quantcast

readline display bug with UTF-8 and char insertion

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

readline display bug with UTF-8 and char insertion

Egmont Koblinger-2
Hi,

On a fully UTF-8 environment, in certain easily reproducible cases, the
command line becomes messed up (the cursor ends up in the wrong column, and
from then on it's tons of cumulative mistakes).  I'm using bash-4.2.42, but
the bug also affects other readline-based applications.

To reproduce:  Save the attached file as your .bash_history.  (Notice the
UTF-8 accented characters in the middle line.  Also notice the trailing
space in each line.)  Use either the Up arrow, or enter a common prefix of
the lines and press a hotkey bound to history-search-backward.  You'll see
the command line getting messed up.

Please read on for my findings on what goes on, and my proposed fix.

The bug occurs only when the readline's insert_some_chars() method (the one
which inserts columns pushing some characters to the right) is involved.
It also needs UTF-8 characters to be present.

My first finding was that display.c line 1684, namely:
  twidth = _rl_col_width (nfd+lendiff, 0, temp-col_lendiff, 1);
is clearly broken.  _rl_col_width() converts logical byte offsets into
columns, and it is passed a column where a byte offset should be present,
this is the typical case of "code smell", semantic incorrectness.  Changing
the argument from "temp-col_lendiff" to "temp-lendiff" fixes the case where
you go back from the last history entry to the previous; however it breaks
going back from that one to the first :)

My second finding, trying to fix the newly introduced bug, was that the
concept of the insert_some_chars() is broken in UTF-8 environment.  Here's
why.

We know lendiff (the growth of the command line in bytes) and col_lendiff
(the growth in columns).  However, the code tries to print the new text in
two runs: first the segment that corresponds to the newly created columns
(this happens within insert_some_chars()), and then the ones that overwrite
the old characters (in update_line()).  In order to do that correctly we
would need to know how many bytes would produce the given number of
columns, (kinda the opposite of _rl_col_width()), but we have no method for
that, so no wonder the code does it wrong.  It's not even possible to have
such a method, e.g. when simple English letter is be replaced by a CJK
glyph then coldiff==1, so you shift the remaining characters to the right
by 1, and you'd need to print the left half and the right half of that CJK
character in separate steps.  Also note that another problem with the
source line quoted above is that nfd+lendiff might point to the middle of a
UTF-8 character, and computing width from there doesn't make sense.

Hence the fix would be to abandon the concept of printing new characters in
two runs, separately for the new columns and separately for overwriting the
old ones.  The correct behavior can only be implemented if, after
allocating the proper amount of columns, all the differing characters are
printed in a single run.  This is what my proposed patch does.
insert_some_chars() is made simpler: it only inserts new colums, but
doesn't print any text, that responsibility is given back to its caller
update_line() which prints the whole stuff at the location where it used to
print the first part only.  Printing the second run is then simply removed.

Please see the attached proof-of-concept patch.  I've stress-tested it
against my actual entries in my real .bash_history, with many mp3 filenames
containing accented letters (and trailing space due to completion), and it
seems to work correctly.

Note that my patch does *not* address the non-UTF-8 branch, I believe I
don't have the knowledge and understanding of the source to test all
possible scenarios.  The three "else" branches starting at line 1645 also
have to be modified to print the whole difference, not just the part
filling the new columns.  I'm kindly asking you to do this piece of work, I
guess it should be very easy for you based on my findings, and I'm sure
you'll want to make cosmetic changes to my code anyway.

Thanks very much,

egmont

.bash_history (94 bytes) Download Attachment
bash-4.2.42-cursor-position-proof-of-concept.patch (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: readline display bug with UTF-8 and char insertion

Egmont Koblinger-2
Hi Chet,

Friendly ping - could you please look at this bug and patch, too?  I know
it's bit more complicated, and I haven't finished the non-UTF part, but
based on my findings I believe this should be really straightforward for
you.

I know bash's UTF-8 support is contributed code and is quite complex, and I
know it improved a lot since the beginning, but unfortunately there are
still some problems with it, and it really frustrates me that handling
accented characters still causes problems in 2013, I'd really love to see
these problems disappear for good.  Please let me know if there's anything
more I could do to get this one fixed.

thanks,
egmont

On Mon, Jan 21, 2013 at 9:16 PM, Egmont Koblinger <[hidden email]> wrote:

> Hi,
>
> On a fully UTF-8 environment, in certain easily reproducible cases, the
> command line becomes messed up (the cursor ends up in the wrong column, and
> from then on it's tons of cumulative mistakes).  I'm using bash-4.2.42, but
> the bug also affects other readline-based applications.
>
> To reproduce:  Save the attached file as your .bash_history.  (Notice the
> UTF-8 accented characters in the middle line.  Also notice the trailing
> space in each line.)  Use either the Up arrow, or enter a common prefix of
> the lines and press a hotkey bound to history-search-backward.  You'll see
> the command line getting messed up.
>
> Please read on for my findings on what goes on, and my proposed fix.
>
> The bug occurs only when the readline's insert_some_chars() method (the
> one which inserts columns pushing some characters to the right) is
> involved.  It also needs UTF-8 characters to be present.
>
> My first finding was that display.c line 1684, namely:
>   twidth = _rl_col_width (nfd+lendiff, 0, temp-col_lendiff, 1);
> is clearly broken.  _rl_col_width() converts logical byte offsets into
> columns, and it is passed a column where a byte offset should be present,
> this is the typical case of "code smell", semantic incorrectness.  Changing
> the argument from "temp-col_lendiff" to "temp-lendiff" fixes the case where
> you go back from the last history entry to the previous; however it breaks
> going back from that one to the first :)
>
> My second finding, trying to fix the newly introduced bug, was that the
> concept of the insert_some_chars() is broken in UTF-8 environment.  Here's
> why.
>
> We know lendiff (the growth of the command line in bytes) and col_lendiff
> (the growth in columns).  However, the code tries to print the new text in
> two runs: first the segment that corresponds to the newly created columns
> (this happens within insert_some_chars()), and then the ones that overwrite
> the old characters (in update_line()).  In order to do that correctly we
> would need to know how many bytes would produce the given number of
> columns, (kinda the opposite of _rl_col_width()), but we have no method for
> that, so no wonder the code does it wrong.  It's not even possible to have
> such a method, e.g. when simple English letter is be replaced by a CJK
> glyph then coldiff==1, so you shift the remaining characters to the right
> by 1, and you'd need to print the left half and the right half of that CJK
> character in separate steps.  Also note that another problem with the
> source line quoted above is that nfd+lendiff might point to the middle of a
> UTF-8 character, and computing width from there doesn't make sense.
>
> Hence the fix would be to abandon the concept of printing new characters
> in two runs, separately for the new columns and separately for overwriting
> the old ones.  The correct behavior can only be implemented if, after
> allocating the proper amount of columns, all the differing characters are
> printed in a single run.  This is what my proposed patch does.
> insert_some_chars() is made simpler: it only inserts new colums, but
> doesn't print any text, that responsibility is given back to its caller
> update_line() which prints the whole stuff at the location where it used to
> print the first part only.  Printing the second run is then simply removed.
>
> Please see the attached proof-of-concept patch.  I've stress-tested it
> against my actual entries in my real .bash_history, with many mp3 filenames
> containing accented letters (and trailing space due to completion), and it
> seems to work correctly.
>
> Note that my patch does *not* address the non-UTF-8 branch, I believe I
> don't have the knowledge and understanding of the source to test all
> possible scenarios.  The three "else" branches starting at line 1645 also
> have to be modified to print the whole difference, not just the part
> filling the new columns.  I'm kindly asking you to do this piece of work, I
> guess it should be very easy for you based on my findings, and I'm sure
> you'll want to make cosmetic changes to my code anyway.
>
> Thanks very much,
>
> egmont
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: readline display bug with UTF-8 and char insertion

Chet Ramey
In reply to this post by Egmont Koblinger-2
On 1/21/13 3:16 PM, Egmont Koblinger wrote:

> Hi,
>
> On a fully UTF-8 environment, in certain easily reproducible cases, the
> command line becomes messed up (the cursor ends up in the wrong column, and
> from then on it's tons of cumulative mistakes).  I'm using bash-4.2.42, but
> the bug also affects other readline-based applications.
>
> To reproduce:  Save the attached file as your .bash_history.  (Notice the
> UTF-8 accented characters in the middle line.  Also notice the trailing
> space in each line.)  Use either the Up arrow, or enter a common prefix of
> the lines and press a hotkey bound to history-search-backward.  You'll see
> the command line getting messed up.

Thanks for the report.  This is a great job of investigating, and your
patch is pretty much right on.  Openining up space and then writing over
it is clearly the right way to go.  The only issue is what to do when a
terminal has insert mode (im/ei) but not insert-char (IC/ic): you have
to insert the spaces yourself and then back up over them.  Mac OS X
Terminal is such a terminal, for example.

Thanks again for the patch.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Loading...