parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

dirk
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H   -I.  -I../. -I.././include -I.././lib  -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -no-pie -Wno-parentheses -Wno-format-security
uname output: Linux dilbert 4.10.0-41-generic #45~16.04.1-Ubuntu SMP Fri Nov 24 15:06:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.4
Patch Level: 12
Release Status: release

Description:
  I'm sanitising urls from advertisement crap. As described below I'm getting a wrong resolution of parenthesised expression defined with non-greedy operator '?'.

  The test url is: http://toolbox.contentspread.net/container/medimops/track/xxxxxxxxxx.dyn?csRdu=https://www.medimops.de/?anid=M9999999999&cl=details&wdm=M9999999999&utm_source=CRM&utm_medium=email&utm_campaign=OS

  The regular expression is: https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*

  As I understand the specification and verified with 'visual regexp' and https://regex101.com/ the result should be:

    1 →  container/medimops/track/xxxxxxxxxx.dyn?csRdu
    2 → https://www.medimops.de/?anid=M9999999999

  Running the script below I got instead:
    1 → container/medimops/track/xxxxxxxxxx.dyn?csRdu=https://www.medimops.de/?anid=M9999999999&cl=details&wdm=M9999999999&utm_source=CRM&utm_medium
    2 → email


Repeat-By:

  Test script:
#!/bin/bash

url='http://toolbox.contentspread.net/container/medimops/track/xxxxxxxxxx.dyn?csRdu=https://www.medimops.de/?anid=M9999999999&cl=details&wdm=M9999999999&utm_source=CRM&utm_medium=email&utm_campaign=OS'
re='https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*'

if [[ ${url} =~ ${re} ]]
then
    echo "0 → ${BASH_REMATCH[0]}"
    echo "1 → ${BASH_REMATCH[1]}"
    echo "2 → ${BASH_REMATCH[2]}"
fi

Reply | Threaded
Open this post in threaded view
|

Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

Greg Wooledge
On Fri, Dec 01, 2017 at 06:40:35PM +0100, [hidden email] wrote:
>   I'm sanitising urls from advertisement crap. As described below I'm getting a wrong resolution of parenthesised expression defined with non-greedy operator '?'.

> re='https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*'
>
> if [[ ${url} =~ ${re} ]]

Bash's =~ operator uses Extended Regular Expressions.  There is no
non-greedy operator (.*? or .+?) in an ERE.  It's a perl extension.

Also, you don't need to escape / but you *do* need to escape dots.

Reply | Threaded
Open this post in threaded view
|

Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

Chet Ramey
In reply to this post by dirk
On 12/1/17 12:40 PM, [hidden email] wrote:

> Bash Version: 4.4
> Patch Level: 12
> Release Status: release
>
> Description:
>   I'm sanitising urls from advertisement crap. As described below I'm getting a wrong resolution of parenthesised expression defined with non-greedy operator '?'.
>
>   The test url is: http://toolbox.contentspread.net/container/medimops/track/xxxxxxxxxx.dyn?csRdu=https://www.medimops.de/?anid=M9999999999&cl=details&wdm=M9999999999&utm_source=CRM&utm_medium=email&utm_campaign=OS
>
>   The regular expression is: https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*

The Bash =~ operator uses Posix extended regexps (EREs) as defined in
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04.
There's no concept of a `non-greedy' operator
in the Posix ERE definition.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

dirk
From the 2 replies I unterstand that the implementation in bash is
correct due to the „official“ standard.

For myself I have solved the issue in my script - but the regular
expression I developed for my problem are without the 'non-greedy'
operator more difficult to read and maintain. From that point of view
it would be an improvement for bash to implement the non-greedy
operator.

Also if I look from an „normal developer“ I think it is a common
pitfall if many testing resources and regexp implementations support
the 'non-greedy' operator.

Maybe there is a switch/option to enable the 'non-greedy' operator in a
future release.

So please feel free to change the „bug report“ to a „feature request“
;-)

Best Regards,

H.-Dirk Schmitt



On So, 2017-12-03 at 15:23 -0500, Chet Ramey wrote:

> On 12/1/17 12:40 PM, [hidden email] wrote:
>
> > Bash Version: 4.4
> > Patch Level: 12
> > Release Status: release
> >
> > Description:
> >   I'm sanitising urls from advertisement crap. As described below
> > I'm getting a wrong resolution of parenthesised expression defined
> > with non-greedy operator '?'.
> >
> >   The test url is: http://toolbox.contentspread.net/container/medim
> > ops/track/xxxxxxxxxx.dyn?csRdu=https://www.medimops.de/?anid=M99999
> > 99999&cl=details&wdm=M9999999999&utm_source=CRM&utm_medium=email&ut
> > m_campaign=OS
> >
> >   The regular expression is:
> > https?:\/\/toolbox.contentspread.net\/(.*?)=(.+?)&.*
>
> The Bash =~ operator uses Posix extended regexps (EREs) as defined in
> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.ht
> ml#tag_09_04.
> There's no concept of a `non-greedy' operator
> in the Posix ERE definition.
>
> Chet
>

Reply | Threaded
Open this post in threaded view
|

Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

Chet Ramey
On 12/4/17 1:42 PM, H.-Dirk Schmitt wrote:
> From the 2 replies I unterstand that the implementation in bash is
> correct due to the „official“ standard.
>
> For myself I have solved the issue in my script - but the regular
> expression I developed for my problem are without the 'non-greedy'
> operator more difficult to read and maintain. From that point of view
> it would be an improvement for bash to implement the non-greedy
> operator.

The thing is, bash doesn't "implement" its regular expressions, per se.
Bash uses the Posix standard library functions (regcomp/regexec) if they
are available in the C library when it's configured and built.  I'm not
wild about adding a dependency on pcre, or a configure test for it, just
to have two varieties of regular expressions available.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://cnswww.cns.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: parenthesised regular expressions and non-greedy operator ? - non standard bash behaviour

dirk
On Mo, 2017-12-04 at 16:49 -0500, Chet Ramey wrote:

> The thing is, bash doesn't "implement" its regular expressions, per
> se.
> Bash uses the Posix standard library functions (regcomp/regexec) if
> they
> are available in the C library when it's configured and built.  I'm
> not
> wild about adding a dependency on pcre, or a configure test for it,
> just
> to have two varieties of regular expressions available.
>
> Chet

O.k.  – so close this as „not a bug“.


--




 
 

  Signature H.-Dirk Schmitt



 

 

  H.-Dirk Schmitt
 

  Dipl.Math.

  eMail:[hidden email]
 

  mobile:+49 177 616 8564
 

  phone: +49 2642 99 41 14
 

  fax: +49 2642 99 41 15
 

  Schillerstr. 42, D-53489 Sinzig

  pgp: http://www.computer42.org/~dirk/OpenPGP-fingerprint.html