speeding up hash_search?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

speeding up hash_search?

George Jones
It looks like hash_search just does a linear walk if array entries to find
elements in a list.   This slows down (order N?) new inserts when the
number of entries gets large.

Would there be any interest in merging a patch to add an option for making
this faster (maybe using b-trees?)

My analysis here https://eludom.github.io/blog/20200418/


Thanks,
---george jones
Reply | Threaded
Open this post in threaded view
|

Re: speeding up hash_search?

Chet Ramey
On 4/19/20 10:53 AM, George Jones wrote:
> It looks like hash_search just does a linear walk if array entries to find
> elements in a list.   This slows down (order N?) new inserts when the
> number of entries gets large.

Well, that's the collision handling mechanism. You have to have one. If the
hash table is sized right -- and it's entirely possible that the default
size of the hash table for associative arrays is too small for your use
case -- it's not much overhead.

Chet
--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

[PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Koichi MURASE
In reply to this post by George Jones
2020-04-19 23:54 George Jones <[hidden email]>:
> It looks like hash_search just does a linear walk if array entries
> to find elements in a list.

https://eludom.github.io/blog/20200418/

> and there it is, the linear search walking the list in hash_search()
>
> ```
> [...]
>
>   bucket = HASH_BUCKET (string, table, hv);
>
>   for (list = table->bucket_array ? table->bucket_array[bucket] : 0; list; list = list->next)
>     {
> [...]
> ```
The associative arrays in `hashlib.c' are implemented by hash tables
as is clear from its name.  The main lookup of hash table algorithm is
done by the following line

  bucket = HASH_BUCKET (string, table, hv);

but not by the subsequent linear search.  The linear search is just a
workaround for the collision of hashes.  As far as the load factor of
the hash table is maintained properly, the linear search is O(1)
because the length of the list is O(1).

2020-04-19 23:54 George Jones <[hidden email]>:
> This slows down (order N?) new inserts when the number of entries
> gets large.

I looked into `hashlib.c' and found that rehashing is actually not
implemented.  I just added a function to perform rehashing, and
the performance has been improved.

> Would there be any interest in merging a patch to add an option for
> making this faster (maybe using b-trees?)

https://eludom.github.io/blog/20200418/
> TODO Look for appropriate in-memory hash insert/lookup functions
> -  btrees ?

The B-tree is not the most appropriate option here.  The hash table is
more appropriate.  The B-tree can be used in the case that we want to
keep the ordering of the keys of associative arrays (e.g. we want to
enumerate items in ascending/descending order).  Bash associative
arrays do not ensure the ordering of the items, so the hash table can
be used as a more efficient choice and is already implemented in
`hashlib.c'.  We can simply add rehashing.

------------------------------------------------------------------------

I attach a patch `0001-hashlib-Implement-rehash.patch' which
implements the rehashing.

I also tested the performance.  The attached `test1.png' shows the
computational time of insertion of items before and after the fix.
The lines are fitted by the function `Time = C Size^alpha' where alpha
is the exponent.  Before the fix, there are two regimes depending on
the number of items: linear regime O(N) (alpha ~ 1) and quadratic
regime O(N^2) (alpha ~ 2).  The quadratic regime signals the linear
scaling of a single insertion.  After the fix, the prformance has been
improved, and the computational time scales linearly in entire region.
I also attach a script `test1.sh' which was used to measure the time.

--
Koichi

test1.png (57K) Download Attachment
0001-hashlib-Implement-rehash.patch (3K) Download Attachment
test1.sh (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

George Jones
Thank you.   Patch applied and (performance) tested with come tests I was
working on
https://github.com/eludom/snippits/tree/master/bash/tests .... bottom line:

Before:
...

lines_per_sec 993.37 LINES 2700000 ELAPSED 2718
lines_per_sec 955.95 LINES 2800000 ELAPSED 2929
lines_per_sec 921.51 LINES 2900000 ELAPSED 3147
lines_per_sec 889.41 LINES 3000000 ELAPSED 3373
lines_per_sec 859.43 LINES 3100000 ELAPSED 3607


After:

...

lines_per_sec 50000.00 LINES 2600000 ELAPSED 52
lines_per_sec 50000.00 LINES 2700000 ELAPSED 54
lines_per_sec 50000.00 LINES 2800000 ELAPSED 56
lines_per_sec 50000.00 LINES 2900000 ELAPSED 58
lines_per_sec 50000.00 LINES 3000000 ELAPSED 60
lines_per_sec 50000.00 LINES 3100000 ELAPSED 62


---George Jones


On Sun, Apr 19, 2020 at 3:52 PM Koichi Murase <[hidden email]>
wrote:

> 2020-04-19 23:54 George Jones <[hidden email]>:
> > It looks like hash_search just does a linear walk if array entries
> > to find elements in a list.
>
> https://eludom.github.io/blog/20200418/
> > and there it is, the linear search walking the list in hash_search()
> >
> > ```
> > [...]
> >
> >   bucket = HASH_BUCKET (string, table, hv);
> >
> >   for (list = table->bucket_array ? table->bucket_array[bucket] : 0;
> list; list = list->next)
> >     {
> > [...]
> > ```
>
> The associative arrays in `hashlib.c' are implemented by hash tables
> as is clear from its name.  The main lookup of hash table algorithm is
> done by the following line
>
>   bucket = HASH_BUCKET (string, table, hv);
>
> but not by the subsequent linear search.  The linear search is just a
> workaround for the collision of hashes.  As far as the load factor of
> the hash table is maintained properly, the linear search is O(1)
> because the length of the list is O(1).
>
> 2020-04-19 23:54 George Jones <[hidden email]>:
> > This slows down (order N?) new inserts when the number of entries
> > gets large.
>
> I looked into `hashlib.c' and found that rehashing is actually not
> implemented.  I just added a function to perform rehashing, and
> the performance has been improved.
>
> > Would there be any interest in merging a patch to add an option for
> > making this faster (maybe using b-trees?)
>
> https://eludom.github.io/blog/20200418/
> > TODO Look for appropriate in-memory hash insert/lookup functions
> > -  btrees ?
>
> The B-tree is not the most appropriate option here.  The hash table is
> more appropriate.  The B-tree can be used in the case that we want to
> keep the ordering of the keys of associative arrays (e.g. we want to
> enumerate items in ascending/descending order).  Bash associative
> arrays do not ensure the ordering of the items, so the hash table can
> be used as a more efficient choice and is already implemented in
> `hashlib.c'.  We can simply add rehashing.
>
> ------------------------------------------------------------------------
>
> I attach a patch `0001-hashlib-Implement-rehash.patch' which
> implements the rehashing.
>
> I also tested the performance.  The attached `test1.png' shows the
> computational time of insertion of items before and after the fix.
> The lines are fitted by the function `Time = C Size^alpha' where alpha
> is the exponent.  Before the fix, there are two regimes depending on
> the number of items: linear regime O(N) (alpha ~ 1) and quadratic
> regime O(N^2) (alpha ~ 2).  The quadratic regime signals the linear
> scaling of a single insertion.  After the fix, the prformance has been
> improved, and the computational time scales linearly in entire region.
> I also attach a script `test1.sh' which was used to measure the time.
>
> --
> Koichi
>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Koichi MURASE
2020-04-20 10:00 George Jones <[hidden email]>:
> Thank you.  Patch applied and (performance) tested with come tests
> I was working on
> https://github.com/eludom/snippits/tree/master/bash/tests
> .... bottom line:

Thank you for the measurements.

Also, I am sorry that I disturbed your plan for contributing to Bash.
I actually initially doubted that the insertion with the current
implementation is O(N), so I created the test first and then found
that it is an easy fix rather than reimplementing it by B-tree or
other data structures.  I couldn't stop my interest in how much it is
improved by the easy fix.

Nevertheless, I have not tuned the parameters of rehashing.  Actually
it is a tradeoff between the memory consumption and the computational
time, so it is a matter of preference to some extent.  I attach an
updated patch which exposes some parameters.  If you have an interest,
you can play by changing the value of the parameters
`HASH_REHASH_MULTIPLIER' and `HASH_MAX_LOADFACTOR' defined in
`hashlib.h'.

--
Koichi

0001-hashlib-Implement-rehash.v2.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Koichi MURASE
I'm sorry, there was a bug. This is the fixed patch (v3).
--
Koichi

0001-hashlib-Implement-rehash.v3.patch (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Greg Wooledge
In reply to this post by Koichi MURASE
On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote:
> Also, I am sorry that I disturbed your plan for contributing to Bash.
> I actually initially doubted that the insertion with the current
> implementation is O(N), so I created the test first and then found
> that it is an easy fix rather than reimplementing it by B-tree or
> other data structures.  I couldn't stop my interest in how much it is
> improved by the easy fix.

This should in no way make the OP feel that they didn't contribute.
Spotting and diagnosing problems is important work, even if their
proposed patch wasn't selected as the best solution.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Chet Ramey
On 4/20/20 8:49 AM, Greg Wooledge wrote:

> On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote:
>> Also, I am sorry that I disturbed your plan for contributing to Bash.
>> I actually initially doubted that the insertion with the current
>> implementation is O(N), so I created the test first and then found
>> that it is an easy fix rather than reimplementing it by B-tree or
>> other data structures.  I couldn't stop my interest in how much it is
>> improved by the easy fix.
>
> This should in no way make the OP feel that they didn't contribute.
> Spotting and diagnosing problems is important work, even if their
> proposed patch wasn't selected as the best solution.

This is quite true.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Koichi MURASE
2020-04-20 23:05 Chet Ramey <[hidden email]>:

> On 4/20/20 8:49 AM, Greg Wooledge wrote:
> > On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote:
> >> Also, I am sorry that I disturbed your plan for contributing to Bash.
> >> I actually initially doubted that the insertion with the current
> >> implementation is O(N), so I created the test first and then found
> >> that it is an easy fix rather than reimplementing it by B-tree or
> >> other data structures.  I couldn't stop my interest in how much it is
> >> improved by the easy fix.
> >
> > This should in no way make the OP feel that they didn't contribute.
> > Spotting and diagnosing problems is important work, even if their
> > proposed patch wasn't selected as the best solution.
>
> This is quite true.

Dear All,

Yes, thank you for clarification and sorry for confusing writing.  I
did not mean George's contribution disappeared, but just I have
disturbed his five-step plan on his blog where the remaining four
steps are now canceled.  I have to add that identifying the problem is
definitely the most non-trivial part because it is relatively
straightforward to fix the problem once the problem is identified :).
Also, I would like to thank George for additional testing of my patch.

Of course, this also applies to my patches.  I think I have sent more
than ten patches so far and many of them have been applied, but I am
anyway happy if the problems are solved.  You can always reject my
patches when you have better solutions.

--
Koichi

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

George Jones
No problem.   Glad you fixed it.   It's been a LONG time since I've
actually written C, so probably best if someone current did it.

On the parameters, I suggest you consider exposing the at user level as a
switch or env var.   My use case was pathologically large (and would have
been better on, e.g. Spark if that were and option in the environment).
 Even the old behavior was probably good enough for most people.  It's only
when you start abusing bash to to "big data" that the problem shows up.

Thanks again, and impressive quick work !

---george jones



On Mon, Apr 20, 2020, 10:48 AM Koichi Murase <[hidden email]> wrote:

> 2020-04-20 23:05 Chet Ramey <[hidden email]>:
> > On 4/20/20 8:49 AM, Greg Wooledge wrote:
> > > On Mon, Apr 20, 2020 at 06:48:44PM +0900, Koichi Murase wrote:
> > >> Also, I am sorry that I disturbed your plan for contributing to Bash.
> > >> I actually initially doubted that the insertion with the current
> > >> implementation is O(N), so I created the test first and then found
> > >> that it is an easy fix rather than reimplementing it by B-tree or
> > >> other data structures.  I couldn't stop my interest in how much it is
> > >> improved by the easy fix.
> > >
> > > This should in no way make the OP feel that they didn't contribute.
> > > Spotting and diagnosing problems is important work, even if their
> > > proposed patch wasn't selected as the best solution.
> >
> > This is quite true.
>
> Dear All,
>
> Yes, thank you for clarification and sorry for confusing writing.  I
> did not mean George's contribution disappeared, but just I have
> disturbed his five-step plan on his blog where the remaining four
> steps are now canceled.  I have to add that identifying the problem is
> definitely the most non-trivial part because it is relatively
> straightforward to fix the problem once the problem is identified :).
> Also, I would like to thank George for additional testing of my patch.
>
> Of course, this also applies to my patches.  I think I have sent more
> than ten patches so far and many of them have been applied, but I am
> anyway happy if the problems are solved.  You can always reject my
> patches when you have better solutions.
>
> --
> Koichi
>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Chet Ramey
In reply to this post by Koichi MURASE
On 4/19/20 3:51 PM, Koichi Murase wrote:

>
> The associative arrays in `hashlib.c' are implemented by hash tables
> as is clear from its name.  The main lookup of hash table algorithm is
> done by the following line
>
>   bucket = HASH_BUCKET (string, table, hv);
>
> but not by the subsequent linear search.  The linear search is just a
> workaround for the collision of hashes.  As far as the load factor of
> the hash table is maintained properly, the linear search is O(1)
> because the length of the list is O(1).
>
> 2020-04-19 23:54 George Jones <[hidden email]>:
>> This slows down (order N?) new inserts when the number of entries
>> gets large.
>
> I looked into `hashlib.c' and found that rehashing is actually not
> implemented.  I just added a function to perform rehashing, and
> the performance has been improved.

Thanks for the patch. This will be in the next devel branch push.

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Chet Ramey
In reply to this post by George Jones
On 4/20/20 11:16 AM, George Jones wrote:
> No problem.   Glad you fixed it.   It's been a LONG time since I've
> actually written C, so probably best if someone current did it.
>
> On the parameters, I suggest you consider exposing the at user level as a
> switch or env var.   My use case was pathologically large (and would have
> been better on, e.g. Spark if that were and option in the environment).
>  Even the old behavior was probably good enough for most people.  It's only
> when you start abusing bash to to "big data" that the problem shows up.

I've been considering how to provide a way to let users indicate the size
of an array when they declare it. `declare' accepts `declare -a name[size]'
and `declare -A name[size]' for compatibility with ksh, but has never done
anything with `size'.

I could see allowing a number, or maybe an arithmetic expression, for SIZE
to set the initial size of an associative array. Or maybe an additional
option to `declare' would work better. I don't like using a shell variable
for this.

What does the group think?

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

George Jones
No real opinion on syntax.

Using something existing:

    declare -A foo[SIZE]

seems sensible, especially if there was no semantic meaning (I'm not a fan
of syntax without semantics .... clutter).

Big thing is that the new stuff for fringe new pathologic use cases (mine)
should not have negative impact (huge buffer preallocation) on the existing
?30 years? of users/scripts.

Some docs on how the SIZE is used (hint for preallocation of hash table
size, not hard limit on number of entries) probably also in order.

Thanks,
---george jones


On Mon, Apr 20, 2020, 4:23 PM Chet Ramey <[hidden email]> wrote:

> On 4/20/20 11:16 AM, George Jones wrote:
> > No problem.   Glad you fixed it.   It's been a LONG time since I've
> > actually written C, so probably best if someone current did it.
> >
> > On the parameters, I suggest you consider exposing the at user level as a
> > switch or env var.   My use case was pathologically large (and would have
> > been better on, e.g. Spark if that were and option in the environment).
> >  Even the old behavior was probably good enough for most people.  It's
> only
> > when you start abusing bash to to "big data" that the problem shows up.
>
> I've been considering how to provide a way to let users indicate the size
> of an array when they declare it. `declare' accepts `declare -a name[size]'
> and `declare -A name[size]' for compatibility with ksh, but has never done
> anything with `size'.
>
> I could see allowing a number, or maybe an arithmetic expression, for SIZE
> to set the initial size of an associative array. Or maybe an additional
> option to `declare' would work better. I don't like using a shell variable
> for this.
>
> What does the group think?
>
> --
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>                  ``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/
>
Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Greg Wooledge
On Mon, Apr 20, 2020 at 05:12:28PM -0400, George Jones wrote:
> No real opinion on syntax.
>
> Using something existing:
>
>     declare -A foo[SIZE]
>
> seems sensible, especially if there was no semantic meaning (I'm not a fan
> of syntax without semantics .... clutter).

That's pretty C-like, and I don't have any strong dislike of it, but I
feel I should point out that users will need to quote the final argument
if it contains square brackets, just like with unset 'a[i]'.

Another choice would be a more shell-like syntax:

declare -s size -A foo=(...)

I'm curious whether the size has to be specified up front when the array
is declared, or can be adjusted on the fly.  The shell-like syntax feels
more natural if the size is being adjusted, since you can write

declare -s new_size foo

without needing to specify the -A again.  But it's not a huge difference.

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Chet Ramey
In reply to this post by George Jones
On 4/20/20 5:12 PM, George Jones wrote:
> No real opinion on syntax.
>
> Using something existing:
>
>     declare -A foo[SIZE]
>
> seems sensible, especially if there was no semantic meaning (I'm not a fan
> of syntax without semantics .... clutter).

It's been there forever for ksh88 compatibility, but bash doesn't use it
for anything (and I suspect ksh93 doesn't either).

> Big thing is that the new stuff for fringe new pathologic use cases (mine)
> should not have negative impact (huge buffer preallocation) on the existing
> ?30 years? of users/scripts.

I don't think there are very many, if any, scripts out there using that
syntax. It's never had any semantic meaning.

> Some docs on how the SIZE is used (hint for preallocation of hash table
> size, not hard limit on number of entries) probably also in order.

Sure, if the size argument ever meant something, I would document it. :-)

Chet

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/

Reply | Threaded
Open this post in threaded view
|

Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?)

Chet Ramey
In reply to this post by Greg Wooledge
On 4/20/20 5:50 PM, Greg Wooledge wrote:

> On Mon, Apr 20, 2020 at 05:12:28PM -0400, George Jones wrote:
>> No real opinion on syntax.
>>
>> Using something existing:
>>
>>     declare -A foo[SIZE]
>>
>> seems sensible, especially if there was no semantic meaning (I'm not a fan
>> of syntax without semantics .... clutter).
>
> That's pretty C-like, and I don't have any strong dislike of it, but I
> feel I should point out that users will need to quote the final argument
> if it contains square brackets, just like with unset 'a[i]'.

If the code to skip over it didn't already exist, I wouldn't consider it.

> Another choice would be a more shell-like syntax:
>
> declare -s size -A foo=(...)

That's definitely more consistent with the rest of declare's options.

> I'm curious whether the size has to be specified up front when the array
> is declared, or can be adjusted on the fly.

It could be adjusted on the fly, I suppose, but there would never really
be a reason to do it unless the size were being reduced. Otherwise, it's
ok to just let the array grow automatically.

> The shell-like syntax feels
> more natural if the size is being adjusted, since you can write
>
> declare -s new_size foo
>
> without needing to specify the -A again.  But it's not a huge difference.

This is a good point.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    [hidden email]    http://tiswww.cwru.edu/~chet/