Improved Sliced Array implementation
If you don’t know what sliced array is, I suggest that you first read the previous post, where I presented the sliced array. There I presented the data structure that is similar to classical dynamic array, but is more efficient for deletes and inserts. Also, adding elements is a bit faster, because memory allocation is more efficient. But the downside was that accessing item by index was slower then with classical dynamic array. And because accessing the item by index is the most important feature of such a structure, I was not satisfied with the results. So as usual, I looked for a way to do it better while still being efficient for deletes and inserts. And I did it, I found a way. First lets see what changed from previous implementation.
As you can see a lot changed. I threw out most of the pointers for doubly linked lists that were actually not needed. I also changed how each slice is implemented. The biggest problem with the old implementation was, that the slices were of variable sizes. That proved it more difficult to access the slice by random index. I could find it very fast due to the tricks I wrote about, but not fast enough. 2-3 times slower access by index is not something I was happy about. Furthermore using the helper pointer to remember last accessed slice made even reading thread unsafe. That was unacceptable. So I though of another solution. The slices need to be of equal size that is a must, no way around that if I want to be super efficient. But how do we then solve deletes and inserts. If I have to maintain equal slice size, that means I have to move items in all slices if I delete or insert one item. Well at least in slices above the targeted one. That way I gain nothing. Then I thought about a neat solution, I remembered how a typewriter works
What if we add some buffer space bellow and above the slice limits. This way if we insert or delete an item, we only have to move one item per slice. To demonstrate, if we insert an item into the N-th slice, we have to move all the items in that slice for one item upwards. Then, we have to move the last item of the N-th slice to the next slice to maintain equal item count. But because we have some buffer space in each slice, we can simply append the item before next slice items and we don’t have to move other items at all! Then just repeat that for each slice to the end. So lets say we have 100 slices. if we insert an item into the 10-th slice we only have to move items in that slice and then just move one item in each of the 90 slices upwards. No need to move items in all 90 slices. I we run out of buffer space in one of the slices, then we move all items to the middle again to reposition them. The bigger the buffer the more operations we can do, before we must move again. Just like the typewriter. It shifts to the right and when we come to the end of the paper we reposition it again. We don’t reposition for each letter.
It turns out this is very efficient. We don’t have many slices. For instance typical setup I use in testing is for one million items. Each slice has 10.000 items, so I have 100 slices. Each slice has for 1000 items of buffer space, that is 10% and it means I have 500 items buffer bellow and above slice limits. I tested with 500 items buffer and the speed was still mostly the same. We don’t need a lot of buffer space. Its simple math really. If you have only 10 items of buffer on each side you already are 10 time more efficient on deletes and inserts. The record structure for the slice now come down to this:
TArraySlice = record
Last: Integer; Start: Integer; Index: Integer; Data: TAnyValues; end; |
And the lookup control is now simply:
PSliceData = ^TSliceData; TSliceData = array of PArraySlice; |
So quite simplified. The tests show great improvement. The next picture shows the same test as last time, for one million items (10.000 items per slice, 1.000 items buffer per slice). I only added sort test to see some real-time performance and changed Integer list with TValue list because TValue is something we compete against.
Nice, isn’t it. We retained around 10 times faster deletes and insert and we are now also very, very fast on indexed access. We got down from approximately 270 ms to around 110 ms. For 1.000.000 items look-up we are now only around 20 ms slower then pure dynamic array (eg. TList). You have to take into account that that 113 ms is mostly calling Random function for random indexed access. The true difference is seen in the iteration by index test. There it is 38 vs 29. I say this is more then acceptable. If that difference bothers you then well…
We can also note how slow variants and especially TValue are on deletes and inserts. 65 seconds on TValue against 0.1 with my implementation. Staggering difference. That is because TAnyValue is only 9 bytes in size on both 32 and 64 bit and TValue is just huge. That is why memory footprint is also very important. Few recognize it but when you move data around it shows. Its not just the pure memory consumption that hits you its also all operations. You need to work with so many more bytes in memory.
I have also done a stress and regression test for IAnyArray. I ran all basic operations in a loop for 5 hours, while checking the content of the array against TList after each cycle. If there was an inconsistency, I was informed. I ironed out all the major bugs I hope. Download section now has a separate AnyValue download entry, so you can download only AnyValue units. Also the updated code is already there, with all the tests and demos. Just don’t expect that you will know how to use the StressTest. That is only meant for me to test for bugs.
I truly think that this structure is better then TList in almost any way. The only downside is, it consumes more memory, but even that is not a given. I allocate memory in chunks of 10.000 items or whatever your slice size is and TList just doubles the size of the array each time it runs out of space. So it will quickly be less efficient even for that, at large item numbers. If you need an array with powerful interface and efficiency for large number of items then this is it.
If you find any bugs or have any thoughts on any of what I wrote, don’t be shy, drop a comment. Also I dare you all out there, to make it even faster if you can. What about a little competition? Take my code and lets see if you can squeeze something more out of it.


Thomas Mueller wrote,
Just an idea: If you run out of space in one slice because of too many inserts, couldn’t you split that one slice into two rather than moving items around? That’s how databases work. Of course how efficient that is, depends on the implementation of your lookup control array.
(btw: “slower then” should be “slower than”)
Link | March 13th, 2013 at 11:04 am
Iztok Kacin wrote,
@Thomas
I don’t know about split. Probably not, because each slice needs to have a constant number of elements. But I could increase the buffer size if needed. If you get a lot of inserts and deletes I could increase the buffer so then you have fewer relocates of the content. But I did not complicate things further for now, because I had to release the code and it is fast enough as it is.
Link | March 13th, 2013 at 12:23 pm
A. Bouchez wrote,
A problem with the current implementation, using hooking interception, is that it will work for x86 only.
In fact, the Cromis.Detour is expected to work only with x86 opcodes.
Perhaps current code will work, due to some overlap of x86 and x64 opcodes for the JMP which is currently implemented in System.pas… but this could be by chance.
Are you sure the System.@FinalizeRecord hooking is working as expected, for instance?
Now that your nice set of units is growing, a source code repository is mandatory !!
I’m thinking of using the very same algorithm for our SynBigTable library, which is dynamic array based, so has some performance problems when handling a lot of small elements.
Nice ideas, here!
Link | March 13th, 2013 at 2:13 pm
Iztok Kacin wrote,
Yea I am aware of the problems with the Detours library. It works for me on XE3 compiled to 64 bit but it is not safe for sure. The problem is I don’t have enough knowledge to make it 64 bit safe. I will have to put a define to control if hooking is present or not. This will unfortunately have two effects
1. AnyValue will be slow
2. There will be a lot of IFDEF in the code
But it is probably mandatory for two reasons
1. To give option to those who value stability over speed
2. To compile under x64 until a good solution is found.
Is there anyone willing to help with Detours, to make it 64 bit compatible?
Link | March 13th, 2013 at 5:17 pm
Iztok Kacin wrote,
Oh I almost forgot. Code repository will be made, as I promised
I just need to find a good way to sync parts of my repository with a public one.
Link | March 13th, 2013 at 5:17 pm
ObjectMethodology.com wrote,
Cool stuff, I like it!
Link | March 13th, 2013 at 9:12 pm
Eric wrote,
Nice!
You may want to add some benchmarks on small lists (a few dozen items), as these can also be common.
TList isn’t exactly a dynamic array, it suffers from poorly implemented bounds checking and also has a built-in change notification mechanism, which adds overhead. Its enumerator implementation is also sub-optimal.
Also the TList Get methods (for indexed access or enumeration) introduce a copy overhead for indexed accesses that a dynamic array doesn’t have.
Link | March 14th, 2013 at 11:57 am
Iztok Kacin wrote,
It is already there. In the speed test there are two tests, one for large and one for small number of elements. Ok small is 5000 in my case I think but I don’t think it makes any difference. If number of elements fit into one slice (and by default slice is 10.000 large) the Sliced Array behaves just like the ordinary dynamic array. You only have the overhead of one function call. It look like this
function TAnyArray.GetSliceByIndex(const Index: Integer; var SliceIndex: Integer): PArraySlice;
begin
if FArrayMode = amDynamicArray then
begin
Result := FSliceData[0];
SliceIndex := Index;
Exit;
end;
if FSliceCount = 1 then
begin
SliceIndex := FSliceData[0].Start + Index;
Result := FSliceData[0];
Exit;
end;
// calculate correct slice and index
Result := FSliceData[Index div FSliceSize];
SliceIndex := Result.Start + Index mod FSliceSize;
end;
If you expect to have small number of elements and want to have as fast indexed access as possible then you can switch to amDynamicArray mode that behaves just as ordinary array. It always uses only one slice and does as little as possible on slice search as you can see. Only one function call of overhead which is basically none.
Link | March 14th, 2013 at 12:40 pm
Iztok Kacin wrote,
Just tested it. For 5000 elements in array with my structure in amDynamicArray mode I only am 5 ms slover (40 vs 35) for 1.000.000 random index accesses. If I eliminate the extra function call by checking for amDynamicArray in GetItem then I am as fast as TList. Faster if I use RawItem function to only get pointer to TAnyValue (30 vs 35, gaining 10 ms, but TAnyValue has highly optimized record copy anyway).
But I don’t think 5ms is worth to make the code uglier
Link | March 14th, 2013 at 12:51 pm
Petar Georgiev wrote,
Offtopic: I cannot reach you via contact form (hangs on sending). It’s about CRON and possible memory leak:
Here are the steps to reproduce the issue:
1) Manually create named TScheduledEvent
2) DO NOT call run method
3) free the TScheduledEvent
The handle for FTermEvent is not closed in the Destructor. You close it only if “FRunning” is true. But in this case TScheduledEvent was never run and the handle leaks.
I found it with MadExcept. It’s resource leak detection finds handle leaks too
Regards!
Link | May 7th, 2013 at 12:13 am