Introduction

Brownies-Collections complements the Java Collections Framework by providing general purpose implementations of the List interface which are both fast and powerful and introducing collections which feature the concept of keys and collections.

Performance

  • GapList is the fastest known implementation for reasonably sized lists as it combines the strengths of both ArrayList and LinkedList. It is implemented to offer both efficient random access to elements by index (as ArrayList does) and at the same time efficient adding and removing elements to and from beginning and end (as LinkedList does). It also exploits the locality of reference and access patterns in applications to further improve performance.

  • BigList is a list optimized for handling large number of elements. It stores the elements in fixed size blocks which are maintained in a tree for fast access and split or merged as needed. Copying a BigList is very efficient as it is implemented using a copy-on-write approach.

Memory Usage

  • There are specialized List implementations for primitive data types. As the storage is realized using arrays of primitives, memory is saved. These classes are named IntGapList/IntBigList, etc.

  • There are also additional classes which allow you to access the primitive data through the standard List interface. With this approach you can save memory without changing the interface to your collections. These classes are named IntObjGapList/IntObjBigList, etc.

Development Efficiency

  • Both GapList and BigList have been designed to be used as drop-in replacement for ArrayList, LinkedList, or ArrayDeque by implementing all their interfaces and offering all their methods.

  • All list implementations (GapList, BigList, KeyList, etc.) inherit many powerful methods provided through the common abstract class IList. This often makes it possible to replace the costly and clumsy use of stream() with a single method.

  • There are additional collections featuring the integration of keys and constraints in an orthogonal and declarative way to increase developer productivity. These features are offered by classes implementing the Collection interface (classes KeyCollection, Key1Collection, Key2Collection), the Set interface (classes KeySet, Key1Set, Key2Set), and the List interface (classes KeyList, Key1List, Key2List).

 

Benchmark First

Probably the most important aspect of collections is the performance of the operations. If an operation does not scale, it can become effectively become not feasible any more if the collection becomes too large.

We therefore jump directly into the benchmarks with the following nominees:

Type

Library

Description

GapList

brownies-collections

Fastest list implementation known. Fast access by index and fast insertion/removal at end and at beginning.

BigList

brownies-collections

List optimized for storing large number of elements. Elements stored in fixed size blocks which are maintained in a tree.

ArrayList

Java

Maintains elements in a single array. Fast access by index, fast insertion/removal at end, but slow at beginning.

LinkedList

Java

Elements stored using a linked list. Slow access by index. Memory overhead for each element stored.

TreeList

commons-collections

Elements stored in a tree. General overhead, but no very slow operations. Memory overhead for each element stored.

 

Performance

As the size of the collection typically heavily influences the performance of the operations, all benchmarks have been run against collections with 100, 10'000 and 1'000'000 elements.

The first table shows the result for collections with 100 elements. The different colors in the result table indicate the relative performance in relation to the best implementation (green: best performance; blue good performance, max 2 times slower; yellow: moderate performance, max 10 times slower; red: bad performance, more than 10 times slower)

 

As the list is quite small, the results are close to each other. However we can already see that TreeList seems to add some overhead to all operations.

The results of with size 10'000 is then more interesting:

We can see, that ArrayList and LinkedList only perform well for a quite a restricted subset of operations:

  • ArrayList: all get operations, add/remove last

  • LinkedList: get first/last, remove first/last

  • On the other hand, both GapList and BigList perform well for all operations

Finally the results for 1'000'000 elements:

As we could expect, only BigList scales well for all operations:

  • GapList: does not not scale if add/remove operations happen completely random, so no locality of reference can be exploited.

  • BigList: scales well for all operations, the only moderate result is for getting elements in a totally random order. This could be expected as there is no locality of reference which can be exploited.

Summary

GapList has been designed as drop-in replacement for ArrayList, LinkedList, and ArrayDequeue and offers fast access by index and fast insertion/removal at the beginning and at the end at the same time. It is the fastest known list implementation for reasonably sized lists.

Use BigList if your list can hold more elements. BigList has been designed to execute all operations in constant time (with the exception of copying). Both GapList and BigList use memory sparingly and provide also specialized implementations supporting primitives to save more memory.

The table below shows the complexity of the operations for the different implementations (with O(1)+ meaning constant amortized time).

The benchmarks have been realized using JMH and great care has been taken in making them reliable. But even with JMH benchmarking is tricky, especially in the case of add/delete operations where you will not reach a steady state and garbage collections kicks in. Nevertheless I'm confident that the reported numbers represent the reality as close as possible.

Memory

Let's also have a look the memory consumption of the different lists:

We can see that BigList, GapList, and ArrayList only add small overhead to the stored elements, where as Linkedlist needs twice the memory and TreeList even more. However big savings are realized if you use an implementation storing primitives instead of objects.

 

GapList – a lightning-fast List implementation

Unfortunately Java does not offer a general purpose of the List interface. GapList combines the strengths of both ArrayList and LinkedList to achieve this. We will show how it is implemented to offer efficient random access to elements by index (like ArrayList) and at the same time efficient adding and removing elements to and from head and tail of the list (like LinkedList).

Review of ArrayList

To understand the benefit of GapList, we first have a look at the implementation of ArrayList to see why the performance of some operations tends to be bad.

The following illustration shows how data is stored in an ArrayList.

All elements are stored in a Java array. The first element is at index 0, the second at index 1 and so on. To prevent new allocation on every insert, the allocated array is typically larger than the stored number of elements and a size variable is used to keep track of the used elements.

If we look at the illustration, it becomes quite obvious why different operations have different performance characteristics: operations at the end are very fast because no data must be copied, whereas operations at the beginning need an expensive copy operation.

If we analyze some operations in a row, we see that subsequent operations cannot profit from previous operations. So in the following example, the elements right from the insertion point have to be moved twice for two subsequent insert operations.

As a consequence, ArrayList is not suited for iterating over a list and adding or removing elements.

Introduction to GapList

To solve the issues brought out, we introduce GapList as another implementation of the java.util.List interface. As main features, GapList provides

  • Efficient access to elements by index

  • Constant time insertion and deletion at head and tail of list

  • Exploit the locality of reference and access patterns often seen in applications

Let's see how GapList is implemented to offer these features.

If we compare how the different kind of inserts are handled by ArrayList, we can quickly come up with a solution to guarantee fast insertion both at the beginning and at the end of the list. Instead of moving all elements to gain space at index 0, we leave the existing elements in place and write the elements at the end of the allocated array if there is space left. So we basically use the array as a kind of rotating buffer.

For accessing the elements in the right order, we have to remember the start position of the first element and use a modulo operation to calculate the physical index from the logical one:

physIndex = (start + index) % capacity

To exploit the locality of reference, we allow a gap to be included in the storage of the list elements. The gap formed by the unused slots in the backing array can be anywhere in the list. There is at most one gap, but there can also be none.

This gap helps you to take advantage of the locality of reference to the list, so if you add an element to the middle of the list, a subsequent addition to the middle will be fast.

If a GapList has no gap, one is created if needed. If the gap is at a wrong place, it is moved. Obviously if the operations happen near to each other, only few elements will have to be copied.

Per default, the gap is placed in such a way that several additions after the specified elements can be done without need to move the gap again. However the access pattern is also analyzed and exploited: if several add operations occur at the exactly same position, the gap is repositioned once more. This means that inserts everywhere will be as fast as operations at the end or the beginning if they occur repeated at the same position.

GapList always allows removal of elements at the beginning and at the end without any moving of elements.

Removals in the middle are handled similar to insertions: an existing gap may be moved or vanish if no longer needed.

Summary of GapList

GapList has been designed as drop-in replacement for both ArrayList, LinkedList, and ArrayDeque:

  • implements all needed interfaces (java.util.List, java.util.Deque)

  • implements all public methods (ensureCapacty, trimToSize)

  • is not thread-safe

  • features specialized implementations using all Java primitives for storage to save memory, either unrelated (e.g. IntGapList) or still integrated into the collections framework by implementing the List interface (e.g. IntObjGapList).

 

BigList – a Scalable High-Perforance List for Java

GapList is tremendously fast for reasonably sized lists, but it will not scale if the size of the collections grows over a certain limit. Reason is that the operations to expand the collections or move the gap will become more costly if more elements have to be handled.

BigList has been designed for handling large collections where large means that all data still fit completely in the heap memory. It addresses the following requirements:

  • Avoid copying large data blocks: If the list grows or shrinks, only a small part of the data must be copied around, as this operation becomes expensive and needs the same amount of memory again.

  • Data sharing: copying collections is a frequent operation which should be efficiently possible even if the collection is large. An efficient implementation requires some sort of data sharing as copying all elements is per se a costly operation.

  • Sparing use of memory: The list should need little memory for its own implementation so memory can be used for storing application data.

  • Specialized versions for primitives: It must be possible to store common primitives like ints in a memory saving way.

If an implementation does not offer these features, some operations will not only be slow for really large collections, but will become just not feasible because memory or CPU usage will be too exhaustive.

Introduction to BigList

The implementation of BigList stores the collection elements in fixed size blocks. Add or remove operations are then implemented to use only data from one block.

Copying is fast by maintaining a reference count on the fixed size blocks which allows to implement a copy-on-write approach. For efficient access to the fixed size blocks, they are maintained in a specialized tree structure.

Each BigList instance stores the following information:

  • Elements are stored in in fixed-size blocks

  • A single block is implemented as GapList with a reference count for sharing

  • All blocks are maintained in a tree for fast access

  • Access information for the current block is cached for better performance

The following illustration shows these details for two instances of BigList which share one block.

Use of Blocks

Elements are stored in in fixed-size blocks with a default block size of 1000. Where this default may look pretty small, it is most of the time a good choice because it guarantees that write operation only need to move few elements. Read operations will profit from locality of reference by using the currently cached block to be fast.

It is however possible to specify the block size for each created BigList instance. All blocks except the first are allocated with this fixed size and will not grow or shrink. The first block will grow to the specified block size to save memory for small lists.

If a block has reached its maximum size and more data must be stored within, the block needs to be split up in two blocks before more elements can be stored. If elements are added to the head or tail of the list, the block will only be filled up to a threshold of 95%. This allows inserts into the block without the immediate need for split operations.

To save memory, blocks are also merged. This happens automatically if two adjacent blocks are both filled less than 35% after a remove operation.

Locality of Reference

For each operation on BigList, the affected block must be determined first. As predicted by locality of reference, most of the time the affected block will be the same as for the last operation.

The implementation of BigList has therefore been designed to profit from locality of reference which makes common operations like iterating over a list very efficient.

Instead of always traversing the block tree to determine the block needed for an operation, lower and upper index of the last used block are cached. So if the next operation happens near to the previous one, the same block can be used again without need to traverse the tree.

Reference Counting

To support a copy-on-write approach, BigList stores a reference count for each fixed size blocks indicating whether this block is private or shared.

Initially all blocks are private having a reference count of 0, so that modification are allowed. If a list is copied, the reference count is incremented which prohibits further modifications. Before a modification then can be made, the block must be copied decrementing the block's reference count and setting the reference count of the copy to 0. The reference count of a block is then decremented by the finalizer of BigList.

Summary of BigList

BigList is a scalable high-performance list for storing large collections. Its design guarantees that all operations will be predictable and efficient both in term of performance and memory consumption, even copying large collections is tremendous fast.

 

Key Collections – Boost Productivity with Keys and Constraints

The key collections will increase developer productivity. To achieve this, the concepts of keys and constraints are integrated into collections and can be used in an orthogonal and declarative way.

Sounds interesting, but too theoretical? Let's start with an example.

A First Example

Let's say we need to represent the list of columns in a table with the following requirements:

  • columns are ordered

  • column names must be unique

  • efficient access to the columns by both index and name

So what collection should we choose? A list gives us efficient access by index. If you are confident that the number of elements remains small, we could implement access by name (which is also needed to check for the uniqueness of the column names) by just iterating through the list, but it clearly needs coding for a solution which would not scale. For a scalable solution, we would need to synchronize a list and a map. This is not an impossible task, but plenty of work for a quite common requirement.

Enter our key collections. They allow the creation of the needed collection with one single statement:

Key1List<Column,String> columns = new Key1List.Builder<Column,String>.
    withKey1Map(Column::getName).withKey1Null(false).
    withKey1Duplicates(false).build()

The code is easy to understand: Create a list to store column objects and allow access to them by a string key defined by the specified mapper. This key must not be null and duplicates are not allowed.

To get the complete picture, see the following definition of class and mapper:

class Column {
    private String name;
    public String getName() { return name; }
}

You might first think that the creation of the collection look clumsy, but think at the amount of code you would have to write otherwise.

Keys

Let's now look in more detail at the definition and functionality of keys: A key is a value which is extracted from an element stored in a collection. It can be used for access and for defining constraints. We call all extracted key values of a collection a key map. There can be one or more key maps for a collection and also the element itself can be used as key - we call this the element set.

For each key map, the following behavior can be defined:

  • Null values: can be allowed (default) or prohibited (withKeyNull())

  • Duplicate values: can be allowed (default) or prohibited (withKeyDuplicates()). There is also a mode where duplicate values are prohibited, but null values can occur without limitation

  • Sort order: the order of keys can be random (default) or sorted. For sorting either the natural order or any order defined by a comparator can be used (withKeySort())

If you define a sort order on a key map, you have also the possibility to specify that the whole collection should use this order (withKeyOrderBy()).

If you have a database background, these concepts will sound familiar to you. In our example, we defined the column name to be the primary key of the collection. To make this more obvious and also shorter, we could also use the method withPrimaryKey(). There is also a second method withUniqueKey() which defines a key which can be null but must be unique if not null.

Note that the values used for keys should be immutable. This is necessary as the key maps are populated upon adding elements and later changes will not be detected. This is however not a problem specific to our key collections, also key used in sets or maps must not change if the element is contained in a collection. For the rare case where this may be necessary, there are methods to invalidate an element.

Constraints

The possibility to restrict null and duplicate values already showed some of the power of constraints. Unfortunately the concepts of constraint collections is missing from the JDK classes. While some classes like Set have intrinsic constraints, there is no general way to apply a constraint to collections. What's more, the classes have also not been designed for extensibility, so implementing a constraint involves overwriting a lot of methods.

We now show how this feature is essential for providing a good API. Let's consider the case that our class should maintain a list of tasks which may not be null. Client code should be able to read and write these tasks using the provided API but it must be guaranteed that only non-null tasks can be added to the list. Clearly

class Task {
    List<Task> tasks;

    List<Task> getTasks() { return tasks; }
}

is dangerous: The client has unrestricted write access to the list, so she can easily insert null values. We can use a slightly modified version to get a good read-only interface:

List<Task> getTasks() { return Collections.unmodifiableList(tasks); }

But we still have to fix the write part of the API. There are different options:

The lazy developer will offer only a minimalistic API:

void addTask(int index, Task task) {
    checkTask(task); tasks.add(index, task); }
void setTask(int index, Task task) { 
    checkTask(task); tasks.set(index, task); }

The hardworking developer could implement all mutator methods (twice add(), twice addAll(), set()). Or she offers only a single method:

void setTasks(List<Task> tasks) { checkTasks(tasks); this.tasks = tasks; }

With this approach, we have to validate all tasks after every change. This may not be a problem if the number of elements is low and the check is as easy as checking for null, but obviously this approach will not scale.

With constrained collections however, our API becomes again as simple as

List<Task> getTasks() { return constraintTasks; }

because the list itself is able to check the elements for validity.

Therefore the key collections offer a general concept to apply constraints to the elements stored in the collections. Per default, key collections allow null and duplicate values like Collection and List do. It is however easily to possible to prohibit null (withElemNull()) and duplicate values (withElemDuplicates()).

You can also define a predicate constraint which must evaluate to true for an element to be added to the collection (withConstraint()). If the predicate evaluation fails, operation is canceled by throwing an exception.

There is also the possibility to define trigger like handlers (withBeforeInsert(), withBeforeDelete()) where you can prohibit the action to be executed by throwing an exception.

There is also a last kind of constraint ready to use: you can restrict the number of entries allowed in the collection by defining a maximum size (withMaxSize()): an attempt to add more elements will then be rejected with an exception. If your collection is of type list, you can also define a window size (withWindowSize()): this also means that the list will store at maximum the defined number of elements, but if another element is added, the first one is automatically removed.

Classes

The functionality of keys and constraints is provided by the following classes:

Class

Description

KeyList<E>

Maintains a list of elements. It optionally provides fast access to the element itself.

Key1List<E,K1>

A key list with one additional key. It provides fast access to the key and optionally to the element itself.

Key2List<E,K1,K2>

A key list with two additional keys. It provides fast access to the key and optionally to the element itself.

KeyCollection<E>

Maintains a collection of elements. It will provide fast access to the elements.

Key1Collection<E,K1>

A key collection with one additional key. It provides fast access to the key and optionally to the element itself.

Key2Collection<E,K1,K2>

A key collection with two additional keys. It provides fast access to the key and optionally to the element itself.

KeySet<E>

Maintains a set of elements. It will provide fast access to the elements.

Key1Set<E,K1>

A key set with one additional key. It provides fast access to the key and to the element itself.

Key2Set<E,K1,K2>

A key set with two additional keys. It provides fast access to the key and to the element itself.

 

The key collections use the JDK collections as building blocks for their implementation: The element set is realized using HashSet/TreeSet, for the key maps HashMap/TreeMap are used.

The following illustration shows how the key collections use different components to store some elements with two keys:

The classes KeyCollection, Key1Collection, and Key2Collection implement the Collection interface whereas KeySet, Key1Set, and Key2Set implement the Set interface. The order of elements is per default determined by the Collection/Set component, it can however also be changed to use one of the key maps.

The classes KeyList, Key1List, and Key2List implement the List interface, so the order of elements is always determined by the list. The list order can however be specified to reflect the order of the set or of a key map resulting in a sorted list.

To access the values in the key maps, specific methods are available: containsKey(), getByKey(), getAllByKey(), getCountByKey(), getDistinctKeys(), removeByKey(), removeAllByKey(), and indexOfKey() (for lists only). For the elements set, the following methods are provided: getAll(), getCount(), removeAll(), getDistinct(). Additionally, you can also use asSet() and asMap() to access the keys.

Key Lists

All key lists (except if sorted) have the problem that removal of an element by key is slow as the element must be found by iterating through the list. You should therefore only use a key list if the number of elements remains small or you remove element by index only.

Sorted Key Lists

There are numerous discussions why Java does not have sorted lists. There are basically two arguments against: The first, puristic one is that a sorted list would violate the contract designed by List.add() because the element is not added at the end. The second, more practical one is that is not possible to implement a sorted list which performs well in adding elements in random order.

While both arguments are true, a sorted key list can nevertheless be created because it can be handy. You must however be aware that adding elements in random order to a sorted list will always be an order of magnitude slower that using a sorted map because the elements must be moved in memory during addition.

More Examples

Here are a few more examples which show the power of key collections:

A collection of tickets with a mandatory internal and an optional external id:

Key2Collection<Ticket,String,String> coll = 
    new Key2Collection.Builder<Ticket,String,String>.
        withKey1Map(Ticket.dMapper).withPrimaryKey().
        withKey2Map(Ticket.ExtIdMapper).withUniqueKey2().build()

A column list sorted:

Key1List<Column,String> list = new Key1List<Column,String>.
    withKey1Map(Column.Mapper).withKey1Sort(true).
    withKey1OrderBy(true).withPrimaryKey1().build()

A sorted list of primitive int values (saving a lot of memory):

KeyList<Integer> list = new KeyList<Integer>.
    withElemOrderBy(int.class).build()

A constraint list:

KeyList<String> = new KeyList<String>.
    withConstraint(uppercasePredicate).build()

A constraint set:

KeyCollection<String> = new KeyCollection<String>.
    withConstraint(uppercasePredicate).build().asSet()

A constraint map:

Key1Collection<Column,String> = new Key1Collection<Column,String>.
    withKey1Map(nameMapper).withConstraint(uppercasePredicate).
    build().asMap1()

 

A Real-Life Example

Let's look at more complex example:

  • maintain a bidirectional mapping between zip code and city name

  • zip codes are unique, but one city name can be associated with several zip codes

  • store both zip codes and city names sorted

So our collection will then store data like this:

  • Zip 4000 → City "Basel"

  • Zip 5000 → City "Aarau"

  • Zip 5001 → City "Aarau"

The key collections do not support a special BiMap interface, but allow access to collections elements by exposed keys. So to store the needed data, we define a simple class:

class ZipCity {
        int zip;
        String city;

        public ZipCity(int zip, String city) {
            this.zip = zip;
            this.city = city;
        }
        public int getZip() {
            return zip;
        }
        public String getCity() {
            return city;
        }
    }

Our collection will then store instances of this class which we need to access by two keys: zip code (an integer) and city name (a string). We therfore create a collection with two keys, hence

  • Key2Collection<ZipCity, Integer, String>

For these two keys, we define the mapping used to extract the key values out of the collection element:

  • withKey1Map(ZipCity::getZip)

  • withKey2Map(ZipCity::getCity)

With this declaration, we already set up the synchronization between object and keys. Let's add the other features as well.

  • Zip code must be non null and unique: withPrimaryKey1()

  • City name must be non null: withKey2Null(false)

  • Both zip codes and names must be sorted: withKey1Sort(true), withKey2Sort(true)

And that's all! How does this work internally? The key collections are using standard Java collections under the hood to implement the functionality. In our example, the Key2Collection contains one HashSet (for the elements) and two TreeMaps which map the sorted keys to the elements in the set. This can be seen on the following illustration:

 

Listing

    static class ZipCity {
        int zip;
        String city;

        public ZipCity(int zip, String city) {
            this.zip = zip;
            this.city = city;
        }
        public int getZip() {
            return zip;
        }
        public String getCity() {
            return city;
        }
    }

    static void testZipCity() {
        Key2Collection<ZipCity,Integer,String> zipCities = new Key2Collection.Builder<ZipCity,Integer,String>().
            withKey1Map(ZipCity::getZip).
            withPrimaryKey1().
            withKey1Sort(true).
            withKey2Map(ZipCity::getCity).
            withKey2Null(false).
            withKey2Sort(true).build();

        zipCities.add(new ZipCity(4000, "Basel"));
        zipCities.add(new ZipCity(5000, "Aarau"));
        zipCities.add(new ZipCity(5001, "Aarau"));

        // Get all zips sorted
        Set<Integer> allZips = zipCities.asMap1().keySet();

        // Get all cities sorted
        Set<String> allCities = zipCities.asMap2().keySet();

        // Get city by zip
        String city = zipCities.getByKey1(5000).getCity();

        // Get all zips by city
        GapList<Integer> zips = zipCities.getAllByKey2("Aarau").map(ZipCity::getZip);
    }

 

 
© Magicwerk.org