Preallocate Java's ArrayList And HashMap For Better Performance

TL;DR Preallocation avoids unnecessary memory reallocations due to dynamic resizing, copying of the data, rehashing for maps and puts less pressure on GC¹. Although we will only look at Java, the same behavior exists in c++, rust and many other languages as well.

ArrayList

There are two ways to create an ArrayList .

Method 1: Created with zero argument constructor. This uses a default capacity of size 10².
```
var arr = new ArrayList<String>(); 

arr.add(elem) // add something later
```

Method 2: Provide an initial capacity with int argument constructor.

// created with a capacity of 100. 
// No dynamic resizing needs to happen till 101th element is added
var arr = new ArrayList<String>(100); 

arr.add(elem) // add something later

Let’s measure the performance under different scenarios for different input sizes:

N (Size of the input)	Vector With No PreAllocation (Method 1)	Vector With PreAllocation (Method 2)	Improvement
10	0.037 μs	0.037 μs	0%
100	0.403 μs	0.350 μs	15.14%
1000	4.448 μs	2.903 μs	53.22 %
10000	52.361 μs	29.089 μs	80%
100000	468.255 μs	279.901 μs	67.29%

Why `Vector` preallocation is efficient:

Whenever an ArrayList runs out of its internal capacity to hold additional elements, it needs to reallocate more space. Generally, most implementations double the existing size. So, a new array of larger size is created and existing elements are copied to this new array³. Later, whenever GC runs, the old array is garbage collected. At the time of writing, when 100000 elements are added to the arraylist with no preallocation, the array reallocation happens at the following sizes in openjdk 17 codebase:

10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 71140⁴.

Note: The capacity growth policy is an implementation detail and can change across different jdk versions.

HashMap

Although we only talk about HashMap, the same argument applies to LinkedHashMap and other kinds of maps which take in slots/capacity as arguments.

HashMap can also be constructed with custom a loadFactor but we will stick with using the default loadFactor of0.75 for now . The capacity can be computed with formula:

$$\frac{count\space of \space expected \space elements}{loadFactor} = \frac{count\space of \space expected \space elements}{0.75} \approx 1.33 \times {count\space of \space expected \space elements}$$

The reason we allocate 1.33 times more than the expected elements count is because of hash collisions. In a perfect world, the number of buckets/slots we need to allocate will be equal to the number of elements to add. However, in reality, collisions will happen and we need to account for that.

Consider the following two methods of constructing a HashMap:

Method 1 - With No Preallocation : Created with 0 argument constructor. This constructs an instance where capacity is 16.
```
 var map = new HashMap<String, Integer>();
```

Method 2 - With Preallocation: Created with required capacity (N)

// here N is the number of elements we intend to add to the HashMap
var capacity = (int)Math.ceil(N/0.75);
var map = new HashMap<String, Integer>(capacity)

Let’s measure the hashmap performance under different scenarios for different input sizes (N):

N (Size of the input)	HashMap No PreAllocation (Method 1)	HashMap With PreAllocation (Method 2)	Improvement
10	0.245 μs	0.233 μs	4.89%
100	2.624 μs	1.699 μs	35.25%
1000	30.244 μs	20.187 μs	33.25%
10000	477.449 μs	336.069 μs	29.61%
100000	11013.597 μs	8875.078 μs	19.42%

Why `HashMap` PreAllocation is Efficient

Preallocation of HashMap is efficient as this avoids unnecessary rehashing. At the time of writing, rehashing happens when map.size() > (capacity() * loadFactor). Rehashing involves the following process:

Allocate a new slot table of larger size. In openjdk 17, the size increases by approximately 1.5⁵
Rip through each entry in the existing map, rehash the entry and find its new slot in the new array

When you try to add 100K elements, the hashmap is rehashed at the following sizes: 10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 71140. When HashMap is preallocated, all these rehashing is eliminated.

Drawbacks of Over Allocation

Hashmap/Vector have no way of knowing how many elements we intend to add and thus, generally are conservative in nature and initially, start with a small capacity. If we are very aggressive and set a very large capacity even though we never utilize the full capacity will result in memory wastage. In a GC language like Java, this would result in unnecessary GC pauses.If you are using an architecture like lambda or cloud functions where you are charged depending on how much memory you use, then your cloud bill should be higher if you waste space.

Unnecessary space in Vector can be claimed by invoking trimToSize method.

How to Determine the Capacity

Generally, all applications typically interact with external systems like databases, handle income rest or grpc requests, listen to incoming messages on an event bus, interface with external apis etc. The clue comes from carefully examining how we interact with such an external system.

Database: Consider a hypothetical example of retrieving the top 100 rated posts globally using a sql: select post_id from top_posts order by position desc limit 100 . We know that in almost all cases, we get 100 rows from the above sql query. So, we can initialize the container with a capacity of 100.

Creating an HashMap: Consider an external api that returns a list of User and you need to convert to a map of userId to User

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


class User {
  // not a complete class. Only relevant method is shown for illustration
  public int getId() {
    return id;
  }
}

// Usage
List<User> users = getUsers(); 
var capacity = (int)Math.ceil(users.size()/0.75);
Map<Integer, User> userIdMap = users.stream()
    .collect(Collectors.toMap(user -> user.getId(), // key
                              Function.identity(), // value
                              (u1, u2) -> u1, // since we expect no duplicates, just pick any user Object for merge operation
                              () -> new HashMap<>(capacity)) // HashMap capacity is pre allocated
            );

Garbage Collection (GC) pressure only applies to garbaged collected languages like Java, go etc but not c++, rust. ↩︎
At the time of writing, the default size in open jdk 17 is 10 . In fact, a shared array of size 0 is used until the first addition. This avoids unnecessary space allocation if ArrayList is created and no elements were to be ever added. ↩︎
In fact, this is the worst case scenario where addition of element doesn’t take constant time. ↩︎
Java instead of doubling the capacity, increases the capacity by approprimately 1.5 times ↩︎
Typical implementations double the size of existing capacity. ↩︎

ArrayList#

Why Vector preallocation is efficient:#

HashMap#

Why HashMap PreAllocation is Efficient#

Drawbacks of Over Allocation#

How to Determine the Capacity#