TL;DR Preallocation avoids unnecessary memory reallocations due to dynamic resizing, copying of the data, rehashing for maps and puts less pressure on GC1. Although we will only look at Java, the same behavior exists in
c++
,rust
and many other languages as well.
ArrayList
There are two ways to create an ArrayList
.
-
Method 1: Created with zero argument constructor. This uses a default capacity of size 102.
var arr = new ArrayList<String>(); arr.add(elem) // add something later
-
Method 2: Provide an initial capacity with int argument constructor.
// created with a capacity of 100. // No dynamic resizing needs to happen till 101th element is added var arr = new ArrayList<String>(100); arr.add(elem) // add something later
Let’s measure the performance under different scenarios for different input sizes:
N (Size of the input) |
Vector With No PreAllocation (Method 1) |
Vector With PreAllocation (Method 2) |
Improvement |
---|---|---|---|
10 | 0.037 μs | 0.037 μs | 0% |
100 | 0.403 μs | 0.350 μs | 15.14% |
1000 | 4.448 μs | 2.903 μs | 53.22 % |
10000 | 52.361 μs | 29.089 μs | 80% |
100000 | 468.255 μs | 279.901 μs | 67.29% |
Why Vector
preallocation is efficient:
Whenever an ArrayList
runs out of its internal capacity to hold additional elements, it needs to reallocate more space. Generally, most implementations double the existing size. So, a new array of larger size is created and existing elements are copied to this new array3. Later, whenever GC runs, the old array is garbage collected. At the time of writing, when 100000
elements are added to the arraylist with no preallocation, the array reallocation happens at the following sizes in openjdk 17 codebase:
10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 711404.
Note: The capacity growth policy is an implementation detail and can change across different jdk versions.
HashMap
Although we only talk about HashMap, the same argument applies to LinkedHashMap and other kinds of maps which take in slots/capacity as arguments.
HashMap
can also be constructed with custom a loadFactor
but we will stick with using the default loadFactor
of0.75
for now . The capacity can be computed with formula:
$$\frac{count\space of \space expected \space elements}{loadFactor} = \frac{count\space of \space expected \space elements}{0.75} \approx 1.33 \times {count\space of \space expected \space elements}$$
The reason we allocate 1.33
times more than the expected elements count is because of hash collisions. In a perfect world, the number of buckets/slots we need to allocate will be equal to the number of elements to add. However, in reality, collisions will happen and we need to account for that.
Consider the following two methods of constructing a HashMap
:
-
Method 1 - With No Preallocation : Created with 0 argument constructor. This constructs an instance where capacity is 16.
var map = new HashMap<String, Integer>();
-
Method 2 - With Preallocation: Created with required capacity (N)
// here N is the number of elements we intend to add to the HashMap var capacity = (int)Math.ceil(N/0.75); var map = new HashMap<String, Integer>(capacity)
Let’s measure the hashmap performance under different scenarios for different input sizes (N):
N (Size of the input) |
HashMap No PreAllocation (Method 1) |
HashMap With PreAllocation (Method 2) |
Improvement |
---|---|---|---|
10 | 0.245 μs | 0.233 μs | 4.89% |
100 | 2.624 μs | 1.699 μs | 35.25% |
1000 | 30.244 μs | 20.187 μs | 33.25% |
10000 | 477.449 μs | 336.069 μs | 29.61% |
100000 | 11013.597 μs | 8875.078 μs | 19.42% |
Why HashMap
PreAllocation is Efficient
Preallocation of HashMap is efficient as this avoids unnecessary rehashing. At the time of writing, rehashing happens when map.size() > (capacity() * loadFactor)
. Rehashing involves the following process:
- Allocate a new slot table of larger size. In openjdk 17, the size increases by approximately
1.5
5 - Rip through each entry in the existing map, rehash the entry and find its new slot in the new array
When you try to add 100K
elements, the hashmap is rehashed at the following sizes: 10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 71140. When HashMap
is preallocated, all these rehashing is eliminated.
Drawbacks of Over Allocation
Hashmap/Vector
have no way of knowing how many elements we intend to add and thus, generally are conservative in nature and initially, start with a small capacity. If we are very aggressive and set a very large capacity even though we never utilize the full capacity will result in memory wastage. In a GC language like Java, this would result in unnecessary GC pauses.If you are using an architecture like lambda or cloud functions where you are charged depending on how much memory you use, then your cloud bill should be higher if you waste space.
Unnecessary space in
Vector
can be claimed by invoking trimToSize method.
How to Determine the Capacity
Generally, all applications typically interact with external systems like databases, handle income rest or grpc requests, listen to incoming messages on an event bus, interface with external apis etc. The clue comes from carefully examining how we interact with such an external system.
-
Database: Consider a hypothetical example of retrieving the top 100 rated posts globally using a sql:
select post_id from top_posts order by position desc limit 100
. We know that in almost all cases, we get 100 rows from the above sql query. So, we can initialize the container with a capacity of100
. -
Creating an HashMap: Consider an external api that returns a list of
User
and you need to convert to a map ofuserId
toUser
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
class User { // not a complete class. Only relevant method is shown for illustration public int getId() { return id; } } // Usage List<User> users = getUsers(); var capacity = (int)Math.ceil(users.size()/0.75); Map<Integer, User> userIdMap = users.stream() .collect(Collectors.toMap(user -> user.getId(), // key Function.identity(), // value (u1, u2) -> u1, // since we expect no duplicates, just pick any user Object for merge operation () -> new HashMap<>(capacity)) // HashMap capacity is pre allocated );
-
Garbage Collection (GC) pressure only applies to garbaged collected languages like Java, go etc but not c++, rust. ↩︎
-
At the time of writing, the default size in open jdk 17 is 10 . In fact, a shared array of size 0 is used until the first addition. This avoids unnecessary space allocation if
ArrayList
is created and no elements were to be ever added. ↩︎ -
In fact, this is the worst case scenario where addition of element doesn’t take constant time. ↩︎
-
Java instead of doubling the capacity, increases the capacity by approprimately 1.5 times ↩︎
-
Typical implementations double the size of existing capacity. ↩︎