TL;DR Preallocation avoids unnecessary memory reallocations due to dynamic resizing, copying of the data, rehashing for maps and puts less pressure on GC1. Although we will only look at Java, the same behavior exists in c++, rust and many other languages as well.

ArrayList

There are two ways to create an ArrayList .

  • Method 1: Created with zero argument constructor. This uses a default capacity of size 102.

    var arr = new ArrayList<String>(); 
    
    arr.add(elem) // add something later
    
  • Method 2: Provide an initial capacity with int argument constructor.

    // created with a capacity of 100. 
    // No dynamic resizing needs to happen till 101th element is added
    var arr = new ArrayList<String>(100); 
    
    arr.add(elem) // add something later
    

Let’s measure the performance under different scenarios for different input sizes:

N
(Size of the input)
Vector With No PreAllocation
(Method 1)
Vector With PreAllocation
(Method 2)
Improvement
10 0.037 μs 0.037 μs 0%
100 0.403 μs 0.350 μs 15.14%
1000 4.448 μs 2.903 μs 53.22 %
10000 52.361 μs 29.089 μs 80%
100000 468.255 μs 279.901 μs 67.29%

Why Vector preallocation is efficient:

Whenever an ArrayList runs out of its internal capacity to hold additional elements, it needs to reallocate more space. Generally, most implementations double the existing size. So, a new array of larger size is created and existing elements are copied to this new array3. Later, whenever GC runs, the old array is garbage collected. At the time of writing, when 100000 elements are added to the arraylist with no preallocation, the array reallocation happens at the following sizes in openjdk 17 codebase:

10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 711404.

Note: The capacity growth policy is an implementation detail and can change across different jdk versions.

HashMap

Although we only talk about HashMap, the same argument applies to LinkedHashMap and other kinds of maps which take in slots/capacity as arguments.

HashMap can also be constructed with custom a loadFactor but we will stick with using the default loadFactor of0.75 for now . The capacity can be computed with formula:

$$\frac{count\space of \space expected \space elements}{loadFactor} = \frac{count\space of \space expected \space elements}{0.75} \approx 1.33 \times {count\space of \space expected \space elements}$$

The reason we allocate 1.33 times more than the expected elements count is because of hash collisions. In a perfect world, the number of buckets/slots we need to allocate will be equal to the number of elements to add. However, in reality, collisions will happen and we need to account for that.

Consider the following two methods of constructing a HashMap:

  • Method 1 - With No Preallocation : Created with 0 argument constructor. This constructs an instance where capacity is 16.

     var map = new HashMap<String, Integer>();
    
  • Method 2 - With Preallocation: Created with required capacity (N)

    // here N is the number of elements we intend to add to the HashMap
    var capacity = (int)Math.ceil(N/0.75);
    var map = new HashMap<String, Integer>(capacity)
    

Let’s measure the hashmap performance under different scenarios for different input sizes (N):

N
(Size of the input)
HashMap No PreAllocation
(Method 1)
HashMap With PreAllocation
(Method 2)
Improvement
10 0.245 μs 0.233 μs 4.89%
100 2.624 μs 1.699 μs 35.25%
1000 30.244 μs 20.187 μs 33.25%
10000 477.449 μs 336.069 μs 29.61%
100000 11013.597 μs 8875.078 μs 19.42%

Why HashMap PreAllocation is Efficient

Preallocation of HashMap is efficient as this avoids unnecessary rehashing. At the time of writing, rehashing happens when map.size() > (capacity() * loadFactor). Rehashing involves the following process:

  • Allocate a new slot table of larger size. In openjdk 17, the size increases by approximately 1.55
  • Rip through each entry in the existing map, rehash the entry and find its new slot in the new array

When you try to add 100K elements, the hashmap is rehashed at the following sizes: 10, 15, 22, 33, 49, 73, 109, 163, 244, 366, 549, 823, 1234, 1851, 2776, 4164, 6246, 9369, 14053, 21079, 31618, 47427, 71140. When HashMap is preallocated, all these rehashing is eliminated.

Drawbacks of Over Allocation

Hashmap/Vector have no way of knowing how many elements we intend to add and thus, generally are conservative in nature and initially, start with a small capacity. If we are very aggressive and set a very large capacity even though we never utilize the full capacity will result in memory wastage. In a GC language like Java, this would result in unnecessary GC pauses.If you are using an architecture like lambda or cloud functions where you are charged depending on how much memory you use, then your cloud bill should be higher if you waste space.

Unnecessary space in Vector can be claimed by invoking trimToSize method.

How to Determine the Capacity

Generally, all applications typically interact with external systems like databases, handle income rest or grpc requests, listen to incoming messages on an event bus, interface with external apis etc. The clue comes from carefully examining how we interact with such an external system.

  • Database: Consider a hypothetical example of retrieving the top 100 rated posts globally using a sql: select post_id from top_posts order by position desc limit 100 . We know that in almost all cases, we get 100 rows from the above sql query. So, we can initialize the container with a capacity of 100.

  • Creating an HashMap: Consider an external api that returns a list of User and you need to convert to a map of userId to User

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    
    class User {
      // not a complete class. Only relevant method is shown for illustration
      public int getId() {
        return id;
      }
    }
    
    // Usage
    List<User> users = getUsers(); 
    var capacity = (int)Math.ceil(users.size()/0.75);
    Map<Integer, User> userIdMap = users.stream()
        .collect(Collectors.toMap(user -> user.getId(), // key
                                  Function.identity(), // value
                                  (u1, u2) -> u1, // since we expect no duplicates, just pick any user Object for merge operation
                                  () -> new HashMap<>(capacity)) // HashMap capacity is pre allocated
                );
    




  1. Garbage Collection (GC) pressure only applies to garbaged collected languages like Java, go etc but not c++, rust. ↩︎

  2. At the time of writing, the default size in open jdk 17 is 10 . In fact, a shared array of size 0 is used until the first addition. This avoids unnecessary space allocation if ArrayList is created and no elements were to be ever added. ↩︎

  3. In fact, this is the worst case scenario where addition of element doesn’t take constant time. ↩︎

  4. Java instead of doubling the capacity, increases the capacity by approprimately 1.5 times ↩︎

  5. Typical implementations double the size of existing capacity. ↩︎