Mechanism and characteristics of Collection implementation class often used in Java

I wrote an article "Significance of Interfaces Learned from Java Collection". Since this article focused on the Collection interface, this time we will focus on the implementation classes that are often used, and talk about the internal implementation, features, and usage of each implementation class.

Implementation class explained this time

This is a Collection class that you will probably use often.

HashMap is an implementation class of Map, not Collection, but since it is often used and has a high relationship with HashSet, I will explain it together with HashSet.

Class diagram image

The role of the interface (Click here for details (http://qiita.com/frost_star/items/14a12d64ccbe85a8ac3f))

interface role
List An ordered group of elements. Basically allow duplication.
Set Group that does not allow duplicate elements(set).. The order depends on the implementation class.

ArrayList ~ List by array ~

ArrayList, as the name implies, is an implementation of List by Array. It has an array internally, and stores, references, and inserts data in the array. Therefore, in order to know the characteristics of ArrayList, it is necessary to know the characteristics of the array.

What is an array in the first place?

An array reserves a contiguous area in memory. Its best feature is that it can be referenced by subscripts at high speed. Since the area is continuous, you can find the address you want to refer to by the following formula as long as you know the start address, subscript, and data size per one.

Address to refer to=Start address+Subscript x data size per one

image

Internal processing of ArrayList

Since the array needs to have such a continuous area, it cannot be changed from the originally determined number of elements. However, ArrayList allows you to add elements dynamically. ArrayList automatically reallocates an array when you add elements and run out of arrays. It is easy to say that it is re-secured, but in reality it is a very heavy process because it processes a new array with 1.5 times the number of elements as the original size and copies the data from the original array. Become. It is said that it is better to decide the initial capacity (constructor argument) in ArrayList because it reduces the frequency of executing this reallocation process by deciding the size of the array to be allocated first.

Also, arrays are very vulnerable to insertion. This is because the area where the data is stored is fixed, so the process of shifting the location cannot be performed. In ArrayList, the process of inserting data to an arbitrary location is implemented by the ʻadd` method, but this internal also reallocates the array, and the data after the insertion position is inserted by shifting the index and copying. We are processing to make room.

LinkedList ~ List by linear list ~

Have you ever heard of a data structure called a linear list? LinkedList is an implementation of List based on the structure of a linear list.

What is a linear list?

A linear list is a data structure that treats data and links (references to the next element) as one object (node), and can handle data columns by concatenating the nodes.

image

The advantage of this data structure is that you can access the elements by following the links within each node, as long as you know the root (reference to the first node). Therefore, it is not necessary for each node to exist in a contiguous area like an array. Also, when inserting data, you only need to change the references of the nodes before and after, so you do not need a large-scale copy process like ArrayList.

image

Internal processing of LinkedList

The downside of linear lists is slow random access. For example, in order to access the 2nd element, follow the link from the root semi-repeatingly like [Root]-> [0th element]-> [1st element]-> [2nd element]. I need to go. In LinkedList, in order to speed up random access as much as possible, we have devised such things as making the link bidirectional and holding a reference to the last element. However, the more elements there are, the slower the random access is inevitable.

Also, because each node has a reference as a field in addition to data, it uses more memory than an ArrayList with the same number of elements.

HashSet ~ Set using hash value ~

HashSet is an implementation class of Set, unlike the previous two Lists. That is, it does not allow duplicate elements and cannot be randomly accessed. Also, HashSet does not preserve order.

Not allowing duplicates means that when you add an element, you need to determine if the element already exists in the Set. HashSet makes good use of arrays, linear lists, and hash values to achieve fast existence verification.

What is a hash value?

The hash value is a value calculated from the original data by calculation based on a specific formula. The same hash value can be calculated from the same data, but it is designed so that if the data is slightly different, the values will be significantly different. Also, even if it is irreversible and the hash value can be calculated from the data, the data cannot be restored from the hash value. The hash value itself is widely used in the information processing world such as authentication, validity check, and encryption.

Hash value in Java

The hash value in Java is a value for identifying an instance, and is an int type integer calculated by the hashCode method. The hashCode method is defined as an Object type. Based on the characteristic that "the same hash value can be calculated from the same data of hash value", the same hash value must be returned between instances where the ʻequals` method returns true, and conversely, if the data is different, it is the same as much as possible. It should not be a value.

Internal processing of HashSet

HashSet realizes high-speed existence confirmation by making good use of this hash value. HashSet reserves an array of size s when instantiated. When storing an instance ʻe, HashSet first finds the hash value with ʻe.hashCode () and then calculates where it should be stored. Find the remainder (division remainder) of ʻe.hashCode ()ands and store ʻe in that location.

array[ e.hashCode() % s ] = e;

Since the storage location is calculated from ʻe.hashCode ()% s`, it is not necessary to search the array one by one when checking the existence, and the hash value of the given instance is calculated and the instance is there. You can confirm the existence by checking if there is any.

In case of collision

That is just an ideal theory. Actually, the size s of the array is a small number compared to the hash value, so the phenomenon that the data already exists in the place where you tried to store it occurs. This is called a collision. In case of a collision, HashSet stores data in a data structure that has a link to the next element like a linear list when storing data. Then, when a collision occurs, the data is connected as the next element of the element that already exists. This makes it possible to check the existence by searching only the groups that have the same value of ʻe.hashCode ()% s`, even if it is not a single reference.

image

resize

The more data you have, the more likely you are to collide. For example, if s = 10 and 11 data are stored, a collision will definitely occur (pigeonhole principle). Therefore, when the number of data increases, the array is re-allocated with a large capacity and the data is reinserted. The data reinsertion here is not a simple copy, but the data structure is not disrupted because the data reinsertion is performed so that the correspondence between the array subscript and ʻe.hashCode ()% s` is not disrupted.

hashCode override

HashSet determines the storage location based on the value of hashCode. Therefore, the performance of the hashCode expression is directly linked to the collision probability of the HashSet. In an extreme case, if the contents of hashCode are processed to always return a constant likereturn 0;, a collision will occur every time data is stored, so the search performance will be lower than the Linked List. Therefore, it is important to override the appropriate hashCode method for the class of the element you are storing. However, there is a high possibility of collision with Oreore hashCode. The best is to use the ʻObjects.hashCode` method.

Objects.hashCode(Field 1,Field 2,Field 3);

Since the argument is a variable argument, you can pass data of multiple Object types. However, since the final hash value is calculated using the hash value obtained by hashCode of each field, it is necessary to override the hashCode method in each field class as well.

HashSet and HashMap

So far, I've talked about the internal implementation of HashSet, but it's actually a lie. As I wrote in Another article, the internal implementation of HashSet is actually realized by HashMap. So, the internal implementation we've talked about so far was actually the internal implementation of HashMap. However, since the implementation of internal processing only depends on HashMap, the behavior is the same for both.

HashMap story

As for HashMap, HashMap is an implementation class of Map that holds values in two data pairs, Key and Value. Key corresponds to the data part of HashSet explained earlier. Since the data is stored by the hash value of the Key instance, it is possible to search the data from the Key at high speed. Value is simply the value associated with Key and is stored with Key. In HashSet, by setting a static value as Value, it is implemented using HashMap without consuming extra memory.

Comparison of each implementation class

In summary, let's compare the performance of each implementation class in order notation. If you don't understand the order notation, * O * (n) is slower than * O * (1).

Performance comparison

Implementation class add to Insert/Delete Search Random access memory usage
ArrayList O(1)※ O(n) O(n) O(1) Few
LinkedList O(1) O(1) O(n) O(n) During ~
HashSet O(1)※ O(1) O(1) impossible Many

*: Resize may occur

By comparison, you can see the characteristics of each implementation class. For example, if there are many inserts, LinkedList, if there are many random accesses, ArrayList, etc. Which class is suitable depends on the processing content, so select an appropriate implementation class.

Recommended Posts

Mechanism and characteristics of Collection implementation class often used in Java
[Java] Comparator of Collection class
Implementation of gzip in java
Use of Abstract Class and Interface properly in Java
Implementation of tri-tree in Java
[Java] Set structure of collection class (about HashSet and TreeSet)
Syntax examples often used in Java
StringBuffer and StringBuilder Class in Java
Implementation of like function in Java
Implementation of DBlayer in Java (RDB, MySQL)
[Java] Inheritance and structure of HttpServlet class
[Java] Contents of Collection interface and List interface
Discrimination of Enums in Java 7 and above
This and that of the implementation of date judgment within the period in Java
Comparison of thread implementation methods in Java and lambda expression description method
Review of "strange Java" and Java knowledge that is often forgotten in Java Bronze
I compared the characteristics of Java and .NET
Basics of threads and Callable in Java [Beginner]
A quick review of Java learned in class
Java and Swift comparison (3) Class implementation / Class inheritance / Class design
[Java] Where is the implementation class of annotation that exists in Bean Validation?
Summary of frequently used commands in Rails and Docker
Expired collection of java
Interpreter implementation in Java
A quick review of Java learned in class part4
Browse class objects in Kotlin (instead of Java class name.class)
Write a class in Kotlin and call it in Java
Boyer-Moore implementation in Java
Heapsort implementation (in java)
Personal summary of the guys often used in JUnit 4
A quick review of Java learned in class part3
List of frequently used Java instructions (for beginners and beginners)
A quick review of Java learned in class part2
[Java] Handling of character strings (String class and StringBuilder class)
Summarize the additional elements of the Optional class in Java 9
[For beginners] Explanation of classes, instances, and statics in Java
Implement Thread in Java and try using anonymous class, lambda
Implement Java Interface in JRuby class and call it from Java
Background and mechanism of Fabric-loader
[Implementation] Java Process class notes
[Java] Implementation of Faistel Network
Java class definition and instantiation
Gem often used in Rails
Summary of Java Math class
Advantages and disadvantages of Java
Matcher often used in RSpec
Deep copy collection in Java
Implementation of HashMap in kotlin
[Java] Get the dates of the past Monday and Sunday in order
[Java8] Proper use of Comparable and Comparator in terms of employee sorting
Summary of ORM "uroboroSQL" that can be used in enterprise Java
Let's create a TODO application in Java 6 Implementation of search function
Handle business logic for a set of Entity in Java class
Collection of programming selection tasks to make and remember (Java basics)
Examine the list of timezone IDs available in the Java ZoneId class
Read the first 4 bytes of the Java class file and output CAFEBABE
About methods often used in devise
Various methods of Java String class
About fastqc of Biocontainers and Java
Java 15 implementation and VS Code preferences
Test API often used in AssertJ