Search basics and Java implementation / measurement

TL;DR

Major search types and images

--Linear search --Binary search --Hash search

Consider an example of searching for a customer by specifying mail_address.

Linear search

--Refer to the target data in order to check if the search key mail_address matches.

out.gif

[Computational complexity](https://ja.wikipedia.org/wiki/%E3%83%A9%E3%83%B3%E3%83%80%E3%82%A6%E3%81%AE%E8% A8% 98% E5% 8F% B7 #% E4% B8% 80% E8% 88% AC% E7% 9A% 84% E3% 81% AA% E3% 82% AA% E3% 83% BC% E3% 83 % 80% E3% 83% BC)

If there are 1024 data, it will be referenced 1024 times. Stupid.

O(n)

Binary search

--Sort the whole by mail_address --Refer to the data in the middle to see if they match. --If they don't match, discard half and refer to the middle of the other half --A kind of divide-and-conquer law

out.gif

[Computational complexity](https://ja.wikipedia.org/wiki/%E3%83%A9%E3%83%B3%E3%83%80%E3%82%A6%E3%81%AE%E8% A8% 98% E5% 8F% B7 #% E4% B8% 80% E8% 88% AC% E7% 9A% 84% E3% 81% AA% E3% 82% AA% E3% 83% BC% E3% 83 % 80% E3% 83% BC)

1024 cases = 2 ^ 10 cases = 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 * 2 cases are discarded in half, so You can find it by referring to at most 10 data. smart.

O(\log n)

Hash method

--Pass the mail_address of all the target data to the hash function and convert it according to a certain rule to get the hash value. --The search key mail_address is also passed to the same hash function and converted according to the same rule to get the hash value. --Refer to the data with matching hash value --No additional processing is required to find the specified hash value xxx from the hash value block of the target data (Implementation example: Hash table % 8F% E3% 83% 83% E3% 82% B7% E3% 83% A5% E3% 83% 86% E3% 83% BC% E3% 83% 96% E3% 83% AB))

out.gif

[Computational complexity](https://ja.wikipedia.org/wiki/%E3%83%A9%E3%83%B3%E3%83%80%E3%82%A6%E3%81%AE%E8% A8% 98% E5% 8F% B7 #% E4% B8% 80% E8% 88% AC% E7% 9A% 84% E3% 81% AA% E3% 82% AA% E3% 83% BC% E3% 83 % 80% E3% 83% BC)

Even with 1024 cases, you can directly refer to the target data by going through the hash function once.

O(1)

Disadvantage

In general, the algorithm is a trade-off between reduced execution time and reduced memory consumption (https://en.wikipedia.org/wiki/%E6%99%82%E9%96%93%E3%81%A8% E7% A9% BA% E9% 96% 93% E3% 81% AE% E3% 83% 88% E3% 83% AC% E3% 83% BC% E3% 83% 89% E3% 82% AA% E3% 83% 95) There are implementations that are both bad, and you can make one, There will be no algorithm that gives the best performance at the same time for each The hash method probably does not leak and consumes memory.

What is used in the database?

PostgreSQL execution plan (explain, explain analyze)

--Seq Scan: Linear search --Index Scan: Something like a binary search --Hash ~: Hash method

[PostgreSQL Index Only Scan Struggle Part 3 | TECHSCORE BLOG](https://www.techscore.com/blog/2013/06/07/postgresql-index-only-scan-%E5%A5%AE%E9%97 % 98% E8% A8% 98-% E3% 81% 9D% E3% 81% AE3 /) [SQL] Read the execution plan from zero knowledge and improve performance --Qiita EXPLAIN --PostgreSQL 10.5 document

Java implementation / measurement

Before implementing the search

キャプチャ.PNG

Even a simple processing branch takes 0.61 seconds if executed 2 billion times.

Source code to base on

Prepare 100,000 customers and search your email address 1000 times. https://ideone.com/w0hDzg

It is 0.9 seconds when the place (TODO) to put the search process is not implemented.

import java.util.*;
import java.lang.*;
import java.io.*;
 
class Ideone
{
	public static void main (String[] args) throws java.lang.Exception
	{
		//Create data for n people
		int n = 100000;
		Random rand = new Random();
		String[][] customer = new String[n][2];
		for (int i = 0; i < n; i++)
			customer[i][0] = String.valueOf((char) (rand.nextInt(26)+'a')) + i + "@ab.com";
		for (int i = 0; i < n; i++)
			customer[i][1] = i + "Mr.";
 
		//Sort by email address
		Arrays.sort(customer, (c1, c2) -> c1[0].compareTo(c2[0]));
 
		//We will also create a hash map with the key as the email address and the value as the name.
		Map<String, String> hashmap = new HashMap<>(n);
		for (String[] c : customer)
			hashmap.put(c[0], c[1]);
 
		//Same customer(101st)Can take the name of
		System.out.println(customer[101][0] + "Is the name" + customer[101][1]);
		//You can also get it by email address from the hash map
		System.out.println(customer[101][0] + "Is the name" + hashmap.get(customer[101][0]));
		System.out.println("Confirmation finished");
 
		//Then, select the email addresses for m people appropriately
		int m = 1000;
		String[] keys = new String[m];
		for (int j = 0; j < m; j++)
			keys[j] = customer[rand.nextInt(n)][0];
 
		//One by one
		for (int j = 0; j < m; j++){
			String key = keys[j];
			//Let's search
 
			// TODO
			String found = null;
 
			if (j % 100 == 0) {
				System.out.println(key + "Is the name" + Objects.toString(found, "unknown"));
			}
		}
	}
}

Linear search

It's 3.9 seconds. slow! https://ideone.com/GIle7V

//Where it was TODO
String found = null;
for (int i = 0; i < n; i++) {
    if (Objects.equals(key, customer[i][0])) {
        found = customer[i][1];
    }
}

Binary search

It's 1 second. In the first place, the data preparation took 0.9 seconds, so the search itself took almost no time. https://ideone.com/eF7Txv

//Where it was TODO
String found = null;
keyCustomer[0] = key;
int index = Arrays.binarySearch(customer, keyCustomer, (c1, c2) -> c1[0].compareTo(c2[0]));
found = customer[index][1];

Hash method

0.93 seconds. Almost no time https://ideone.com/67PzQP

//Where it was TODO
String found = null;
found = hashmap.get(key);

Binary search (1024 times)

Since the difference between the binary search and the hash method was the error level, Let's compare by multiplying the number of searches (m) by 1024. The binary search is 1.69 seconds. https://ideone.com/ppG31d

Hash method (1024 times)

Let's also increase the number of searches (m) by the hash method by 1024 times. It's 1.16 seconds. https://ideone.com/ppG31d Compared to binary search, the increase in execution time is slower as the number of search keys increases. It is the difference in the amount of calculation

Is exploration actually used in development?

Although there are few opportunities for beginners to be aware of it, they are using it. Consider the situation where you search the order history list in the database. In SQL, combine the customer customer table (more than 10,000 items) with the order order table (more than 10,000 items) and the order_detail order details table (more than 10,000 items) to get the details, and then the product product table (more than 10,000 items). Combine super) to get the product name. If you execute this SQL and finish in a realistic time, the DB administrator sets the appropriate primary key constraint / unique constraint / index, etc., and the DB performs a binary search based on it (like something). The tree structure data used for) and the hash table used for the hash method are prepared in advance and are used when executing SQL. In the field, processing is slow because the DB definition is missing / only linear search is possible because the SQL table join condition is complicated, and it is slow / you do not notice those problems until you run it in the production environment / there is room for improvement There are many unfortunate situations such as being left unattended. Please be aware that there are other suspicious points outside of your responsibility.

Hope this helps.


Other references

[What is Linear Search-IT Glossary e-Words] (http://e-words.jp/w/%E7%B7%9A%E5%BD%A2%E6%8E%A2%E7%B4%A2.html) [What is binary search (2-minute search)? --IT Glossary e-Words](http://e-words.jp/w/%E4%BA%8C%E5%88%86%E6%8E%A2% E7% B4% A2.html) [What is hash method (hash search)? --IT Glossary e-Words](http://e-words.jp/w/%E3%83%8F%E3%83%83%E3%82%B7%E3 % 83% A5% E6% B3% 95.html) Linear Search-Wikipedia Binary Search-Wikipedia

Recommended Posts

Search basics and Java implementation / measurement
Perceptron basics and implementation
Explanation and implementation of SocialFoceModel
Normalizing Flow theory and implementation
I compared Java and Python!
WordNet structure and synonym search
Python basics: conditions and iterations
Maxout description and implementation (Python)
[Python] Depth-first search and breadth-first search
[Day 5] View and template basics
Crawling with Python and Twitter API 2-Implementation of user search function
Mathematical explanation of binary search and ternary search and implementation method without bugs