Hello, this is day 23 of the ABEJA Advent Calendar.
Do you know a library called faiss? Facebook Resarch's neighborhood search library is very easy to use and has a lot of tutorials, so I use it frequently for both business and hobbies. I will. However, such a convenient library also has its weaknesses. When I try to do something a little niche, I just say, "Well, explain the details and get a feel for the test code (-д ☆)." I learned how to use it by actually reading the test code and sometimes reading the C ++ code of the main body, so I will write it with a reminder for myself in the future.
In rare cases, you want to get the features that faiss holds, rather than searching for them. For example, with faiss.IndexFlatL2
, the feature amount itself can be obtained from the attribute xb
. You can also convert it to numpy.ndarray
by using the function faiss.vector_float_to_array
.
>>> d = 2
>>> nb = 10
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>>
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>>
>>> index.xb
<faiss.swigfaiss.FloatVector; proxy of <Swig Object of type 'std::vector< float > *' at 0x7f7ff26dede0> >
>>> faiss.vector_float_to_array(index.xb)
array([0.70434606, 0.8172881 , 0.27514696, 0.04918063, 0.16584012,
0.58303493, 0.83627784, 0.7318148 , 0.91633004, 0.16084996,
0.6760264 , 0.65586466, 0.45432937, 0.35858378, 0.0895841 ,
0.3424391 , 0.6606455 , 0.7392444 , 0.07704416, 0.13714503],
dtype=float32)
You can check from which attribute the feature itself can be obtained from the header such as faiss / IndexFlat.h.
=>
float * x` conversion[faiss / c_api /] to use swig interface functions not implemented in faiss / python / faiss.py If you look at IndexFlat_c.h](https://github.com/facebookresearch/faiss/blob/master/c_api/IndexFlat_c.h), you may encounter float * x
. If you use faiss.swig_ptr
for numpy.ndarray
, you can convert it to a pointer of swig, so you can use such a little niche function.
>>> x = np.random.random(10).astype(np.float32)
>>> faiss.swig_ptr(x)
<Swig Object of type 'faiss::IndexReplicasTemplate< faiss::Index >::component_t *' at 0x7fe8a57c3b10>
Until now, many basic functions have been introduced, but from now on, it will be a little more advanced. About the handling of subspace in faiss.
First is the distance calculation in the subspace. You can use compute_distance_subset
to do the following:
>>> def compute_distance_subset(index, xq, subset):
... n, _ = xq.shape
... _, k = subset.shape
... distances = np.empty((n, k), dtype=np.float32)
... index.compute_distance_subset(
... n, faiss.swig_ptr(xq),
... k, faiss.swig_ptr(distances),
... faiss.swig_ptr(subset)
... )
... return distances
...
>>>
>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 5
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> subset = np.random.choice(range(nb), (nq, k))
>>>
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>>
>>> compute_distance_subset(index, xq, subset)
array([[0.04731181, 0.1585833 , 0.4276843 , 0.02083743, 0.14153683],
[0.55289364, 0.19499591, 0.24127454, 0.16293366, 0.02044217],
[0.61750704, 0.48981428, 0.51042193, 0.12334089, 0.55514073],
[0.5959296 , 0.6604827 , 0.20945217, 0.10136123, 0.01619768],
[0.13882631, 0.16818088, 0.01572821, 0.17454663, 0.03992677],
[0.46265444, 0.70609426, 0.49902472, 0.730565 , 0.09248901],
[1.1638596 , 1.1041545 , 0.73789394, 0.60920525, 0.21328084],
[0.02405633, 0.00557276, 0.6880306 , 0.821055 , 0.0421453 ],
[0.08726364, 0.33441633, 0.15067156, 0.28792596, 0.30785137],
[0.04219329, 0.747885 , 0.01912764, 0.19305223, 0.51132184]],
dtype=float32)
Note that compute_distance_subset
only calculates the distance of the subset, and does not calculate the K-nearest neighbors.
Next is how to create the subspace itself. This can be achieved by using copy_subset_to
. If you look at Splitting and merging indexes, you can roughly understand it, so I will write only the notes.
The first point is written in the document, but it can only be used with ʻIndex IVF`.
The second point is that there are some difficulties in how to create a subspace. As you can see from the Source Code, only continuous and periodic copies are supported.
faiss/IndexIVF.h
/** copy a subset of the entries index to the other index
*
* if subset_type == 0: copies ids in [a1, a2)
* if subset_type == 1: copies ids if id % a1 == a2
* if subset_type == 2: copies inverted lists such that a1
* elements are left before and a2 elements are after
*/
virtual void copy_subset_to (IndexIVF & other, int subset_type,
idx_t a1, idx_t a2) const;
By default, consecutive IDs are assigned, but you can use your own ID system by using ʻIndexIDMap`. If you look at Faiss ID mapping, you can see it, but the unique ID is as follows. You can use the system.
>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 2
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> ids = np.arange(nb) * 10
>>>
>>> index = faiss.IndexFlatL2(d)
>>> custom_index = faiss.IndexIDMap(index)
>>> custom_index.add_with_ids(xb, ids)
>>> custom_index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
[1.38282776e-05, 6.81877136e-05],
[4.52995300e-06, 1.12056732e-05],
[8.55922699e-05, 9.26256180e-05],
[1.41859055e-05, 1.38044357e-04],
[2.40206718e-05, 4.58657742e-05],
[6.55651093e-06, 3.46302986e-05],
[5.24520874e-06, 1.35898590e-05],
[2.90870667e-05, 3.90410423e-05],
[2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[38210, 66060],
[51500, 97890],
[17100, 97780],
[42300, 51430],
[ 3790, 63660],
[19050, 26470],
[22070, 45900],
[39140, 4190],
[10040, 7850],
[14390, 48690]]))
There is a caveat here. Since faiss.IndexIDMap
only maps the original index and ID system, if you are using an older version of faiss, you will get a segmentation fault in the situation where the index can be GC as shown below. The latest version of faiss (1.5.3) seems to solve this problem.
>>> index = faiss.IndexIDMap(faiss.IndexFlatL2(d))
>>> index.add_with_ids(xb, ids)
Segmentation fault (core dumped)
By the way, there is not much use, but you can also search for the original index.
>>> index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
[1.38282776e-05, 6.81877136e-05],
[4.52995300e-06, 1.12056732e-05],
[8.55922699e-05, 9.26256180e-05],
[1.41859055e-05, 1.38044357e-04],
[2.40206718e-05, 4.58657742e-05],
[6.55651093e-06, 3.46302986e-05],
[5.24520874e-06, 1.35898590e-05],
[2.90870667e-05, 3.90410423e-05],
[2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[3821, 6606],
[5150, 9789],
[1710, 9778],
[4230, 5143],
[ 379, 6366],
[1905, 2647],
[2207, 4590],
[3914, 419],
[1004, 785],
[1439, 4869]]))
I introduced a little niche function of faiss, which is a neighborhood search library of Facebook Resarch, with a reminder to myself. It seems that it will be a niche article halfway and who will read it ... but I hope I can play the role of one person. I will continue to add it if I try a little niche function of faiss.
Recommended Posts