You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a vss0 table is inserted or updated and a COMMIT happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.
This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).
Proposal: A non-faiss index storage option
We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.
For example, we could have:
create virtual table vss_demo using vss0(
index_storage="incremental",
chunk_embeddings(768)
);
insert into demo(rowid, chunk_embeddings)
select rowid, chunk_embeddings from demo;
-- inserting 1 row should be fast and not re-write the entire indexinsert into demo(rowid, chunk_embeddings)
values (1234, :embedding);
The index_storage="incremental" option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, the vss_demo table should instead store vectors in this new "incremental" storage format that's not Faiss-based.
We can still use Faiss for actual KNN and range searches, through knn_L2sqr_by_idx () and range_search_L2sqr(). But we'll need to design our own SQLite-based vector storage solution.
Design notes
This is a WIP, will likely change.
createtablevss_demo_config(
chunk_size integer
);
createtablevss_demo_streaming_chunks(
-- blob of size `chunk_size * sizeof(int64)`, storing the rowid values of the corresponding vectors in `vectors`
rowids blob,
-- bitmap of size `chunk_size / 8` that tracks if the i-th vector is valid/not-deleted
valid blob,
-- blob of size `chunk_size * sizeof(float), containing the raw float vectors
vectors blob
);
At creation time, an all-zero chunk of size chunk_size is inserted into vss_demo_streaming_chunks. At insert time, each query and their rowid is stored in the rowids and vectors column using SQLite's BLOB I/O API, and the i-th bit in the valid bitmap is updated. At query time, the vectors and rowids columns are read into memory, the range_search_L2sqr() function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.
Possibly could be a new operation="optimize" option that'll re-arrange these chunks for delete-heavy workflows, to save space.
insert into vss_demo(operation) values ('optimize');
Disadvantages
This index will likely be slower to query, since each query will have to read from the disk. However, it'll be reading from the already-opened SQLite database and not external files, and small BLOBs in SQLite are extremely fast, so this may not be noticeable in most applications.
This probably can only support flat indexes, ie it'll be an exhaustive search. Possibly could incorporate something like HNSW or IVF, but that'll be complex and further in the future.
The text was updated successfully, but these errors were encountered:
With the current "faiss in a shadow table" storage strategy, there's one drawback: every time data in a
vss0
table is inserted or updated and aCOMMIT
happens, the entire Faiss index has to be written to the DB. So if your index as 1 million vectors and you insert 1 more, then 1,000,001 vectors will have to be written to the DB.This is a limitation of Faiss's IO tools: I believe only entire writes are supported (unless you mmap which isn't possible with in-db storage).
Proposal: A non-faiss index storage option
We should created our own be-spoke index storage option that allows for incremental, streaming writes without re-writing the entire index. This would be outside of Faiss, but can still use some of Faiss's C++ API for querying.
For example, we could have:
The
index_storage="incremental"
option is the key: that signals that instead of the default faiss-shadow-table index storage strategy, thevss_demo
table should instead store vectors in this new "incremental" storage format that's not Faiss-based.We can still use Faiss for actual KNN and range searches, through
knn_L2sqr_by_idx ()
andrange_search_L2sqr()
. But we'll need to design our own SQLite-based vector storage solution.Design notes
This is a WIP, will likely change.
At creation time, an all-zero chunk of size
chunk_size
is inserted intovss_demo_streaming_chunks
. At insert time, each query and their rowid is stored in therowids
andvectors
column using SQLite's BLOB I/O API, and the i-th bit in thevalid
bitmap is updated. At query time, thevectors
androwids
columns are read into memory, therange_search_L2sqr()
function is called to find the K-nearest vectors in that chunk, and the chunks are de-allocated and the next chunk is worked on. In the end, we re-sort the the results and get the true K-nearest vectors.Possibly could be a new
operation="optimize"
option that'll re-arrange these chunks for delete-heavy workflows, to save space.Disadvantages
The text was updated successfully, but these errors were encountered: