An inverted index is a powerful data structure widely used in information retrieval systems and search engines to enable fast and efficient full-text searches.
It reverses the traditional document-to-word mapping by creating a word-to-document index, allowing for quick identification of documents containing specific terms. This approach significantly reduces search time and improves query performance, especially when dealing with large volumes of text data.
The inverted index works by breaking down documents into individual terms, removing stop words, and applying techniques like stemming to normalize the text.
It then creates a mapping where each unique term points to a list of documents or locations where that term appears. This structure enables rapid look-ups and facilitates complex queries, making it an essential component in modern search technologies and document retrieval systems.
DDL
Here’s how I store my inverted index in the database. Document size and count aren’t needed for searching with the inverted index, I use them for ranking purposes and it’s handy to have in the same table.
CREATE TABLE searchengine.inverted_index
(
`term` String CODEC(ZSTD(9)),
`document_id` FixedString(32),
`doc_size` UInt64 CODEC(T64('bit'), ZSTD(9)),
`count` UInt64 CODEC(T64('bit'), ZSTD(9))
)
ENGINE = ReplacingMergeTree
ORDER BY (term, document_id)