FM-index

Introduction

The FM-index is a compressed full-text, substring index based on the Burrows-Wheeler transform, with some additional information. It was created by Paolo Ferragina and Giovanni Manzini, who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The FM-index is a type of suffix array, a data structure used in high-performance text searching algorithms.

An image showing a simplified representation of an FM-index structure

History and Development

The FM-index was first introduced in a paper by Paolo Ferragina and Giovanni Manzini in 2000. The authors were researching ways to improve upon existing methods of text compression and searching. The result was the FM-index, which combined the Burrows-Wheeler transform with additional data to create a compressed, searchable index. This development was a significant advancement in the field of text searching algorithms, as it allowed for efficient storage and retrieval of data.

Structure and Function

The FM-index is built using a combination of the Burrows-Wheeler transform (BWT), prefix sums, and a compressed bitvector. The BWT is a reversible algorithm that rearranges a string into runs of similar characters. This is useful for compression, as it tends to produce a string with a lot of runs of the same character.

The prefix sums are used to provide a quick way to find the position of a character in the sorted list of characters of the text. The compressed bitvector, also known as the rank dictionary, is used to determine the occurrence of a character in the prefix of the BWT.

The FM-index can be used to efficiently find the number of occurrences of a pattern in the text, as well as locate the position of each occurrence. This makes it a valuable tool in fields such as bioinformatics, where it is used for tasks like genome sequencing.

Algorithm

The FM-index uses a backward search algorithm to find a pattern in the text. This algorithm starts at the end of the pattern and works its way backwards. It uses the BWT and the additional data structures to quickly find the range of suffixes that match the current prefix of the pattern.

The algorithm can be broken down into the following steps: 1. Initialize the top and bottom pointers to the start and end of the BWT. 2. For each character in the pattern, starting from the end:

  a. Use the prefix sum to find the new top pointer.
  b. Use the rank dictionary to find the new bottom pointer.

3. If the top pointer is less than or equal to the bottom pointer, the pattern is in the text. The number of occurrences is the difference between the bottom and top pointers plus one.

Applications

The FM-index has found wide application in the field of bioinformatics, particularly in genome sequencing. It is used in software like Bowtie and BWA, which are tools for aligning sequencing reads to long reference sequences such as the human genome.

It is also used in data compression, as it allows the original data to be reconstructed from the index. This property is useful in applications where space is at a premium, such as in embedded systems or in transmitting data over a network.

Advantages and Limitations

One of the main advantages of the FM-index is its space efficiency. It allows the storage of large amounts of data in a compressed format, while still allowing for efficient search and retrieval. This makes it particularly useful in fields like bioinformatics, where large volumes of data are common.

However, the FM-index also has some limitations. The construction of the index can be time-consuming, particularly for large texts. Additionally, while the index allows for efficient search, it does not support efficient arbitrary text retrieval or editing operations. This means that while it is excellent for tasks like searching a genome for a particular sequence, it is less suitable for tasks like text editing.