Burrows Wheeler Transform


Reading time: about 6 minutes
In [1]:
get_ipython().ast_node_interactivity = 'all'

import string
In [2]:
lipsum = """
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Nulla tempus convallis
accumsan. Suspendisse ac euismod lectus.
Class aptent taciti sociosqu ad litora
torquent per conubia nostra, per inceptos
himenaeos. Aliquam sit amet urna accumsan,
porttitor sem id, aliquam sapien. Morbi
ullamcorper erat eget auctor viverra.
Nam rutrum eget justo id fermentum. Proin
accumsan condimentum dolor, non bibendum
lorem placerat quis. Sed ultrices, nisl
non varius feugiat, purus eros faucibus
quam, vel laoreet quam turpis laoreet nibh.
Etiam tincidunt massa volutpat ligula
pharetra faucibus. Curabitur malesuada erat
orci, ac aliquet nibh vestibulum tincidunt.
Nulla gravida erat neque, tristique aliquet
erat egestas at. Vivamus tristique tristique
nisl, convallis faucibus nisi dignissim sed.
Cras id erat sed sapien rutrum imperdiet.
Maecenas vestibulum mi libero, non iaculis
nunc viverra sed. Donec massa felis, tincidunt
at ligula ut, ultrices pulvinar libero. Aenean
lectus ipsum, porta sit amet sapien quis, fermentum non.""".replace("\n", " ")
In [3]:
alphabet = []

BEGIN = "(BEGIN)"
END = "(END)"

alphabet.append(BEGIN)
alphabet.append(END)

for letter in string.ascii_lowercase:
    alphabet.append(letter)
In [4]:
def all_shifts(l):
    copy = list(l)
    
    while True:
        copy.insert(0, copy.pop())
        yield list(copy)
        
        if copy == l:
            return
In [5]:
def bwt(symbols):
    result = []
    
    for perm in sorted(all_shifts(symbols)):
        result.append(perm[-1])
    
    return result
In [6]:
msg = "BANANA|"
msg = list(msg)

bwt(msg)
Out [6]:
['B', 'N', 'N', '|', 'A', 'A', 'A']

Let’s see if we can revert it.

In [7]:
def ibwt(symbols, end=None):
    if not end:
        end = END
    results = []
    
    for _ in symbols:
        results.append([])
    
    for _ in symbols:
        for res, s in zip(results, symbols):
            res.insert(0, s)
        results.sort()
    
    for res in results:
        if res[-1] == end:
            return res
In [8]:
ibwt("BNN|AAA", "|")
Out [8]:
['B', 'A', 'N', 'A', 'N', 'A', '|']
In [9]:
message = []

for c in "nana nana nana nana nana nana nana batmaaaan":
    message.append(c)

message.append(END)

"".join(message)
Out [9]:
'nana nana nana nana nana nana nana batmaaaan(END)'
In [10]:
message = bwt(message)

"".join(message)
Out [10]:
'aaaaaaannnnnnnnmaaannnnnnnb taaaaaaaa      (END)a'
In [11]:
message = ibwt(message)

"".join(message)
Out [11]:
'nana nana nana nana nana nana nana batmaaaan(END)'
In [12]:
lipsum[:70]
len(lipsum)
Out [12]:
' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla tempus'
Out [12]:
1026
In [13]:
message = bwt(list(lipsum) + [END])
"".join(message)
Out [13]:
'.......(END).........,enasur,ectttststnn,rs,immttmgtraadscssaa,dsasnsommrmstslndirttdmrctmttts,e,,lmastt,am,,s,tsntmntdmtmaramrita,mma,esmi,dtan,shcra.idtlssroamntsetmossdsdhtnattmsntn                 nddrslilrtlrlsrrr     lti u Mnm   vvNuuiuul   vsessllsss nhvtrnlmmrrp rrrri fff riiiiiru aiiiiiaaenaaaenaiiruuunnnsoa     meueeaccciioaeeieeiar nan  niiisuuuunnasllsSssrr   vf srrtiiimcpbAutmmmacnppp    c pffbb vvccgvvlgugueemmmirtf        neeeui iibbp smtbcb tgbllnnttcccrr   vcccpppddllshd ov tttccd  lltttllllplulupnnnudnrrrssslctlbrVvv esslulupl  C  a    aAaalluee uuuaaoo uuuuouuuueueuuueaaiaauauau   aiirraaa sieuuuaioaoaeeoaeaoreieuiiiioeeeo i     g     ouueeueeeo ootrrsmrddvnnnncDcccctltltM aaLlctpprteints r m aaari   iiae  msii  iiierii  ueeeauooootrotruCeeeeeegooeooooatttttaeeueePeouoeeoottu  isauuiouoauiuieuuaieiouuissmmm   isn    is   iii uoaaaiieeeiiiuoeppiiineaaeeaanaeeaneaeaneeaueniar s pciEss   ssstsici puesll   uurnnnei ccqsqqqqnaaaaqqqqqqqeqqeggcNN   bbpstrrdltlstcccndddttC tppritbbmtbSj lrrnni  ii  al   '
In [14]:
message = ibwt(message)

"".join(message)
Out [14]:
' Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla tempus convallis accumsan. Suspendisse ac euismod lectus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Aliquam sit amet urna accumsan, porttitor sem id, aliquam sapien. Morbi ullamcorper erat eget auctor viverra. Nam rutrum eget justo id fermentum. Proin accumsan condimentum dolor, non bibendum lorem placerat quis. Sed ultrices, nisl non varius feugiat, purus eros faucibus quam, vel laoreet quam turpis laoreet nibh. Etiam tincidunt massa volutpat ligula pharetra faucibus. Curabitur malesuada erat orci, ac aliquet nibh vestibulum tincidunt. Nulla gravida erat neque, tristique aliquet erat egestas at. Vivamus tristique tristique nisl, convallis faucibus nisi dignissim sed. Cras id erat sed sapien rutrum imperdiet. Maecenas vestibulum mi libero, non iaculis nunc viverra sed. Donec massa felis, tincidunt at ligula ut, ultrices pulvinar libero. Aenean lectus ipsum, porta sit amet sapien quis, fermentum non.(END)'

The following pages link here

Citation

If you find this work useful, please cite it as:
@article{yaltirakliwikiburrowswheelertransform,
  title   = "Burrows Wheeler Transform",
  author  = "Yaltirakli, Gokberk",
  journal = "gkbrk.com",
  year    = "2024",
  url     = "https://www.gkbrk.com/wiki/burrows-wheeler-transform/"
}
Not using BibTeX? Click here for more citation styles.
IEEE Citation
Gokberk Yaltirakli, "Burrows Wheeler Transform", November, 2024. [Online]. Available: https://www.gkbrk.com/wiki/burrows-wheeler-transform/. [Accessed Nov. 12, 2024].
APA Style
Yaltirakli, G. (2024, November 12). Burrows Wheeler Transform. https://www.gkbrk.com/wiki/burrows-wheeler-transform/
Bluebook Style
Gokberk Yaltirakli, Burrows Wheeler Transform, GKBRK.COM (Nov. 12, 2024), https://www.gkbrk.com/wiki/burrows-wheeler-transform/

Comments

© 2024 Gokberk Yaltirakli