Training your RBM on sequences of Protein

Training your RBM on sequences of Protein#

We would like you to train your RBM on a sequence of proteins across a large number of creatures. This closely follows this paper by Francesco Zamponi.

The protein sequences you are given are strings of 20 amino acids (plus a - to indicate no protein).

To get the data, use the following code:

!pip install bio
import pylab as plt
import random
from Bio import SeqIO
import numpy as np
import time
get_bin = lambda x, n: format(x, 'b').zfill(n)

p=dict()
count=0
input_file='PF00014_mgap6.fasta'
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    name, sequence = fasta.id, str(fasta.seq)
    for s in sequence:
        if s not in p.keys():
            p[s]=[0 if k=='0' else 1 for k in get_bin(count,5)]
            count=count+1
            
def SeqToIsing(sequence):
    visible=[]
    for s in sequence:
        visible=visible+p[s]
    return np.array(visible)

input_file='PF00014_mgap6.fasta'
fasta_sequences = list(SeqIO.parse(open(input_file),'fasta'))

sequences=[]
for seq in fasta_sequences:
    sequences.append(SeqToIsing(seq))
sequences=np.array(sequences)

This will convert the 21 symbols into binary numbers.

Please train your RBM on these sequences. Note: You can do this with your standard RBM. In practice it will work “ok” but really what you want to do for it to work best is to not do an Ising model (with an up or down spin) but do a model where each site has 21 different spins. This is called a Potts model. You don’t have to do this, but if you do and make it work correctly, you get an extra 10 points (so 20 total points of extra credit for this section).

Analyze your results#

  • Plot the free energy as a function of epoch

  • Produce 10 sequences that are 1 Gibbs sample away from already existing sequences. See if those sequences seem reasonable. In the real world, what we would do is take these sequences and try to produce them.

  • Produce many many sequences and compare the probability of seeing a given amino acid on site 3 with the histogram of that probability in your dataset.