Cornerstone Sponsor  AMD  
  About NewsletterWorking GroupsWikiFAQsPresentationsSponsor OpenFPGA
  

OpenFPGA
Mission & Objectives
News & Activities
Newsletter
Web Resource Index
Mailing List
Forums
Presentations
Working Groups
Technology Roundtable
   Discussions Schedule

FAQs
OpenFPGA Wiki
Meeting Minutes
Board of Directors
Bootstrap Contributors
OpenFPGA Benchmarks
Sponsor OpenFPGA

Membership
Membership Information
Membership FAQs
Apply for Membership
Organizations
   with Current
   OpenFPGA Members

 

OpenFPGA Benchmarks

Benchmark #1 Smith-Waterman

The Smith-Waterman algorithm is a dynamic program used to find regions of similarity between two sequences. It is used to compute local alignment between nucleotide sequences, or between amino acid sequences. For this benchmark, we will assume that one of the inputs is the alphabet used is specified at run time (max 32 characters) and could be converted to a sequence of 5-bit integers as part of the preprocessing (in the CPU) before the data is passed to the FPGA. Also, one of these sequences, the reference sequence, is no longer than 10M characters long. The second sequence, the test sequence, is unbounded, and is expected to be streamed into the FPGA. The similarity metric allows not only replacement of characters, but also insertion and deletions (called gaps, since they represent a gap in one of the two sequences). It is usually considered less significant to extend a gap, once in a gap, than to initially enter the gap. Thus, two penalties are provided.

The expected inputs for this benchmark are:

  1. An alphabet representing the characters to be found in the two sequences. (Need to also specify if case represents different characters.)
  2. A reference sequence consisting of one or more initial lines (comment lines) followed by the sequence. The comment lines (at least one required) start with the character “>”. No comment lines are allowed inside the sequence. Maximum length 10M characters.
  3. Gap enter and gap extend penalties. (These are negative integers.)
  4. A similarity matrix for the alphabet used. Note that this should be such that the expected value of a random path is negative. (These are small integers.)
  5. Threshold of the minimum match score needed before reporting a match region.
  6. A test sequence, in the same format as the reference sequence, but potentially much longer.

The expected output will be a file containing for each reported match, the start and end in each sequence, along with the match score. (Since the detailed match path is not required, this eliminates the need for backtracking (the start point can be carried as part of the state). The exact path can be computed off-line, in the rare case that a match is identified, and thus is not part of the benchmark.)

Let R[i] be the i’th character in the reference sequence, and T[j] be the j’th character in the test sequence. Let S[i,j] be the state at position i,j. This state will contain the following information:
P[i,j] is the score for the best path to state S[i,j]
PS[i,j] is the start for this path.
G[i,j] is the score for the best path to state S[i,j] whose last step is a gap step.
Properly, there should be two values here, one for a vertical step, and a separate one for a horizontal step, but these are usually combined into a single score.
GS[i,j] is the start for this gap path.
Let GP be the gap penalty, and GE the gap extend penalty.
Let K[i,j] be the similarity of R[i] and T[j]

The values in S[i,j] can be defined as follows:

For i=0 or j=0:
	  P[i,j] = G[i,j]=0
	  PS[i,j]= GS[i,j]=i,j
Else:  G[i,j] = max  of
	  0 => GS[i,j]=i,j
	  G[i-1,j]+GE=> GS[i,j]=GS[i-1,j]
	  P[i-1,j]+GP=> GS[i,j]=PS[i-1,j]
	  G[i,j-1]+GE=> GS[i,j]=GS[i,j-1]
	  P[i,j-1]+GP=> GS[i,j]=PS[i,j-1]
P[i,j] = max  of  		
	  G[i,j]  => PS[i,j]=GS[i,j]  		
	  P[i-1,j-1]+K[i,j]=> PS[i,j]=PS[i-1,j-1]

Implementation note: Note that the calculation of S[i,j] and S[i+1,j-1] can be calculated at the same time, since they do not depend on each other. Thus the usual approach is to calculate at step t S[i,t-i] for all i.