InterProScan 
Introduction to InterPro:

Databases of protein domains and functional sites have become vital resources for the prediction of protein functions. During the last decade, several signature-recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them. InterPro (The InterPro Consortium 2001) is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non-redundant characterisation of a given protein family, domain or functional site. The InterPro database integrates PROSITE (Hofmann,K. et al. 1999), PRINTS (Attwood,T. K. et al. 2000), Pfam (Bateman,A. et al. 2000), ProDom (Corpet,F. et al. 1999), SMART (Schultz,J.et al. 2000) and TIGRFAMs (Haft,D.H. et al. 2001) databases and the addition of others is scheduled. InterPro data is distributed in XML format and it is freely available under the InterPro Consortium copyright. The InterPro project home page is available at http://www.ebi.ac.uk/interpro.

 

Any queries should be emailed to InterHelp@ebi.ac.uk.

InterPro member databases and scanning methods:

Legend: v denotes a database and Ø denotes the associated scanning tools.

PROSITE patterns.

Some biologically significant amino acid patterns can be summarised in the form of regular expressions.

ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk), Ppsearch (Fuchs, R. 1994) .

PROSITE profile.

There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detection of such domains.

  pfscan from thePftools package (by Philipp.Bucher@isrec.unil.ch).

PRINTS.
The PRINTS database houses a collection of protein family fingerprints. These are groups of motifs that together are diagnostically more potent than single motifs by making use of the biological context inherent in a multiple-motif method.

     FingerPRINTScan (Scordis, P. et al. 1999) .

 PFAM.
Pfam is a database of protein domain families. Pfam contains curated multiple sequence alignments for each family and corresponding profile hidden Markov models (HMMs).

     hmmpfam from theHMMER2.1 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu),
DeCypher™ (TimeLogic) implementation of HMM search.

 PRODOM.
ProDom families are built by an automated process based on a recursive use ofPSI-BLAST homology searches.

     BlastProDom.pl (by Florence Servant, fservant@toulouse.inra.fr) – a filter on top of theBlast package (Altschul, S. F. et al. 1997) .

 SMART.
SMART domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. SMART alignments are optimised manually and following construction of corresponding hidden Markov models (HMMs).

     hmmpfam from theHMMER2.1 package.

TIGRFAMs.
TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects

Ø     hmmpfam from theHMMER2.1 package.

Optionally, predictions for coiled-coil, signal peptide cleavage sites (SignalP v2) and TM helices (TMHMM v2) are supported.

InterProScan:

InterProScan is a tool that combines different protein signature recognition methods into one resource. The number of signature databases and their associated scanning tools as well as the further refinement procedures increase the complexity of the problem. InterProScan is more than a simple wrapping of sequence analysis applications since it requires performing considerable data look-ups from some databases and program outputs.  The Perl-based InterProScan is intended to be an extensible and scalable system optimised to cope with bulk data processing. The need for production scale efficiency and an easy extensibility require a robust and efficient (parallel) internal architecture that can benefit from network distributed computing with the support of UNIX queuing systems. In the package a Perl-based simple data retrieval system was introduced to provide the required data look-up efficiency and easy extensibility.

Features:

1)      The most important feature of InterProScan is its ability to perform underlying processes in a parallel mode and to recover in case of failure in an intermediate step. To do so InterProScan relies on the GNU Make utility.

2)      Another important feature of InterProScan is the distributed execution of scanning jobs. The integrated applications are executed using Unix rsh on the configured network hosts. The job can either be directly executed on a remote host or can be submitted from the host to a Unix queuing system like LSF, which can redirect it further.

3)      As a wrapper InterProScan has a modular structure with a simple "one Perl module per database" organisation. This allows reusing the Perl modules for a particular database in other independent Perl scripts.

4)      Each of the Perl modules provides an object-oriented interface to the underlying database entry attributes. The parsing of data into memory objects happens only once and is done upon request, implementing so-called lazy parsing.

5)      Parsing routines are implemented using the Recursive Descent approach (Parse-RecDescent package by Damian Conway) and are described as SRS-like parsing rules.

6)      To speed up the required data look-up, InterProScan indexes the corresponding databases. Fast data retrieval is implemented based on Perl native B-trees indexing (DB_File.pm by Paul Marquess, based on Berkeley DB).

7)      InterProScan makes the results available in four formats:

a)     raw format - is basic tab delimited format useful for uploading the data into a relational database or concatenation of different runs.

b)      xml format - is a self descriptive computer readable format compatible with the distribution XML format of InterProMatches.

c)      txt format - is a condensed plain text representation of the results.

d)      html format - conforms to the html3 standard viewable by Internet Browsers. This format is enhanced by a graphical representation of the identified matches as well as by hyperlinks to the corresponding InterPro entries, the signature entries of the InterPro member databases, the scanned protein sequences and the original output of the underlying applications. It also provides links to the application's home pages.

8)      The InterProScan package includes optional support for a Web user interface with a script for basic retrieval of local data.

9)      You can submit nucleic acid sequences that will be translated in all 6 frames and piped into the analysis programs.

10)  The InterProScan package implements additional filtering of the results based on family specific cut-offs.

Availability:

InterProScan and the underlying applications are freely available under the GNU licence agreement from the EBI's ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/).

System requirements:

        The InterProScan package has been developed in Perl5 under UNIX

        DB_File.pm (interface to Berkeley DB, which is a part of standard Perl5 distribution)

        GNU Make utility (http://www.gnu.org/manual/make/)

        Binaries of signature recognition methods provided for the following UNIX platforms:

       iprscan_bin_IRIX64.tar.gz - the executables for SGI

       iprscan_bin_Linux.tar.gz  - ... for Linux PC

       iprscan_bin_OSF1.tar.gz - ... for DEC Alpha

       iprscan_bin_SunOS.tar.gz - ... for SUN

       The full installation (with binaries for all platforms) takes about 400Mb.

       For distributed computing:

       InterProScan relies on UNIX rsh. This means you have to be able to rlogin to the hosts you are going to use.

       The installation should be on a shared file system (e.g. over NFS) that is accessible from all hosts (in the queue) you are going to use.

       The installation step implies that you are able to execute such commands as 'ls', 'pwd', 'rsh', 'uname'.

Installation:

1) Download, unzip and untar the following files

from ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/ e.g.:

        % ncftp ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/iprscan_XXX_.tar.gz

        % gunzip -c iprscan_XXX_.tar.gz | tar xvf -

a) iprscan_vX.tar.gz - the InterProScan itself

b) iprscan_bin_XXX.tar.gz - executables of signature scanning applications specific to the UNIX platform(s) you are going to use (see System requirements).

c) iprscan_data.tar.gz  - databases used by the InterProScan, including all required indexes of the data.

2) From the root of the installation (where you created 'iprscan' directory) run

 

        % perl CONFIG.pl

 

and follow the prompts.

3) You can test the installation by querying the test sequence (test.seq) included:

 

        % ./InterProScan.pl test.seq +ipr

 

Where the '+ipr' requests to lookup the corresponding InterPro references to the output.

The program will prepare a temporary directory (something like 'tmp/jsmith_13-Sep-2000_21893') and print the following command, which the user must execute in order to start the scanning phase:

cd tmp/jsmith_13-Sep-2000_21893 gmake {raw/htm/xml/txt} -jX -k

To start the job choose one of the formats available among the raw, htm, xml or txt options. X defines the maximum number of threads to be executed in parallel. For example, to obtain an output in raw format and run it using two threads, type:

           % cd tmp/zdevg_13-Sep-2000_21893

           % gmake raw -j2 -k

When the scanning is finished the results will be in a file called 'merged.raw'. To check that everything works correctly you can compare your results with the 'test.raw' file included in the distribution.

DATA Update:

1) If you are updating InterPro, Pfam or PRINTS just put the new files in 'data/' directory either preserving the original names or change these names in the corresponding files in 'conf/' directory. If you intend to update PROSITE or PRODOM edit the location of the data in 'bin/index_data.pl' (marked by '#EDIT the following lines !!!').

If you are using PPSEARCH note that it creates platform dependent(!) indexes. So edit the names of the UNIX hosts on which you are going to run it in 'bin/index_data.pl'.>

2) Run:

           %gmake

Distributed DATA and Applications:

InterPro protein signature databases.

PROSITE      : ftp://ftp.expasy.ch/databases/prosite/ prosite.dat & prosite.doc

PROFILE      : ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/ prosite_prerelease.prf

PRODOM     : ftp://ftp.toulouse.inra.fr/pub/prodom/current_release/ prodom2000.1.forblast.gz

InterPro          : ftp://ftp.ebi.ac.uk/pub/databases/interpro/ interpro.xml.gz

Pfam                : ftp://ftp.sanger.ac.uk/pub/databases/Pfam/ Pfam.gz

PRINTS         : ftp://ftp.bioinf.man.ac.uk/pub/fingerPRINTScan/ printsXX_0.pval_blos62.gz

TIGRFAMs   : ftp://ftp.tigr.org/pub/data/TIGRFAMs/

Scanning applications.

FingerPRINTScan      : ftp://proline.sbc.man.ac.uk/pub/fingerPRINTScan/binaries/

ScanRegExp & PPsearch : ftp://ftp.ebi.ac.uk/pub/software/unix/

ProfileScan                   : http://www.isrec.isb-sib.ch/profile/profile.html

HMMPfam                   : http://hmmer.wustl.edu/

NCBI Blast                  : ftp://ncbi.nlm.nih.gov/blast/

~~~ not public ~~~

TMHMM (v. 2.0)      : http://www.cbs.dtu.dk/services/TMHMM/

SignalP V2.0                : http://www.cbs.dtu.dk/services/SignalP-2.0/

SMART                   : http://smart.embl-heidelberg.de/

Architecture review:

As mentioned above, the Perl-based InterProScan was designed for bulk sequence analysis. The architecture does not have any internal limitations on the number of submitted sequences and has been tested on runs with more than 100000 sequences. The general approach is to split the original input file into smaller parts with a pre-configured number of sequences in each. Later the jobs can be done in parallel at the level of the smaller parts and of the different methods. Internally it is implemented using the GNU Make utility (http://www.gnu.org/manual/make/).

InterProScan is more than a simple wrapper of protein sequence analysis applications. In addition, it requires to do a considerable data look-up from some databases and has abilities for parsing and retrieving program outputs. The system has a modular structure and is designed in an SRS-like fashion. Each of the data description modules defines the data schema of the source text data and the parsing rules. The corresponding Perl module provides an object-oriented interface to the underlying entry attributes. The parsing of the source data into the memory objects happens only once and is done upon request, implementing so-called lazy-parsing. Hierarchical parsing rules are implemented using the recursive-descent approach (Parse-RecDescent package). Fast data retrieval is implemented using the Perl native B-trees indexing (DB_File.pm, based on Berkeley DB). The simple 'one Perl module per data source' organisation makes it possible to reuse the modules in other stand-alone ad-hock solutions. The Perl-based InterProScan is capable of providing post-processed, integrated results in several formats and it could be used as a simple retrieval system for the underlying data.

Architecture Details


Implementation details:

Each installation has the following directories:

* 'data' directory contains all databases and required indices (run 'make' command from the installation root after data update to create indices; see DATA Update);

* 'tmp' directory is used to store temporary user sessions (and temporary jobs outputs);

* 'bin' directory contains some Perl scripts and platform specific binaries of scanning programs;

* 'lib' directory contains all Perl modules for each database/application used (they are not supposed to be edited directly(!) since your changes can be overwritten later by configuration scripts) and some for general use;

* 'conf' directory contains configuration files for each database/application used (since for simplicity these are not full Perl modules, run 'make' command from the installation root to make changes active).

The job itself is started and controlled by make utility using the generated 'Makefile'. Since make prints all executed commands and their outputs it is a good idea to redirect BOTH stdout and stderr to a logfile (or use 'script' command) to be able to trace any problems encountered. Note that make's terminating error message is not informative about what has happened in the procedure. If the job crashed you can try to fix the problem, then just run 'make' again to continue (or remove some outputs to do them again).

Results filtering / Match status:

Method cut-offs:

InterProScan is based on scanning methods native to the InterPro member databases. It is distributed with pre-configured method cut-offs recommended by the member database experts and which are believed to report relevant  matches. All cut-offs are defined in configuration files (see 'conf' directory). Matches of Pfam and Smart signatures obtained with the fixed cut-off are subject to the following filtering.

* PFAM filtering:

each Pfam HMM model has its own cut-off scores for each domain match and the total model match. These bit score cut-offs are defined in the GA lines of the Pfam database. Initial results are obtained with quite a high common cut-off and then the matches (of the signature or some of its domains) with a lower score than the family specific cut-offs are dropped.

* PRINTS filtering:

there is a test version of PRINTS families specific p-value cut-offs. All matches with p-value more than p_min for the signature are dropped.

* SMART filtering:

The publicly distributed version of InterProScan has a common e-value cut-off corresponding to the reference database size. A more sophisticated scoring model is used on the SMART web server and in the production of pre-calculated InterProMatches data. Exact scoring thresholds for domain assignments are proprietary data that can be obtained directly from the

SMART team.

[The InterProMatches data production procedure uses the additional thresholds.txt (note, that the given cut-offs are e-values - the number of expected random hits and they are valid only in the context of reference database size) and descriptions.txt data files (which are available from the SMART team) to filter out results obtained with higher cut-off. It implements the following logic:

1. If the E-value of found match is higher than the 'cut_low' the match is dropped.

2. If the E-value of found match is higher than the 'family' cut-off it is reported as the family hit with unknown status.

3. If the E-value of found match is less than the 'family' cut-off and higher than the 'cutoff' it is reported as the family member with true status.

4. If the 'family' cut-off is undefined and the E-value of the match is higher than the 'cutoff' but less than the 'cut_low' it is reported as a domain match with unknown status.

5. If the E-value of the found match is less than the 'cutoff' it is reported as a domain match with true status.]

* PROSITE patterns CONFIRMation:

ScanRegExp is able to verify PROSITE matches using corresponding statistically significant CONFIRM patterns. The default status of the PROSITE matches is unknown (?) and the true positive (T) status is assigned if the corresponding CONFIRM patterns match as well. The CONFIRM patterns were generated based on the true positive Swiss-Prot PROSITE matches using eMOTIF software with a stringency of 10e-9 P-value.

Programs:

CONFIG.pl

is provided to make the installation and reconfiguration of InterProScan easy. Most of the prompts have some explanations and provide default suggestions in [].

InterProScan.pl input_file [+ipr [+go]] [+scr] [-tr_T:NN] [-tr_L:NN]

is the program that initiates an InterProScan job. It creates a temporary user directory and prepares all required infrastructure for the scanning:

1. the input file is checked to confirm FASTA format, and split into configured portions each in its own 'cnk_NN' directory

2. the local 'bin' directory created that contains generated scripts to launch individual scanning steps and parsing scripts

3. each 'cnk_NN' directory gets its own Makefile that describes how to run all scanning and parsing

4. the top Makefile controls the final assembling of all results. At this point previously configured parameters such as required applications, their command line parameters, queue names and so on become fixed.

+ipr switch on look up of corresponding InterPro annotation

+go switch on look up of corresponding Gene Ontology annotation

+scr switch on reporting scores of the found matches

-tr_T:NN and -tr_L:NN are used for specifying Translation Table code and transcript length treshold respectively for nucleic acid to protein sequence translation (based on CodonTable.pm by Heikki Lehvaslaiho <heikki@ebi.ac.uk>).

prep_pm.pl  < conf/XX > lib/XX.pm

is used internally by make to extend configuration files in the'conf' directory to proper perl modules in 'lib' directory.

bin/meter.pl tmp/user_NNN/

reports the progress of a job in 'user_NNN' session.

bin/getit.pl <lib_name>:<attr> or -libs

retrieves an entry of <lib_name> corresponding to indexed query <attr> (looks for CaSe sensitive exact match);

-libs reports all available databanks and status of their indices.

bin/indexer.pl <lib_name>:(<attr2index>[,<attr2index>[,<attr2index>]] or '-')

indexes (extends the current indices if the file exists) up to 3 specified

entry attributes (<attr2index>) of <lib_name> database; <lib_name>:- shows

all declared attributes for the database.

bin/index_data.pl

checks and updates all required indices (see DATA Update).

bin/converter.pl format ./merged.raw  > merged.format

is used to reformat results from raw into html, xml or txt format.

bin/filterProDom.pl

is used to filter out ProDom entries that have a corresponding InterPro entry.

bin/iterator.pl mfasta.seq final_out ..cmd..should_print_to_stdout

some programs (like BlastProDom.pl and ScanProfile) take one sequence as their input. So the script takes a multiple sequence file and applies provided command to each sequence at a time.

bin/demo_iprscan.pl

was written (and no longer supported) to show how it looks in Perl to scan consecutively on one sequence.

What's new:

since v1.x

* crc64 calculation for FastaSeq object (raw format changed: seq_CRC64 and seq_Length fields inserted after seq_ID);

* match status handling (raw format changed: status mark inserted after match location, which is considered to be true (T) unless parser reports something);

* NULL reference to InterPro reported if the corresponding Interpro entry was not found;

* PRODOM cut-off fixed (E-value/dbsize are frozen);

* cleanup of Blast core dumps (into 'blast.core') on low complexity sequences;

* FPrintScan partial match extrapolation trimed to the submitted sequence length ;

* Smart HMM search;

* [Smart family specific cut-off filtering (data is not distributed);]

* progress report (see bin/meter.pl);

* indexer (allows to index specified attribute of an available database);

* html output splitted into chunks;

* query html form;

* [getit.pl regexpr querying (commented out in 'lib/utils.ph' since it can cause problems returning ambigios results)]

since v2.0

* restructured CONFIG.pl

* parsers and data updated to InterPro release v3.1

since v2.1

* [Coiled-Coil search & display;]

* [TMHMM search & display (not distributed);]

* [SignalP search & display (not distributed);]

* Interpro2GO parsing;

* raw format changed (with +ipr switch): GO terms separated by ';' added at the end of lines;

* converter.pl - shows GO in txt & html formats;

* TIGRFAM search;

* scores parsing;

* implemented PRINTS family specific cut-offs for FingerPRINTScan;

* 6 frame translation for nucleic acid sequence input added.

since v3.1

* Sequences are chopped to feed SignalP program with only the N-terminal region (first 100 amino acids)
* Rewrite of the nucleic acid sequence translation script to fix a problem
* Low complexity region search with SEG
* HMMPfam decypher module
* Data updated to InterPro release v5.0
* Data updated to InterPro release v5.1
* Data updated to InterPro release v5.2
* Data updated to InterPro release v5.3

Any comments and suggestions are very welcome (InterHelp@EBI.ac.uk).

References: 1.      The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov), "The InterPro database, an integrated documentation resource for protein families, domains and functional sites", Nucleic Acids Research, 2001. vol 29(1):37-40.

2. Hofmann K., Bucher P., Falquet L., and Bairoch A., "The Prosite Database, Its Status in 1999". Nucleic Acids Res, 1999. 27(1): p. 215-9.

3. Attwood T.K., Croning M.D., Flower D.R., Lewis A.P., Mabey J.E., Scordis P., Selley J.N., and Wright W., "Prints-S: The Database Formerly Known as Prints". Nucleic Acids Res, 2000. 28(1): p. 225-7.

4. Bateman A., Birney E., Durbin R., Eddy S.R., Howe K.L., and Sonnhammer E.L., "The Pfam Protein Families Database". Nucleic Acids Res, 2000. 28(1): p. 263-6.

5. Corpet F., Gouzy J., and Kahn D., "Recent Improvements of the Prodom Database of Protein Domain Families". Nucleic Acids Res, 1999. 27(1): p. 263-7.

6. Schultz J., Copley R.R., Doerks T., Ponting C.P., and Bork P., "Smart: A Web-Based Tool for the Study of Genetically Mobile Domains". Nucleic Acids Res, 2000. 28(1): p. 231-4.

Bucher P., Karplus K., Moeri N., and Hofmann K., "A Flexible Motif Search Technique Based on Generalised Profiles". Comput Chem, 1996. 20(1): p. 3-23.

 Scordis P., Flower D.R., and Attwood T.K., "Fingerprintscan: Intelligent Searching of the Prints Motif Database". Bioinformatics, 1999. 15(10): p. 799-806.

9. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J., "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs". Nucleic Acids Res, 1997. 25(17): p. 3389-402.

11.  Haft,D.H., Loftus,B.J., Richardson,D.L., Yang,F., Eisen,J.A., Paulsen,I.T., White,O., "TIGRFAMs: a protein family resource for the functional identification of proteins". Nucleic. Acids. Res, 2001. 29 (1):41-3

12. Eddy, S.R. "HMMER: Profile hidden Markov models for biological sequence analysis". WWW, 2001. http://hmmer.wustl.edu/

How to cite:

Zdobnov E.M. and Apweiler R. "InterProScan - an integration platform for the signature-recognition methods in InterPro" Bioinformatics, 2001, 17(9): p. 847-8.