You are on page 1of 76

STRUCTURAL BIOINFORMATICS

( Toward A High-Resolution Understanding of Biology )

Objectives of Lecture
 Structural Bioinformatics

 What is 3D Structure Prediction  Significance of 3D Structure Prediction  Central Dogma  Fundamentals of Protein Structure Protein Data bank (PDB)  To be aware of a number of Structure Prediction methods:  Homology Modeling  Fold Recognition/Threading  Ab initio Protein Folding Approaches Applications of Structural Bioinformatics  Analog-Based design  Structure-Based design

Structural Bioinformatics
Structural Bioinformatics is a subset of Bioinformatics concerned with the use of biological structuresProtein, DNA, RNA, Ligands and complexes thereof to further our understanding of biological systems.

What is protein structure prediction?


A prediction of the (relative) spatial position of each atom in the tertiary structure generated from knowledge (sequence). only of the primary structure

Significance of Protein Structure Prediction


 In evolutionary related proteins structure is much better

preserved than sequence.  3D protein structure offers much more information then just the amino acid sequence.  By comparison with known structures we can infer probable biological functions of new proteins  By mapping the residue conservations on to the structure we can infer active sites and possibly the molecular function

 We can also identify regions involved in protein-protein interactions.  We can reconstruct (at least partially) the structure of protein complexes identified by other experimental methods.  We can build homology models.

The central dogma


DNA ------{A,C,T,G} Guanine, Cytosine Thymine, Adenine RNA {A,C,G,U} T U ---------Protein {A,D,..Y}

Fundamentals of Protein Structure

Terminology
Primary Structure-- The sequence of amino acid
residues in the proteins.
--MESSTHEDRKVLDL

Amino acids and the peptide bond

C atoms
C first side chain carbon (except for glycine).

Secondary Structure
 A first level description of 3D structure.  The peptide backbone of DNA has areas of positive charge and negative charge  These areas can interact with one another to form hydrogen bonds  The result of these hydrogen bonds are two types of structures:  alpha helices  beta pleated sheets

Secondary Structure I: The EHelix

Secondary Structure II: The Strand

(About 3.4)

Several betastrands assemble into a beta-sheet (a tertiary structural element)

Antiparallel

-Sheets

Parallel

-Sheets

Mixed

-Sheets

Tertiary Structure: The Global Three Dimensional Structure


 Secondary structure elements pack together to form a

structural core Tertiary structure results from the folding of alpha helices and beta pleated sheets Factors influencing tertiary structure include:  Hydrophobic/hydrophilic interactions  Hydrogen bonding  Disulfide linkages  Folding by chaperone proteins

Tertiary Structure: Different Representations

(Richardson-style) Ribbon Diagrams are traces of the protein backbone emphasizing the 3-D arrangement of a-helices and b-strands. This arrangement is called the protein fold or the protein folding topology.

Tertiary Structure: Different Representations

This is much rather like what other molecules see when they encounter a protein! This is a representation of the molecular surface (Van der Waals surface) of a hemagglutinin domain with bound sialic acid.

Super secondary Structures: Between Secondary and Tertiary Structure

For example: - alpha- -above - -hairpin - left

Quaternary Structure
Association of Multiple Polypeptide Chains.  Quaternary structure results from the interaction of independent polypeptide chains  Factors influencing quaternary structure include:  Hydrophobic/hydrophilic interactions  Hydrogen bonding  The shape and charge distribution on associating polypeptides


Side Chain Properties


Hydrophobic amino acids stay inside of a protein. Hydrophilic ones tend to stay in the exterior of a protein. Oppositely charged amino acids can form salt bridge. Polar amino acids can participate hydrogen bonding.

Domain, Motif, Fold


 Domain: a discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function.  Most proteins have multiple domains. The overall shape of a domain is called a fold. There are only a few thousand possible folds. Super-secondary structure, motif Frequently occurring structure patterns among multiple proteins, which are not necessarily have similar folds.

Determination of protein structures


 X-ray Crystallography  NMR (Nuclear Magnetic Resonance)  EM (Electron microscopy)

Protein Data bank (PDB)


 A repository for 3-D biological macromolecular structure.  Established in 1971 at Brookhaven National Lab (7 structures)  It includes proteins, nucleic acids and viruses.  Obtained by X-Ray crystallography (80%) or NMR spectroscopy (16%).  Submitted by biologists and biochemists from around the world.

Other sites:
MMDB (EBI): NCBI: msd.ebi.ac.uk www.ncbi.nlm.nih.gov/Structure/

Growth of Protein Data Bank (PDB): The Motivation

Old fold

New fold

The number of unique folds in nature is fairly small (possibly a few thousands) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB

Protein Structure Prediction Methods


 Comparative Modeling Method:

Homology Modeling Method Threading Method


 Ab initio folding Method

Protein structure prediction flowchart


Experimental Sequence Database Searching Structure Homolog?

YES NO
Ab initio method

Homology Modeling

Homology Protein Threading Modeling

Homology Modeling
 Predicts

the three-dimensional structure of a given protein sequence (TARGET) based on an alignment to one or more known protein structures (TEMPLATES)

 If similarity between the TARGET sequence and the TEMPLATE sequence is detected, structural similarity can be assumed.  In general, 30% sequence identity is required for generating useful models.

7 Steps In Homology Modeling

Step 1: ID Homologues in PDB

PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGHKLMCNASQERWW PRETWQLKHGFDSADAMNCVCNQWER GFDHSDASFWERQWK

Query Sequence

PDB

Step 1: ID Homologues in PDB


PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGHKLMCNASQERWW PRETWQLKHGFDSADAMNCVCNQWER GFDHSDASFWERQWK PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGPRTEINSEQENCEPRTEINS EQUENCEPRTEINSEQNCEQWERYTRASDFH GTREWQIYPASDFG TREWQIYPASDFGPRTEINSEQENCEPRTEINS EQUENCEPRTEINSEQNCEQWERYTRASDFH GTREWQ PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFG

Hit#1
PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQQWEWEWQWEWEQWEW EWQRYEYEWQWNCEQWERYTRASDF HG TREWQIYPASDWERWEREWRFDSFG

PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGHKLMCNASQERWW PRETWQLKHGFDSADAMNCVCNQWER GFDHSDASFWERQWK

PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGHKLMCNASQERWW PRETWQLKHGFDSADAMNCVCNQWER GFDHSDASFWERQWK

PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGHKLMCNASQERWW PRETWQLKHGFDSADAMNCVCNQWER GFDHSDASFWERQWK

Hit#2
PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFGPRTEINSEQENC PRTEINSEQENCEPRTEINSEQUENC EPRTEINSEQNCEQWERYTRASDFHG TREWQIYPASDFG

Query sequence

PDB

Step 2: Align Sequences


G E N E S I S G 10 0 0 0 0 0 0 E 0 10 0 0 0 0 0 N 0 0 10 0 0 0 0 E 0 10 0 10 0 0 0 T 0 0 0 0 0 0 0 I 0 0 0 0 0 10 0 C 0 0 0 0 0 0 0 S 0 0 0 0 10 0 10
G G 60 E 40 N 30 E 20 S 20 I 10 S 0 E 40 50 30 20 20 10 0 N E 30 20 30 30 40 20 20 30 20 20 10 10 0 0 T 20 20 20 20 20 10 0 I C 0 10 0 10 0 10 10 10 0 10 20 10 0 0 S 0 0 0 0 10 0 10

Dynamic Programming

Alignment
 Key step in Homology Modeling.  Global (Needleman-Wunsch) alignment is absolutely required.  Small error in alignment can lead to big error in structural model.  Multiple alignments are usually better than pair wise alignments.  Alignment is prepared by superimposing all template structures.

Two zones of sequence alignment

Step 3: Find SCRs

Query Hit #1 Hit #2

ACDEFGHIKLMNPQRST--FGHQWERT-----TYREWYEG ASDEYAHLRILDPQRSTVAYAYE--KSFAPPGSFKWEYEA MCDEYAHIRLMNPERSTVAGGHQWERT----GSFKEWYAA


SCR#1 SCR#2

Structurally Conserved regions (SCRs)


 Corresponds to the most stable structures or regions (usually interior) of protein.  Corresponds to sequence regions with lowest level of gapping, highest level of sequence conservation.  Usually corresponds to secondary structures.

Step 4: Find SVRs

Query Hit #1 Hit #2

ACDEFGHIKLMNPQRST--FGHQWERT-----TYREWYEG ASDEYAHLRILDPQRSTVAYAYE--KSFAPPGSFKWEYEA MCDEYAHIRLMNPERSTVAGGHQWERT----GSFKEWYAA HHHHHHHHHHHHHCCCCCCCCCCCCCCCCCCBBBBBBBBB SVR Loop

Structurally Variable Regions (SVRs)


 Corresponds

to the least stable or most flexible regions (usually exterior) of protein

 Corresponds to sequence regions with highest level of gapping, lowest level of sequence conservation  Usually corresponds to loops and turns

Step 5: Side Chain Modeling


 Rotamer

placement and positioning is done via a superposition algorithm using rotamers.

Step 6: Model Optimization


 Efficient way of polishing and shining your protein model  Removes atomic overlaps and unnatural strains in the structure  Stabilizes or reinforces strong hydrogen bonds, breaks weak ones  Brings protein to lowest energy in about 1-2 minutes CPU time  Several freeware options to choose
XPLOR (Axel Brunger, Yale) GROMACS (Gronnigen, The Netherlands) AMBER (Peter Kollman, UCSF) CHARMM (Martin Karplus, Harvard) TINKER (Jay Ponder, Wash U))

Step 7: Model Validation


PROCHECK http://www.biochem.ucl.ac.uk/~roman/procheck/procheck.html

PROSA II http://lore.came.sbg.ac.at/People/mo/Prosa/prosa.html

VADAR http://www.pence.ualberta.ca/ftp/vadar/

DSSP http://www.embl-heidelberg.de/dssp/

Homology Modeling On Web

http://www.expasy.ch/swissmod/SWISS-MODEL.html

http://www.cmbi.kun.nl:1100/WIWWWI/

http://cl.sdsc.edu/hm.html

Raw Sequence

Use templates to build the structure of the homologous sequence

Predicted structure

Use of SwissPDB Viewer to build the structure of following sequence

MQQPMNYPCP QIFWVDSSAT SSWAPPGSVF PCPSCGPRGP DQRRPPPPPP PVSPLPPPSQ PLPLPPLTPL KKKDHNTNLW LPVVFFMVLV ALVGMGLGMY QLFHLQKELA ELREFTNQSL KVSSFEKQIA NPSTPSEKKE PRSVAHLTGN PHSRSIPLEW EDTYGTALIS GVKYKKGGLV INETGLYFVY SKVYFRGQSC NNQPLNHKVY MRNSKYPEDL VLMEEKRLNY CTTGQIWAHS SYLGAVFNLT SADHLYVNIS QLSLINFEES KTFFGLYKL

DOGB

1TNRA

After magic fit

Activate the raw sequence

The Preliminary Result

Protein Threading
 Makes structure prediction through identification of good sequence-structure fit.  Protein threading can predict only the backbone structure of a protein (side-chains have to be predicted using other methods)

Predicted

Actual

Ab Initio 3D structure prediction


to predict tertiary structure from basic physico-chemical properties.
 It is used when Homology Modeling & Threading have failed (no homologies are evident ).  Does not rely on any detection of similarity to sequence of known structure.  As yet very unreliable for practical predictions.
 Aims

Applications
 Structural Bioinformatics can facilitate the discovery, design, and optimization of new chemical entities.  Computer aided drug design (CADD) or Computer aided molecular design (CAMD) follows two strategies:  Analog based design (Ligand Based)  Structure based design (Target Based)

Analog Based Design


 The analog based approach mainly uses Pharmacophoric maps and Quantitative structure Activity Relationship (QSAR) to identify or modify a lead in the absence of a known 3D structure of the receptor.

Structure-Based Design
Structure-based approach starts with the structure of the receptor site, such as the active site in protein.


 Docking comes under this category of design.

Quantitative Structure Activity relationship (QSAR)


 QSAR is an applied series of mathematical models built to predict biological and physicochemical behavior of molecules based on their chemical structures.  It alleviates the need to determine molecular activity of hundreds of similar compounds that would take large amounts of resources to determine individually.  The underlying premise of QSAR is that Biological Activity is correlated to its physiochemical parameters. BA = f (biological + Chemical + Physical) Biological activity can be any measured such as IC50, or ED50.

QSAR Table
Structure Comp.1 Comp.2 Comp.3 Comp.4 Bioproperty Bio1 Bio2 Bio3 Bio3 Structural properties P1 " " " P2 " " " P3 " " " P4 " " "

BA = k1P1 + k2P2 + k3P3 + ...

EXTERNAL VALIDATION OF QSAR MODELS


Entire dataset

Training set

Test set

Model development (q2)

Prediction of the test set (R2)

Thank You

You might also like